Fixing Phi-3.5-mini QNN Error With Transformers

Aug 4, 2025 by Mei Lin 48 views

Phi-3.5-mini QNN Example Broken by Latest Transformers: A Deep Dive and Solution

Hey everyone! It looks like we've hit a snag with the Phi-3.5-mini QNN example when using the latest transformers library. If you're following the Olive tutorial and running into a RuntimeError, you're not alone. Let's break down the issue, understand why it's happening, and, most importantly, figure out how to fix it. We want to ensure that the main keywords are covered, so we will focus on Phi-3.5-mini, QNN (Quantized Neural Network), transformers library, and the specific RuntimeError encountered.

Understanding the Bug: A Detailed Look

The Problem

The core issue manifests as a RuntimeError during the GPTQ quantization process within Olive. Specifically, the error message states: The size of tensor a (32) must match the size of tensor b (96) at non-singleton dimension 3. This cryptic message points to a mismatch in tensor dimensions during the quantization process, which is crucial for reducing the model size and improving inference speed. We need to really focus on what quantization means in the context of Phi-3.5-mini and QNN. Quantization, in simple terms, is like compressing a picture – you reduce the file size without losing too much quality. In machine learning, this means reducing the precision of the numbers used in the model, making it smaller and faster. However, this process needs to be done correctly, or we end up with dimension mismatches like the one we're seeing here.

The Root Cause: Transformers Library Update

It turns out that a recent update to the transformers library is the culprit. The error arises due to changes in how tensor shapes are handled within the Phi-3 model's attention mechanism. The attention mechanism, particularly the rotary positional embeddings, is where the tensor size mismatch occurs. If we dive into the specifics, the apply_rotary_pos_emb function within the modeling_phi3.py file is where the error surfaces. This function is responsible for applying rotary positional embeddings to the query and key states in the attention mechanism. The error arises in the line where tensors are concatenated: q_embed = torch.cat([(q_rot * cos) + (rotate_half(q_rot) * sin), q_pass], dim=-1). The dimensions of q_rot and q_pass are not aligned as expected, leading to the RuntimeError. This highlights the importance of understanding the inner workings of the model and how different components interact. The Phi-3 model's architecture and the transformers library's implementation are key elements here. The interaction between them has introduced this bug, emphasizing the need for compatibility testing when library updates occur.

Reproducing the Bug

To reproduce the bug, simply follow the Phi-3.5-mini example provided in the Olive repository: https://github.com/microsoft/Olive/tree/main/examples/phi3_5. Running the command olive run --config qnn_config.json will trigger the error if you're using a version of the transformers library that introduces this issue. This is valuable information for anyone trying to debug or verify the fix. Having a clear way to reproduce the issue is crucial for effective problem-solving. It allows developers to isolate the problem and confirm that the solution works as expected. This also helps in preventing regressions in future updates. The specific configuration file, qnn_config.json, is also important, as it defines the quantization parameters and the overall workflow that leads to the error. Understanding the configuration is key to understanding the context in which the error occurs.

The Solution: Downgrading Transformers

The Quick Fix

The quickest and most reliable solution is to downgrade your transformers package version to 4.53.*. This version is known to be compatible with the current Olive configuration for Phi-3.5-mini. You can do this using pip:

pip install transformers==4.53.0

This command will install the specific version of the transformers library, effectively sidestepping the bug introduced in later versions. This immediate solution allows users to continue working with the Phi-3.5-mini model and QNN using Olive without being blocked by the RuntimeError. It's a practical workaround while a more permanent fix is developed and released. This emphasizes the importance of version control in software development, especially when dealing with external libraries. Pinning specific versions can help ensure stability and prevent unexpected issues caused by updates.

Why This Works

Downgrading works because it reverts the transformers library to a state where the tensor shape handling is compatible with the Phi-3 model's attention mechanism as implemented in Olive. The changes in later versions of transformers, while potentially beneficial in other contexts, introduced an incompatibility in this specific scenario. This highlights the complexity of software dependencies and the potential for seemingly unrelated changes to have unintended consequences. It's a reminder that libraries evolve over time, and maintaining compatibility requires ongoing effort and testing.

Long-Term Solution and What's Next

The Plan

A proper fix involves either updating the Olive code to accommodate the changes in the latest transformers library or working with the transformers maintainers to address the incompatibility. This is crucial for ensuring that Olive remains compatible with future releases of transformers and that users can benefit from the latest features and improvements. A robust solution will likely involve modifying the way Olive handles the tensor shapes during the quantization process, specifically within the apply_rotary_pos_emb function or its equivalent. This might require a deeper understanding of the changes in transformers and how they affect the Phi-3 model's architecture. Collaboration between the Olive team and the transformers community could be valuable in identifying the best approach and implementing a sustainable solution.

Staying Updated

Keep an eye on the Olive GitHub repository (https://github.com/microsoft/Olive) for updates and a permanent fix. You can also subscribe to the issue tracker to receive notifications when there are new developments. This proactive approach ensures that you're informed about the progress and can apply the fix as soon as it's available. Open-source projects often rely on community involvement, so staying informed and contributing when possible can help improve the software for everyone. This also demonstrates the importance of community engagement in software development, as users and developers work together to identify and resolve issues.

Diving Deeper into the Error

The Technical Details

Let's dig a bit deeper into the technical details of the error. The RuntimeError occurs in the apply_rotary_pos_emb function, which is a crucial part of the Phi-3 model's attention mechanism. This function applies rotary positional embeddings to the query and key states, which helps the model understand the order of words in a sentence. The error message The size of tensor a (32) must match the size of tensor b (96) at non-singleton dimension 3 indicates that the tensors being concatenated have incompatible shapes along the third dimension. This usually means that the number of features or channels in the tensors doesn't match, preventing the concatenation operation from succeeding. To truly understand this, we need to delve into the tensor shapes and how they're manipulated within the apply_rotary_pos_emb function. The query and key states are typically reshaped and transformed before the rotary positional embeddings are applied. If the reshaping or transformation process is not consistent with the expected shapes, it can lead to the dimension mismatch that we're seeing here.

Inspecting the Code

To further investigate, we can examine the relevant code snippets from transformers/models/phi3/modeling_phi3.py. The apply_rotary_pos_emb function looks something like this:

def apply_rotary_pos_emb(q, k, cos, sin):
    # Code for applying rotary positional embeddings
    q_embed = torch.cat([(q_rot * cos) + (rotate_half(q_rot) * sin), q_pass], dim=-1)
    return q_embed, k_embed

Here, q_rot and q_pass are portions of the query tensor, and the error occurs during the torch.cat operation. By printing the shapes of q_rot and q_pass just before the torch.cat call, we can pinpoint the exact dimensions that are causing the mismatch. This kind of debugging technique – adding print statements to inspect intermediate values – is a common and effective way to diagnose tensor shape issues in PyTorch models. It allows us to see the tensors as they exist at different stages of the computation, making it easier to identify where the shapes diverge from expectations. Understanding the flow of data and the transformations applied to it is crucial for resolving these kinds of errors.

Quantization and its Challenges

The Importance of Quantization

Quantization is a critical technique for deploying large language models like Phi-3.5-mini on resource-constrained devices. By reducing the precision of the model's weights and activations, we can significantly decrease its memory footprint and computational requirements. This makes it possible to run the model on devices with limited resources, such as mobile phones or edge devices. However, quantization is not without its challenges. It can introduce some loss of accuracy, as the reduced precision means that the model's representations are less fine-grained. Therefore, it's essential to carefully choose the quantization method and parameters to balance the trade-off between model size and accuracy. GPTQ (Generative Post-Training Quantization) is a popular method for quantizing large language models, as it aims to minimize the accuracy loss by optimizing the quantized weights based on a small calibration dataset. However, GPTQ can be sensitive to the specific architecture and implementation details of the model, which is why we're seeing this issue with the transformers library update.

Quantization and Compatibility

The bug we're discussing highlights the importance of compatibility between quantization tools and the underlying model implementation. Changes in the model architecture or the way tensors are handled can break the quantization process, leading to errors like the RuntimeError we're seeing. This means that quantization tools need to be updated and tested whenever there are significant changes in the models they support. In this case, the update to the transformers library introduced changes that were not fully compatible with the GPTQ quantization process used by Olive. This emphasizes the need for thorough integration testing and version control when working with complex machine learning pipelines. It's not enough to simply quantize a model; we also need to ensure that the quantized model behaves as expected and maintains its accuracy. This often involves careful evaluation and fine-tuning of the quantization process.

Looking Ahead

This issue with the Phi-3.5-mini QNN example serves as a valuable learning experience. It underscores the importance of staying updated with library changes, understanding the technical details of the error, and having a clear solution in place. By downgrading the transformers package, we can quickly resolve the issue and continue working with the model. And by keeping an eye on the Olive GitHub repository, we can stay informed about the long-term solution. So, let's keep learning and building amazing things with AI, guys!