The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

ICLR 2026 Conference SubmissionAnonymous Authors
MultiModal Large Language Model;Pre-Normlization
Abstract:

Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ''asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic---the persistence of norm disparity and the resulting asymmetric update rates---is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formalizes asymmetric update dynamics arising from norm disparity between visual and text tokens in Pre-Norm multimodal architectures, proposing a LayerNorm insertion solution. It occupies the 'Pre-Norm Architecture Norm Disparity' leaf within the 'Architectural Imbalance and Norm Disparity Analysis' branch. Notably, this leaf contains only the original paper itself—no sibling papers were identified in the taxonomy. This suggests the specific focus on Pre-Norm architectural norm disparity as a root cause of asymmetric updates represents a relatively sparse research direction within the broader field of modality imbalance.

The taxonomy reveals three main branches addressing modality imbalance: architectural analysis, gradient-based mitigation, and multi-task optimization. The paper's diagnostic stance on Pre-Norm architectures contrasts with neighboring gradient-based methods (e.g., gradient alignment, reward modulation) that intervene during training without analyzing structural causes. The 'Expert Activation Gradient Imbalance' sibling category examines mixture-of-experts architectures, indicating that architectural sources of imbalance extend beyond Pre-Norm designs. The paper's theoretical framing of 'representational inertia' bridges architectural analysis and the gradient-based mitigation branch, offering foundational insights that could inform intervention strategies.

Among nineteen candidates examined across three contributions, no refutable prior work was identified. The theoretical formalization examined ten candidates with zero refutations, suggesting the formal analysis of asymmetric update dynamics in Pre-Norm MLLMs may be novel within this limited search scope. The empirical validation (one candidate examined) and the Gradient-Aware Norm Alignment mechanism (eight candidates examined) similarly showed no clear overlap. However, the small candidate pool—especially for empirical validation—means the analysis cannot rule out relevant prior work outside the top-K semantic matches or citation network examined.

Based on the limited search scope, the work appears to occupy a relatively unexplored niche: formalizing norm disparity as a structural cause of asymmetric updates in Pre-Norm MLLMs. The absence of sibling papers in its taxonomy leaf and zero refutations across nineteen candidates suggest novelty, though the small candidate pool and sparse taxonomy structure indicate this assessment is preliminary. A broader literature search might reveal related architectural analyses or norm-based interventions not captured in the current scope.

Taxonomy

Core-task Taxonomy Papers
4
3
Claimed Contributions
19
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: norm disparity and asymmetric update dynamics in multimodal large language models. The field structure suggested by the taxonomy reveals three main branches that collectively address how different modalities (e.g., vision and language) interact unevenly during training and fine-tuning of large models. The first branch, Architectural Imbalance and Norm Disparity Analysis, examines structural sources of imbalance—such as how pre-norm architectures can lead to divergent gradient norms across modalities—providing diagnostic insights into why certain components update at different rates. The second branch, Gradient-Based Mitigation of Modality Imbalance, focuses on intervention strategies that directly manipulate gradients or learning rates to counteract these disparities, ensuring more balanced multimodal learning. The third branch, Multi-Task Optimization Imbalance in LLM Post-Training, broadens the scope to post-training scenarios (e.g., reinforcement learning from human feedback) where multiple objectives or reward signals can exacerbate imbalance, requiring careful gradient modulation or task weighting. A particularly active line of work centers on gradient-based mitigation techniques: See-Saw Modality Balance[1] and Reward Gradient Modulation[2] both propose dynamic reweighting schemes to prevent one modality from dominating updates, while Imbalanced Gradients RL[3] extends similar ideas to the reinforcement learning setting. In contrast, architectural analysis works like Unseen Bias[0] take a more diagnostic stance, investigating how layer normalization placement and parameter initialization contribute to norm disparity in pre-norm architectures. Unseen Bias[0] sits squarely within the Architectural Imbalance branch, emphasizing root-cause analysis of asymmetric dynamics rather than proposing mitigation heuristics. Its focus on pre-norm architectures complements gradient-based methods by identifying when and why imbalance arises, offering a foundation for understanding the trade-offs between architectural choices and the need for explicit gradient interventions.

Claimed Contributions

Theoretical formalization of asymmetric update dynamics in Pre-Norm MLLMs

The authors provide a formal theoretical analysis demonstrating that norm disparity between visual and text tokens in Pre-Norm architectures induces an asymmetric update dynamic. High-norm visual tokens exhibit representational inertia, transforming semantically slower than text tokens, which impairs cross-modal feature fusion.

10 retrieved papers
Extensive empirical validation across mainstream MLLMs

The authors conduct comprehensive empirical experiments on multiple state-of-the-art MLLMs to validate that the theoretically predicted norm disparities and asymmetric update rates actually occur in real-world models, confirming their theoretical framework.

1 retrieved paper
Gradient-Aware Norm Alignment with Global Weight Compensation mechanism

The authors introduce a solution that inserts a carefully initialized LayerNorm layer after the visual projector to enforce norm alignment, combined with a Global Weight Compensation mechanism that prevents vanishing gradients during training while maintaining forward-pass norm alignment.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical formalization of asymmetric update dynamics in Pre-Norm MLLMs

The authors provide a formal theoretical analysis demonstrating that norm disparity between visual and text tokens in Pre-Norm architectures induces an asymmetric update dynamic. High-norm visual tokens exhibit representational inertia, transforming semantically slower than text tokens, which impairs cross-modal feature fusion.

Contribution

Extensive empirical validation across mainstream MLLMs

The authors conduct comprehensive empirical experiments on multiple state-of-the-art MLLMs to validate that the theoretically predicted norm disparities and asymmetric update rates actually occur in real-world models, confirming their theoretical framework.

Contribution

Gradient-Aware Norm Alignment with Global Weight Compensation mechanism

The authors introduce a solution that inserts a carefully initialized LayerNorm layer after the visual projector to enforce norm alignment, combined with a Global Weight Compensation mechanism that prevents vanishing gradients during training while maintaining forward-pass norm alignment.