The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

MultiModal Large Language Model;Pre-Normlization

Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an ''asymmetric update dynamic,'' where high-norm visual tokens exhibit a ``representational inertia,'' causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic---the persistence of norm disparity and the resulting asymmetric update rates---is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formalizes asymmetric update dynamics arising from norm disparity between visual and text tokens in Pre-Norm multimodal architectures, proposing a LayerNorm insertion solution. It occupies the 'Pre-Norm Architecture Norm Disparity' leaf within the 'Architectural Imbalance and Norm Disparity Analysis' branch. Notably, this leaf contains only the original paper itself—no sibling papers were identified in the taxonomy. This suggests the specific focus on Pre-Norm architectural norm disparity as a root cause of asymmetric updates represents a relatively sparse research direction within the broader field of modality imbalance.

The taxonomy reveals three main branches addressing modality imbalance: architectural analysis, gradient-based mitigation, and multi-task optimization. The paper's diagnostic stance on Pre-Norm architectures contrasts with neighboring gradient-based methods (e.g., gradient alignment, reward modulation) that intervene during training without analyzing structural causes. The 'Expert Activation Gradient Imbalance' sibling category examines mixture-of-experts architectures, indicating that architectural sources of imbalance extend beyond Pre-Norm designs. The paper's theoretical framing of 'representational inertia' bridges architectural analysis and the gradient-based mitigation branch, offering foundational insights that could inform intervention strategies.

Among nineteen candidates examined across three contributions, no refutable prior work was identified. The theoretical formalization examined ten candidates with zero refutations, suggesting the formal analysis of asymmetric update dynamics in Pre-Norm MLLMs may be novel within this limited search scope. The empirical validation (one candidate examined) and the Gradient-Aware Norm Alignment mechanism (eight candidates examined) similarly showed no clear overlap. However, the small candidate pool—especially for empirical validation—means the analysis cannot rule out relevant prior work outside the top-K semantic matches or citation network examined.

Based on the limited search scope, the work appears to occupy a relatively unexplored niche: formalizing norm disparity as a structural cause of asymmetric updates in Pre-Norm MLLMs. The absence of sibling papers in its taxonomy leaf and zero refutations across nineteen candidates suggest novelty, though the small candidate pool and sparse taxonomy structure indicate this assessment is preliminary. A broader literature search might reveal related architectural analyses or norm-based interventions not captured in the current scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: norm disparity and asymmetric update dynamics in multimodal large language models. The field structure suggested by the taxonomy reveals three main branches that collectively address how different modalities (e.g., vision and language) interact unevenly during training and fine-tuning of large models. The first branch, Architectural Imbalance and Norm Disparity Analysis, examines structural sources of imbalance—such as how pre-norm architectures can lead to divergent gradient norms across modalities—providing diagnostic insights into why certain components update at different rates. The second branch, Gradient-Based Mitigation of Modality Imbalance, focuses on intervention strategies that directly manipulate gradients or learning rates to counteract these disparities, ensuring more balanced multimodal learning. The third branch, Multi-Task Optimization Imbalance in LLM Post-Training, broadens the scope to post-training scenarios (e.g., reinforcement learning from human feedback) where multiple objectives or reward signals can exacerbate imbalance, requiring careful gradient modulation or task weighting. A particularly active line of work centers on gradient-based mitigation techniques: See-Saw Modality Balance[1] and Reward Gradient Modulation[2] both propose dynamic reweighting schemes to prevent one modality from dominating updates, while Imbalanced Gradients RL[3] extends similar ideas to the reinforcement learning setting. In contrast, architectural analysis works like Unseen Bias[0] take a more diagnostic stance, investigating how layer normalization placement and parameter initialization contribute to norm disparity in pre-norm architectures. Unseen Bias[0] sits squarely within the Architectural Imbalance branch, emphasizing root-cause analysis of asymmetric dynamics rather than proposing mitigation heuristics. Its focus on pre-norm architectures complements gradient-based methods by identifying when and why imbalance arises, offering a foundation for understanding the trade-offs between architectural choices and the need for explicit gradient interventions.

Claimed Contributions

Theoretical formalization of asymmetric update dynamics in Pre-Norm MLLMs

10 retrieved papers

The authors provide a formal theoretical analysis demonstrating that norm disparity between visual and text tokens in Pre-Norm architectures induces an asymmetric update dynamic. High-norm visual tokens exhibit representational inertia, transforming semantically slower than text tokens, which impairs cross-modal feature fusion.

10 retrieved papers

Extensive empirical validation across mainstream MLLMs

1 retrieved paper

The authors conduct comprehensive empirical experiments on multiple state-of-the-art MLLMs to validate that the theoretically predicted norm disparities and asymmetric update rates actually occur in real-world models, confirming their theoretical framework.

1 retrieved paper

Gradient-Aware Norm Alignment with Global Weight Compensation mechanism

8 retrieved papers

The authors introduce a solution that inserts a carefully initialized LayerNorm layer after the visual projector to enforce norm alignment, combined with a Global Weight Compensation mechanism that prevents vanishing gradients during training while maintaining forward-pass norm alignment.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical formalization of asymmetric update dynamics in Pre-Norm MLLMs

[1] See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias PDF

Cannot Refute

[6] Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion PDF

Cannot Refute

[7] Balanced Multimodal Learning via On-the-fly Gradient Modulation PDF

Cannot Refute

[8] Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation PDF

Cannot Refute

[9] Wireless Interference Recognition With Multimodal Learning PDF

Cannot Refute

[10] Mmpareto: Boosting multimodal learning with innocent unimodal assistance PDF

Cannot Refute

[11] Gradient decoupled learning with unimodal regularization for multimodal remote sensing classification PDF

Cannot Refute

[12] Intra- and Inter-Modal Curriculum for Multimodal Learning PDF

Cannot Refute

[13] Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise PDF

Cannot Refute

[14] MGIML: Cancer grading with incomplete radiology-pathology data via memory learning and gradient homogenization PDF

Cannot Refute

Contribution

Extensive empirical validation across mainstream MLLMs

[5] An information-theoretic evaluation of generative models in learning multi-modal distributions PDF

Cannot Refute

Contribution

Gradient-Aware Norm Alignment with Global Weight Compensation mechanism

[15] Actual Cause Guided Adaptive Gradient Scaling for Balanced Multimodal Sentiment Analysis PDF

Cannot Refute

[16] Cross-modal meta consensus for heterogeneous federated learning PDF

Cannot Refute

[17] Boosting Multi-modal Model Performance with Adaptive Gradient Modulation PDF

Cannot Refute

[18] FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration PDF

Cannot Refute

[19] Scaling multimodal pre-training via cross-modality gradient harmonization PDF

Cannot Refute

[20] Multi-Modal Learning with Bayesian-Oriented Gradient Calibration PDF

Cannot Refute

[21] IMENet: infrared-guided multimodal enhancement network for low-light vision PDF

Cannot Refute

[22] On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective PDF

Cannot Refute

The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Theoretical formalization of asymmetric update dynamics in Pre-Norm MLLMs

[1] See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias PDF

[6] Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion PDF

[7] Balanced Multimodal Learning via On-the-fly Gradient Modulation PDF

[8] Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation PDF

[9] Wireless Interference Recognition With Multimodal Learning PDF

[10] Mmpareto: Boosting multimodal learning with innocent unimodal assistance PDF

[11] Gradient decoupled learning with unimodal regularization for multimodal remote sensing classification PDF

[12] Intra- and Inter-Modal Curriculum for Multimodal Learning PDF

[13] Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise PDF

[14] MGIML: Cancer grading with incomplete radiology-pathology data via memory learning and gradient homogenization PDF

Extensive empirical validation across mainstream MLLMs

[5] An information-theoretic evaluation of generative models in learning multi-modal distributions PDF

Gradient-Aware Norm Alignment with Global Weight Compensation mechanism

[15] Actual Cause Guided Adaptive Gradient Scaling for Balanced Multimodal Sentiment Analysis PDF

[16] Cross-modal meta consensus for heterogeneous federated learning PDF

[17] Boosting Multi-modal Model Performance with Adaptive Gradient Modulation PDF

[18] FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration PDF

[19] Scaling multimodal pre-training via cross-modality gradient harmonization PDF

[20] Multi-Modal Learning with Bayesian-Oriented Gradient Calibration PDF

[21] IMENet: infrared-guided multimodal enhancement network for low-light vision PDF

[22] On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective PDF

Table of Contents