CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Large Language ModelKnowledege Distillation
Abstract:

Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CompoDistill, a knowledge distillation framework that transfers visual perception abilities from large teacher MLLMs to smaller student models by explicitly aligning visual attention patterns. It resides in the Attention-Based Distillation leaf, which contains only two papers including this work. This leaf sits within the broader Knowledge Distillation Frameworks for MLLMs branch, which encompasses four distinct distillation strategies (attention-based, feature-level, competitive, and domain-specific). The sparse population of the attention-based leaf suggests this specific angle—using attention alignment to preserve compositional visual reasoning—is relatively underexplored compared to feature-level distillation approaches.

The taxonomy reveals that neighboring research directions include Feature-Level Distillation (three papers focusing on intermediate representations) and Competitive Distillation (two papers on bidirectional knowledge transfer). The Visual Encoding and Perception Enhancement branch offers alternative pathways to improve visual understanding through multi-encoder integration or perception tokens, rather than distillation. The Reasoning and Compositional Understanding branch addresses similar compositional challenges but through chain-of-thought methods or reward signals instead of teacher-student knowledge transfer. CompoDistill bridges these areas by framing compositional reasoning as a distillation problem requiring attention-level alignment, distinct from both pure architectural enhancements and reasoning-focused interventions.

Among thirty candidates examined, the contribution identifying visual attention misalignment as a barrier shows one refutable candidate from ten examined, suggesting some prior recognition of attention's role in distillation quality. The CompoDistill framework itself (VAT and TAF modules) encountered no refutations across ten candidates, indicating the specific architectural combination may be novel within this limited search scope. The three-stage training strategy also shows one refutable candidate from ten examined, implying staged distillation approaches exist but may differ in implementation details. The modest refutation counts reflect the constrained search scale rather than exhaustive field coverage.

Based on the top-thirty semantic matches and citation expansion, the work appears to occupy a sparsely populated intersection of attention-based distillation and compositional visual reasoning. The taxonomy structure confirms that while distillation for MLLMs is an active area, attention-specific approaches remain less common than feature-level methods. The analysis cannot rule out relevant work outside the examined candidates, particularly in adjacent communities focused on vision transformers or general knowledge distillation beyond the MLLM context.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: distilling visual perception abilities in multimodal large language models. The field encompasses a diverse set of approaches organized around ten major branches. Knowledge Distillation Frameworks for MLLMs explore how to transfer capabilities from large teacher models to compact students, often leveraging attention mechanisms as in LLaVA-KD[4] or competitive strategies. Visual Encoding and Perception Enhancement focuses on improving how models represent and process visual inputs, with works like Visual Experts[3] and Perception Tokens[18] refining feature extraction. Visual Prompting and Steering Mechanisms investigate methods to guide model behavior through prompts or intermediate representations, exemplified by Visual Sketchpad[5] and Textual Steering Vectors[6]. Reasoning and Compositional Understanding addresses complex multi-step inference and compositional generalization, while Foundation Models and Architectural Innovations examine core design choices in systems like MM1[12]. Evaluation and Analysis branches assess visual capabilities rigorously, Cross-Modal Transfer studies knowledge sharing across modalities, Specialized Applications target domains such as medical imaging or autonomous driving, Training-Free and Few-Shot Adaptation explores efficient learning paradigms, and Survey and Conceptual Frameworks provide overarching perspectives on the landscape. A particularly active line of work centers on distillation techniques that preserve or enhance visual understanding during model compression. CompoDistill[0] sits within the Attention-Based Distillation cluster, emphasizing how attention patterns can guide the transfer of compositional visual reasoning from teacher to student. This contrasts with approaches like LLaVA-KD[4], which also employs attention-based distillation but may prioritize different aspects of the teacher's internal representations. Meanwhile, works such as Visual Cognition[1] and Visual Perception Reward[9] explore alternative pathways—using cognitive frameworks or reward signals to shape perception—highlighting ongoing debates about whether distillation should mimic intermediate activations, final outputs, or learned behaviors. The interplay between architectural choices, training objectives, and evaluation protocols remains a central open question, with CompoDistill[0] contributing to the discourse on how fine-grained attention alignment can support robust compositional generalization in smaller models.

Claimed Contributions

Identification of visual attention misalignment as key barrier to distilling visual perception

The authors conduct systematic analysis showing that existing knowledge distillation methods fail to transfer visual perception abilities from teacher to student MLLMs due to misalignment in visual attention distributions between the two models, particularly in visual understanding layers.

10 retrieved papers
Can Refute
CompoDistill framework with VAT and TAF modules

The authors introduce CompoDistill, a knowledge distillation framework featuring two core components: the Visual ATtention alignment module to align student and teacher visual attention using group layer matching, and the Teacher Adapter Fetch module to bridge feature space gaps between models.

10 retrieved papers
Three-stage training strategy for comprehensive visual perception distillation

The authors develop a three-stage training approach consisting of Distilled Pre-Training, Distilled Fine-Tuning with attention alignment, and Supervised Fine-Tuning to effectively transfer both visual recognition and perception abilities from teacher to student models.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of visual attention misalignment as key barrier to distilling visual perception

The authors conduct systematic analysis showing that existing knowledge distillation methods fail to transfer visual perception abilities from teacher to student MLLMs due to misalignment in visual attention distributions between the two models, particularly in visual understanding layers.

Contribution

CompoDistill framework with VAT and TAF modules

The authors introduce CompoDistill, a knowledge distillation framework featuring two core components: the Visual ATtention alignment module to align student and teacher visual attention using group layer matching, and the Teacher Adapter Fetch module to bridge feature space gaps between models.

Contribution

Three-stage training strategy for comprehensive visual perception distillation

The authors develop a three-stage training approach consisting of Distilled Pre-Training, Distilled Fine-Tuning with attention alignment, and Supervised Fine-Tuning to effectively transfer both visual recognition and perception abilities from teacher to student models.

CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs | Novelty Validation