CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal Large Language ModelKnowledege Distillation

Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CompoDistill, a knowledge distillation framework that transfers visual perception abilities from large teacher MLLMs to smaller student models by explicitly aligning visual attention patterns. It resides in the Attention-Based Distillation leaf, which contains only two papers including this work. This leaf sits within the broader Knowledge Distillation Frameworks for MLLMs branch, which encompasses four distinct distillation strategies (attention-based, feature-level, competitive, and domain-specific). The sparse population of the attention-based leaf suggests this specific angle—using attention alignment to preserve compositional visual reasoning—is relatively underexplored compared to feature-level distillation approaches.

The taxonomy reveals that neighboring research directions include Feature-Level Distillation (three papers focusing on intermediate representations) and Competitive Distillation (two papers on bidirectional knowledge transfer). The Visual Encoding and Perception Enhancement branch offers alternative pathways to improve visual understanding through multi-encoder integration or perception tokens, rather than distillation. The Reasoning and Compositional Understanding branch addresses similar compositional challenges but through chain-of-thought methods or reward signals instead of teacher-student knowledge transfer. CompoDistill bridges these areas by framing compositional reasoning as a distillation problem requiring attention-level alignment, distinct from both pure architectural enhancements and reasoning-focused interventions.

Among thirty candidates examined, the contribution identifying visual attention misalignment as a barrier shows one refutable candidate from ten examined, suggesting some prior recognition of attention's role in distillation quality. The CompoDistill framework itself (VAT and TAF modules) encountered no refutations across ten candidates, indicating the specific architectural combination may be novel within this limited search scope. The three-stage training strategy also shows one refutable candidate from ten examined, implying staged distillation approaches exist but may differ in implementation details. The modest refutation counts reflect the constrained search scale rather than exhaustive field coverage.

Based on the top-thirty semantic matches and citation expansion, the work appears to occupy a sparsely populated intersection of attention-based distillation and compositional visual reasoning. The taxonomy structure confirms that while distillation for MLLMs is an active area, attention-specific approaches remain less common than feature-level methods. The analysis cannot rule out relevant work outside the examined candidates, particularly in adjacent communities focused on vision transformers or general knowledge distillation beyond the MLLM context.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: distilling visual perception abilities in multimodal large language models. The field encompasses a diverse set of approaches organized around ten major branches. Knowledge Distillation Frameworks for MLLMs explore how to transfer capabilities from large teacher models to compact students, often leveraging attention mechanisms as in LLaVA-KD[4] or competitive strategies. Visual Encoding and Perception Enhancement focuses on improving how models represent and process visual inputs, with works like Visual Experts[3] and Perception Tokens[18] refining feature extraction. Visual Prompting and Steering Mechanisms investigate methods to guide model behavior through prompts or intermediate representations, exemplified by Visual Sketchpad[5] and Textual Steering Vectors[6]. Reasoning and Compositional Understanding addresses complex multi-step inference and compositional generalization, while Foundation Models and Architectural Innovations examine core design choices in systems like MM1[12]. Evaluation and Analysis branches assess visual capabilities rigorously, Cross-Modal Transfer studies knowledge sharing across modalities, Specialized Applications target domains such as medical imaging or autonomous driving, Training-Free and Few-Shot Adaptation explores efficient learning paradigms, and Survey and Conceptual Frameworks provide overarching perspectives on the landscape. A particularly active line of work centers on distillation techniques that preserve or enhance visual understanding during model compression. CompoDistill[0] sits within the Attention-Based Distillation cluster, emphasizing how attention patterns can guide the transfer of compositional visual reasoning from teacher to student. This contrasts with approaches like LLaVA-KD[4], which also employs attention-based distillation but may prioritize different aspects of the teacher's internal representations. Meanwhile, works such as Visual Cognition[1] and Visual Perception Reward[9] explore alternative pathways—using cognitive frameworks or reward signals to shape perception—highlighting ongoing debates about whether distillation should mimic intermediate activations, final outputs, or learned behaviors. The interplay between architectural choices, training objectives, and evaluation protocols remains a central open question, with CompoDistill[0] contributing to the discourse on how fine-grained attention alignment can support robust compositional generalization in smaller models.

Claimed Contributions

Identification of visual attention misalignment as key barrier to distilling visual perception

Can Refute

10 retrieved papers

The authors conduct systematic analysis showing that existing knowledge distillation methods fail to transfer visual perception abilities from teacher to student MLLMs due to misalignment in visual attention distributions between the two models, particularly in visual understanding layers.

10 retrieved papers

Can Refute

CompoDistill framework with VAT and TAF modules

10 retrieved papers

The authors introduce CompoDistill, a knowledge distillation framework featuring two core components: the Visual ATtention alignment module to align student and teacher visual attention using group layer matching, and the Teacher Adapter Fetch module to bridge feature space gaps between models.

10 retrieved papers

Three-stage training strategy for comprehensive visual perception distillation

Can Refute

10 retrieved papers

The authors develop a three-stage training approach consisting of Distilled Pre-Training, Distilled Fine-Tuning with attention alignment, and Supervised Fine-Tuning to effectively transfer both visual recognition and perception abilities from teacher to student models.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Llava-kd: A framework of distilling multimodal large language models PDF

Cai Yu-Xuan, Yuxuan Cai, Zhang, Jiangning, Jiangning Zhang, He, Haoyang, Haoyang He, He XinWei, Xinwei He, Tong Ao, Ao Tong, Gan, Zhenye, Zhenye Gan, Wang, Chengjie, Chengjie Wang, Xue, Zhucun, Xiang Bai, Liu Yong, Bai Xiang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification of visual attention misalignment as key barrier to distilling visual perception

[61] Compressing visual-linguistic model via knowledge distillation PDF

Can Refute

[22] Move-kd: Knowledge distillation for vlms with mixture of visual encoders PDF

Cannot Refute

[62] ARDN: Attention re-distribution network for visual question answering PDF

Cannot Refute

[63] M3AE-Distill: An Efficient Distilled Model for Medical VisionâLanguage Downstream Tasks PDF

Cannot Refute

[64] Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross â¦ PDF

Cannot Refute

[65] Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments PDF

Cannot Refute

[66] KAID: Knowledge-Aware Interactive Distillation for Vision-Language Models PDF

Cannot Refute

[67] No Head Left Behind - Multi-Head Alignment Distillation for Transformers PDF

Cannot Refute

[68] Notes on Distillation PDF

Cannot Refute

[69] Enhancing Medical Large Vision-Language Models via Alignment Distillation PDF

Cannot Refute

Contribution

CompoDistill framework with VAT and TAF modules

[51] From gaze to insight: Bridging human visual attention and vision language model explanation for weakly-supervised medical image segmentation PDF

Cannot Refute

[52] Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement PDF

Cannot Refute

[53] Multimodal Sentiment Analysis Method Based on Knowledge Distillation for Imbalanced Data PDF

Cannot Refute

[54] Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation PDF

Cannot Refute

[55] Distill clip (dclip): Enhancing image-text retrieval via cross-modal transformer distillation PDF

Cannot Refute

[56] Cross-modal fine-tuning: Align then refine PDF

Cannot Refute

[57] Robust cross-modal representation learning with progressive self-distillation PDF

Cannot Refute

[58] VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification PDF

Cannot Refute

[59] Reconstruction Student with Attention for Student-Teacher Pyramid Matching PDF

Cannot Refute

[60] Cross-Modal Prostate Cancer Segmentation via Self-Attention Distillation PDF

Cannot Refute

Contribution

Three-stage training strategy for comprehensive visual perception distillation

[4] Llava-kd: A framework of distilling multimodal large language models PDF

Can Refute

[23] Visual program distillation: Distilling tools and programmatic reasoning into vision-language models PDF

Cannot Refute

[70] Promptkd: Unsupervised prompt distillation for vision-language models PDF

Cannot Refute

[71] Internvideo2: Scaling foundation models for multimodal video understanding PDF

Cannot Refute

[72] Video-xl: Extra-long vision language model for hour-scale video understanding PDF

Cannot Refute

[73] AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models PDF

Cannot Refute

[74] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Cannot Refute

[75] COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training PDF

Cannot Refute

[76] OneEncoder: a lightweight framework for efficient multimodal training PDF

Cannot Refute

[77] Uni-moe: Scaling unified multimodal llms with mixture of experts PDF

Cannot Refute

CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Llava-kd: A framework of distilling multimodal large language models PDF

Contribution Analysis

Identification of visual attention misalignment as key barrier to distilling visual perception

[61] Compressing visual-linguistic model via knowledge distillation PDF

[22] Move-kd: Knowledge distillation for vlms with mixture of visual encoders PDF

[62] ARDN: Attention re-distribution network for visual question answering PDF

[63] M3AE-Distill: An Efficient Distilled Model for Medical VisionâLanguage Downstream Tasks PDF

[64] Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross â¦ PDF

[65] Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments PDF

[66] KAID: Knowledge-Aware Interactive Distillation for Vision-Language Models PDF

[67] No Head Left Behind - Multi-Head Alignment Distillation for Transformers PDF

[68] Notes on Distillation PDF

[69] Enhancing Medical Large Vision-Language Models via Alignment Distillation PDF

CompoDistill framework with VAT and TAF modules

[51] From gaze to insight: Bridging human visual attention and vision language model explanation for weakly-supervised medical image segmentation PDF

[52] Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement PDF

[53] Multimodal Sentiment Analysis Method Based on Knowledge Distillation for Imbalanced Data PDF

[54] Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation PDF

[55] Distill clip (dclip): Enhancing image-text retrieval via cross-modal transformer distillation PDF

[56] Cross-modal fine-tuning: Align then refine PDF

[57] Robust cross-modal representation learning with progressive self-distillation PDF

[58] VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification PDF

[59] Reconstruction Student with Attention for Student-Teacher Pyramid Matching PDF

[60] Cross-Modal Prostate Cancer Segmentation via Self-Attention Distillation PDF

Three-stage training strategy for comprehensive visual perception distillation

[4] Llava-kd: A framework of distilling multimodal large language models PDF

[23] Visual program distillation: Distilling tools and programmatic reasoning into vision-language models PDF

[70] Promptkd: Unsupervised prompt distillation for vision-language models PDF

[71] Internvideo2: Scaling foundation models for multimodal video understanding PDF

[72] Video-xl: Extra-long vision language model for hour-scale video understanding PDF

[73] AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models PDF

[74] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

[75] COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training PDF

[76] OneEncoder: a lightweight framework for efficient multimodal training PDF

[77] Uni-moe: Scaling unified multimodal llms with mixture of experts PDF

Table of Contents

[63] M3AE-Distill: An Efficient Distilled Model for Medical VisionâLanguage Downstream Tasks PDF

[64] Bridging Ears and Eyes: Analyzing Audio and Visual Large Language Models to Humans in Visible Sound Recognition and Reducing Their Sensory Gap via Cross â¦ PDF