Distilling to Hybrid Attention Models via KL-Guided Layer Selection

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Linear attentionHybrid architecturesDistillationLayer selectionInference efficiency

Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a KL-divergence-guided layer selection method for distilling pretrained softmax attention Transformers into hybrid architectures that interleave softmax and linear attention layers. It resides in the 'Layer Selection and Replacement Strategies' leaf, which contains only two papers total (including this one). This leaf sits under 'Hybrid Attention Layer Optimization', a branch focused on selectively converting specific layers rather than wholesale architecture replacement. The sparse population of this leaf suggests that principled, metric-driven layer selection for hybrid attention distillation remains an underexplored research direction within the broader field.

The taxonomy reveals neighboring work in adjacent branches: 'Hybrid Transformer-SSM Architectures' explores integrating Mamba or SSM components with attention layers, while 'Transformer-to-Linear-Attention Distillation' pursues full conversion to linear mechanisms without retaining softmax layers. The paper's approach differs by maintaining a mixed architecture and emphasizing data-driven selection criteria. Sibling work in the same leaf (one other paper) likely addresses layer replacement strategies but may use different importance metrics or selection heuristics. The taxonomy's scope_note clarifies that this leaf excludes uniform conversion strategies and sparse attention optimization, positioning the work as targeting strategic, non-uniform layer replacement.

Among the three contributions analyzed, the literature search examined 22 candidates total, with no clearly refutable prior work identified. The KL-guided layer selection method examined 7 candidates with 0 refutations; the greedy addition strategy examined 5 candidates with 0 refutations; and the architecture-dependent transferability demonstration examined 10 candidates with 0 refutations. This limited search scope (top-K semantic matches plus citation expansion) suggests that within the examined candidate pool, no prior work directly overlaps with the proposed importance-score-based selection approach. The absence of refutations across all contributions indicates potential novelty, though the small candidate pool (22 papers) means the analysis does not cover the full literature landscape.

Based on the limited search scope of 22 candidates, the work appears to occupy a relatively sparse research direction with minimal direct prior overlap among examined papers. The taxonomy structure confirms that layer-wise optimization for hybrid attention architectures is less crowded than adjacent areas like vision-domain CNN-Transformer hybrids or full SSM conversion. However, the analysis cannot rule out relevant work outside the top-K semantic matches or in specialized venues not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Distilling pretrained transformers into hybrid attention architectures. The field addresses the challenge of compressing large transformer models by transferring their knowledge into more efficient hybrid designs that combine multiple attention mechanisms or alternative computational primitives. The taxonomy reveals several major branches: Transformer-to-Alternative-Architecture Distillation explores replacing transformers with state-space models or RNNs (e.g., Transformer-to-Mamba Distillation[17], Transformers to SSMs[21]); Hybrid Attention Layer Optimization focuses on strategically selecting and replacing layers within a model (e.g., KL-Guided Layer Selection[0], RAD[9]); Hybrid Softmax-Convolution Distillation merges convolutional and attention-based components (e.g., ViT-CNN Distillation[2], Broadcasting CNN-Transformer[12]); Sparse and Mixed Attention Mechanisms reduce computational cost through selective attention patterns (e.g., Sparse Attention Mechanisms[18], Anchor Attention[20]); Multi-Teacher and Cross-Model Distillation leverages multiple sources of supervision; Efficient Transformer Architectures Without Distillation pursues architectural innovations independently; and Perception and Non-Language Applications extends these techniques beyond NLP to vision and specialized domains (e.g., Distilled Face Forgery[4], LiDAR Object Detection[35]). A particularly active line of work centers on layer-wise optimization strategies, where researchers investigate which transformer layers to retain versus replace with cheaper alternatives. KL-Guided Layer Selection[0] sits within this branch, emphasizing principled criteria for deciding layer replacement based on divergence metrics. This contrasts with approaches like RAD[9], which may employ different selection heuristics, and broader hybrid designs such as Hybrid LSTM-Transformer[1] or Mamba in Llama[13], which integrate entirely different sequence modeling paradigms. Meanwhile, works like Adaptive Transformer Distillation[7] and Selective Distillation[16] explore dynamic or task-specific distillation strategies, raising questions about the trade-offs between uniform compression and context-aware adaptation. The central tension across these branches involves balancing computational savings against the risk of losing critical representational capacity, with ongoing exploration of how architectural diversity—whether through sparsity, convolution, or recurrence—can preserve performance while enabling practical deployment.

Claimed Contributions

KL-guided layer selection method for hybrid attention distillation

7 retrieved papers

The authors propose a method that measures each layer's marginal utility by restoring exactly that layer to softmax attention in an otherwise all-linear student model, then briefly distilling and scoring the reduction in teacher-student KL divergence. This approach identifies which layers should remain as softmax attention when converting pretrained Transformers into hybrid architectures.

7 retrieved papers

Greedy addition strategy using Stage-2 KL-based importance metric

5 retrieved papers

The authors introduce a specific selection strategy that uses KL divergence from the second distillation stage as the importance metric and employs a greedy addition approach, starting from an all-linear baseline and adding the most impactful softmax layers one at a time. This design choice is shown to outperform alternatives like greedy removal or MSE-based metrics.

5 retrieved papers

Demonstration of architecture-dependent layer selection transferability

10 retrieved papers

The authors discover that layer selections derived using one linear attention variant (such as GDN) can transfer effectively to other variants (such as GLA), and that certain architectures serve as better probes for identifying universally important layers in the teacher model. This finding reveals that the method's strength lies not just in specialization but in leveraging different student architectures to find fundamentally important teacher layers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding PDF

Hoshino Yuichiro, Tachibana Hideyuki, Inahara, Muneyoshi (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KL-guided layer selection method for hybrid attention distillation

[40] ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via --Divergence PDF

Cannot Refute

[41] Attend, distill, detect: Attention-aware entropy distillation for anomaly detection PDF

Cannot Refute

[42] Teaching where to look: Attention similarity knowledge distillation for low resolution face recognition PDF

Cannot Refute

[43] RA-KD: Random Attention Map Projection for Knowledge Distillation PDF

Cannot Refute

[44] Generalized Kullback-Leibler Divergence Loss PDF

Cannot Refute

[45] Adversarial Distillation via Attention Helps Enhance Accuracy and Robustness PDF

Cannot Refute

[46] Class-adaptive attention transfer and multilevel entropy decoupled knowledge distillation PDF

Cannot Refute

Contribution

Greedy addition strategy using Stage-2 KL-based importance metric

[13] The mamba in the llama: Distilling and accelerating hybrid models PDF

Cannot Refute

[16] Optimizing transformer inference with selective distillation: Layerwise conversion to linear attention PDF

Cannot Refute

[37] Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation PDF

Cannot Refute

[38] Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information PDF

Cannot Refute

[39] Iterative Semantic Transformer by Greedy Distillation for Community Question Answering PDF

Cannot Refute

Contribution

Demonstration of architecture-dependent layer selection transferability

[43] RA-KD: Random Attention Map Projection for Knowledge Distillation PDF

Cannot Refute

[47] Cross-Layer Distillation with Semantic Calibration PDF

Cannot Refute

[48] ALP-KD: Attention-Based Layer Projection for Knowledge Distillation PDF

Cannot Refute

[49] Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks PDF

Cannot Refute

[50] Multiattentive Perception and Multilayer Transfer Network Using Knowledge Distillation for RGB-D Indoor Scene Parsing PDF

Cannot Refute

[51] On the surprising effectiveness of attention transfer for vision transformers PDF

Cannot Refute

[52] Attention transfer-based deep distilled architecture for 6G driven-smart vehicle transportation system PDF

Cannot Refute

[53] Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation PDF

Cannot Refute

[54] Cross-Modal Self-Attention Distillation for Prostate Cancer Segmentation PDF

Cannot Refute

[55] Latent Danger Zone: Distilling Unified Attention for Cross-Architecture Black-box Attacks PDF

Cannot Refute

Distilling to Hybrid Attention Models via KL-Guided Layer Selection

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding PDF

Contribution Analysis

KL-guided layer selection method for hybrid attention distillation

[40] ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via --Divergence PDF

[41] Attend, distill, detect: Attention-aware entropy distillation for anomaly detection PDF

[42] Teaching where to look: Attention similarity knowledge distillation for low resolution face recognition PDF

[43] RA-KD: Random Attention Map Projection for Knowledge Distillation PDF

[44] Generalized Kullback-Leibler Divergence Loss PDF

[45] Adversarial Distillation via Attention Helps Enhance Accuracy and Robustness PDF

[46] Class-adaptive attention transfer and multilevel entropy decoupled knowledge distillation PDF

Greedy addition strategy using Stage-2 KL-based importance metric

[13] The mamba in the llama: Distilling and accelerating hybrid models PDF

[16] Optimizing transformer inference with selective distillation: Layerwise conversion to linear attention PDF

[37] Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation PDF

[38] Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information PDF

[39] Iterative Semantic Transformer by Greedy Distillation for Community Question Answering PDF

Demonstration of architecture-dependent layer selection transferability

[43] RA-KD: Random Attention Map Projection for Knowledge Distillation PDF

[47] Cross-Layer Distillation with Semantic Calibration PDF

[48] ALP-KD: Attention-Based Layer Projection for Knowledge Distillation PDF

[49] Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks PDF

[50] Multiattentive Perception and Multilayer Transfer Network Using Knowledge Distillation for RGB-D Indoor Scene Parsing PDF

[51] On the surprising effectiveness of attention transfer for vision transformers PDF

[52] Attention transfer-based deep distilled architecture for 6G driven-smart vehicle transportation system PDF

[53] Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation PDF

[54] Cross-Modal Self-Attention Distillation for Prostate Cancer Segmentation PDF

[55] Latent Danger Zone: Distilling Unified Attention for Cross-Architecture Black-box Attacks PDF

Table of Contents