Distilling to Hybrid Attention Models via KL-Guided Layer Selection

ICLR 2026 Conference SubmissionAnonymous Authors
Linear attentionHybrid architecturesDistillationLayer selectionInference efficiency
Abstract:

Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a KL-divergence-guided layer selection method for distilling pretrained softmax attention Transformers into hybrid architectures that interleave softmax and linear attention layers. It resides in the 'Layer Selection and Replacement Strategies' leaf, which contains only two papers total (including this one). This leaf sits under 'Hybrid Attention Layer Optimization', a branch focused on selectively converting specific layers rather than wholesale architecture replacement. The sparse population of this leaf suggests that principled, metric-driven layer selection for hybrid attention distillation remains an underexplored research direction within the broader field.

The taxonomy reveals neighboring work in adjacent branches: 'Hybrid Transformer-SSM Architectures' explores integrating Mamba or SSM components with attention layers, while 'Transformer-to-Linear-Attention Distillation' pursues full conversion to linear mechanisms without retaining softmax layers. The paper's approach differs by maintaining a mixed architecture and emphasizing data-driven selection criteria. Sibling work in the same leaf (one other paper) likely addresses layer replacement strategies but may use different importance metrics or selection heuristics. The taxonomy's scope_note clarifies that this leaf excludes uniform conversion strategies and sparse attention optimization, positioning the work as targeting strategic, non-uniform layer replacement.

Among the three contributions analyzed, the literature search examined 22 candidates total, with no clearly refutable prior work identified. The KL-guided layer selection method examined 7 candidates with 0 refutations; the greedy addition strategy examined 5 candidates with 0 refutations; and the architecture-dependent transferability demonstration examined 10 candidates with 0 refutations. This limited search scope (top-K semantic matches plus citation expansion) suggests that within the examined candidate pool, no prior work directly overlaps with the proposed importance-score-based selection approach. The absence of refutations across all contributions indicates potential novelty, though the small candidate pool (22 papers) means the analysis does not cover the full literature landscape.

Based on the limited search scope of 22 candidates, the work appears to occupy a relatively sparse research direction with minimal direct prior overlap among examined papers. The taxonomy structure confirms that layer-wise optimization for hybrid attention architectures is less crowded than adjacent areas like vision-domain CNN-Transformer hybrids or full SSM conversion. However, the analysis cannot rule out relevant work outside the top-K semantic matches or in specialized venues not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers
36
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Distilling pretrained transformers into hybrid attention architectures. The field addresses the challenge of compressing large transformer models by transferring their knowledge into more efficient hybrid designs that combine multiple attention mechanisms or alternative computational primitives. The taxonomy reveals several major branches: Transformer-to-Alternative-Architecture Distillation explores replacing transformers with state-space models or RNNs (e.g., Transformer-to-Mamba Distillation[17], Transformers to SSMs[21]); Hybrid Attention Layer Optimization focuses on strategically selecting and replacing layers within a model (e.g., KL-Guided Layer Selection[0], RAD[9]); Hybrid Softmax-Convolution Distillation merges convolutional and attention-based components (e.g., ViT-CNN Distillation[2], Broadcasting CNN-Transformer[12]); Sparse and Mixed Attention Mechanisms reduce computational cost through selective attention patterns (e.g., Sparse Attention Mechanisms[18], Anchor Attention[20]); Multi-Teacher and Cross-Model Distillation leverages multiple sources of supervision; Efficient Transformer Architectures Without Distillation pursues architectural innovations independently; and Perception and Non-Language Applications extends these techniques beyond NLP to vision and specialized domains (e.g., Distilled Face Forgery[4], LiDAR Object Detection[35]). A particularly active line of work centers on layer-wise optimization strategies, where researchers investigate which transformer layers to retain versus replace with cheaper alternatives. KL-Guided Layer Selection[0] sits within this branch, emphasizing principled criteria for deciding layer replacement based on divergence metrics. This contrasts with approaches like RAD[9], which may employ different selection heuristics, and broader hybrid designs such as Hybrid LSTM-Transformer[1] or Mamba in Llama[13], which integrate entirely different sequence modeling paradigms. Meanwhile, works like Adaptive Transformer Distillation[7] and Selective Distillation[16] explore dynamic or task-specific distillation strategies, raising questions about the trade-offs between uniform compression and context-aware adaptation. The central tension across these branches involves balancing computational savings against the risk of losing critical representational capacity, with ongoing exploration of how architectural diversity—whether through sparsity, convolution, or recurrence—can preserve performance while enabling practical deployment.

Claimed Contributions

KL-guided layer selection method for hybrid attention distillation

The authors propose a method that measures each layer's marginal utility by restoring exactly that layer to softmax attention in an otherwise all-linear student model, then briefly distilling and scoring the reduction in teacher-student KL divergence. This approach identifies which layers should remain as softmax attention when converting pretrained Transformers into hybrid architectures.

7 retrieved papers
Greedy addition strategy using Stage-2 KL-based importance metric

The authors introduce a specific selection strategy that uses KL divergence from the second distillation stage as the importance metric and employs a greedy addition approach, starting from an all-linear baseline and adding the most impactful softmax layers one at a time. This design choice is shown to outperform alternatives like greedy removal or MSE-based metrics.

5 retrieved papers
Demonstration of architecture-dependent layer selection transferability

The authors discover that layer selections derived using one linear attention variant (such as GDN) can transfer effectively to other variants (such as GLA), and that certain architectures serve as better probes for identifying universally important layers in the teacher model. This finding reveals that the method's strength lies not just in specialization but in leveraging different student architectures to find fundamentally important teacher layers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KL-guided layer selection method for hybrid attention distillation

The authors propose a method that measures each layer's marginal utility by restoring exactly that layer to softmax attention in an otherwise all-linear student model, then briefly distilling and scoring the reduction in teacher-student KL divergence. This approach identifies which layers should remain as softmax attention when converting pretrained Transformers into hybrid architectures.

Contribution

Greedy addition strategy using Stage-2 KL-based importance metric

The authors introduce a specific selection strategy that uses KL divergence from the second distillation stage as the importance metric and employs a greedy addition approach, starting from an all-linear baseline and adding the most impactful softmax layers one at a time. This design choice is shown to outperform alternatives like greedy removal or MSE-based metrics.

Contribution

Demonstration of architecture-dependent layer selection transferability

The authors discover that layer selections derived using one linear attention variant (such as GDN) can transfer effectively to other variants (such as GLA), and that certain architectures serve as better probes for identifying universally important layers in the teacher model. This finding reveals that the method's strength lies not just in specialization but in leveraging different student architectures to find fundamentally important teacher layers.