What Layers When: Learning to Skip Compute in LLMs with Residual Gates

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

decoder-only language modelslarge language modelslayer skippingadaptive computeefficient inferenceLLM

We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that compresses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining >90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes GateSkip, a residual-stream gating mechanism enabling token-wise layer skipping in decoder-only language models. It resides in the 'Learned Gating and Routing Approaches' leaf, which contains three papers total, indicating a moderately populated research direction within the broader taxonomy of fifteen papers. This leaf focuses on trainable gates or routers for dynamic per-token layer execution, distinguishing it from heuristic-based or test-time adaptation methods. The taxonomy structure suggests this is an active but not overcrowded area, with clear boundaries separating learned mechanisms from rule-based alternatives.

The taxonomy reveals several neighboring directions that contextualize GateSkip's positioning. Adjacent leaves include 'Heuristic-Based Selection Strategies' (two papers using predefined rules without learned parameters) and 'Test-Time Architecture Adaptation' (one paper manipulating layer ordering at inference). Sibling branches address 'Integration with Inference Optimization Frameworks' (combining layer skipping with speculative decoding or batching) and 'Architectural and Positional Insights' (analyzing redundancy patterns and depth decay). GateSkip's learned gating approach diverges from heuristic methods by requiring training, yet it shares the broader goal of dynamic layer selection with these neighboring clusters.

Among twenty-one candidates examined, none clearly refute any of the three contributions. The core gating mechanism examined seven candidates with zero refutations, suggesting limited direct overlap in the search scope. The compute-accuracy trade-off claim reviewed ten candidates without refutation, indicating either genuine novelty or insufficient coverage of closely related benchmarks. Compatibility with orthogonal techniques examined four candidates, also yielding no refutations. These statistics reflect a constrained literature search rather than exhaustive validation; the absence of refutations may stem from the specific semantic search strategy or the nascent state of this research direction.

Based on the limited search scope of twenty-one top-K semantic matches, the work appears to occupy a distinct position within learned gating approaches. The taxonomy structure and sibling papers suggest the field is exploring multiple gating paradigms concurrently, with GateSkip contributing a residual-stream variant. However, the analysis does not cover all possible prior work in adaptive computation or mixture-of-experts architectures, leaving open the possibility of unexamined overlaps. The contribution-level statistics indicate no immediate refutations within the examined set, but broader claims would require more comprehensive literature coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: token-wise adaptive layer skipping in decoder-only language models. The field has organized itself around several complementary directions. Dynamic Layer Selection Mechanisms explore how to decide which layers to execute for each token, ranging from learned gating and routing approaches to confidence-based or heuristic strategies. Integration with Inference Optimization Frameworks examines how layer skipping interacts with speculative decoding, early exiting, and other runtime techniques. Architectural and Positional Insights investigate structural properties such as depth decay patterns and positional encoding effects, while Cross-Domain and Multilingual Adaptations extend these ideas beyond English text. Complementary Efficiency Techniques address token pruning, quantization, and other orthogonal optimizations, and Theoretical and Algorithmic Foundations provide formal analysis of convergence, approximation guarantees, and training dynamics. Together, these branches reflect a maturing ecosystem where adaptive computation is pursued from multiple angles—some emphasizing trainable routing, others leveraging static or heuristic rules, and still others focusing on hybrid or ensemble strategies. Within the learned gating and routing cluster, a particularly active line of work centers on trainable mechanisms that predict layer necessity on a per-token basis. Residual Gates[0] introduces lightweight gating modules that modulate residual connections, allowing the model to learn which layers contribute most to each token's representation. This approach contrasts with methods like Adaptive Layer Skipping[3], which relies on confidence thresholds derived from intermediate outputs, and SkipGPT[14], which employs a separate routing network to guide skip decisions. Compared to heuristic strategies such as Depth Decay Decoding[5] or static middle-layer skipping, learned gating offers greater flexibility but requires careful training to avoid instability or overfitting. Residual Gates[0] sits squarely in this learned-routing space, emphasizing end-to-end differentiability and fine-grained control, while neighboring works explore alternative trade-offs between training overhead, inference speedup, and task-specific adaptability.

Claimed Contributions

GateSkip residual gating mechanism for token-wise layer skipping

7 retrieved papers

The authors propose a differentiable gating mechanism placed at the output of attention and MLP modules that uses sigmoid-activated linear projections to control information flow into the residual stream. This enables stable fine-tuning on pretrained models and allows token-level layer skipping at inference based on learned importance scores.

7 retrieved papers

State-of-the-art compute-accuracy trade-offs on generative reasoning tasks

10 retrieved papers

The method achieves up to 15% compute savings while retaining over 90% baseline accuracy on long-form reasoning tasks, and on instruction-tuned models it improves accuracy at full compute while matching baseline quality near 50% savings, outperforming prior adaptive compute approaches that fail on generative tasks.

10 retrieved papers

Compatibility with orthogonal efficiency techniques

4 retrieved papers

The authors demonstrate that their gating mechanism can be combined with existing efficiency methods including 4-bit quantization, structured pruning, and self-speculative decoding, showing that GateSkip operates on a complementary axis of efficiency.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Adaptive layer-skipping in pre-trained llms PDF

Luo Xuan, Wang, Weizhi, Yan, Xifeng (2025)

[14] SkipGPT: Each Token is One of a Kind PDF

A Zhao, F Ye, Y Fan, J Tong, J Xiong, Z Fei (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GateSkip residual gating mechanism for token-wise layer skipping

[26] Resonant pattern shaping through iterative latency induction in contextual token expansion of transformer-based language models PDF

Cannot Refute

[27] You only cache once: Decoder-decoder architectures for language models PDF

Cannot Refute

[28] Zarvan: An Efficient Gated Architecture for Sequence Modeling with Linear Complexity PDF

Cannot Refute

[29] Pilot gaze intent classification using BiLSTM and gated transformer model PDF

Cannot Refute

[30] DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning PDF

Cannot Refute

[31] How to capture the important tokens and build sequential encoder for token-level classification models PDF

Cannot Refute

[32] StagFormer: Time Staggering Decoder only Transformers PDF

Cannot Refute

Contribution

State-of-the-art compute-accuracy trade-offs on generative reasoning tasks

[16] Hybridflow: A flexible and efficient rlhf framework PDF

Cannot Refute

[17] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models PDF

Cannot Refute

[18] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference PDF

Cannot Refute

[19] On efficient computation in active inference PDF

Cannot Refute

[20] Scalable Adaptive Computation for Iterative Generation PDF

Cannot Refute

[21] Flexgen: High-throughput generative inference of large language models with a single gpu PDF

Cannot Refute

[22] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management PDF

Cannot Refute

[23] Generating sequences by learning to self-correct PDF

Cannot Refute

[24] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

Cannot Refute

[25] Temporal Dynamic Quantization for Diffusion Models PDF

Cannot Refute

Contribution

Compatibility with orthogonal efficiency techniques

[33] S3d: A simple and cost-effective self-speculative decoding scheme for low-memory gpus PDF

Cannot Refute

[34] Accelerating Inference in Lightweight Language Models Using Speculative Decoding and Quantization PDF

Cannot Refute

[35] Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention PDF

Cannot Refute

[36] Towards Efficient and Effective Inference for Large-Scale Models PDF

Cannot Refute

What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Adaptive layer-skipping in pre-trained llms PDF

[14] SkipGPT: Each Token is One of a Kind PDF

Contribution Analysis

GateSkip residual gating mechanism for token-wise layer skipping

[26] Resonant pattern shaping through iterative latency induction in contextual token expansion of transformer-based language models PDF

[27] You only cache once: Decoder-decoder architectures for language models PDF

[28] Zarvan: An Efficient Gated Architecture for Sequence Modeling with Linear Complexity PDF

[29] Pilot gaze intent classification using BiLSTM and gated transformer model PDF

[30] DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning PDF

[31] How to capture the important tokens and build sequential encoder for token-level classification models PDF

[32] StagFormer: Time Staggering Decoder only Transformers PDF

State-of-the-art compute-accuracy trade-offs on generative reasoning tasks

[16] Hybridflow: A flexible and efficient rlhf framework PDF

[17] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models PDF

[18] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference PDF

[19] On efficient computation in active inference PDF

[20] Scalable Adaptive Computation for Iterative Generation PDF

[21] Flexgen: High-throughput generative inference of large language models with a single gpu PDF

[22] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management PDF

[23] Generating sequences by learning to self-correct PDF

[24] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

[25] Temporal Dynamic Quantization for Diffusion Models PDF

Compatibility with orthogonal efficiency techniques

[33] S3d: A simple and cost-effective self-speculative decoding scheme for low-memory gpus PDF

[34] Accelerating Inference in Lightweight Language Models Using Speculative Decoding and Quantization PDF

[35] Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention PDF

[36] Towards Efficient and Effective Inference for Large-Scale Models PDF

Table of Contents