What Layers When: Learning to Skip Compute in LLMs with Residual Gates

ICLR 2026 Conference SubmissionAnonymous Authors
decoder-only language modelslarge language modelslayer skippingadaptive computeefficient inferenceLLM
Abstract:

We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that compresses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining >90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes GateSkip, a residual-stream gating mechanism enabling token-wise layer skipping in decoder-only language models. It resides in the 'Learned Gating and Routing Approaches' leaf, which contains three papers total, indicating a moderately populated research direction within the broader taxonomy of fifteen papers. This leaf focuses on trainable gates or routers for dynamic per-token layer execution, distinguishing it from heuristic-based or test-time adaptation methods. The taxonomy structure suggests this is an active but not overcrowded area, with clear boundaries separating learned mechanisms from rule-based alternatives.

The taxonomy reveals several neighboring directions that contextualize GateSkip's positioning. Adjacent leaves include 'Heuristic-Based Selection Strategies' (two papers using predefined rules without learned parameters) and 'Test-Time Architecture Adaptation' (one paper manipulating layer ordering at inference). Sibling branches address 'Integration with Inference Optimization Frameworks' (combining layer skipping with speculative decoding or batching) and 'Architectural and Positional Insights' (analyzing redundancy patterns and depth decay). GateSkip's learned gating approach diverges from heuristic methods by requiring training, yet it shares the broader goal of dynamic layer selection with these neighboring clusters.

Among twenty-one candidates examined, none clearly refute any of the three contributions. The core gating mechanism examined seven candidates with zero refutations, suggesting limited direct overlap in the search scope. The compute-accuracy trade-off claim reviewed ten candidates without refutation, indicating either genuine novelty or insufficient coverage of closely related benchmarks. Compatibility with orthogonal techniques examined four candidates, also yielding no refutations. These statistics reflect a constrained literature search rather than exhaustive validation; the absence of refutations may stem from the specific semantic search strategy or the nascent state of this research direction.

Based on the limited search scope of twenty-one top-K semantic matches, the work appears to occupy a distinct position within learned gating approaches. The taxonomy structure and sibling papers suggest the field is exploring multiple gating paradigms concurrently, with GateSkip contributing a residual-stream variant. However, the analysis does not cover all possible prior work in adaptive computation or mixture-of-experts architectures, leaving open the possibility of unexamined overlaps. The contribution-level statistics indicate no immediate refutations within the examined set, but broader claims would require more comprehensive literature coverage.

Taxonomy

Core-task Taxonomy Papers
15
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: token-wise adaptive layer skipping in decoder-only language models. The field has organized itself around several complementary directions. Dynamic Layer Selection Mechanisms explore how to decide which layers to execute for each token, ranging from learned gating and routing approaches to confidence-based or heuristic strategies. Integration with Inference Optimization Frameworks examines how layer skipping interacts with speculative decoding, early exiting, and other runtime techniques. Architectural and Positional Insights investigate structural properties such as depth decay patterns and positional encoding effects, while Cross-Domain and Multilingual Adaptations extend these ideas beyond English text. Complementary Efficiency Techniques address token pruning, quantization, and other orthogonal optimizations, and Theoretical and Algorithmic Foundations provide formal analysis of convergence, approximation guarantees, and training dynamics. Together, these branches reflect a maturing ecosystem where adaptive computation is pursued from multiple angles—some emphasizing trainable routing, others leveraging static or heuristic rules, and still others focusing on hybrid or ensemble strategies. Within the learned gating and routing cluster, a particularly active line of work centers on trainable mechanisms that predict layer necessity on a per-token basis. Residual Gates[0] introduces lightweight gating modules that modulate residual connections, allowing the model to learn which layers contribute most to each token's representation. This approach contrasts with methods like Adaptive Layer Skipping[3], which relies on confidence thresholds derived from intermediate outputs, and SkipGPT[14], which employs a separate routing network to guide skip decisions. Compared to heuristic strategies such as Depth Decay Decoding[5] or static middle-layer skipping, learned gating offers greater flexibility but requires careful training to avoid instability or overfitting. Residual Gates[0] sits squarely in this learned-routing space, emphasizing end-to-end differentiability and fine-grained control, while neighboring works explore alternative trade-offs between training overhead, inference speedup, and task-specific adaptability.

Claimed Contributions

GateSkip residual gating mechanism for token-wise layer skipping

The authors propose a differentiable gating mechanism placed at the output of attention and MLP modules that uses sigmoid-activated linear projections to control information flow into the residual stream. This enables stable fine-tuning on pretrained models and allows token-level layer skipping at inference based on learned importance scores.

7 retrieved papers
State-of-the-art compute-accuracy trade-offs on generative reasoning tasks

The method achieves up to 15% compute savings while retaining over 90% baseline accuracy on long-form reasoning tasks, and on instruction-tuned models it improves accuracy at full compute while matching baseline quality near 50% savings, outperforming prior adaptive compute approaches that fail on generative tasks.

10 retrieved papers
Compatibility with orthogonal efficiency techniques

The authors demonstrate that their gating mechanism can be combined with existing efficiency methods including 4-bit quantization, structured pruning, and self-speculative decoding, showing that GateSkip operates on a complementary axis of efficiency.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GateSkip residual gating mechanism for token-wise layer skipping

The authors propose a differentiable gating mechanism placed at the output of attention and MLP modules that uses sigmoid-activated linear projections to control information flow into the residual stream. This enables stable fine-tuning on pretrained models and allows token-level layer skipping at inference based on learned importance scores.

Contribution

State-of-the-art compute-accuracy trade-offs on generative reasoning tasks

The method achieves up to 15% compute savings while retaining over 90% baseline accuracy on long-form reasoning tasks, and on instruction-tuned models it improves accuracy at full compute while matching baseline quality near 50% savings, outperforming prior adaptive compute approaches that fail on generative tasks.

Contribution

Compatibility with orthogonal efficiency techniques

The authors demonstrate that their gating mechanism can be combined with existing efficiency methods including 4-bit quantization, structured pruning, and self-speculative decoding, showing that GateSkip operates on a complementary axis of efficiency.