Draft-based Approximate Inference for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

long-contextsparse attentionKV cache evictionprompt compression

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory costs of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same improvements in memory usage, latency, and throughput.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a framework for approximate LLM inference that uses small draft models to predict token and KV pair importance, yielding three concrete methods: SpecKV for KV cache dropping, SpecPC for prompt compression, and SpecKV-PC as a cascaded strategy. It resides in the Importance-Based KV Cache Eviction leaf, which contains seven papers including the original submission. This leaf sits within the broader KV Cache Compression and Management branch, indicating a moderately crowded research direction focused on memory-efficient inference through selective retention of key-value pairs.

The taxonomy reveals that KV cache management is one of ten major branches addressing long-context inference challenges. Adjacent leaves include KV Cache Quantization (three papers using mixed-precision techniques) and Structured KV Cache Organization (three papers employing clustering or hierarchical schemes). Neighboring branches such as Attention Mechanism Optimization (sparse patterns, linear attention) and Prompt Compression (four papers on input-level reduction) tackle related but distinct aspects of the quadratic complexity problem. The scope note for this leaf explicitly excludes quantization and structured methods, clarifying that the focus is on importance-driven eviction policies rather than encoding or organizational strategies.

Among the three contributions analyzed, the first two—the draft-based framework and the theoretical justification for lookahead estimation—show no clear refutation across ten and four candidates respectively. The third contribution, comprising the SpecKV, SpecPC, and SpecKV-PC algorithms, examined ten candidates and found one potentially overlapping prior work. Given the limited search scope of twenty-four total candidates, these statistics suggest that the core framework and theoretical analysis appear relatively novel within the examined literature, while the specific algorithmic instantiations may have closer precedents in the field.

Overall, the paper occupies a well-populated research area with multiple sibling methods addressing importance-based KV eviction. The analysis is constrained by the top-K semantic search strategy and does not constitute an exhaustive survey of all related work. The finding that one of ten candidates can refute the algorithmic contribution indicates some prior overlap, though the extent and nature of that overlap would require deeper inspection of the identified paper. The framework's emphasis on draft-model-driven prediction distinguishes it from quantization and retrieval-based approaches in neighboring taxonomy leaves.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: approximate inference for long-context large language models. The field addresses the computational and memory challenges that arise when deploying LLMs on sequences far exceeding their original training lengths. The taxonomy reveals a diverse landscape organized around ten major branches. KV Cache Compression and Management focuses on reducing memory overhead by selectively retaining or evicting key-value pairs, with importance-based eviction strategies forming a dense cluster. Attention Mechanism Optimization explores sparse or approximate attention patterns to avoid quadratic complexity, while Prompt Compression aims to distill input contexts into shorter representations. Alternative Architectures such as state-space models (Mamba Study[4], Retentive Network[8]) and hybrid designs (Jamba[12]) offer fundamentally different ways to handle long sequences. Context Extension Techniques like positional encoding modifications (Yarn[1], LongLoRA[13]) enable models to generalize beyond their pretrained context windows. Speculative Decoding, System-Level Optimization, Diffusion Language Models, Evaluation and Benchmarking, and Domain-Specific Applications round out the taxonomy, reflecting both algorithmic innovation and practical deployment concerns. Within KV cache management, a particularly active line of work centers on importance-based eviction policies that dynamically decide which tokens to retain. Draft Inference[0] sits squarely in this cluster, proposing a method to predict and prune less critical cache entries during generation. Nearby approaches such as Recycled Attention[5] and InfiniGen[16] similarly exploit token-level importance scores, though they differ in how they estimate relevance and handle eviction granularity. Another contrasting direction involves quantization-based compression (PQCache[22], MoSKA[38]) or retrieval-augmented schemes (RetrievalAttention[34]), which trade off precision or architectural complexity for memory savings. The main open questions revolve around balancing accuracy, latency, and memory footprint: aggressive eviction can degrade quality on tasks requiring fine-grained context (Context Utilization[6]), while conservative policies may not sufficiently alleviate memory pressure. Draft Inference[0] emphasizes lightweight prediction overhead compared to heavier retrieval or quantization methods, positioning itself as a practical middle ground for real-time serving scenarios.

Claimed Contributions

Draft-based Approximate Inference framework

10 retrieved papers

The authors introduce a unified framework that leverages small draft models to predict token and KV pair importance more accurately than existing methods, which rely only on input tokens. This framework extends recent work by using draft model lookahead to approximate future outputs for better importance estimation.

10 retrieved papers

Theoretical and empirical analyses justifying lookahead-based importance estimation

4 retrieved papers

The authors provide novel theoretical bounds (Theorems 1 and 2) showing that error in approximate input embeddings or outputs translates proportionally to error in importance scores. They also present empirical evidence demonstrating strong correlation between draft and target model importance scores.

4 retrieved papers

SpecKV, SpecPC, and SpecKV-PC algorithms

Can Refute

10 retrieved papers

The authors present SpecKV as the first method to use draft model lookahead for KV cache optimization, SpecPC for prompt compression using draft attention activations, and SpecKV-PC as a cascaded compression strategy combining both techniques to achieve superior accuracy, latency, and memory efficiency.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Recycled attention: Efficient inference for long-context language models PDF

Fangyuan Xu, Tanya Goyal, Eunsol Choi (2024)

[16] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management PDF

Lee, Wonbeom, Lee Jungi, Wonbeom Lee, Seo, Junghwan, Jungi Lee, Sim Jaewoong, Junghwan Seo, Jaewoong Sim (2024) • USENIX Symposium on Operating Systems Design and Implementation

[18] D2o: Dynamic discriminative operations for efficient generative inference of large language models PDF

Wan, Zhongwei, Wu Xinjian, Zhongwei Wan, Zhang, Yu, Xinjian Wu, Xin Yi, Yu Zhang, Tao, Chaofan, Yi Xin, Zhu Zhi-hong, Chaofan Tao, Wang Xin, Zhihong Zhu, Luo SiQi, Xin Wang, Xiong Jing, Siqi Luo, Wang, Longyue, Jing Xiong, Zhang Mi, Longyue Wang, Mi Zhang (2024)

[27] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference PDF

Tang Jia-Ming, Zhao Yilong, Jiaming Tang, Zhu Kan, Yilong Zhao, Xiao, Guangxuan, Kan Zhu, Guangxuan Xiao, Han Song, Baris Kasikci, Song Han (2024)

[45] TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection PDF

Wei Wu, Zhou-Jun Pan, Kun Fu, Chao Wang, Liyi Chen, Tianfu Wang, Zheng Wang, Hui Xiong (2024) • Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

[47] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression PDF

Behnam, Payman, Fu, Yaosheng, Payman Behnam, Zhao, Ritchie, Yaosheng Fu, Tsai, Po-An, Ritchie Zhao, Yu, Zhiding, Po-An Tsai, Tumanov, Alexey, Zhiding Yu, Alexey Tumanov (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Draft-based Approximate Inference framework

[51] Self Speculative Decoding for Diffusion Large Language Models PDF

Cannot Refute

[52] On Speculative Decoding for Multimodal Large Language Models PDF

Cannot Refute

[53] Confidence-Modulated Speculative Decoding for Large Language Models PDF

Cannot Refute

[54] Speculative Decoding Reimagined for Multimodal Large Language Models PDF

Cannot Refute

[55] Spectr: Fast speculative decoding via optimal transport PDF

Cannot Refute

[56] Recurrent Drafter for Fast Speculative Decoding in Large Language Models PDF

Cannot Refute

[57] Optimizing speculative decoding for serving large language models using goodput PDF

Cannot Refute

[58] HADES: Hardware Accelerated Decoding for Efficient Speculation in Large Language Models PDF

Cannot Refute

[59] Rest: Retrieval-based speculative decoding PDF

Cannot Refute

[60] SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models PDF

Cannot Refute

Contribution

Theoretical and empirical analyses justifying lookahead-based importance estimation

[71] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time PDF

Cannot Refute

[72] Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction PDF

Cannot Refute

[73] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

Cannot Refute

[74] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference PDF

Cannot Refute

Contribution

SpecKV, SpecPC, and SpecKV-PC algorithms

[70] SpecAttn: Speculating Sparse Attention PDF

Can Refute

[61] Scbench: A kv cache-centric analysis of long-context methods PDF

Cannot Refute

[62] Loki: Low-Rank Keys for Efficient Sparse Attention PDF

Cannot Refute

[63] Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation PDF

Cannot Refute

[64] Finch: Prompt-guided Key-Value Cache Compression PDF

Cannot Refute

[65] Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference PDF

Cannot Refute

[66] Omnikv: Dynamic context selection for efficient long-context llms PDF

Cannot Refute

[67] Twilight: Adaptive Attention Sparsity with Hierarchical Top- Pruning PDF

Cannot Refute

[68] GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference PDF

Cannot Refute

[69] FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models PDF

Cannot Refute

Draft-based Approximate Inference for LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Recycled attention: Efficient inference for long-context language models PDF

[16] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management PDF

[18] D2o: Dynamic discriminative operations for efficient generative inference of large language models PDF

[27] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference PDF

[45] TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection PDF

[47] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression PDF

Contribution Analysis

Draft-based Approximate Inference framework

[51] Self Speculative Decoding for Diffusion Large Language Models PDF

[52] On Speculative Decoding for Multimodal Large Language Models PDF

[53] Confidence-Modulated Speculative Decoding for Large Language Models PDF

[54] Speculative Decoding Reimagined for Multimodal Large Language Models PDF

[55] Spectr: Fast speculative decoding via optimal transport PDF

[56] Recurrent Drafter for Fast Speculative Decoding in Large Language Models PDF

[57] Optimizing speculative decoding for serving large language models using goodput PDF

[58] HADES: Hardware Accelerated Decoding for Efficient Speculation in Large Language Models PDF

[59] Rest: Retrieval-based speculative decoding PDF

[60] SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models PDF

Theoretical and empirical analyses justifying lookahead-based importance estimation

[71] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time PDF

[72] Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction PDF

[73] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

[74] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference PDF

SpecKV, SpecPC, and SpecKV-PC algorithms

[70] SpecAttn: Speculating Sparse Attention PDF

[61] Scbench: A kv cache-centric analysis of long-context methods PDF

[62] Loki: Low-Rank Keys for Efficient Sparse Attention PDF

[63] Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation PDF

[64] Finch: Prompt-guided Key-Value Cache Compression PDF

[65] Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference PDF

[66] Omnikv: Dynamic context selection for efficient long-context llms PDF

[67] Twilight: Adaptive Attention Sparsity with Hierarchical Top- Pruning PDF

[68] GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference PDF

[69] FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models PDF

Table of Contents