Draft-based Approximate Inference for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
long-contextsparse attentionKV cache evictionprompt compression
Abstract:

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory costs of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same improvements in memory usage, latency, and throughput.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a framework for approximate LLM inference that uses small draft models to predict token and KV pair importance, yielding three concrete methods: SpecKV for KV cache dropping, SpecPC for prompt compression, and SpecKV-PC as a cascaded strategy. It resides in the Importance-Based KV Cache Eviction leaf, which contains seven papers including the original submission. This leaf sits within the broader KV Cache Compression and Management branch, indicating a moderately crowded research direction focused on memory-efficient inference through selective retention of key-value pairs.

The taxonomy reveals that KV cache management is one of ten major branches addressing long-context inference challenges. Adjacent leaves include KV Cache Quantization (three papers using mixed-precision techniques) and Structured KV Cache Organization (three papers employing clustering or hierarchical schemes). Neighboring branches such as Attention Mechanism Optimization (sparse patterns, linear attention) and Prompt Compression (four papers on input-level reduction) tackle related but distinct aspects of the quadratic complexity problem. The scope note for this leaf explicitly excludes quantization and structured methods, clarifying that the focus is on importance-driven eviction policies rather than encoding or organizational strategies.

Among the three contributions analyzed, the first two—the draft-based framework and the theoretical justification for lookahead estimation—show no clear refutation across ten and four candidates respectively. The third contribution, comprising the SpecKV, SpecPC, and SpecKV-PC algorithms, examined ten candidates and found one potentially overlapping prior work. Given the limited search scope of twenty-four total candidates, these statistics suggest that the core framework and theoretical analysis appear relatively novel within the examined literature, while the specific algorithmic instantiations may have closer precedents in the field.

Overall, the paper occupies a well-populated research area with multiple sibling methods addressing importance-based KV eviction. The analysis is constrained by the top-K semantic search strategy and does not constitute an exhaustive survey of all related work. The finding that one of ten candidates can refute the algorithmic contribution indicates some prior overlap, though the extent and nature of that overlap would require deeper inspection of the identified paper. The framework's emphasis on draft-model-driven prediction distinguishes it from quantization and retrieval-based approaches in neighboring taxonomy leaves.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: approximate inference for long-context large language models. The field addresses the computational and memory challenges that arise when deploying LLMs on sequences far exceeding their original training lengths. The taxonomy reveals a diverse landscape organized around ten major branches. KV Cache Compression and Management focuses on reducing memory overhead by selectively retaining or evicting key-value pairs, with importance-based eviction strategies forming a dense cluster. Attention Mechanism Optimization explores sparse or approximate attention patterns to avoid quadratic complexity, while Prompt Compression aims to distill input contexts into shorter representations. Alternative Architectures such as state-space models (Mamba Study[4], Retentive Network[8]) and hybrid designs (Jamba[12]) offer fundamentally different ways to handle long sequences. Context Extension Techniques like positional encoding modifications (Yarn[1], LongLoRA[13]) enable models to generalize beyond their pretrained context windows. Speculative Decoding, System-Level Optimization, Diffusion Language Models, Evaluation and Benchmarking, and Domain-Specific Applications round out the taxonomy, reflecting both algorithmic innovation and practical deployment concerns. Within KV cache management, a particularly active line of work centers on importance-based eviction policies that dynamically decide which tokens to retain. Draft Inference[0] sits squarely in this cluster, proposing a method to predict and prune less critical cache entries during generation. Nearby approaches such as Recycled Attention[5] and InfiniGen[16] similarly exploit token-level importance scores, though they differ in how they estimate relevance and handle eviction granularity. Another contrasting direction involves quantization-based compression (PQCache[22], MoSKA[38]) or retrieval-augmented schemes (RetrievalAttention[34]), which trade off precision or architectural complexity for memory savings. The main open questions revolve around balancing accuracy, latency, and memory footprint: aggressive eviction can degrade quality on tasks requiring fine-grained context (Context Utilization[6]), while conservative policies may not sufficiently alleviate memory pressure. Draft Inference[0] emphasizes lightweight prediction overhead compared to heavier retrieval or quantization methods, positioning itself as a practical middle ground for real-time serving scenarios.

Claimed Contributions

Draft-based Approximate Inference framework

The authors introduce a unified framework that leverages small draft models to predict token and KV pair importance more accurately than existing methods, which rely only on input tokens. This framework extends recent work by using draft model lookahead to approximate future outputs for better importance estimation.

10 retrieved papers
Theoretical and empirical analyses justifying lookahead-based importance estimation

The authors provide novel theoretical bounds (Theorems 1 and 2) showing that error in approximate input embeddings or outputs translates proportionally to error in importance scores. They also present empirical evidence demonstrating strong correlation between draft and target model importance scores.

4 retrieved papers
SpecKV, SpecPC, and SpecKV-PC algorithms

The authors present SpecKV as the first method to use draft model lookahead for KV cache optimization, SpecPC for prompt compression using draft attention activations, and SpecKV-PC as a cascaded compression strategy combining both techniques to achieve superior accuracy, latency, and memory efficiency.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Draft-based Approximate Inference framework

The authors introduce a unified framework that leverages small draft models to predict token and KV pair importance more accurately than existing methods, which rely only on input tokens. This framework extends recent work by using draft model lookahead to approximate future outputs for better importance estimation.

Contribution

Theoretical and empirical analyses justifying lookahead-based importance estimation

The authors provide novel theoretical bounds (Theorems 1 and 2) showing that error in approximate input embeddings or outputs translates proportionally to error in importance scores. They also present empirical evidence demonstrating strong correlation between draft and target model importance scores.

Contribution

SpecKV, SpecPC, and SpecKV-PC algorithms

The authors present SpecKV as the first method to use draft model lookahead for KV cache optimization, SpecPC for prompt compression using draft attention activations, and SpecKV-PC as a cascaded compression strategy combining both techniques to achieve superior accuracy, latency, and memory efficiency.