ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

ICLR 2026 Conference SubmissionAnonymous Authors
Large Reasoning ModelsKV Cache CompressionQuantizationEvictionSparsityThought-Aware Compression
Abstract:

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key–value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization–eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over SoTA baselines.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ThinKV proposes a thought-adaptive KV cache compression framework combining hybrid quantization and eviction strategies tailored to reasoning models. The paper resides in the 'Reasoning and Long-Output Generation' leaf, which contains five papers total, including R-KV, R-KV Redundancy, Reasoning Path Compression, and Lethe. This leaf represents a moderately active research direction within the broader taxonomy of fifty papers, focusing specifically on compression challenges arising from extended chain-of-thought outputs rather than general long-context scenarios.

The taxonomy tree positions this work within 'Task and Context Adaptations,' distinct from general-purpose compression mechanisms and system-level optimizations. Neighboring leaves include 'Long-Context and Prefix Management' (four papers addressing shared prefixes and input length) and 'Task-Aware and Query-Adaptive Compression' (two papers on query-specific strategies). The sibling papers in the same leaf emphasize redundancy detection in reasoning traces and task-aware eviction, suggesting ThinKV's thought-type classification aligns with emerging interest in semantic structure within reasoning outputs.

Among thirty candidates examined, the ThinKV framework contribution shows one refutable candidate from ten examined, while the TBQ/TBE algorithms and Continuous Thinking kernel each show zero refutations from ten candidates. The framework-level overlap suggests prior work addresses thought-adaptive compression concepts, though the specific algorithmic contributions (TBQ/TBE) and kernel design appear less directly anticipated. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage of the reasoning-model compression literature.

Based on the examined candidates, ThinKV's algorithmic and kernel contributions appear relatively novel within the constrained search, while the broader framework concept encounters some prior overlap. The taxonomy context reveals a moderately populated research direction where thought-aware strategies are gaining traction, positioning this work as an incremental advance in a growing subfield rather than a foundational departure.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: KV cache compression for large reasoning models. The field addresses memory bottlenecks in transformer-based language models by reducing the storage footprint of key-value caches during inference. The taxonomy organizes work into five main branches: Compression Mechanisms and Techniques explores fundamental methods such as eviction policies, quantization (e.g., SKVQ[18], WKVQuant[34]), and merging strategies; Task and Context Adaptations examines how compression strategies vary with workload characteristics, including reasoning-heavy scenarios and long-output generation; Domain-Specific Adaptations targets specialized settings like retrieval-augmented generation (Beyond RAG[12]) and vision-language models (VL-cache[13]); System Architecture and Deployment focuses on practical integration, batching, and hardware-aware optimizations; and Evaluation and Benchmarking provides frameworks (Comprehensive Benchmark[40]) to assess trade-offs between compression ratio, latency, and task accuracy. Representative works like MiniCache[3] and PyramidKV[7] illustrate how eviction and layer-wise strategies can be combined, while recent surveys (KV Cache Review[1], Acceleration Survey[25]) synthesize emerging trends across these branches. A particularly active line of work centers on reasoning and long-output generation, where models must maintain extended context while producing multi-step solutions. Methods such as R-KV[4] and R-KV Redundancy[37] identify and exploit redundancy patterns specific to reasoning traces, while Reasoning Path Compression[38] and Lethe[44] propose task-aware eviction that preserves critical intermediate steps. ThinKV[0] sits within this cluster, emphasizing efficient cache management tailored to the unique demands of large reasoning models. Compared to more general eviction schemes like LeanKV[5] or quantization-focused approaches (Residual Vector Quantization[9]), ThinKV[0] and its neighbors prioritize preserving logical dependencies and multi-turn coherence over uniform compression. This contrast highlights an open question: whether reasoning workloads benefit more from semantic-aware selection or from aggressive uniform compression paired with robust error correction.

Claimed Contributions

ThinKV thought-adaptive KV cache compression framework

The authors introduce ThinKV, a framework that decomposes chain-of-thought reasoning into distinct thought types (reasoning, execution, transition) based on attention sparsity patterns. It applies hybrid quantization and eviction strategies that adapt to the importance of different thought types, achieving near-lossless accuracy with less than 5% of the original KV cache while improving inference throughput.

10 retrieved papers
Can Refute
Think Before you Quantize (TBQ) and Think Before You Evict (TBE) algorithms

The authors develop TBQ, a quantization method that allocates bit-precision according to thought-type importance, and TBE, an eviction policy that progressively removes tokens from less critical thoughts as reasoning trajectories evolve. These algorithms operate at the thought-segment level rather than individual tokens, preserving reasoning-critical information under high compression.

10 retrieved papers
Continuous Thinking kernel extending PagedAttention

The authors design Continuous Thinking, a system-level kernel that extends PagedAttention to enable in-place memory reuse of evicted KV token slots. This eliminates the need for gather-based compaction operations that cause memory bandwidth contention, thereby reducing inference overhead and enabling efficient dynamic eviction during decoding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ThinKV thought-adaptive KV cache compression framework

The authors introduce ThinKV, a framework that decomposes chain-of-thought reasoning into distinct thought types (reasoning, execution, transition) based on attention sparsity patterns. It applies hybrid quantization and eviction strategies that adapt to the importance of different thought types, achieving near-lossless accuracy with less than 5% of the original KV cache while improving inference throughput.

Contribution

Think Before you Quantize (TBQ) and Think Before You Evict (TBE) algorithms

The authors develop TBQ, a quantization method that allocates bit-precision according to thought-type importance, and TBE, an eviction policy that progressively removes tokens from less critical thoughts as reasoning trajectories evolve. These algorithms operate at the thought-segment level rather than individual tokens, preserving reasoning-critical information under high compression.

Contribution

Continuous Thinking kernel extending PagedAttention

The authors design Continuous Thinking, a system-level kernel that extends PagedAttention to enable in-place memory reuse of evicted KV token slots. This eliminates the need for gather-based compaction operations that cause memory bandwidth contention, thereby reducing inference overhead and enabling efficient dynamic eviction during decoding.