ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Reasoning ModelsKV Cache CompressionQuantizationEvictionSparsityThought-Aware Compression

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key–value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization–eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over SoTA baselines.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ThinKV proposes a thought-adaptive KV cache compression framework combining hybrid quantization and eviction strategies tailored to reasoning models. The paper resides in the 'Reasoning and Long-Output Generation' leaf, which contains five papers total, including R-KV, R-KV Redundancy, Reasoning Path Compression, and Lethe. This leaf represents a moderately active research direction within the broader taxonomy of fifty papers, focusing specifically on compression challenges arising from extended chain-of-thought outputs rather than general long-context scenarios.

The taxonomy tree positions this work within 'Task and Context Adaptations,' distinct from general-purpose compression mechanisms and system-level optimizations. Neighboring leaves include 'Long-Context and Prefix Management' (four papers addressing shared prefixes and input length) and 'Task-Aware and Query-Adaptive Compression' (two papers on query-specific strategies). The sibling papers in the same leaf emphasize redundancy detection in reasoning traces and task-aware eviction, suggesting ThinKV's thought-type classification aligns with emerging interest in semantic structure within reasoning outputs.

Among thirty candidates examined, the ThinKV framework contribution shows one refutable candidate from ten examined, while the TBQ/TBE algorithms and Continuous Thinking kernel each show zero refutations from ten candidates. The framework-level overlap suggests prior work addresses thought-adaptive compression concepts, though the specific algorithmic contributions (TBQ/TBE) and kernel design appear less directly anticipated. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage of the reasoning-model compression literature.

Based on the examined candidates, ThinKV's algorithmic and kernel contributions appear relatively novel within the constrained search, while the broader framework concept encounters some prior overlap. The taxonomy context reveals a moderately populated research direction where thought-aware strategies are gaining traction, positioning this work as an incremental advance in a growing subfield rather than a foundational departure.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: KV cache compression for large reasoning models. The field addresses memory bottlenecks in transformer-based language models by reducing the storage footprint of key-value caches during inference. The taxonomy organizes work into five main branches: Compression Mechanisms and Techniques explores fundamental methods such as eviction policies, quantization (e.g., SKVQ[18], WKVQuant[34]), and merging strategies; Task and Context Adaptations examines how compression strategies vary with workload characteristics, including reasoning-heavy scenarios and long-output generation; Domain-Specific Adaptations targets specialized settings like retrieval-augmented generation (Beyond RAG[12]) and vision-language models (VL-cache[13]); System Architecture and Deployment focuses on practical integration, batching, and hardware-aware optimizations; and Evaluation and Benchmarking provides frameworks (Comprehensive Benchmark[40]) to assess trade-offs between compression ratio, latency, and task accuracy. Representative works like MiniCache[3] and PyramidKV[7] illustrate how eviction and layer-wise strategies can be combined, while recent surveys (KV Cache Review[1], Acceleration Survey[25]) synthesize emerging trends across these branches. A particularly active line of work centers on reasoning and long-output generation, where models must maintain extended context while producing multi-step solutions. Methods such as R-KV[4] and R-KV Redundancy[37] identify and exploit redundancy patterns specific to reasoning traces, while Reasoning Path Compression[38] and Lethe[44] propose task-aware eviction that preserves critical intermediate steps. ThinKV[0] sits within this cluster, emphasizing efficient cache management tailored to the unique demands of large reasoning models. Compared to more general eviction schemes like LeanKV[5] or quantization-focused approaches (Residual Vector Quantization[9]), ThinKV[0] and its neighbors prioritize preserving logical dependencies and multi-turn coherence over uniform compression. This contrast highlights an open question: whether reasoning workloads benefit more from semantic-aware selection or from aggressive uniform compression paired with robust error correction.

Claimed Contributions

ThinKV thought-adaptive KV cache compression framework

Can Refute

10 retrieved papers

The authors introduce ThinKV, a framework that decomposes chain-of-thought reasoning into distinct thought types (reasoning, execution, transition) based on attention sparsity patterns. It applies hybrid quantization and eviction strategies that adapt to the importance of different thought types, achieving near-lossless accuracy with less than 5% of the original KV cache while improving inference throughput.

10 retrieved papers

Can Refute

Think Before you Quantize (TBQ) and Think Before You Evict (TBE) algorithms

10 retrieved papers

The authors develop TBQ, a quantization method that allocates bit-precision according to thought-type importance, and TBE, an eviction policy that progressively removes tokens from less critical thoughts as reasoning trajectories evolve. These algorithms operate at the thought-segment level rather than individual tokens, preserving reasoning-critical information under high compression.

10 retrieved papers

Continuous Thinking kernel extending PagedAttention

10 retrieved papers

The authors design Continuous Thinking, a system-level kernel that extends PagedAttention to enable in-place memory reuse of evicted KV token slots. This eliminates the need for gather-based compaction operations that cause memory bandwidth contention, thereby reducing inference overhead and enabling efficient dynamic eviction during decoding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration PDF

Cai, Zefan, Xiao Wen, Zefan Cai, Sun, Hanshi, Wen Xiao, Luo, Cheng, Hanshi Sun, Zhang, Yikai, Cheng Luo, Wan Ke, Yikai Zhang, Li Yu-Cheng, Ke Wan, Zhou Yeyang, Yucheng Li, chang li wen, Y Zhou, Gu, Jiuxiang, Li-Wen Chang, Dong, Zhen, Jiuxiang Gu, Anandkumar, Anima, Zhen Dong, Asi, Abedelkadir, Anima Anandkumar, Hu Junjie, Abedelkadir Asi, Junjie Hu (2025)

[37] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models PDF

Cai, Zefan, Xiao Wen, Sun, Hanshi, Luo, Cheng, Zhang, Yikai, Wan Ke, Li Yu-Cheng, Zhou Yeyang, chang li wen, Gu, Jiuxiang, Dong, Zhen, Anandkumar, Anima, Asi, Abedelkadir, Hu Junjie (2025)

[38] Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning PDF

Song Ji-won, Jo, Dongwon, Jiwon Song, Kim, Yulhwa, Dongwon Jo, Jae-Joon, Yulhwa Kim, Jae-Joon Kim (2025)

[44] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ThinKV thought-adaptive KV cache compression framework

[4] R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration PDF

Can Refute

[5] Unifying KV Cache Compression for Large Language Models with LeanKV PDF

Cannot Refute

[44] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

Cannot Refute

[51] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference PDF

Cannot Refute

[52] SeerAttention-R: Sparse Attention Adaptation for Long Reasoning PDF

Cannot Refute

[53] A silver bullet or a compromise for full attention? a comprehensive study of gist token-based context compression PDF

Cannot Refute

[54] Long-Context Generalization with Sparse Attention PDF

Cannot Refute

[55] From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification PDF

Cannot Refute

[56] Sampleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention PDF

Cannot Refute

[57] RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning PDF

Cannot Refute

Contribution

Think Before you Quantize (TBQ) and Think Before You Evict (TBE) algorithms

[5] Unifying KV Cache Compression for Large Language Models with LeanKV PDF

Cannot Refute

[13] VL-cache: Sparsity and modality-aware KV cache compression for vision-language model inference acceleration PDF

Cannot Refute

[17] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization PDF

Cannot Refute

[65] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences PDF

Cannot Refute

[66] EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance PDF

Cannot Refute

[67] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification PDF

Cannot Refute

[68] Zipvl: Efficient large vision-language models with dynamic token sparsification and kv cache compression PDF

Cannot Refute

[69] DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction PDF

Cannot Refute

[70] Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache PDF

Cannot Refute

[71] More tokens, lower precision: Towards the optimal token-precision trade-off in kv cache compression PDF

Cannot Refute

Contribution

Continuous Thinking kernel extending PagedAttention

[10] Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving PDF

Cannot Refute

[36] KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head PDF

Cannot Refute

[50] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention PDF

Cannot Refute

[58] Efficient Memory Management for Large Language Model Serving with PagedAttention PDF

Cannot Refute

[59] JENGA: Effective memory management for serving LLM with heterogeneity PDF

Cannot Refute

[60] Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference PDF

Cannot Refute

[61] Pagedeviction: Structured block-wise kv cache pruning for efficient large language model inference PDF

Cannot Refute

[62] ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing and Batching PDF

Cannot Refute

[63] Accelerating Chatbot Inference with vLLM: Evaluating the Efficiency of PagedAttention PDF

Cannot Refute

[64] DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention PDF

Cannot Refute

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration PDF

[37] R-KV: Redundancy-aware KV Cache Compression for Reasoning Models PDF

[38] Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning PDF

[44] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

Contribution Analysis

ThinKV thought-adaptive KV cache compression framework

[4] R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration PDF

[5] Unifying KV Cache Compression for Large Language Models with LeanKV PDF

[44] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

[51] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference PDF

[52] SeerAttention-R: Sparse Attention Adaptation for Long Reasoning PDF

[53] A silver bullet or a compromise for full attention? a comprehensive study of gist token-based context compression PDF

[54] Long-Context Generalization with Sparse Attention PDF

[55] From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification PDF

[56] Sampleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention PDF

[57] RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning PDF

Think Before you Quantize (TBQ) and Think Before You Evict (TBE) algorithms

[5] Unifying KV Cache Compression for Large Language Models with LeanKV PDF

[13] VL-cache: Sparsity and modality-aware KV cache compression for vision-language model inference acceleration PDF

[17] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization PDF

[65] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences PDF

[66] EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance PDF

[67] ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification PDF

[68] Zipvl: Efficient large vision-language models with dynamic token sparsification and kv cache compression PDF

[69] DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction PDF

[70] Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache PDF

[71] More tokens, lower precision: Towards the optimal token-precision trade-off in kv cache compression PDF

Continuous Thinking kernel extending PagedAttention

[10] Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving PDF

[36] KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head PDF

[50] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention PDF

[58] Efficient Memory Management for Large Language Model Serving with PagedAttention PDF

[59] JENGA: Effective memory management for serving LLM with heterogeneity PDF

[60] Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference PDF

[61] Pagedeviction: Structured block-wise kv cache pruning for efficient large language model inference PDF

[62] ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing and Batching PDF

[63] Accelerating Chatbot Inference with vLLM: Evaluating the Efficiency of PagedAttention PDF

[64] DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention PDF

Table of Contents