Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Information bottleneckGeneralisationLarge Language modelsLatent space reasoningRepresentation learningMemory consolidationKV-cache compressionPredictive encodingReasoningInformation theory

Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space “thinking” (i.e., chains of thought). A growing line of work pushes this extra computation into the model’s latent space (adjacent to standard decoding) which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent or special-token rollouts, (ii) residual/activation steering, and (iii) memory compression via cache pruning, merging, or summarization. An underexplored alternative is memory consolidation and reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In a Transformer LLM, this can be seen as analogous to performing in-place global rewrites of incoming KV segments, and rewrites of past segments conditioned on newly observed tokens. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We prove using IB theory that Vanilla decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then introduce the Bottlenecked Transformer, which augments a decoder-only backbone LLM with a lightweight Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The processor consolidates recently written KV entries and reconsolidates a small, top- $k$ attention-selected set of prior entries, conditioned on recent context. We evaluate our Bottlenecked Transformer architecture on seven mathematical reasoning benchmarks, with four backbone LLMs. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented Transformer baselines, with gains of up to +6.6pp for selected tasks and backbones.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a memory consolidation mechanism for Transformer LLMs that performs periodic, in-place global rewrites of KV cache segments, justified through Information Bottleneck theory. It sits in the 'Information Bottleneck-Guided KV Cache Rewriting' leaf, which contains only two papers total (including this work and one sibling). This is a notably sparse research direction within a small taxonomy of seven papers across four main branches, suggesting the specific combination of IB-theoretic justification and consolidation-based rewriting is relatively underexplored compared to more established cache management strategies.

The taxonomy reveals three neighboring branches: attention-guided eviction methods that prune based on observed attention scores, bounded-capacity architectures enforcing fixed memory limits, and system-level frameworks integrating offloading or masking. The paper's consolidation approach diverges from these by emphasizing learned compression over heuristic pruning (attention-guided branch) and principled rewriting over hard capacity constraints (bounded-capacity branch). The taxonomy's scope notes clarify that consolidation-based rewrites exclude simple eviction or compression without reconsolidation mechanisms, positioning this work as conceptually distinct from the more populated eviction-focused directions.

Among twenty-three candidates examined across three contributions, none were flagged as clearly refutable. The Information Bottleneck justification examined ten candidates with zero refutations, the Bottlenecked Transformer architecture examined three with none, and the memory consolidation mechanism examined ten with none. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found to substantially overlap with the specific combination of IB-guided periodic rewrites and brain-inspired consolidation framing. The sibling paper in the same leaf likely shares conceptual ground but was not flagged as refuting any contribution.

Based on the limited literature search of twenty-three candidates, the work appears to occupy a relatively novel position combining IB theory, periodic rewriting, and neuroscience-inspired consolidation. The sparse taxonomy leaf and absence of refutable overlaps suggest this specific synthesis is underexplored, though the small candidate pool means the analysis does not cover the full breadth of cache management or reasoning-enhancement literature. The novelty assessment is thus conditional on the examined scope rather than exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: improving reasoning through periodic KV cache rewriting in decoder-only transformers. The field addresses how to manage the key-value cache in transformer models to enhance reasoning capabilities while controlling memory costs. The taxonomy reveals four main branches: Memory Consolidation and Reconsolidation for Reasoning Enhancement focuses on periodically compressing or rewriting cached representations to distill essential information; Attention-Guided KV Cache Eviction and Pruning uses attention patterns to selectively discard less relevant tokens; Bounded and Constant-Complexity KV Cache Architectures enforces fixed-size caches through architectural constraints; and KV Cache Management for Long-Context Inference Systems optimizes cache strategies for extended sequences. Representative works like Self-Attention Guided Eviction[2] illustrate pruning approaches, while TConstFormer[5] exemplifies bounded-complexity designs that maintain constant memory footprints regardless of sequence length. A central tension across these branches is the trade-off between compression aggressiveness and information retention: aggressive eviction or rewriting can reduce memory but may discard reasoning-critical context, whereas conservative strategies preserve more information at higher cost. Within the Memory Consolidation branch, Bottlenecked Transformers[0] and its closely related variant Bottlenecked Transformers Abstraction[3] employ information bottleneck principles to periodically rewrite the KV cache, compressing historical context into compact representations that retain reasoning-relevant features. This contrasts with attention-guided methods like Self-Attention Guided Eviction[2], which rely on observed attention scores rather than learned compression, and with bounded architectures such as TConstFormer[5] or Predefined KV Capacity[7], which enforce hard capacity limits without explicit reconsolidation. Bottlenecked Transformers[0] sits squarely in the consolidation paradigm, emphasizing principled rewriting over simple eviction, and shares conceptual ground with Bottlenecked Transformers Abstraction[3] while differing from the more heuristic pruning strategies prevalent in neighboring branches.

Claimed Contributions

Information Bottleneck theoretical justification for KV cache rewrites

10 retrieved papers

The authors provide an information-theoretic analysis showing that autoregressive training in decoder-only Transformers encourages the KV cache to preserve unnecessary input information, potentially hindering generalisation. They demonstrate that periodic KV rewrites can improve the balance between input compression and predictive information retention.

10 retrieved papers

Bottlenecked Transformer architecture with Cache Processor

3 retrieved papers

The authors introduce a novel architecture that augments pretrained LLMs with a small auxiliary Transformer module called the Cache Processor. This module periodically rewrites KV cache entries in-place at reasoning step boundaries, implementing consolidation of recent entries and reconsolidation of selectively recalled prior entries.

3 retrieved papers

Memory consolidation and reconsolidation mechanism for Transformer LLMs

10 retrieved papers

The authors explore an underexplored direction in auxiliary latent-space computation by incorporating neuroscience-inspired memory consolidation and reconsolidation processes. This is realised through periodic in-place edits to the KV cache that stabilise new memories and update recalled memories with new contextual information.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning PDF

Fountas, Zafeirios, Adnan Oomerjee, Bou-Ammar, Haitham, Z. Fountas, Wang Jun, Zhongwei Yu, Haitham Bou-Ammar, Jun Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Information Bottleneck theoretical justification for KV cache rewrites

[3] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning PDF

Cannot Refute

[8] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling PDF

Cannot Refute

[9] Ai flow at the network edge PDF

Cannot Refute

[10] Block transformer: Global-to-local language modeling for fast inference PDF

Cannot Refute

[11] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression PDF

Cannot Refute

[12] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers PDF

Cannot Refute

[13] Transformers in Deep Learning: A Comprehensive Technical Review PDF

Cannot Refute

[14] From TLinFormer to TConstFormer: The Leap to Constant-Time Transformer Attention: Achieving O (1) Computation and O (1) KV Cache during Autoregressive â¦ PDF

Cannot Refute

[15] Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression PDF

Cannot Refute

[16] Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning PDF

Cannot Refute

Contribution

Bottlenecked Transformer architecture with Cache Processor

[3] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning PDF

Cannot Refute

[17] RefreshKV: Updating Small KV Cache During Long-form Generation PDF

Cannot Refute

[18] BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers PDF

Cannot Refute

Contribution

Memory consolidation and reconsolidation mechanism for Transformer LLMs

[19] Learning Longer Memory in Recurrent Neural Networks PDF

Cannot Refute

[20] Reversible architectures for arbitrarily deep residual neural networks PDF

Cannot Refute

[21] Structured Memory Mechanisms for Stable Context Representation in Large Language Models PDF

Cannot Refute

[22] Neural network models of autonomous adaptive intelligence and artificial general intelligence: how our brains learn large language models and their meanings PDF

Cannot Refute

[23] Rethinking memory in ai: Taxonomy, operations, topics, and future directions PDF

Cannot Refute

[24] Sleep-dependent consolidation effects on foreign language word acquisition in a virtual reality environment PDF

Cannot Refute

[25] Hebbian Memory-Augmented Recurrent Networks: Engram Neurons in Deep Learning PDF

Cannot Refute

[26] Attention is All You Need Until You Need Retention PDF

Cannot Refute

[27] Memory-augmented transformers: A systematic review from neuroscience principles to enhanced model architectures PDF

Cannot Refute

[28] Towards large language models with human-like episodic memory PDF

Cannot Refute

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning PDF

Contribution Analysis

Information Bottleneck theoretical justification for KV cache rewrites

[3] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning PDF

[8] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling PDF

[9] Ai flow at the network edge PDF

[10] Block transformer: Global-to-local language modeling for fast inference PDF

[11] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression PDF

[12] Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers PDF

[13] Transformers in Deep Learning: A Comprehensive Technical Review PDF

[14] From TLinFormer to TConstFormer: The Leap to Constant-Time Transformer Attention: Achieving O (1) Computation and O (1) KV Cache during Autoregressive â¦ PDF

[15] Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression PDF

[16] Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning PDF

Bottlenecked Transformer architecture with Cache Processor

[3] Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning PDF

[17] RefreshKV: Updating Small KV Cache During Long-form Generation PDF

[18] BumbleBee: Dynamic KV-Cache Streaming Submodular Summarization for Infinite-Context Transformers PDF

Memory consolidation and reconsolidation mechanism for Transformer LLMs

[19] Learning Longer Memory in Recurrent Neural Networks PDF

[20] Reversible architectures for arbitrarily deep residual neural networks PDF

[21] Structured Memory Mechanisms for Stable Context Representation in Large Language Models PDF

[22] Neural network models of autonomous adaptive intelligence and artificial general intelligence: how our brains learn large language models and their meanings PDF

[23] Rethinking memory in ai: Taxonomy, operations, topics, and future directions PDF

[24] Sleep-dependent consolidation effects on foreign language word acquisition in a virtual reality environment PDF

[25] Hebbian Memory-Augmented Recurrent Networks: Engram Neurons in Deep Learning PDF

[26] Attention is All You Need Until You Need Retention PDF

[27] Memory-augmented transformers: A systematic review from neuroscience principles to enhanced model architectures PDF

[28] Towards large language models with human-like episodic memory PDF

Table of Contents

[14] From TLinFormer to TConstFormer: The Leap to Constant-Time Transformer Attention: Achieving O (1) Computation and O (1) KV Cache during Autoregressive â¦ PDF