Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Information bottleneckGeneralisationLarge Language modelsLatent space reasoningRepresentation learningMemory consolidationKV-cache compressionPredictive encodingReasoningInformation theory
Abstract:

Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space “thinking” (i.e., chains of thought). A growing line of work pushes this extra computation into the model’s latent space (adjacent to standard decoding) which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent or special-token rollouts, (ii) residual/activation steering, and (iii) memory compression via cache pruning, merging, or summarization. An underexplored alternative is memory consolidation and reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In a Transformer LLM, this can be seen as analogous to performing in-place global rewrites of incoming KV segments, and rewrites of past segments conditioned on newly observed tokens. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We prove using IB theory that Vanilla decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then introduce the Bottlenecked Transformer, which augments a decoder-only backbone LLM with a lightweight Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The processor consolidates recently written KV entries and reconsolidates a small, top-kk attention-selected set of prior entries, conditioned on recent context. We evaluate our Bottlenecked Transformer architecture on seven mathematical reasoning benchmarks, with four backbone LLMs. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented Transformer baselines, with gains of up to +6.6pp for selected tasks and backbones.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a memory consolidation mechanism for Transformer LLMs that performs periodic, in-place global rewrites of KV cache segments, justified through Information Bottleneck theory. It sits in the 'Information Bottleneck-Guided KV Cache Rewriting' leaf, which contains only two papers total (including this work and one sibling). This is a notably sparse research direction within a small taxonomy of seven papers across four main branches, suggesting the specific combination of IB-theoretic justification and consolidation-based rewriting is relatively underexplored compared to more established cache management strategies.

The taxonomy reveals three neighboring branches: attention-guided eviction methods that prune based on observed attention scores, bounded-capacity architectures enforcing fixed memory limits, and system-level frameworks integrating offloading or masking. The paper's consolidation approach diverges from these by emphasizing learned compression over heuristic pruning (attention-guided branch) and principled rewriting over hard capacity constraints (bounded-capacity branch). The taxonomy's scope notes clarify that consolidation-based rewrites exclude simple eviction or compression without reconsolidation mechanisms, positioning this work as conceptually distinct from the more populated eviction-focused directions.

Among twenty-three candidates examined across three contributions, none were flagged as clearly refutable. The Information Bottleneck justification examined ten candidates with zero refutations, the Bottlenecked Transformer architecture examined three with none, and the memory consolidation mechanism examined ten with none. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found to substantially overlap with the specific combination of IB-guided periodic rewrites and brain-inspired consolidation framing. The sibling paper in the same leaf likely shares conceptual ground but was not flagged as refuting any contribution.

Based on the limited literature search of twenty-three candidates, the work appears to occupy a relatively novel position combining IB theory, periodic rewriting, and neuroscience-inspired consolidation. The sparse taxonomy leaf and absence of refutable overlaps suggest this specific synthesis is underexplored, though the small candidate pool means the analysis does not cover the full breadth of cache management or reasoning-enhancement literature. The novelty assessment is thus conditional on the examined scope rather than exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers
7
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: improving reasoning through periodic KV cache rewriting in decoder-only transformers. The field addresses how to manage the key-value cache in transformer models to enhance reasoning capabilities while controlling memory costs. The taxonomy reveals four main branches: Memory Consolidation and Reconsolidation for Reasoning Enhancement focuses on periodically compressing or rewriting cached representations to distill essential information; Attention-Guided KV Cache Eviction and Pruning uses attention patterns to selectively discard less relevant tokens; Bounded and Constant-Complexity KV Cache Architectures enforces fixed-size caches through architectural constraints; and KV Cache Management for Long-Context Inference Systems optimizes cache strategies for extended sequences. Representative works like Self-Attention Guided Eviction[2] illustrate pruning approaches, while TConstFormer[5] exemplifies bounded-complexity designs that maintain constant memory footprints regardless of sequence length. A central tension across these branches is the trade-off between compression aggressiveness and information retention: aggressive eviction or rewriting can reduce memory but may discard reasoning-critical context, whereas conservative strategies preserve more information at higher cost. Within the Memory Consolidation branch, Bottlenecked Transformers[0] and its closely related variant Bottlenecked Transformers Abstraction[3] employ information bottleneck principles to periodically rewrite the KV cache, compressing historical context into compact representations that retain reasoning-relevant features. This contrasts with attention-guided methods like Self-Attention Guided Eviction[2], which rely on observed attention scores rather than learned compression, and with bounded architectures such as TConstFormer[5] or Predefined KV Capacity[7], which enforce hard capacity limits without explicit reconsolidation. Bottlenecked Transformers[0] sits squarely in the consolidation paradigm, emphasizing principled rewriting over simple eviction, and shares conceptual ground with Bottlenecked Transformers Abstraction[3] while differing from the more heuristic pruning strategies prevalent in neighboring branches.

Claimed Contributions

Information Bottleneck theoretical justification for KV cache rewrites

The authors provide an information-theoretic analysis showing that autoregressive training in decoder-only Transformers encourages the KV cache to preserve unnecessary input information, potentially hindering generalisation. They demonstrate that periodic KV rewrites can improve the balance between input compression and predictive information retention.

10 retrieved papers
Bottlenecked Transformer architecture with Cache Processor

The authors introduce a novel architecture that augments pretrained LLMs with a small auxiliary Transformer module called the Cache Processor. This module periodically rewrites KV cache entries in-place at reasoning step boundaries, implementing consolidation of recent entries and reconsolidation of selectively recalled prior entries.

3 retrieved papers
Memory consolidation and reconsolidation mechanism for Transformer LLMs

The authors explore an underexplored direction in auxiliary latent-space computation by incorporating neuroscience-inspired memory consolidation and reconsolidation processes. This is realised through periodic in-place edits to the KV cache that stabilise new memories and update recalled memories with new contextual information.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Information Bottleneck theoretical justification for KV cache rewrites

The authors provide an information-theoretic analysis showing that autoregressive training in decoder-only Transformers encourages the KV cache to preserve unnecessary input information, potentially hindering generalisation. They demonstrate that periodic KV rewrites can improve the balance between input compression and predictive information retention.

Contribution

Bottlenecked Transformer architecture with Cache Processor

The authors introduce a novel architecture that augments pretrained LLMs with a small auxiliary Transformer module called the Cache Processor. This module periodically rewrites KV cache entries in-place at reasoning step boundaries, implementing consolidation of recent entries and reconsolidation of selectively recalled prior entries.

Contribution

Memory consolidation and reconsolidation mechanism for Transformer LLMs

The authors explore an underexplored direction in auxiliary latent-space computation by incorporating neuroscience-inspired memory consolidation and reconsolidation processes. This is realised through periodic in-place edits to the KV cache that stabilise new memories and update recalled memories with new contextual information.