Cartridges: Lightweight and general-purpose long context representations via self-study

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

test-time trainingfine-tuning

Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-10M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Cartridges, trainable KV cache representations that encode entire text corpora offline for reuse across multiple queries. This work occupies a unique position in the taxonomy, residing alone in the 'Trained Context Representations and Self-Study' leaf. Unlike the heavily populated token selection, quantization, and merging branches (which collectively contain over 40 papers), this direction represents a sparse research area focused on learning compact context encodings rather than heuristically compressing existing caches. The isolation in its own leaf suggests this approach diverges substantially from mainstream compression strategies.

The taxonomy reveals that most neighboring work pursues training-free compression: token eviction methods like Scissorhands analyze attention patterns to discard tokens, quantization techniques like KVQuant reduce precision, and merging approaches like Zsmerge consolidate similar representations. The 'Trained Context Representations' branch sits conceptually between these compression-focused directions and the 'Architectural Modifications' category, which redesigns model structures for inherent efficiency. While some hybrid frameworks combine multiple strategies, none in the examined taxonomy explicitly train offline cache representations for corpus-specific reuse, highlighting the distinctiveness of the Cartridges paradigm.

Among the 30 candidates examined, the contribution-level analysis shows varied novelty profiles. The core Cartridges concept (10 candidates, 0 refutations) and the self-study training recipe (10 candidates, 0 refutations) appear novel within the limited search scope, with no prior work explicitly training reusable KV caches on corpora. However, the memory reduction and throughput claims (10 candidates, 3 refutations) face overlap with existing compression methods that also demonstrate efficiency gains, though through different mechanisms. This suggests the technical approach is distinctive while the performance benefits align with broader field objectives.

Based on the top-30 semantic matches examined, the work appears to explore a relatively uncharted direction within KV cache optimization. The taxonomy structure confirms this is not a crowded research area, though the limited search scope means potentially relevant work in representation learning or context distillation outside the KV cache framing may exist. The analysis captures novelty relative to established compression paradigms but cannot claim exhaustive coverage of all related literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient long context representation through trained KV caches. The field addresses the memory bottleneck of transformer-based language models by developing methods to compress or optimize the key-value cache that grows linearly with sequence length. The taxonomy reveals a diverse landscape organized around several main strategies. Token selection and eviction approaches (e.g., Scissorhands[10], Model Tells Discard[27]) identify and discard less important tokens based on attention scores or other heuristics. Quantization techniques (e.g., KVQuant[7], QAQ[17]) reduce precision to shrink memory footprint. Merging and transformation methods (e.g., Zsmerge[2], FINCH[4]) consolidate similar keys or values. Low-rank factorization (e.g., ThinKV[6], Keyformer[14]) exploits redundancy in the cache structure. Hybrid frameworks combine multiple strategies, while streaming and incremental management (e.g., Chunkattention[28], Rolling Forcing[31]) handle dynamic contexts. A smaller but conceptually distinct branch focuses on trained context representations and self-study, where models learn to encode or distill context more efficiently. Domain-specific optimizations, system-level serving improvements, architectural modifications, and benchmarking studies round out the taxonomy, reflecting both algorithmic innovation and practical deployment concerns. Among the most active lines, token selection and quantization have attracted many studies due to their simplicity and immediate memory savings, though they often face trade-offs between compression ratio and task accuracy. Merging and low-rank methods offer more structured compression but require careful tuning to preserve semantic fidelity. The trained context representations branch, where Cartridges[0] resides, takes a different tack by learning compact, reusable cache encodings rather than heuristically pruning or compressing existing caches. This approach aligns conceptually with self-study paradigms and shares motivations with works like Minicache[12] and Locret[42], which also explore learned or retrieval-augmented context management. Compared to purely eviction-based methods such as Scissorhands[10] or quantization schemes like KVQuant[7], Cartridges[0] emphasizes training-time optimization to produce efficient representations, potentially offering better generalization across tasks at the cost of additional pretraining or fine-tuning overhead. This positions it as a complementary direction that bridges compression with representation learning.

Claimed Contributions

Cartridges: trainable KV caches for long-context representations

10 retrieved papers

The authors introduce Cartridges, which are compact, trainable KV caches that represent large text corpora. These are trained offline and loaded at inference time to reduce memory consumption while maintaining the generality of in-context learning.

10 retrieved papers

Self-Study: a training recipe for general-purpose Cartridges

10 retrieved papers

The authors propose Self-Study, a method that generates synthetic conversations about the corpus and trains Cartridges using a context-distillation objective. This approach enables Cartridges to replicate the functionality of in-context learning across diverse query types.

10 retrieved papers

Demonstration of memory reduction and throughput improvement

Can Refute

10 retrieved papers

The authors demonstrate that Cartridges trained with Self-Study achieve comparable performance to in-context learning while significantly reducing memory usage and increasing serving throughput on challenging long-context benchmarks.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cartridges: trainable KV caches for long-context representations

[4] FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models PDF

Cannot Refute

[12] Minicache: Kv cache compression in depth dimension for large language models PDF

Cannot Refute

[13] Lacache: Ladder-shaped kv caching for efficient long-context modeling of large language models PDF

Cannot Refute

[29] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference PDF

Cannot Refute

[44] MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference PDF

Cannot Refute

[51] dKV-Cache: The Cache for Diffusion Language Models PDF

Cannot Refute

[52] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and â¦ PDF

Cannot Refute

[53] Finch: Prompt-guided key-value cache compression PDF

Cannot Refute

[54] Skvq: Sliding-window key and value cache quantization for large language models PDF

Cannot Refute

[55] PQCache: Product Quantization-based KVCache for Long Context LLM Inference PDF

Cannot Refute

Contribution

Self-Study: a training recipe for general-purpose Cartridges

[56] Soda: Million-scale dialogue distillation with social commonsense contextualization PDF

Cannot Refute

[57] Democratizing cost-effective, agentic artificial intelligence to multilingual medical summarization through knowledge distillation PDF

Cannot Refute

[58] Distilling implicit multimodal knowledge into large language models for zero-resource dialogue generation PDF

Cannot Refute

[59] Effective and efficient conversation retrieval for dialogue state tracking with implicit text summaries PDF

Cannot Refute

[60] Strategize before teaching: A conversational tutoring system with pedagogy self-distillation PDF

Cannot Refute

[61] Architecting contextual gradient synthesis for knowledge representation in large language models PDF

Cannot Refute

[62] TAGNet: a tiny answer-guided network for conversational question generation PDF

Cannot Refute

[63] Heterogeneous-branch collaborative learning for dialogue generation PDF

Cannot Refute

[64] Using advanced llms to enhance smaller llms: An interpretable knowledge distillation approach PDF

Cannot Refute

[65] The current state of summarization PDF

Cannot Refute

Contribution

Demonstration of memory reduction and throughput improvement

[67] H2o: Heavy-hitter oracle for efficient generative inference of large language models PDF

Can Refute

[68] In-context autoencoder for context compression in a large language model PDF

Can Refute

[74] Adapting language models to compress contexts PDF

Can Refute

[66] Jamba: A hybrid transformer-mamba language model PDF

Cannot Refute

[69] Retentive network: A successor to transformer for large language models PDF

Cannot Refute

[70] Compressing context to enhance inference efficiency of large language models PDF

Cannot Refute

[71] Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning PDF

Cannot Refute

[72] An empirical study of mamba-based language models PDF

Cannot Refute

[73] Efficient Memory Management for Large Language Model Serving with PagedAttention PDF

Cannot Refute

[75] Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution PDF

Cannot Refute

Cartridges: Lightweight and general-purpose long context representations via self-study

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Cartridges: trainable KV caches for long-context representations

[4] FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models PDF

[12] Minicache: Kv cache compression in depth dimension for large language models PDF

[13] Lacache: Ladder-shaped kv caching for efficient long-context modeling of large language models PDF

[29] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference PDF

[44] MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference PDF

[51] dKV-Cache: The Cache for Diffusion Language Models PDF

[52] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and â¦ PDF

[53] Finch: Prompt-guided key-value cache compression PDF

[54] Skvq: Sliding-window key and value cache quantization for large language models PDF

[55] PQCache: Product Quantization-based KVCache for Long Context LLM Inference PDF

Self-Study: a training recipe for general-purpose Cartridges

[56] Soda: Million-scale dialogue distillation with social commonsense contextualization PDF

[57] Democratizing cost-effective, agentic artificial intelligence to multilingual medical summarization through knowledge distillation PDF

[58] Distilling implicit multimodal knowledge into large language models for zero-resource dialogue generation PDF

[59] Effective and efficient conversation retrieval for dialogue state tracking with implicit text summaries PDF

[60] Strategize before teaching: A conversational tutoring system with pedagogy self-distillation PDF

[61] Architecting contextual gradient synthesis for knowledge representation in large language models PDF

[62] TAGNet: a tiny answer-guided network for conversational question generation PDF

[63] Heterogeneous-branch collaborative learning for dialogue generation PDF

[64] Using advanced llms to enhance smaller llms: An interpretable knowledge distillation approach PDF

[65] The current state of summarization PDF

Demonstration of memory reduction and throughput improvement

[67] H2o: Heavy-hitter oracle for efficient generative inference of large language models PDF

[68] In-context autoencoder for context compression in a large language model PDF

[74] Adapting language models to compress contexts PDF

[66] Jamba: A hybrid transformer-mamba language model PDF

[69] Retentive network: A successor to transformer for large language models PDF

[70] Compressing context to enhance inference efficiency of large language models PDF

[71] Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning PDF

[72] An empirical study of mamba-based language models PDF

[73] Efficient Memory Management for Large Language Model Serving with PagedAttention PDF

[75] Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution PDF

Table of Contents

[52] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and â¦ PDF