Out of the Memory Barrier: A Highly Memory-Efficient Training System for LLMs with Million-Token Contexts

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMNLPLong-Context LLMMemory Efficient Training

Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint ( $\mathcal{O}(1)$ ) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available for review at https://anonymous.4open.science/r/oomb/README.md.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OOMB, a memory-efficient training system for long-context LLMs that combines chunk-recurrent training with on-the-fly activation recomputation to achieve constant activation memory footprint. It resides in the 'Activation and Gradient Management' leaf under 'Memory-Efficient Training Techniques', alongside three sibling papers. This leaf represents a focused research direction within the broader taxonomy of 50 papers across approximately 36 topics, indicating a moderately active but not overcrowded area. The work directly addresses activation memory scaling, which the taxonomy identifies as a primary bottleneck distinct from KV cache management or optimizer state reduction.

The taxonomy reveals that OOMB's leaf sits within a larger branch of memory-efficient training techniques that includes distributed training parallelism (5 papers) and optimizer state reduction (1 paper). Neighboring branches address context window extension through positional encoding modifications and hybrid architectures, as well as inference-time optimizations through KV cache management systems. The scope note for OOMB's leaf explicitly excludes KV cache management methods, yet OOMB integrates paged memory management for KV cache alongside activation handling. This positions the work at the boundary between activation management and system-level KV cache optimization, potentially bridging two traditionally separate research directions within the taxonomy structure.

Among 14 candidates examined across three contributions, the analysis reveals varied novelty profiles. The chunk-recurrent training framework with constant activation memory was not evaluated against prior work (0 candidates examined). The paged memory manager for KV cache examined 4 candidates with no clear refutations, suggesting moderate novelty in this component. The asynchronous CPU offloading mechanism examined 10 candidates and found 5 refutable pairs, indicating substantial prior work exists for this specific technique. The limited search scope (14 total candidates from semantic search) means these findings reflect top-K matches rather than exhaustive coverage, particularly for the offloading mechanism where half the examined candidates provide overlapping prior work.

Based on the limited literature search, OOMB appears to offer incremental advances in asynchronous offloading while potentially introducing novel integration of chunk-recurrent training with paged KV cache management. The analysis covers top-14 semantic matches and does not capture the full landscape of activation management techniques or system-level optimizations. The work's positioning at the intersection of activation management and KV cache optimization may represent a meaningful synthesis, though the scope of examined candidates limits definitive assessment of its overall novelty.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Memory-efficient training of large language models with long contexts. The field addresses the challenge of extending context windows in LLMs while managing the substantial memory overhead that arises during training and inference. The taxonomy reveals several complementary research directions: Context Window Extension Methods focus on architectural innovations and positional encoding schemes that enable models to handle longer sequences, exemplified by works like YaRN[3] and LongRope[24]. Memory-Efficient Training Techniques tackle the computational bottlenecks through activation management, gradient checkpointing, and parallelization strategies such as Deepspeed Ulysses[13]. Efficient Fine-Tuning and Adaptation explores parameter-efficient methods like LongLoRA[1] and LongQLoRA[50] that adapt pretrained models to extended contexts without full retraining. Meanwhile, Inference-Time Memory Optimization and Context Compression branches address deployment challenges through techniques like PagedAttention[11], MInference[12], and prompt compression methods such as LongLLMLingua[16]. Alternative paradigms including hybrid architectures like Jamba[6] and retrieval-augmented approaches offer fundamentally different strategies for context handling. Within the Memory-Efficient Training Techniques branch, a particularly active area centers on activation and gradient management during backpropagation. Works like Blockwise Parallel[39] and SlimPipe[37] explore pipeline parallelism and memory reuse strategies to reduce peak memory consumption. Memory Barrier[0] sits squarely in this cluster, focusing on managing intermediate activations during long-context training. Compared to SlimPipe[37], which emphasizes pipeline efficiency, and Blockwise Parallel[39], which partitions computation across sequence blocks, Memory Barrier[0] appears to introduce novel mechanisms for controlling activation memory growth. This line of work complements broader training recipes like LongRecipe[18] and LoongTrain[10], which provide end-to-end frameworks, by offering targeted solutions to specific memory bottlenecks that emerge when scaling context lengths beyond standard ranges.

Claimed Contributions

OOMB chunk-recurrent training framework with constant activation memory

0 retrieved papers

The authors propose a chunk-recurrent training framework that processes sequences in segments, discarding activations after the forward pass and recomputing them during backpropagation. This design maintains constant activation memory complexity regardless of sequence length, shifting the bottleneck from activations to the KV cache.

0 retrieved papers

Paged memory manager for KV cache and gradients with custom kernels

4 retrieved papers

The authors develop a paged memory management system for both the KV cache and its gradients, eliminating memory fragmentation from incremental growth. They implement specialized Triton kernels that bypass PyTorch's autograd system, enabling in-place gradient accumulation and preventing the KV cache from being stored as an activation.

4 retrieved papers

Asynchronous CPU offloading mechanism for KV cache

Can Refute

10 retrieved papers

The authors introduce an asynchronous offloading strategy that transfers the KV cache between GPU and CPU memory, with different approaches for dense/local versus sparse attention patterns. The mechanism uses pre-fetching and overlapping computation to hide data transfer latency, enabling training on extremely long contexts.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training PDF

Li, Zhouyang, Liu, Yuliang, Zhouyang Li, Zhang Wei, Yuliang Liu, Wei Zhang, Chen Bin, Tailing Yuan, Song Chengru, Bin Chen, Zhang Di, Chengru Song (2025)

[39] Blockwise parallel transformers for large context models PDF

H Liu, P Abbeel (2023)

[46] SPPO: Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading PDF

Chen Qiaoling, Li, Shenggui, Qiaoling Chen, Gao Wei, Shenggui Li, Sun Peng, Wei Gao, Wen, Yonggang, Peng Sun, Zhang, Tianwei, Yonggang Wen, Tianwei Zhang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OOMB chunk-recurrent training framework with constant activation memory

Contribution

Paged memory manager for KV cache and gradients with custom kernels

[51] Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques PDF

Cannot Refute

[52] Machine Learning Systems with Reduced Memory Requirements PDF

Cannot Refute

[53] Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques PDF

Cannot Refute

[54] Performance Evaluation of Generative Transformer Model Inference on Edge Devices PDF

Cannot Refute

Contribution

Asynchronous CPU offloading mechanism for KV cache

[55] {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention} PDF

Can Refute

[58] Attentionstore: Cost-effective attention reuse across multi-turn conversations in large language model serving PDF

Can Refute

[59] Mcam: Efficient llm inference with multi-tier kv cache management PDF

Can Refute

[61] Tetris: Efficient and Predictive KV Cache Offloading for Agentic and Reasoning Workloads PDF

Can Refute

[63] SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs PDF

Can Refute

[56] MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing PDF

Cannot Refute

[57] DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge PDF

Cannot Refute

[60] MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtization PDF

Cannot Refute

[62] FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management PDF

Cannot Refute

[64] Throughput-Oriented LLM Inference via KV-Activation Hybrid Caching with a Single GPU PDF

Cannot Refute

Out of the Memory Barrier: A Highly Memory-Efficient Training System for LLMs with Million-Token Contexts

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training PDF

[39] Blockwise parallel transformers for large context models PDF

[46] SPPO: Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading PDF

Contribution Analysis

OOMB chunk-recurrent training framework with constant activation memory

Paged memory manager for KV cache and gradients with custom kernels

[51] Decision-Making Large Language Model for Wireless Communication: A Comprehensive Survey on Key Techniques PDF

[52] Machine Learning Systems with Reduced Memory Requirements PDF

[53] Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques PDF

[54] Performance Evaluation of Generative Transformer Model Inference on Edge Devices PDF

Asynchronous CPU offloading mechanism for KV cache

[55] {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention} PDF

[58] Attentionstore: Cost-effective attention reuse across multi-turn conversations in large language model serving PDF

[59] Mcam: Efficient llm inference with multi-tier kv cache management PDF

[61] Tetris: Efficient and Predictive KV Cache Offloading for Agentic and Reasoning Workloads PDF

[63] SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs PDF

[56] MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing PDF

[57] DPU-KV: On the Benefits of DPU Offloading for In-Memory Key-Value Stores at the Edge PDF

[60] MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtization PDF

[62] FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management PDF

[64] Throughput-Oriented LLM Inference via KV-Activation Hybrid Caching with a Single GPU PDF

Table of Contents