TNT: Improving Chunkwise Training for Test-Time Memorization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Recurrent Neural NetworksSequence Modeling

Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for long-range context, while multiple parallel local modules handle fine-grained details. Crucially, by periodically resetting local memory states, we break sequential dependencies to enable massive context parallelization. Stage two is a brief fine-tuning phase where only the local memory modules are adapted to a smaller, high-resolution chunksize, maximizing accuracy with minimal overhead. Evaluated on Titans and TTT models, TNT achieves a substantial acceleration in training speed—up to 17 $\times$ faster than the most accurate baseline configuration—while simultaneously improving model accuracy. This improvement removes a critical scalability barrier, establishing a practical foundation for developing expressive RNNs and facilitating future work to close the performance gap with Transformers.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TNT, a two-stage training paradigm that decouples training efficiency from inference performance in recurrent networks with deep test-time memorization. It resides in the 'Chunkwise Training and Hierarchical Memory' leaf under 'Training Efficiency and Parallelization Methods', where it is currently the sole occupant. This sparse positioning suggests the work addresses a relatively underexplored niche: while test-time training architectures have received substantial attention (six papers across three sibling leaves), methods explicitly targeting the training-efficiency bottleneck through hierarchical chunking remain limited in the examined literature.

The taxonomy reveals a crowded landscape of test-time training architectures (Core Test-Time Training Frameworks, Optimal Memorization and Attentional Bias, Locally Optimal and Online Learning Approaches) and diverse domain applications (3D Reconstruction, Multivariate Time Series, Image Super-Resolution). TNT's hierarchical memory with periodic state resets connects conceptually to these architectural innovations but diverges by prioritizing parallelization over novel memory mechanisms. Neighboring leaves like 'Optimal Polynomial Projection Memory' and 'External Memory Networks for Question Answering' explore memory compression and retrieval from different angles, yet none explicitly tackle the chunksize-performance trade-off that TNT addresses through its two-stage approach.

Among sixteen candidates examined, no contribution was clearly refuted. The TNT two-stage paradigm examined ten candidates with zero refutable overlaps; the hierarchical memory architecture examined one candidate; the Q-K projection mechanism examined five candidates. This limited search scope—focused on top-K semantic matches and citation expansion—suggests that within the immediate neighborhood of chunkwise training and test-time memorization, no prior work directly anticipates TNT's combination of efficiency-focused pre-training with fine-tuning adaptation. However, the small candidate pool (sixteen total) means the analysis cannot rule out relevant prior work in adjacent subfields or less semantically similar publications.

Given the sparse occupancy of the 'Chunkwise Training and Hierarchical Memory' leaf and the absence of refutable candidates among sixteen examined papers, TNT appears to occupy a relatively novel position within the surveyed literature. The two-stage decoupling strategy and periodic state resets represent incremental advances over existing chunkwise methods, though the limited search scope precludes definitive claims about broader field-wide novelty. Future exhaustive searches across training efficiency and memory-augmented RNN literature would clarify whether similar hierarchical chunking schemes exist outside the top-K semantic neighborhood.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient training of recurrent neural networks with deep test-time memorization modules. The field has evolved around the challenge of enabling RNNs to memorize long contexts at inference time without incurring prohibitive training costs. The taxonomy reflects a multifaceted landscape: one branch focuses on Test-Time Training Architectures and Mechanisms, exploring how models can adapt their internal representations during inference; another emphasizes Training Efficiency and Parallelization Methods, addressing the computational bottlenecks that arise when scaling memory-augmented recurrents; a third examines Domain-Specific Applications, demonstrating how test-time memorization benefits tasks ranging from language modeling to time-series forecasting; and additional branches cover Foundational Memory Mechanisms (e.g., HiPPO Recurrent Memory[8], Memory Networks[12]), Memory-Augmented Networks for Language and Reasoning, and Specialized Applications with interpretability concerns. Together, these branches illustrate a progression from classical memory architectures toward modern test-time adaptation strategies that decouple training from inference-time context handling. A particularly active line of work centers on chunkwise and hierarchical training schemes that break long sequences into manageable segments, enabling parallelization while preserving the benefits of deep memorization. TNT Chunkwise Training[0] sits squarely in this cluster, proposing methods to train recurrent layers in chunks so that gradients can flow efficiently without full-sequence unrolling. Nearby efforts such as Titans Memorize[2] and Test-Time Training Right[3] explore complementary mechanisms for updating memory states during inference, while Ttt3r Reconstruction[1] and Atlas Context Memorization[4] investigate reconstruction-based objectives and context-aware caching. The original paper's emphasis on chunkwise parallelization distinguishes it from works like Test-Time Memorization Journey[5] or MesaNet Sequence Modeling[6], which prioritize architectural innovations over training efficiency. This positioning highlights an ongoing trade-off: whether to invest complexity in novel memory modules or in scalable training recipes that make existing recurrent designs practical for very long contexts.

Claimed Contributions

TNT two-stage training paradigm for deep memory modules

10 retrieved papers

TNT is a novel training framework that separates training into an efficiency-focused pre-training stage using large chunks and a performance-focused fine-tuning stage using smaller chunks. This decoupling resolves the fundamental tension between training throughput and inference accuracy in deep memory modules.

10 retrieved papers

Hierarchical memory architecture with periodic state resets

1 retrieved paper

The framework employs a global memory module operating on large chunks for long-range context alongside multiple parallel local memory modules for fine-grained details. Periodic resets of local memory states break sequential dependencies, enabling massive context parallelization even for non-linear recurrences.

1 retrieved paper

Q-K Projection mechanism for memory compression-retrieval alignment

5 retrieved papers

Q-K Projection addresses the inconsistency where memory is trained on keys but queried with queries by projecting queries onto the subspace of observed keys. This ensures the memory function receives inputs from the domain it was trained on, improving retrieval performance.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TNT two-stage training paradigm for deep memory modules

[31] Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning PDF

Cannot Refute

[32] A meta-learning framework for cross-service elastic scaling in cloud environments PDF

Cannot Refute

[33] Improving the Sparse Structure Learning of Spiking Neural Networks from the View of Compression Efficiency PDF

Cannot Refute

[34] Moe-nuseg: Enhancing nuclei segmentation in histology images with a two-stage mixture of experts network PDF

Cannot Refute

[35] Co-designing transformer architectures for distributed inference with low communication PDF

Cannot Refute

[36] PhyDNNs: Bringing Deep Neural Networks to the Physical Layer PDF

Cannot Refute

[37] Decoupled two-phase framework for class-incremental few-shot named entity recognition PDF

Cannot Refute

[38] Towards federated large language models: Motivations, methods, and future directions PDF

Cannot Refute

[39] CoSTV: Accelerating Code Search with Two-Stage Paradigm and Vector Retrieval PDF

Cannot Refute

[40] Low Resource Language Adaptation using Two-stage Regularization for Multilingual ASR PDF

Cannot Refute

Contribution

Hierarchical memory architecture with periodic state resets

[25] 2MGAS-Net: multi-level multi-scale gated attentional squeezed network for polyp segmentation PDF

Cannot Refute

Contribution

Q-K Projection mechanism for memory compression-retrieval alignment

[26] Fast trainable projection for robust fine-tuning PDF

Cannot Refute

[27] Generalized Semantic Segmentation by Self-Supervised Source Domain Projection and Multi-Level Contrastive Learning PDF

Cannot Refute

[28] Federated Domain Adaptation via Gradient Projection PDF

Cannot Refute

[29] Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting PDF

Cannot Refute

[30] NucNormZSL: nuclear norm-based domain adaptation in zero-shot learning PDF

Cannot Refute

TNT: Improving Chunkwise Training for Test-Time Memorization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

TNT two-stage training paradigm for deep memory modules

[31] Follow-your-motion: Video motion transfer via efficient spatial-temporal decoupled finetuning PDF

[32] A meta-learning framework for cross-service elastic scaling in cloud environments PDF

[33] Improving the Sparse Structure Learning of Spiking Neural Networks from the View of Compression Efficiency PDF

[34] Moe-nuseg: Enhancing nuclei segmentation in histology images with a two-stage mixture of experts network PDF

[35] Co-designing transformer architectures for distributed inference with low communication PDF

[36] PhyDNNs: Bringing Deep Neural Networks to the Physical Layer PDF

[37] Decoupled two-phase framework for class-incremental few-shot named entity recognition PDF

[38] Towards federated large language models: Motivations, methods, and future directions PDF

[39] CoSTV: Accelerating Code Search with Two-Stage Paradigm and Vector Retrieval PDF

[40] Low Resource Language Adaptation using Two-stage Regularization for Multilingual ASR PDF

Hierarchical memory architecture with periodic state resets

[25] 2MGAS-Net: multi-level multi-scale gated attentional squeezed network for polyp segmentation PDF

Q-K Projection mechanism for memory compression-retrieval alignment

[26] Fast trainable projection for robust fine-tuning PDF

[27] Generalized Semantic Segmentation by Self-Supervised Source Domain Projection and Multi-Level Contrastive Learning PDF

[28] Federated Domain Adaptation via Gradient Projection PDF

[29] Principled Federated Domain Adaptation: Gradient Projection and Auto-Weighting PDF

[30] NucNormZSL: nuclear norm-based domain adaptation in zero-shot learning PDF

Table of Contents