Test-Time Training Done Right

ICLR 2026 Conference SubmissionAnonymous Authors
Test-Time TrainingSequence ModelLong Context Model
Abstract:

Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (often referred to as fast weights) at inference time. This adapted fast weight, similar to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods have struggled to demonstrate effectiveness in handling long-sequence data, due to their computational inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often below 5%) because they deliberately apply small online mini-batch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small mini-batch implies fine-grained block-wise causal dependencies in the data, making them unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by proposing an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). This approach improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameter size), hence substantially improving state capacity, all without requiring cumbersome and error-prone custom kernel implementations. It also allows easy integration of sophisticated optimizers like Muon for online memory updates. We validate our approach across diverse data modalities and tasks, including novel view synthesis from image sets, language models, and auto-regressive video diffusion models. Our approach can scale up to 14-billion-parameter auto-regressive video diffusion models handling sequences of up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with more than one million context length. Our results highlight the computational and performance benefits of large-chunk test-time training, paving the way for more efficient and scalable long-context sequence modeling. We hope that this work will inspire and accelerate new research in the field of long-context modeling and test-time training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Large Chunk Test-Time Training (LaCT), which updates fast weights over extremely large token chunks (2K-1M tokens) rather than small mini-batches. It resides in the 'TTT Layer Design and Theoretical Frameworks' leaf, which contains four papers total including this one. This leaf sits within the broader 'Test-Time Training Architectures and Mechanisms' branch, indicating a moderately populated research direction focused on foundational TTT layer designs. The sibling papers explore related but distinct angles: principled TTT design choices, meta-learning inspired architectures, and regression-focused applications.

The taxonomy reveals neighboring work in 'Hybrid and Memory-Augmented Architectures' (three papers combining TTT with attention or MoE) and 'Chunkwise Training and Parallelization' (one paper on training efficiency through chunking). The paper's focus on large-chunk updates at inference time bridges architectural innovation with computational efficiency concerns. The taxonomy's scope_note clarifies that this leaf excludes domain-specific implementations and training optimization methods, positioning the work as a core architectural contribution rather than an application or training-time technique. Related branches address long-context adaptation and deployment optimization, but through different mechanisms.

Among 24 candidates examined, the contribution-level analysis shows varied novelty signals. The large-chunk TTT concept examined four candidates with zero refutations, suggesting relative novelty in this specific framing. The hybrid architecture combining large-chunk TTT with window attention examined ten candidates, also with zero refutations. However, the nonlinear fast-weight update mechanisms examined ten candidates and found one refutable match, indicating some overlap with prior work on update mechanisms. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage of the field.

Based on the 24-candidate search, the work appears to occupy a distinct position within TTT layer design, particularly in its emphasis on extremely large chunk sizes for hardware efficiency. The taxonomy structure suggests this is a moderately active research area with clear boundaries separating architectural innovations from training methods and applications. The analysis captures semantic proximity but cannot assess novelty against the full corpus of TTT literature or related sequence modeling approaches outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
43
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient long-context sequence modeling with test-time training. The field encompasses methods that adapt models during inference to handle extended sequences more effectively. The taxonomy reveals several major branches: Test-Time Training Architectures and Mechanisms explores foundational layer designs and theoretical frameworks for incorporating gradient-based updates at inference; Training Efficiency and Optimization addresses computational costs and scalability; Long-Context Understanding and Reasoning examines how models process and reason over lengthy inputs; Domain-Specific Applications demonstrates TTT's utility in areas like video processing, medical imaging, and recommendation systems; Deployment and Inference Optimization focuses on practical implementation challenges; Empirical Studies and Reproducibility investigates experimental rigor; Alternative Sequence Modeling Approaches considers non-TTT methods for long contexts; and Surveys and Broad Perspectives provide overarching views. Representative works like Learning at Test Time[7] and MesaNet[5] illustrate core architectural innovations, while domain applications such as TTT4Rec[8] and scFusionTTT[13] show the breadth of TTT's reach. A particularly active line of work centers on refining TTT layer designs and their theoretical underpinnings, where Test-Time Training Done Right[0] sits alongside neighbors like Test-time Regression[2] and MesaNet[5]. These studies grapple with trade-offs between expressive power and computational overhead, exploring how gradient-based adaptation can be made both effective and efficient. Test-Time Training Done Right[0] emphasizes principled design choices for stable and scalable TTT layers, contrasting with MesaNet[5]'s focus on meta-learning inspired architectures and Test-time Regression[2]'s application to predictive tasks. Meanwhile, other branches tackle orthogonal challenges: domain-specific adaptations demonstrate TTT's versatility across modalities, while deployment-focused efforts like Mobile Edge LLMs[16] address real-world constraints. Open questions remain around balancing adaptation flexibility with inference speed, and understanding when TTT offers advantages over retrieval-augmented or state-space alternatives.

Claimed Contributions

Large Chunk Test-Time Training (LaCT)

The authors introduce LaCT, a test-time training approach that uses extremely large chunk sizes (2K to 1M tokens) for updating fast weights, in contrast to existing methods that use small mini-batches of 16-64 tokens. This design improves GPU utilization from below 5% to up to 70% and enables scaling of nonlinear state sizes up to 40% of model parameter size.

4 retrieved papers
Hybrid architecture combining large-chunk TTT with window attention

The authors propose a hybrid architecture that combines large-chunk test-time training layers with window attention layers. The window attention handles local structure and dependencies within chunks, while the TTT layer focuses on non-local context modeling across chunks, enabling the method to handle diverse N-dimensional data structures.

10 retrieved papers
Nonlinear fast-weight update mechanisms with normalization

The authors develop nonlinear update rules for fast weights, including gradient descent with L2 weight normalization and integration of the Muon optimizer. These mechanisms improve numerical stability and effectiveness of test-time training updates compared to simple linear updates used in prior work.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large Chunk Test-Time Training (LaCT)

The authors introduce LaCT, a test-time training approach that uses extremely large chunk sizes (2K to 1M tokens) for updating fast weights, in contrast to existing methods that use small mini-batches of 16-64 tokens. This design improves GPU utilization from below 5% to up to 70% and enables scaling of nonlinear state sizes up to 40% of model parameter size.

Contribution

Hybrid architecture combining large-chunk TTT with window attention

The authors propose a hybrid architecture that combines large-chunk test-time training layers with window attention layers. The window attention handles local structure and dependencies within chunks, while the TTT layer focuses on non-local context modeling across chunks, enabling the method to handle diverse N-dimensional data structures.

Contribution

Nonlinear fast-weight update mechanisms with normalization

The authors develop nonlinear update rules for fast weights, including gradient descent with L2 weight normalization and integration of the Muon optimizer. These mechanisms improve numerical stability and effectiveness of test-time training updates compared to simple linear updates used in prior work.