Test-Time Training Done Right

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Test-Time TrainingSequence ModelLong Context Model

Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (often referred to as fast weights) at inference time. This adapted fast weight, similar to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods have struggled to demonstrate effectiveness in handling long-sequence data, due to their computational inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often below 5%) because they deliberately apply small online mini-batch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small mini-batch implies fine-grained block-wise causal dependencies in the data, making them unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by proposing an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). This approach improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameter size), hence substantially improving state capacity, all without requiring cumbersome and error-prone custom kernel implementations. It also allows easy integration of sophisticated optimizers like Muon for online memory updates. We validate our approach across diverse data modalities and tasks, including novel view synthesis from image sets, language models, and auto-regressive video diffusion models. Our approach can scale up to 14-billion-parameter auto-regressive video diffusion models handling sequences of up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with more than one million context length. Our results highlight the computational and performance benefits of large-chunk test-time training, paving the way for more efficient and scalable long-context sequence modeling. We hope that this work will inspire and accelerate new research in the field of long-context modeling and test-time training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Large Chunk Test-Time Training (LaCT), which updates fast weights over extremely large token chunks (2K-1M tokens) rather than small mini-batches. It resides in the 'TTT Layer Design and Theoretical Frameworks' leaf, which contains four papers total including this one. This leaf sits within the broader 'Test-Time Training Architectures and Mechanisms' branch, indicating a moderately populated research direction focused on foundational TTT layer designs. The sibling papers explore related but distinct angles: principled TTT design choices, meta-learning inspired architectures, and regression-focused applications.

The taxonomy reveals neighboring work in 'Hybrid and Memory-Augmented Architectures' (three papers combining TTT with attention or MoE) and 'Chunkwise Training and Parallelization' (one paper on training efficiency through chunking). The paper's focus on large-chunk updates at inference time bridges architectural innovation with computational efficiency concerns. The taxonomy's scope_note clarifies that this leaf excludes domain-specific implementations and training optimization methods, positioning the work as a core architectural contribution rather than an application or training-time technique. Related branches address long-context adaptation and deployment optimization, but through different mechanisms.

Among 24 candidates examined, the contribution-level analysis shows varied novelty signals. The large-chunk TTT concept examined four candidates with zero refutations, suggesting relative novelty in this specific framing. The hybrid architecture combining large-chunk TTT with window attention examined ten candidates, also with zero refutations. However, the nonlinear fast-weight update mechanisms examined ten candidates and found one refutable match, indicating some overlap with prior work on update mechanisms. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage of the field.

Based on the 24-candidate search, the work appears to occupy a distinct position within TTT layer design, particularly in its emphasis on extremely large chunk sizes for hardware efficiency. The taxonomy structure suggests this is a moderately active research area with clear boundaries separating architectural innovations from training methods and applications. The analysis captures semantic proximity but cannot assess novelty against the full corpus of TTT literature or related sequence modeling approaches outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient long-context sequence modeling with test-time training. The field encompasses methods that adapt models during inference to handle extended sequences more effectively. The taxonomy reveals several major branches: Test-Time Training Architectures and Mechanisms explores foundational layer designs and theoretical frameworks for incorporating gradient-based updates at inference; Training Efficiency and Optimization addresses computational costs and scalability; Long-Context Understanding and Reasoning examines how models process and reason over lengthy inputs; Domain-Specific Applications demonstrates TTT's utility in areas like video processing, medical imaging, and recommendation systems; Deployment and Inference Optimization focuses on practical implementation challenges; Empirical Studies and Reproducibility investigates experimental rigor; Alternative Sequence Modeling Approaches considers non-TTT methods for long contexts; and Surveys and Broad Perspectives provide overarching views. Representative works like Learning at Test Time[7] and MesaNet[5] illustrate core architectural innovations, while domain applications such as TTT4Rec[8] and scFusionTTT[13] show the breadth of TTT's reach. A particularly active line of work centers on refining TTT layer designs and their theoretical underpinnings, where Test-Time Training Done Right[0] sits alongside neighbors like Test-time Regression[2] and MesaNet[5]. These studies grapple with trade-offs between expressive power and computational overhead, exploring how gradient-based adaptation can be made both effective and efficient. Test-Time Training Done Right[0] emphasizes principled design choices for stable and scalable TTT layers, contrasting with MesaNet[5]'s focus on meta-learning inspired architectures and Test-time Regression[2]'s application to predictive tasks. Meanwhile, other branches tackle orthogonal challenges: domain-specific adaptations demonstrate TTT's versatility across modalities, while deployment-focused efforts like Mobile Edge LLMs[16] address real-world constraints. Open questions remain around balancing adaptation flexibility with inference speed, and understanding when TTT offers advantages over retrieval-augmented or state-space alternatives.

Claimed Contributions

Large Chunk Test-Time Training (LaCT)

4 retrieved papers

The authors introduce LaCT, a test-time training approach that uses extremely large chunk sizes (2K to 1M tokens) for updating fast weights, in contrast to existing methods that use small mini-batches of 16-64 tokens. This design improves GPU utilization from below 5% to up to 70% and enables scaling of nonlinear state sizes up to 40% of model parameter size.

4 retrieved papers

Hybrid architecture combining large-chunk TTT with window attention

10 retrieved papers

The authors propose a hybrid architecture that combines large-chunk test-time training layers with window attention layers. The window attention handles local structure and dependencies within chunks, while the TTT layer focuses on non-local context modeling across chunks, enabling the method to handle diverse N-dimensional data structures.

10 retrieved papers

Nonlinear fast-weight update mechanisms with normalization

Can Refute

10 retrieved papers

The authors develop nonlinear update rules for fast weights, including gradient descent with L2 weight normalization and integration of the Muon optimizer. These mechanisms improve numerical stability and effectiveness of test-time training updates compared to simple linear updates used in prior work.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Test-time regression: a unifying framework for designing sequence models with associative memory PDF

Wang, Ke Alexander, Shi, Jiaxin, Ke Alexander Wang, Fox, Emily B., Jiaxin Shi, Emily B. Fox (2025)

[5] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training PDF

von Oswald, Johannes, Scherrer, Nino, Kobayashi, Seijin, Versari, Luca, yang songlin, Maile, Kaitlin, Meulemans, Alexander, Saurous, Rif A., Lajoie, Guillaume, Frenkel Charlotte, Pascanu, Razvan, Arcas, Blaise AgÃ¼era y, Sacramento, JoÃ£o (2025)

[7] Learning to (learn at test time): Rnns with expressive hidden states PDF

Sun Yu, Yu Sun, Li Xinhao, Xinhao Li, Karan Dalal, Xu Jiarui, Jiarui Xu, Arjun Vikram, Zhang, Genghan, Genghan Zhang, Dubois, Yann, Yann Dubois, Chen Xinlei, Xinlei Chen, Wang Xiao-long, Xiaolong Wang, Koyejo, Sanmi, Oluwasanmi Koyejo, Hashimoto, Tatsunori, Tatsunori Hashimoto, Guestrin, Carlos, Carlos Guestrin (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large Chunk Test-Time Training (LaCT)

[5] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training PDF

Cannot Refute

[63] Otas: An elastic transformer serving system via token adaptation PDF

Cannot Refute

[64] Fast-weight Product Key Memory PDF

Cannot Refute

[65] ChameleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters PDF

Cannot Refute

Contribution

Hybrid architecture combining large-chunk TTT with window attention

[30] End-to-End Test-Time Training for Long Context PDF

Cannot Refute

[44] Longlive: Real-time interactive long video generation PDF

Cannot Refute

[45] Working-Memory-Correct Long-Horizon Expert-Retrieval TTT Dialogue PDF

Cannot Refute

[46] ViT: Unlocking Test-Time Training in Vision PDF

Cannot Refute

[47] SWAA: Sliding Window Attention Adaptation for Efficient Long-Context LLMs Without Pretraining PDF

Cannot Refute

[48] Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression PDF

Cannot Refute

[49] Perceptually Oriented Video Frame Interpolation PDF

Cannot Refute

[50] Individualized and Interpretable Sleep Forecasting via a Two-Stage Adaptive Spatial-Temporal Model PDF

Cannot Refute

[51] A Multimodal BiMamba Network with Test-Time Adaptation for Emotion Recognition Based on Physiological Signals PDF

Cannot Refute

[52] DAN+: Enhancing Transformer-Based Document Recognizer with Dynamic Attention Sink and Structured Skipping PDF

Cannot Refute

Contribution

Nonlinear fast-weight update mechanisms with normalization

[60] Weight normalization: A simple reparameterization to accelerate training of deep neural networks PDF

Can Refute

[53] xlstm: Extended long short-term memory PDF

Cannot Refute

[54] Normalization and effective learning rates in reinforcement learning PDF

Cannot Refute

[55] Dynamic step-size normalized LMS algorithm for alpha-stable impulsive noise control and peak tracking PDF

Cannot Refute

[56] Layer normalization PDF

Cannot Refute

[57] Continuous-time homeostatic dynamics for reentrant inference models PDF

Cannot Refute

[58] Differential neural networks for robust nonlinear control: identification, state estimation and trajectory tracking PDF

Cannot Refute

[59] Novel adaptive nonlinear predistorters based on the direct learning algorithm PDF

Cannot Refute

[61] Centered weight normalization in accelerating training of deep neural networks PDF

Cannot Refute

[62] Separating the effects of batch normalization on cnn training speed and stability using classical adaptive filter theory PDF

Cannot Refute

Test-Time Training Done Right

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Test-time regression: a unifying framework for designing sequence models with associative memory PDF

[5] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training PDF

[7] Learning to (learn at test time): Rnns with expressive hidden states PDF

Contribution Analysis

Large Chunk Test-Time Training (LaCT)

[5] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training PDF

[63] Otas: An elastic transformer serving system via token adaptation PDF

[64] Fast-weight Product Key Memory PDF

[65] ChameleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters PDF

Hybrid architecture combining large-chunk TTT with window attention

[30] End-to-End Test-Time Training for Long Context PDF

[44] Longlive: Real-time interactive long video generation PDF

[45] Working-Memory-Correct Long-Horizon Expert-Retrieval TTT Dialogue PDF

[46] ViT: Unlocking Test-Time Training in Vision PDF

[47] SWAA: Sliding Window Attention Adaptation for Efficient Long-Context LLMs Without Pretraining PDF

[48] Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression PDF

[49] Perceptually Oriented Video Frame Interpolation PDF

[50] Individualized and Interpretable Sleep Forecasting via a Two-Stage Adaptive Spatial-Temporal Model PDF

[51] A Multimodal BiMamba Network with Test-Time Adaptation for Emotion Recognition Based on Physiological Signals PDF

[52] DAN+: Enhancing Transformer-Based Document Recognizer with Dynamic Attention Sink and Structured Skipping PDF

Nonlinear fast-weight update mechanisms with normalization

[60] Weight normalization: A simple reparameterization to accelerate training of deep neural networks PDF

[53] xlstm: Extended long short-term memory PDF

[54] Normalization and effective learning rates in reinforcement learning PDF

[55] Dynamic step-size normalized LMS algorithm for alpha-stable impulsive noise control and peak tracking PDF

[56] Layer normalization PDF

[57] Continuous-time homeostatic dynamics for reentrant inference models PDF

[58] Differential neural networks for robust nonlinear control: identification, state estimation and trajectory tracking PDF

[59] Novel adaptive nonlinear predistorters based on the direct learning algorithm PDF

[61] Centered weight normalization in accelerating training of deep neural networks PDF

[62] Separating the effects of batch normalization on cnn training speed and stability using classical adaptive filter theory PDF

Table of Contents