Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

test-time traininglinear representation hypothesisspecializationcontinual learningsparse autoencoderscompressed sensing

Recent empirical studies have explored the idea of continuing to train a model at test-time for a given task, known as test-time training (TTT), and have found it to yield significant performance improvements. However, there is limited understanding of why and when TTT is effective. Earlier explanations mostly focused on the observation that TTT may help when applied to out-of-distribution adaptation or used with privileged data. However, the growing scale of foundation models with most test data being in-distribution questions these explanations. We instead posit that foundation models remain globally underparameterized, with TTT providing a mechanism for specialization after generalization—focusing capacity on concepts relevant to the test task. Specifically, under the linear representation hypothesis, we propose a model in which TTT achieves a substantially smaller in-distribution test error than global training. We empirically validate our model's key assumptions by training a sparse autoencoder on ImageNet, showing that semantically related data points are explained by only a few shared concepts. Finally, we perform scaling studies across image and language tasks that confirm the practical implications of our model, identifying the regimes where specialization is most effective.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a theoretical framework explaining test-time training (TTT) as 'specialization after generalization' under the linear representation hypothesis, arguing that foundation models remain globally underparameterized and benefit from focusing capacity on test-relevant concepts. It resides in the 'Theoretical Foundations and Specialization Mechanisms' leaf, which contains only three papers total, including this one. This represents a sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the theoretical underpinnings of in-distribution TTT remain relatively underexplored compared to application-driven work.

The taxonomy reveals that most TTT research concentrates on modality-specific adaptations (vision-language, medical imaging, time series, speech) and practical mechanisms (prompt tuning, retrieval-augmented methods, diffusion-based adaptation). The paper's theoretical leaf sits within 'Core Test-Time Training Mechanisms and Theory,' which also includes forward-only optimization and retrieval-augmented approaches—these neighboring leaves focus on algorithmic efficiency rather than foundational explanations. The sibling papers (TTT Nonlinear Functions, TTT Transformers ICL) examine architectural mechanisms for test-time learning, whereas this work addresses the more fundamental question of why TTT improves in-distribution prediction through a specialization lens.

Among 15 candidates examined across three contributions, no refutable prior work was identified. The 'specialization after generalization' framework examined 10 candidates with zero refutations; the linear representation hypothesis model examined 1 candidate; and the empirical validation through sparse autoencoders examined 4 candidates. These statistics reflect a limited search scope (top-K semantic search plus citation expansion), not an exhaustive literature review. The absence of refutable candidates suggests that among the examined papers, none directly anticipated the theoretical model linking global underparameterization to in-distribution TTT benefits, though the small sample size limits strong conclusions.

Based on the limited search scope of 15 candidates, the theoretical framing appears novel within the examined literature, particularly the claim that TTT addresses underparameterization rather than distribution shift. However, the sparse population of the theoretical leaf (3 papers) and the modest search scale mean substantial related work may exist outside the top-K semantic matches. The analysis covers conceptual positioning but cannot definitively assess novelty against the full field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: test-time training for in-distribution prediction in foundation models. This field explores how large pre-trained models can be adapted at inference time to improve performance on specific test instances or distributions without requiring extensive offline retraining. The taxonomy reveals a rich landscape organized around several major themes. At the foundation lie Core Test-Time Training Mechanisms and Theory, which develop the mathematical principles and algorithmic frameworks—such as self-supervised auxiliary tasks, parameter-efficient updates, and specialization strategies—that enable models to learn from test data on the fly. Branching outward, the taxonomy encompasses modality-specific adaptations: Vision-Language Foundation Model Adaptation addresses multimodal alignment challenges (e.g., Robust Prompt Tuning[5], Noisy VLM Adaptation[9]), while Computer Vision Task-Specific Adaptation and Medical and Healthcare Applications tackle domain-specific requirements in imaging and clinical settings (e.g., AutoMiSeg[4], TTA-FM Prostate[13]). Parallel branches cover Time Series Foundation Models (TS-RAG[14], Financial Time Series[18]), Speech and Audio Foundation Models (Dysarthric Speech Adaptation[26]), and Reinforcement Learning and Behavioral Foundation Models (Behavioral Foundation Adaptation[2], Zero-Shot Behavioral Adaptation[11]). Cross-cutting concerns appear in Continual and Online Test-Time Adaptation, Cross-Domain and Multimodal Generalization, Uncertainty Quantification and Reliability (Uncertainty-Aware Priors[16]), and Specialized Domain Foundation Models. Within the theoretical core, a particularly active line of work investigates how foundation models can specialize after broad pre-training, balancing generalization with instance-specific refinement. Specialization after Generalization[0] sits squarely in this theoretical branch alongside TTT Nonlinear Functions[28] and TTT Transformers ICL[33], which explore architectural mechanisms for test-time learning in transformer-based systems. These works contrast with more application-driven approaches: while Diffusion Test Adaptation[3] and Forward Pass Adaptation[6] emphasize lightweight, single-pass adjustments for efficiency, the theoretical studies examine deeper questions about what representations and learning rules enable effective specialization without catastrophic forgetting. The original paper's emphasis on theoretical foundations and specialization mechanisms positions it as a conceptual anchor for understanding when and why test-time training succeeds, complementing empirical investigations across the taxonomy's diverse application domains.

Claimed Contributions

Specialization after generalization framework for test-time training

10 retrieved papers

The authors propose a conceptual framework where test-time training enables foundation models to specialize by temporarily reallocating capacity to concepts relevant to the immediate test task, rather than requiring out-of-distribution data or privileged information. This mechanism addresses global underparameterization by locally adapting the model.

10 retrieved papers

Theoretical model under the linear representation hypothesis

1 retrieved paper

The authors develop a theoretical model based on the linear representation hypothesis where test-time training can achieve lower in-distribution test error than globally trained models. The model formalizes how TTT efficiently recovers the local meaning of superimposed concepts in underparameterized feature spaces.

1 retrieved paper

Empirical validation through sparse autoencoders and scaling studies

4 retrieved papers

The authors train sparse autoencoders on ImageNet to validate that local neighborhoods are supported by few concepts and that TTT implicitly finds sparse solutions. They conduct scaling studies across vision and language tasks demonstrating that TTT provides the largest performance gains in the underparameterized regime.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[28] Test time training enhances in-context learning of nonlinear functions PDF

Suzuki, Taiji, Kento Kuwataka, Taiji Suzuki (2025)

[33] Test-Time Training Provably Improves Transformers as In-context Learners PDF

Ildiz, M. Emrullah, Halil Alperen Gozeten, Zhang Xuechen, M. E. Ildiz, Soltanolkotabi, Mahdi, Xuechen Zhang, Mondelli, Marco, M. Soltanolkotabi, Oymak, Samet, Marco Mondelli, Samet Oymak (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Specialization after generalization framework for test-time training

[18] Time Series Foundation Models for Multivariate Financial Time Series Forecasting PDF

Cannot Refute

[52] Dynamic adaptation of lora fine-tuning for efficient and task-specific optimization of large language models PDF

Cannot Refute

[53] Test-Time Learning for Large Language Models PDF

Cannot Refute

[54] Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment PDF

Cannot Refute

[55] FedDG-MoE: Test-Time Mixture-of-Experts Fusion for Federated Domain Generalization PDF

Cannot Refute

[56] A foundation model for generalized brain MRI analysis PDF

Cannot Refute

[57] Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection PDF

Cannot Refute

[58] Bayesian test-time adaptation for vision-language models PDF

Cannot Refute

[59] Contrastive adapters for foundation model group robustness PDF

Cannot Refute

[60] Dual-personalizing adapter for federated foundation models PDF

Cannot Refute

Contribution

Theoretical model under the linear representation hypothesis

[51] Test-time adaptation induces stronger accuracy and agreement-on-the-line PDF

Cannot Refute

Contribution

Empirical validation through sparse autoencoders and scaling studies

[61] Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned PDF

Cannot Refute

[62] Efficient Test-Time Scaling for Small Vision-Language Models PDF

Cannot Refute

[63] SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders PDF

Cannot Refute

[64] Scaling Vision with Sparse Mixture of Experts PDF

Cannot Refute

Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[28] Test time training enhances in-context learning of nonlinear functions PDF

[33] Test-Time Training Provably Improves Transformers as In-context Learners PDF

Contribution Analysis

Specialization after generalization framework for test-time training

[18] Time Series Foundation Models for Multivariate Financial Time Series Forecasting PDF

[52] Dynamic adaptation of lora fine-tuning for efficient and task-specific optimization of large language models PDF

[53] Test-Time Learning for Large Language Models PDF

[54] Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment PDF

[55] FedDG-MoE: Test-Time Mixture-of-Experts Fusion for Federated Domain Generalization PDF

[56] A foundation model for generalized brain MRI analysis PDF

[57] Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection PDF

[58] Bayesian test-time adaptation for vision-language models PDF

[59] Contrastive adapters for foundation model group robustness PDF

[60] Dual-personalizing adapter for federated foundation models PDF

Theoretical model under the linear representation hypothesis

[51] Test-time adaptation induces stronger accuracy and agreement-on-the-line PDF

Empirical validation through sparse autoencoders and scaling studies

[61] Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned PDF

[62] Efficient Test-Time Scaling for Small Vision-Language Models PDF

[63] SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders PDF

[64] Scaling Vision with Sparse Mixture of Experts PDF

Table of Contents