MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

ICLR 2026 Conference SubmissionAnonymous Authors
Sequence modelingtest-time trainingRNN transformer alternatives
Abstract:

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a numerically stable, chunkwise parallelizable version of the Mesa layer optimized through conjugate gradient solvers for in-context learning, and evaluates it at billion-parameter scale for language modeling. Within the taxonomy, it resides in 'Efficient Sequence Processing Strategies' under 'Efficiency and Scalability Optimization', alongside four sibling papers focused on segmentation, dual-path processing, and parallelization techniques. This leaf contains five papers total, representing a moderately active but not overcrowded research direction within the broader 50-paper taxonomy covering RNN sequence modeling.

The taxonomy reveals that 'Efficient Sequence Processing Strategies' sits adjacent to 'Model Compression and Sparsity' and 'Hardware Acceleration', forming a cluster addressing computational bottlenecks in RNN inference and training. Neighboring branches include 'State Space Models and Linear RNNs' (focused on structured recurrences) and 'Transformer-RNN Comparisons' (examining architectural trade-offs). The paper's emphasis on optimal test-time training through conjugate gradient solvers distinguishes it from siblings like SegRNN, which simplifies recurrence via segmentation, and from state space models that rely on linear recurrence formulations rather than iterative optimization.

Among 23 candidates examined across three contributions, none were flagged as clearly refuting the work. Contribution A (parallelizable Mesa layer) examined 3 candidates with 0 refutable; Contribution B (MesaNet architecture performance) examined 10 candidates with 0 refutable; Contribution C (comparative RNN analysis) examined 10 candidates with 0 refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found providing directly overlapping methods or results. The absence of refutable candidates across all contributions indicates potential novelty, though the search scale (23 papers) leaves open the possibility of relevant work outside this sample.

Based on the limited literature search, the work appears to occupy a distinct position combining optimal in-context learning with efficient RNN design. The taxonomy context shows it addresses efficiency concerns shared by neighboring papers but through a unique optimization-based approach. However, the analysis covers only top-23 semantic matches and does not exhaustively survey all RNN efficiency literature or recent state space model developments, which may contain related optimization strategies not captured here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: sequence modeling with efficient recurrent neural networks. The field encompasses a broad spectrum of research directions organized around making RNNs more practical and powerful. At the highest level, the taxonomy distinguishes between foundational architecture design (exploring gating mechanisms, novel cell structures, and hybrid models), efficiency and scalability optimization (addressing computational bottlenecks through pruning, quantization, and hardware-aware strategies), training and optimization methodologies (tackling gradient issues and regularization), long-range dependency modeling (capturing temporal patterns over extended horizons), comparisons with Transformers and sequence-to-sequence frameworks, diverse application domains (from speech recognition to biomedical analytics), and supporting survey literature and tooling. Representative works such as RNN Comprehensive Review[1] and RNN Survey[5] provide overviews of architectural evolution, while studies like Gated Recurrent Networks[2] and RNN Architectures Training[3] delve into specific design and training challenges. Several active lines of work reveal contrasting priorities and open questions. One thread focuses on architectural innovation to balance expressiveness and efficiency, exemplified by hierarchical gating schemes (Hierarchically Gated RNN[14]) and modular designs (Group Recurrent Networks[46]). Another emphasizes direct efficiency gains through segmentation and streamlined processing, as seen in SegRNN[38] and Dual Path RNN[45], which reduce redundant computation without sacrificing accuracy. MesaNet[0] sits within this efficiency-oriented cluster, proposing strategies for efficient sequence processing that align closely with works like SegRNN[38] and Dynamic Beam Width[27]. Compared to SegRNN[38], which segments inputs to simplify recurrence, MesaNet[0] appears to explore complementary mechanisms for accelerating inference or training. Meanwhile, efforts such as RWKV Transformer Era[39] and Resurrecting RNN[49] investigate whether modern RNN variants can rival Transformer scalability, highlighting ongoing debates about the trade-offs between recurrent and attention-based paradigms in large-scale sequence modeling.

Claimed Contributions

Parallelizable and numerically stable Mesa layer with adaptive forgetting

The authors introduce a chunkwise parallelizable version of the Mesa layer that solves linear systems using conjugate gradient methods. This new formulation enables efficient training on modern accelerators, supports dynamic forgetting through gating mechanisms, and maintains numerical stability, overcoming limitations of the original sequential Mesa layer.

3 retrieved papers
MesaNet architecture achieving strong language modeling performance

The authors train MesaNet models at 140M, 440M, and 1B parameter scales on the SlimPajama dataset. These models achieve lower validation perplexity compared to existing recurrent models like Mamba2, xLSTM, DeltaNet, and Gated DeltaNet, while matching or exceeding transformer performance on various benchmarks.

10 retrieved papers
In-depth comparative analysis of modern RNN architectures

The authors conduct comprehensive analyses revealing that RNN models and transformers reduce perplexity differently across sequence positions, with RNNs excelling early in sequences while transformers perform better on later tokens. They also disentangle downstream benchmarks by global versus local language modeling requirements using controlled Sliding-Window Attention ablations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Parallelizable and numerically stable Mesa layer with adaptive forgetting

The authors introduce a chunkwise parallelizable version of the Mesa layer that solves linear systems using conjugate gradient methods. This new formulation enables efficient training on modern accelerators, supports dynamic forgetting through gating mechanisms, and maintains numerical stability, overcoming limitations of the original sequential Mesa layer.

Contribution

MesaNet architecture achieving strong language modeling performance

The authors train MesaNet models at 140M, 440M, and 1B parameter scales on the SlimPajama dataset. These models achieve lower validation perplexity compared to existing recurrent models like Mamba2, xLSTM, DeltaNet, and Gated DeltaNet, while matching or exceeding transformer performance on various benchmarks.

Contribution

In-depth comparative analysis of modern RNN architectures

The authors conduct comprehensive analyses revealing that RNN models and transformers reduce perplexity differently across sequence positions, with RNNs excelling early in sequences while transformers perform better on later tokens. They also disentangle downstream benchmarks by global versus local language modeling requirements using controlled Sliding-Window Attention ablations.