MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Sequence modelingtest-time trainingRNN transformer alternatives

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a numerically stable, chunkwise parallelizable version of the Mesa layer optimized through conjugate gradient solvers for in-context learning, and evaluates it at billion-parameter scale for language modeling. Within the taxonomy, it resides in 'Efficient Sequence Processing Strategies' under 'Efficiency and Scalability Optimization', alongside four sibling papers focused on segmentation, dual-path processing, and parallelization techniques. This leaf contains five papers total, representing a moderately active but not overcrowded research direction within the broader 50-paper taxonomy covering RNN sequence modeling.

The taxonomy reveals that 'Efficient Sequence Processing Strategies' sits adjacent to 'Model Compression and Sparsity' and 'Hardware Acceleration', forming a cluster addressing computational bottlenecks in RNN inference and training. Neighboring branches include 'State Space Models and Linear RNNs' (focused on structured recurrences) and 'Transformer-RNN Comparisons' (examining architectural trade-offs). The paper's emphasis on optimal test-time training through conjugate gradient solvers distinguishes it from siblings like SegRNN, which simplifies recurrence via segmentation, and from state space models that rely on linear recurrence formulations rather than iterative optimization.

Among 23 candidates examined across three contributions, none were flagged as clearly refuting the work. Contribution A (parallelizable Mesa layer) examined 3 candidates with 0 refutable; Contribution B (MesaNet architecture performance) examined 10 candidates with 0 refutable; Contribution C (comparative RNN analysis) examined 10 candidates with 0 refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found providing directly overlapping methods or results. The absence of refutable candidates across all contributions indicates potential novelty, though the search scale (23 papers) leaves open the possibility of relevant work outside this sample.

Based on the limited literature search, the work appears to occupy a distinct position combining optimal in-context learning with efficient RNN design. The taxonomy context shows it addresses efficiency concerns shared by neighboring papers but through a unique optimization-based approach. However, the analysis covers only top-23 semantic matches and does not exhaustively survey all RNN efficiency literature or recent state space model developments, which may contain related optimization strategies not captured here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: sequence modeling with efficient recurrent neural networks. The field encompasses a broad spectrum of research directions organized around making RNNs more practical and powerful. At the highest level, the taxonomy distinguishes between foundational architecture design (exploring gating mechanisms, novel cell structures, and hybrid models), efficiency and scalability optimization (addressing computational bottlenecks through pruning, quantization, and hardware-aware strategies), training and optimization methodologies (tackling gradient issues and regularization), long-range dependency modeling (capturing temporal patterns over extended horizons), comparisons with Transformers and sequence-to-sequence frameworks, diverse application domains (from speech recognition to biomedical analytics), and supporting survey literature and tooling. Representative works such as RNN Comprehensive Review[1] and RNN Survey[5] provide overviews of architectural evolution, while studies like Gated Recurrent Networks[2] and RNN Architectures Training[3] delve into specific design and training challenges. Several active lines of work reveal contrasting priorities and open questions. One thread focuses on architectural innovation to balance expressiveness and efficiency, exemplified by hierarchical gating schemes (Hierarchically Gated RNN[14]) and modular designs (Group Recurrent Networks[46]). Another emphasizes direct efficiency gains through segmentation and streamlined processing, as seen in SegRNN[38] and Dual Path RNN[45], which reduce redundant computation without sacrificing accuracy. MesaNet[0] sits within this efficiency-oriented cluster, proposing strategies for efficient sequence processing that align closely with works like SegRNN[38] and Dynamic Beam Width[27]. Compared to SegRNN[38], which segments inputs to simplify recurrence, MesaNet[0] appears to explore complementary mechanisms for accelerating inference or training. Meanwhile, efforts such as RWKV Transformer Era[39] and Resurrecting RNN[49] investigate whether modern RNN variants can rival Transformer scalability, highlighting ongoing debates about the trade-offs between recurrent and attention-based paradigms in large-scale sequence modeling.

Claimed Contributions

Parallelizable and numerically stable Mesa layer with adaptive forgetting

3 retrieved papers

The authors introduce a chunkwise parallelizable version of the Mesa layer that solves linear systems using conjugate gradient methods. This new formulation enables efficient training on modern accelerators, supports dynamic forgetting through gating mechanisms, and maintains numerical stability, overcoming limitations of the original sequential Mesa layer.

3 retrieved papers

MesaNet architecture achieving strong language modeling performance

10 retrieved papers

The authors train MesaNet models at 140M, 440M, and 1B parameter scales on the SlimPajama dataset. These models achieve lower validation perplexity compared to existing recurrent models like Mamba2, xLSTM, DeltaNet, and Gated DeltaNet, while matching or exceeding transformer performance on various benchmarks.

10 retrieved papers

In-depth comparative analysis of modern RNN architectures

10 retrieved papers

The authors conduct comprehensive analyses revealing that RNN models and transformers reduce perplexity differently across sequence positions, with RNNs excelling early in sequences while transformers perform better on later tokens. They also disentangle downstream benchmarks by global versus local language modeling requirements using controlled Sliding-Window Attention ablations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[27] Dynamic beam width tuning for energy-efficient recurrent neural networks PDF

Daniele Jahier Pagliari, Francesco Panini, D. J. Pagliari, Enrico Macii, F. Panini, Massimo Poncino, E. Macii, M. Poncino (2019)

[38] Segrnn: Segment recurrent neural network for long-term time series forecasting PDF

Lin Shengsheng, Lin WeiWei, Shengsheng Lin, Weiwei Lin, Zhao Feiyu, Wentai Wu, Feiyu Zhao, Zhang, Haotong, Ruichao Mo, Haotong Zhang (2023)

[45] Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation PDF

Yi Luo, Takuya Yoshioka, Zhuo Chen (2020)

[46] Efficient sequence learning with group recurrent networks PDF

Fei Gao, Lijun Wu, Li Zhao, Zhao Li, Tao Qin, Xueqi Cheng, Tie-Yan Liu, TieâYan Liu (2018)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Parallelizable and numerically stable Mesa layer with adaptive forgetting

[51] Recurrent neural networks for edge intelligence: a survey PDF

Cannot Refute

[52] A conjugate gradient learning algorithm for recurrent neural networks PDF

Cannot Refute

[53] Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method PDF

Cannot Refute

Contribution

MesaNet architecture achieving strong language modeling performance

[63] Learning to (Learn at Test Time): RNNs with Expressive Hidden States PDF

Cannot Refute

[64] On the predictive power of neural language models for human real-time comprehension behavior PDF

Cannot Refute

[65] Do Transformer Interpretability Methods Transfer to RNNs? PDF

Cannot Refute

[66] You Do Not Fully Utilize Transformer's Representation Capacity PDF

Cannot Refute

[67] Does Transformer Interpretability Transfer to RNNs? PDF

Cannot Refute

[68] Improving language model predictions via prompts enriched with knowledge graphs PDF

Cannot Refute

[69] Bayesian Neural Network Language Modeling for Speech Recognition PDF

Cannot Refute

[70] Just read twice: closing the recall gap for recurrent language models PDF

Cannot Refute

[71] Antiviral Peptide-Generative Pre-Trained Transformer (AVP-GPT): A Deep Learning-Powered Model for Antiviral Peptide Design with High-Throughput â¦ PDF

Cannot Refute

[72] Combining RNN with Transformer for Modeling Multi-Leg Trips. PDF

Cannot Refute

Contribution

In-depth comparative analysis of modern RNN architectures

[39] RWKV: Reinventing RNNs for the Transformer Era PDF

Cannot Refute

[54] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention PDF

Cannot Refute

[55] Compact Recurrent Transformer with Persistent Memory PDF

Cannot Refute

[56] Evaluating long-context understanding via latent and positional structure queries in large language models PDF

Cannot Refute

[57] Back to recurrent processing at the crossroad of transformers and state-space models PDF

Cannot Refute

[58] Transformers are Multi-State RNNs PDF

Cannot Refute

[59] A comparative study on transformer vs rnn in speech applications PDF

Cannot Refute

[60] Positional encoding helps recurrent neural networks handle a large vocabulary PDF

Cannot Refute

[61] Recurrent Memory Transformer PDF

Cannot Refute

[62] Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model PDF

Cannot Refute

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[27] Dynamic beam width tuning for energy-efficient recurrent neural networks PDF

[38] Segrnn: Segment recurrent neural network for long-term time series forecasting PDF

[45] Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation PDF

[46] Efficient sequence learning with group recurrent networks PDF

Contribution Analysis

Parallelizable and numerically stable Mesa layer with adaptive forgetting

[51] Recurrent neural networks for edge intelligence: a survey PDF

[52] A conjugate gradient learning algorithm for recurrent neural networks PDF

[53] Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method PDF

MesaNet architecture achieving strong language modeling performance

[63] Learning to (Learn at Test Time): RNNs with Expressive Hidden States PDF

[64] On the predictive power of neural language models for human real-time comprehension behavior PDF

[65] Do Transformer Interpretability Methods Transfer to RNNs? PDF

[66] You Do Not Fully Utilize Transformer's Representation Capacity PDF

[67] Does Transformer Interpretability Transfer to RNNs? PDF

[68] Improving language model predictions via prompts enriched with knowledge graphs PDF

[69] Bayesian Neural Network Language Modeling for Speech Recognition PDF

[70] Just read twice: closing the recall gap for recurrent language models PDF

[71] Antiviral Peptide-Generative Pre-Trained Transformer (AVP-GPT): A Deep Learning-Powered Model for Antiviral Peptide Design with High-Throughput â¦ PDF

[72] Combining RNN with Transformer for Modeling Multi-Leg Trips. PDF

In-depth comparative analysis of modern RNN architectures

[39] RWKV: Reinventing RNNs for the Transformer Era PDF

[54] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention PDF

[55] Compact Recurrent Transformer with Persistent Memory PDF

[56] Evaluating long-context understanding via latent and positional structure queries in large language models PDF

[57] Back to recurrent processing at the crossroad of transformers and state-space models PDF

[58] Transformers are Multi-State RNNs PDF

[59] A comparative study on transformer vs rnn in speech applications PDF

[60] Positional encoding helps recurrent neural networks handle a large vocabulary PDF

[61] Recurrent Memory Transformer PDF

[62] Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model PDF

Table of Contents

[71] Antiviral Peptide-Generative Pre-Trained Transformer (AVP-GPT): A Deep Learning-Powered Model for Antiviral Peptide Design with High-Throughput â¦ PDF