From Markov to Laplace: How Mamba In-Context Learns Markov Chains

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

State-space modelsMarkov chainsIn-context learningLaplacian smoothing

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed-ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering an interesting phenomenon: even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes a formal connection between single-layer Mamba and the Bayes-optimal Laplacian smoothing estimator for Markov chains, demonstrating that Mamba efficiently learns this estimator through in-context learning. Within the taxonomy, it resides in the 'Optimal Statistical Estimation and Markov Chain Learning' leaf under 'Theoretical Foundations and Representation Capacity'. This leaf contains only two papers total, including one sibling work focused on ICL outlier robustness, indicating a relatively sparse but emerging research direction at the intersection of state-space models and statistical optimality theory.

The taxonomy reveals that theoretical work on Mamba spans two main directions: optimal estimation (where this paper sits) and memory mechanisms/state-space realizations. Neighboring branches examine architectural comparisons with Transformers and efficiency analysis, while application domains explore imitation learning and computer vision. The paper's focus on Markov chain learning and convolution's role in representation capacity distinguishes it from sibling work on outlier robustness, and from broader architectural studies that lack formal statistical optimality proofs. The taxonomy's scope and exclude notes clarify that this work provides theoretical guarantees rather than empirical benchmarking.

Among twenty-one candidates examined across three contributions, no clearly refutable prior work was identified. The first contribution (Laplacian smoothing characterization) examined six candidates with zero refutations; the second (representation capacity theory) examined ten candidates with zero refutations; the third (convolution's architectural role) examined five candidates with zero refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—the specific combination of Mamba architecture, Markov chain ICL, and formal optimality proofs appears relatively unexplored, though the search does not claim exhaustive coverage of all related theoretical work.

Based on the limited literature search of twenty-one candidates, the paper appears to occupy a novel position connecting Mamba's architectural properties to optimal statistical estimation on Markov chains. The sparse taxonomy leaf and absence of refutable candidates within the examined scope suggest this formal theoretical angle is underexplored. However, the analysis covers semantic neighbors and citations rather than exhaustive field coverage, leaving open the possibility of related work in adjacent theoretical communities not captured by this search methodology.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: in-context learning capabilities of Mamba on Markov chains. The field explores how state-space models like Mamba perform in-context learning—adapting to new tasks from demonstration sequences without parameter updates. The taxonomy organizes work into three main branches: Theoretical Foundations and Representation Capacity examines statistical optimality and expressiveness guarantees; Architectural Analysis and Comparisons contrasts Mamba with Transformers and hybrid designs; and Application Domains investigates deployment in areas such as imitation learning, continual learning, and specialized prediction tasks. Representative works span from theoretical studies like Mamba ICL Outliers[8] to applied systems such as Mamba Temporal Imitation[1] and Mamba Location Prediction[6], illustrating both the breadth of architectural inquiry and the diversity of downstream uses. A particularly active line of work focuses on understanding Mamba's statistical efficiency and memory mechanisms. Some studies probe optimal estimation rates on structured sequences, while others compare Mamba's selective state-space design against Transformer attention or hybrid architectures like BMOJO Hybrid Memory[3]. The original paper, Mamba Markov Chains[0], sits squarely within the Theoretical Foundations branch, specifically addressing optimal statistical estimation and Markov chain learning. It shares thematic ground with Mamba ICL Outliers[8], which also investigates in-context learning properties, but Mamba Markov Chains[0] emphasizes rigorous analysis of learning dynamics on Markov chains rather than outlier robustness. Meanwhile, applied works such as Mamba Temporal Imitation[1] and Mail Imitation Learning[4] demonstrate how these theoretical insights translate into practical imitation-learning systems, highlighting an ongoing dialogue between foundational guarantees and real-world performance trade-offs.

Claimed Contributions

Characterization of Mamba's in-context learning of Laplacian smoothing on Markov chains

6 retrieved papers

The authors demonstrate empirically that a single-layer Mamba model learns the optimal Laplacian smoothing estimator (which is both Bayes and minimax optimal) when trained on random Markov chains of various orders, exhibiting strong in-context learning capabilities.

6 retrieved papers

Theoretical characterization of Mamba's representation capacity for optimal estimators

10 retrieved papers

The authors provide a constructive theoretical proof showing how Mamba can represent the Laplacian smoothing estimator for finite-state first-order Markov processes, highlighting the interplay of convolution, selectivity, and recurrence. They also establish fundamental limits on hidden dimension requirements for higher-order processes.

10 retrieved papers

Identification of convolution as the key architectural component

5 retrieved papers

Through ablation studies, the authors show that convolution is the most critical architectural component in Mamba for learning optimal estimators on Markovian data, with a simplified variant containing only convolution (MambaZero) matching the full model's performance while removing convolution causes failure.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Understanding Mamba in In-Context Learning with Outliers: A Theoretical Generalization Analysis PDF

H Li, S Lu, X Cui, PY Chen, M Wang (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Characterization of Mamba's in-context learning of Laplacian smoothing on Markov chains

[10] Transformers on markov data: Constant depth suffices PDF

Cannot Refute

[11] Non-deterministic calibration of crystal plasticity model parameters PDF

Cannot Refute

[12] Structured Prediction of Sequences and Trees Using Infinite Contexts PDF

Cannot Refute

[13] Time-and location-sensitive recommender systems PDF

Cannot Refute

[14] Robust model-based scene interpretation by multilayered context information PDF

Cannot Refute

[15] Laplacian-Guided Denoising Graph Diffusion for Graph Learning with an Adaptive Prior PDF

Cannot Refute

Contribution

Theoretical characterization of Mamba's representation capacity for optimal estimators

[16] Spectral State Space Model for Rotation-Invariant Visual Representation Learning PDF

Cannot Refute

[17] State space models on temporal graphs: A first-principles study PDF

Cannot Refute

[18] DBMGNet: A Dual-Branch Mamba-GCN Network for Hyperspectral Image Classification PDF

Cannot Refute

[19] Expressive power of randomized signature PDF

Cannot Refute

[20] Efficient Weight-Space Laplace-Gaussian Filtering and Smoothing for Sequential Deep Learning PDF

Cannot Refute

[21] A Comparative Study on Dynamic Graph Embedding based on Mamba and Transformers PDF

Cannot Refute

[22] Signed laplacian graph neural networks PDF

Cannot Refute

[23] Spatiotemporal modeling of multivariate signals with graph neural networks and structured state space models PDF

Cannot Refute

[24] Generalized Laplacian regularized framelet graph neural networks PDF

Cannot Refute

[25] Robust Filtering and Learning in State-Space Models: Skewness and Heavy Tails Via Asymmetric Laplace Distribution PDF

Cannot Refute

Contribution

Identification of convolution as the key architectural component

[26] Can mamba learn how to learn? a comparative study on in-context learning tasks PDF

Cannot Refute

[27] Zoology: Measuring and improving recall in efficient language models PDF

Cannot Refute

[28] DeepAGS: Deep learning with activity, geography and sequential information in predicting an individual's next trip destination PDF

Cannot Refute

[29] Universal in-context approximation by prompting fully recurrent models PDF

Cannot Refute

[30] Fine-grained analysis of in-context linear estimation: Data, architecture, and beyond PDF

Cannot Refute

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Understanding Mamba in In-Context Learning with Outliers: A Theoretical Generalization Analysis PDF

Contribution Analysis

Characterization of Mamba's in-context learning of Laplacian smoothing on Markov chains

[10] Transformers on markov data: Constant depth suffices PDF

[11] Non-deterministic calibration of crystal plasticity model parameters PDF

[12] Structured Prediction of Sequences and Trees Using Infinite Contexts PDF

[13] Time-and location-sensitive recommender systems PDF

[14] Robust model-based scene interpretation by multilayered context information PDF

[15] Laplacian-Guided Denoising Graph Diffusion for Graph Learning with an Adaptive Prior PDF

Theoretical characterization of Mamba's representation capacity for optimal estimators

[16] Spectral State Space Model for Rotation-Invariant Visual Representation Learning PDF

[17] State space models on temporal graphs: A first-principles study PDF

[18] DBMGNet: A Dual-Branch Mamba-GCN Network for Hyperspectral Image Classification PDF

[19] Expressive power of randomized signature PDF

[20] Efficient Weight-Space Laplace-Gaussian Filtering and Smoothing for Sequential Deep Learning PDF

[21] A Comparative Study on Dynamic Graph Embedding based on Mamba and Transformers PDF

[22] Signed laplacian graph neural networks PDF

[23] Spatiotemporal modeling of multivariate signals with graph neural networks and structured state space models PDF

[24] Generalized Laplacian regularized framelet graph neural networks PDF

[25] Robust Filtering and Learning in State-Space Models: Skewness and Heavy Tails Via Asymmetric Laplace Distribution PDF

Identification of convolution as the key architectural component

[26] Can mamba learn how to learn? a comparative study on in-context learning tasks PDF

[27] Zoology: Measuring and improving recall in efficient language models PDF

[28] DeepAGS: Deep learning with activity, geography and sequential information in predicting an individual's next trip destination PDF

[29] Universal in-context approximation by prompting fully recurrent models PDF

[30] Fine-grained analysis of in-context linear estimation: Data, architecture, and beyond PDF

Table of Contents