From Markov to Laplace: How Mamba In-Context Learns Markov Chains

ICLR 2026 Conference SubmissionAnonymous Authors
State-space modelsMarkov chainsIn-context learningLaplacian smoothing
Abstract:

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed-ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering an interesting phenomenon: even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes a formal connection between single-layer Mamba and the Bayes-optimal Laplacian smoothing estimator for Markov chains, demonstrating that Mamba efficiently learns this estimator through in-context learning. Within the taxonomy, it resides in the 'Optimal Statistical Estimation and Markov Chain Learning' leaf under 'Theoretical Foundations and Representation Capacity'. This leaf contains only two papers total, including one sibling work focused on ICL outlier robustness, indicating a relatively sparse but emerging research direction at the intersection of state-space models and statistical optimality theory.

The taxonomy reveals that theoretical work on Mamba spans two main directions: optimal estimation (where this paper sits) and memory mechanisms/state-space realizations. Neighboring branches examine architectural comparisons with Transformers and efficiency analysis, while application domains explore imitation learning and computer vision. The paper's focus on Markov chain learning and convolution's role in representation capacity distinguishes it from sibling work on outlier robustness, and from broader architectural studies that lack formal statistical optimality proofs. The taxonomy's scope and exclude notes clarify that this work provides theoretical guarantees rather than empirical benchmarking.

Among twenty-one candidates examined across three contributions, no clearly refutable prior work was identified. The first contribution (Laplacian smoothing characterization) examined six candidates with zero refutations; the second (representation capacity theory) examined ten candidates with zero refutations; the third (convolution's architectural role) examined five candidates with zero refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—the specific combination of Mamba architecture, Markov chain ICL, and formal optimality proofs appears relatively unexplored, though the search does not claim exhaustive coverage of all related theoretical work.

Based on the limited literature search of twenty-one candidates, the paper appears to occupy a novel position connecting Mamba's architectural properties to optimal statistical estimation on Markov chains. The sparse taxonomy leaf and absence of refutable candidates within the examined scope suggest this formal theoretical angle is underexplored. However, the analysis covers semantic neighbors and citations rather than exhaustive field coverage, leaving open the possibility of related work in adjacent theoretical communities not captured by this search methodology.

Taxonomy

Core-task Taxonomy Papers
9
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: in-context learning capabilities of Mamba on Markov chains. The field explores how state-space models like Mamba perform in-context learning—adapting to new tasks from demonstration sequences without parameter updates. The taxonomy organizes work into three main branches: Theoretical Foundations and Representation Capacity examines statistical optimality and expressiveness guarantees; Architectural Analysis and Comparisons contrasts Mamba with Transformers and hybrid designs; and Application Domains investigates deployment in areas such as imitation learning, continual learning, and specialized prediction tasks. Representative works span from theoretical studies like Mamba ICL Outliers[8] to applied systems such as Mamba Temporal Imitation[1] and Mamba Location Prediction[6], illustrating both the breadth of architectural inquiry and the diversity of downstream uses. A particularly active line of work focuses on understanding Mamba's statistical efficiency and memory mechanisms. Some studies probe optimal estimation rates on structured sequences, while others compare Mamba's selective state-space design against Transformer attention or hybrid architectures like BMOJO Hybrid Memory[3]. The original paper, Mamba Markov Chains[0], sits squarely within the Theoretical Foundations branch, specifically addressing optimal statistical estimation and Markov chain learning. It shares thematic ground with Mamba ICL Outliers[8], which also investigates in-context learning properties, but Mamba Markov Chains[0] emphasizes rigorous analysis of learning dynamics on Markov chains rather than outlier robustness. Meanwhile, applied works such as Mamba Temporal Imitation[1] and Mail Imitation Learning[4] demonstrate how these theoretical insights translate into practical imitation-learning systems, highlighting an ongoing dialogue between foundational guarantees and real-world performance trade-offs.

Claimed Contributions

Characterization of Mamba's in-context learning of Laplacian smoothing on Markov chains

The authors demonstrate empirically that a single-layer Mamba model learns the optimal Laplacian smoothing estimator (which is both Bayes and minimax optimal) when trained on random Markov chains of various orders, exhibiting strong in-context learning capabilities.

6 retrieved papers
Theoretical characterization of Mamba's representation capacity for optimal estimators

The authors provide a constructive theoretical proof showing how Mamba can represent the Laplacian smoothing estimator for finite-state first-order Markov processes, highlighting the interplay of convolution, selectivity, and recurrence. They also establish fundamental limits on hidden dimension requirements for higher-order processes.

10 retrieved papers
Identification of convolution as the key architectural component

Through ablation studies, the authors show that convolution is the most critical architectural component in Mamba for learning optimal estimators on Markovian data, with a simplified variant containing only convolution (MambaZero) matching the full model's performance while removing convolution causes failure.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Characterization of Mamba's in-context learning of Laplacian smoothing on Markov chains

The authors demonstrate empirically that a single-layer Mamba model learns the optimal Laplacian smoothing estimator (which is both Bayes and minimax optimal) when trained on random Markov chains of various orders, exhibiting strong in-context learning capabilities.

Contribution

Theoretical characterization of Mamba's representation capacity for optimal estimators

The authors provide a constructive theoretical proof showing how Mamba can represent the Laplacian smoothing estimator for finite-state first-order Markov processes, highlighting the interplay of convolution, selectivity, and recurrence. They also establish fundamental limits on hidden dimension requirements for higher-order processes.

Contribution

Identification of convolution as the key architectural component

Through ablation studies, the authors show that convolution is the most critical architectural component in Mamba for learning optimal estimators on Markovian data, with a simplified variant containing only convolution (MambaZero) matching the full model's performance while removing convolution causes failure.