Transformers Learn Latent Mixture Models In-Context via Mirror Descent

ICLR 2026 Conference SubmissionAnonymous Authors
in-context learningmarkov chaintransformersmirror descentmixture modelslatent variables
Abstract:

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a novel framework based on Mixture of Transition Distributions, whereby a latent variable, whose distribution is parameterized by a set of unobserved mixture weights, determines the influence of past tokens on the next. To correctly predict the next token, transformers need to learn the mixture weights in-context. We demonstrate that transformers can implement Mirror Descent to learn the mixture weights from the context. To this end, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch converge to this solution: attention maps match our construction, and deeper models’ performance aligns with multi-step Mirror Descent.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formalizes in-context learning of token importance as inference over latent mixture weights in a Mixture of Transition Distributions framework. It resides in the 'Mirror Descent and Optimization-Based ICL Analysis' leaf under 'Theoretical Foundations of In-Context Learning for Mixture Models'. Notably, this leaf contains only the original paper itself—no sibling papers appear in the taxonomy. This isolation suggests the optimization-based mechanistic interpretation of in-context learning for mixture models represents a relatively sparse research direction within the broader field of 19 papers surveyed.

The taxonomy reveals three sibling leaves within Theoretical Foundations: Bayesian Inference Perspectives (1 paper), Mixture of Linear Regressions ICL Theory (1 paper), and Latent Variable Inference and ICL Effectiveness (1 paper). These neighboring directions explore alternative theoretical lenses—probabilistic inference, regression-specific analysis, and the relationship between latent recovery and performance—rather than explicit optimization algorithms. The paper's focus on mirror descent as the mechanistic substrate distinguishes it from these Bayesian and regression-focused frameworks, while the broader taxonomy shows substantial activity in empirical HMM studies (4 papers) and neural architectures (3 papers), indicating the field balances theory with applied sequential modeling.

Among 26 candidates examined across three contributions, none were flagged as clearly refuting the paper's claims. The MTD framework examined 10 candidates with zero refutations; the three-layer transformer construction examined 10 candidates with zero refutations; the Bayes-optimal connection examined 6 candidates with zero refutations. This absence of overlapping prior work within the limited search scope suggests the specific combination—mixture of transition distributions, explicit transformer construction for mirror descent, and first-order Bayes-optimality proof—has not been directly addressed in the top-30 semantic matches and their citations. However, the search scale is modest and does not cover the full optimization or transformer theory literature.

Given the limited search scope and the paper's unique position as the sole occupant of its taxonomy leaf, the work appears to introduce a novel mechanistic perspective on in-context learning. The optimization-based framing contrasts with existing Bayesian and empirical approaches in neighboring leaves, and the explicit construction offers a concrete algorithmic interpretation. Nonetheless, the analysis reflects top-30 semantic candidates and does not exhaustively survey the broader optimization theory or transformer interpretability communities, leaving open the possibility of related work outside this search boundary.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: learning latent mixture weights in-context for sequence modeling. The field explores how models—ranging from classical probabilistic frameworks to modern neural architectures—can infer and adapt latent structure on the fly as they process sequential data. The taxonomy organizes this landscape into four main branches. Theoretical Foundations of In-Context Learning for Mixture Models investigates the algorithmic principles underlying in-context adaptation, often connecting transformer mechanics to optimization procedures such as mirror descent or Bayesian inference (e.g., Implicit Bayesian Inference[4]). Hidden Markov Models and Sequential Latent Structure Learning encompasses classical and contextual HMM variants (Contextual Hidden Markov[8], Online Contextual HMM[10]) that explicitly model latent state transitions and mixture dynamics. Neural Architectures for Sequential Latent Variable Modeling includes recurrent and variational approaches (Latent LSTM Allocation[11], Stochastic WaveNet[3]) that embed latent variables within deep networks. Large-Scale Architectures and Applications covers modern transformer-based systems and mixture-of-experts designs (Glam[1]) that scale in-context learning to real-world tasks, bridging theory and practice. Recent work has intensified around understanding whether and how large language models implicitly perform structured inference over latent variables. A handful of studies (LLMs Learn HMMs[5], Right Latent Variables[6]) demonstrate that transformers can recover hidden Markov structure or mixture components without explicit probabilistic machinery, raising questions about the representational capacity and inductive biases of attention mechanisms. The original paper, Transformers Mirror Descent[0], sits squarely within the optimization-based analysis branch of Theoretical Foundations, arguing that transformer layers implement steps of mirror descent when learning mixture weights in-context. This perspective complements probabilistic views like Implicit Bayesian Inference[4] and contrasts with empirical investigations such as LLMs Learn HMMs[5], which focus on emergent capabilities rather than mechanistic interpretations. Together, these lines of work reveal an active debate over whether in-context learning is best understood through the lens of optimization, Bayesian updating, or emergent neural computation.

Claimed Contributions

MTD framework for in-context learning of latent mixture weights

The authors introduce a framework using Mixture of Transition Distributions (MTD) models to formalize token importance estimation as an in-context learning problem. In this framework, latent mixture weights determine the influence of past tokens, and transformers must learn these weights from context to predict the next token correctly.

10 retrieved papers
Explicit three-layer transformer construction implementing one-step Mirror Descent

The authors provide a constructive proof showing that a three-layer disentangled transformer can exactly implement one step of the Mirror Descent algorithm for learning mixture weights. This construction demonstrates how attention mechanisms can compute posterior responsibilities and produce estimates matching the Mirror Descent update rule.

10 retrieved papers
Theoretical connection between one-step Mirror Descent and Bayes-optimal predictor

The authors establish that the one-step Mirror Descent estimator serves as a first-order approximation to the Bayes-optimal predictor. They prove that the Taylor expansions of both estimators coincide around the no-evidence regime, providing theoretical justification for why this simple non-iterative procedure achieves good performance.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MTD framework for in-context learning of latent mixture weights

The authors introduce a framework using Mixture of Transition Distributions (MTD) models to formalize token importance estimation as an in-context learning problem. In this framework, latent mixture weights determine the influence of past tokens, and transformers must learn these weights from context to predict the next token correctly.

Contribution

Explicit three-layer transformer construction implementing one-step Mirror Descent

The authors provide a constructive proof showing that a three-layer disentangled transformer can exactly implement one step of the Mirror Descent algorithm for learning mixture weights. This construction demonstrates how attention mechanisms can compute posterior responsibilities and produce estimates matching the Mirror Descent update rule.

Contribution

Theoretical connection between one-step Mirror Descent and Bayes-optimal predictor

The authors establish that the one-step Mirror Descent estimator serves as a first-order approximation to the Bayes-optimal predictor. They prove that the Taylor expansions of both estimators coincide around the no-evidence regime, providing theoretical justification for why this simple non-iterative procedure achieves good performance.

Transformers Learn Latent Mixture Models In-Context via Mirror Descent | Novelty Validation