Transformers Learn Latent Mixture Models In-Context via Mirror Descent

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.4 Download Report PDF

in-context learningmarkov chaintransformersmirror descentmixture modelslatent variables

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a novel framework based on Mixture of Transition Distributions, whereby a latent variable, whose distribution is parameterized by a set of unobserved mixture weights, determines the influence of past tokens on the next. To correctly predict the next token, transformers need to learn the mixture weights in-context. We demonstrate that transformers can implement Mirror Descent to learn the mixture weights from the context. To this end, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch converge to this solution: attention maps match our construction, and deeper models’ performance aligns with multi-step Mirror Descent.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper formalizes in-context learning of token importance as inference over latent mixture weights in a Mixture of Transition Distributions framework. It resides in the 'Mirror Descent and Optimization-Based ICL Analysis' leaf under 'Theoretical Foundations of In-Context Learning for Mixture Models'. Notably, this leaf contains only the original paper itself—no sibling papers appear in the taxonomy. This isolation suggests the optimization-based mechanistic interpretation of in-context learning for mixture models represents a relatively sparse research direction within the broader field of 19 papers surveyed.

The taxonomy reveals three sibling leaves within Theoretical Foundations: Bayesian Inference Perspectives (1 paper), Mixture of Linear Regressions ICL Theory (1 paper), and Latent Variable Inference and ICL Effectiveness (1 paper). These neighboring directions explore alternative theoretical lenses—probabilistic inference, regression-specific analysis, and the relationship between latent recovery and performance—rather than explicit optimization algorithms. The paper's focus on mirror descent as the mechanistic substrate distinguishes it from these Bayesian and regression-focused frameworks, while the broader taxonomy shows substantial activity in empirical HMM studies (4 papers) and neural architectures (3 papers), indicating the field balances theory with applied sequential modeling.

Among 26 candidates examined across three contributions, none were flagged as clearly refuting the paper's claims. The MTD framework examined 10 candidates with zero refutations; the three-layer transformer construction examined 10 candidates with zero refutations; the Bayes-optimal connection examined 6 candidates with zero refutations. This absence of overlapping prior work within the limited search scope suggests the specific combination—mixture of transition distributions, explicit transformer construction for mirror descent, and first-order Bayes-optimality proof—has not been directly addressed in the top-30 semantic matches and their citations. However, the search scale is modest and does not cover the full optimization or transformer theory literature.

Given the limited search scope and the paper's unique position as the sole occupant of its taxonomy leaf, the work appears to introduce a novel mechanistic perspective on in-context learning. The optimization-based framing contrasts with existing Bayesian and empirical approaches in neighboring leaves, and the explicit construction offers a concrete algorithmic interpretation. Nonetheless, the analysis reflects top-30 semantic candidates and does not exhaustively survey the broader optimization theory or transformer interpretability communities, leaving open the possibility of related work outside this search boundary.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: learning latent mixture weights in-context for sequence modeling. The field explores how models—ranging from classical probabilistic frameworks to modern neural architectures—can infer and adapt latent structure on the fly as they process sequential data. The taxonomy organizes this landscape into four main branches. Theoretical Foundations of In-Context Learning for Mixture Models investigates the algorithmic principles underlying in-context adaptation, often connecting transformer mechanics to optimization procedures such as mirror descent or Bayesian inference (e.g., Implicit Bayesian Inference[4]). Hidden Markov Models and Sequential Latent Structure Learning encompasses classical and contextual HMM variants (Contextual Hidden Markov[8], Online Contextual HMM[10]) that explicitly model latent state transitions and mixture dynamics. Neural Architectures for Sequential Latent Variable Modeling includes recurrent and variational approaches (Latent LSTM Allocation[11], Stochastic WaveNet[3]) that embed latent variables within deep networks. Large-Scale Architectures and Applications covers modern transformer-based systems and mixture-of-experts designs (Glam[1]) that scale in-context learning to real-world tasks, bridging theory and practice. Recent work has intensified around understanding whether and how large language models implicitly perform structured inference over latent variables. A handful of studies (LLMs Learn HMMs[5], Right Latent Variables[6]) demonstrate that transformers can recover hidden Markov structure or mixture components without explicit probabilistic machinery, raising questions about the representational capacity and inductive biases of attention mechanisms. The original paper, Transformers Mirror Descent[0], sits squarely within the optimization-based analysis branch of Theoretical Foundations, arguing that transformer layers implement steps of mirror descent when learning mixture weights in-context. This perspective complements probabilistic views like Implicit Bayesian Inference[4] and contrasts with empirical investigations such as LLMs Learn HMMs[5], which focus on emergent capabilities rather than mechanistic interpretations. Together, these lines of work reveal an active debate over whether in-context learning is best understood through the lens of optimization, Bayesian updating, or emergent neural computation.

Claimed Contributions

MTD framework for in-context learning of latent mixture weights

10 retrieved papers

The authors introduce a framework using Mixture of Transition Distributions (MTD) models to formalize token importance estimation as an in-context learning problem. In this framework, latent mixture weights determine the influence of past tokens, and transformers must learn these weights from context to predict the next token correctly.

10 retrieved papers

Explicit three-layer transformer construction implementing one-step Mirror Descent

10 retrieved papers

The authors provide a constructive proof showing that a three-layer disentangled transformer can exactly implement one step of the Mirror Descent algorithm for learning mixture weights. This construction demonstrates how attention mechanisms can compute posterior responsibilities and produce estimates matching the Mirror Descent update rule.

10 retrieved papers

Theoretical connection between one-step Mirror Descent and Bayes-optimal predictor

6 retrieved papers

The authors establish that the one-step Mirror Descent estimator serves as a first-order approximation to the Bayes-optimal predictor. They prove that the Taylor expansions of both estimators coincide around the no-evidence regime, providing theoretical justification for why this simple non-iterative procedure achieves good performance.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MTD framework for in-context learning of latent mixture weights

[36] Mixtures of in-context learners PDF

Cannot Refute

[37] Merging Multi-Task Models via Weight-Ensembling Mixture of Experts PDF

Cannot Refute

[38] Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation PDF

Cannot Refute

[39] Introducing dynamic token embedding sampling of large language models for improved inference accuracy PDF

Cannot Refute

[40] Novel token-level recurrent routing for enhanced mixture-of-experts performance PDF

Cannot Refute

[41] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping PDF

Cannot Refute

[42] On the role of attention in prompt-tuning PDF

Cannot Refute

[43] Soft Adaptive Policy Optimization PDF

Cannot Refute

[44] Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations PDF

Cannot Refute

[45] Style-conditional Prompt Token Learning for Generalizable Face Anti-spoofing PDF

Cannot Refute

Contribution

Explicit three-layer transformer construction implementing one-step Mirror Descent

[20] MCDDT: Mirror center loss based dual-scale dual-softmax transformer for multi-source subjects transfer learning in motor imagery recognition PDF

Cannot Refute

[21] CSFwinformer: Cross-Space-Frequency Window Transformer for Mirror Detection PDF

Cannot Refute

[22] Scoop: An Optimizer for Profiling Attacks against Higher-Order Masking PDF

Cannot Refute

[23] Internalizing Tools as Morphisms in Graded Transformers PDF

Cannot Refute

[24] Identifying Equivalent Training Dynamics PDF

Cannot Refute

[25] Mat: mixed-strategy game of adversarial training in fine-tuning PDF

Cannot Refute

[26] A Universal Banach--Bregman Framework for Stochastic Iterations: Unifying Stochastic Mirror Descent, Learning and LLM Training PDF

Cannot Refute

[27] A shallow mirror transformer for subject-independent motor imagery BCI. PDF

Cannot Refute

[28] A Unified Approach to Controlling Implicit Regularization Using Mirror Descent PDF

Cannot Refute

[29] Time-Dependent Mirror Flows and Where to Find Them PDF

Cannot Refute

Contribution

Theoretical connection between one-step Mirror Descent and Bayes-optimal predictor

[30] Theoretical Foundations of the Deep Copula Classifier: A Generative Approach to Modeling Dependent Features PDF

Cannot Refute

[31] Bayesian online natural gradient (BONG) PDF

Cannot Refute

[32] Learning with noisy labels PDF

Cannot Refute

[33] How rotation invariant algorithms are fooled by noise on sparse targets PDF

Cannot Refute

[34] Efficient methods in counterfactual policy learning and sequential decision making PDF

Cannot Refute

[35] Multi-Expert Distributionally Robust Optimization for Out-of-Distribution Generalization PDF

Cannot Refute

Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

MTD framework for in-context learning of latent mixture weights

[36] Mixtures of in-context learners PDF

[37] Merging Multi-Task Models via Weight-Ensembling Mixture of Experts PDF

[38] Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation PDF

[39] Introducing dynamic token embedding sampling of large language models for improved inference accuracy PDF

[40] Novel token-level recurrent routing for enhanced mixture-of-experts performance PDF

[41] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping PDF

[42] On the role of attention in prompt-tuning PDF

[43] Soft Adaptive Policy Optimization PDF

[44] Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations PDF

[45] Style-conditional Prompt Token Learning for Generalizable Face Anti-spoofing PDF

Explicit three-layer transformer construction implementing one-step Mirror Descent

[20] MCDDT: Mirror center loss based dual-scale dual-softmax transformer for multi-source subjects transfer learning in motor imagery recognition PDF

[21] CSFwinformer: Cross-Space-Frequency Window Transformer for Mirror Detection PDF

[22] Scoop: An Optimizer for Profiling Attacks against Higher-Order Masking PDF

[23] Internalizing Tools as Morphisms in Graded Transformers PDF

[24] Identifying Equivalent Training Dynamics PDF

[25] Mat: mixed-strategy game of adversarial training in fine-tuning PDF

[26] A Universal Banach--Bregman Framework for Stochastic Iterations: Unifying Stochastic Mirror Descent, Learning and LLM Training PDF

[27] A shallow mirror transformer for subject-independent motor imagery BCI. PDF

[28] A Unified Approach to Controlling Implicit Regularization Using Mirror Descent PDF

[29] Time-Dependent Mirror Flows and Where to Find Them PDF

Theoretical connection between one-step Mirror Descent and Bayes-optimal predictor

[30] Theoretical Foundations of the Deep Copula Classifier: A Generative Approach to Modeling Dependent Features PDF

[31] Bayesian online natural gradient (BONG) PDF

[32] Learning with noisy labels PDF

[33] How rotation invariant algorithms are fooled by noise on sparse targets PDF

[34] Efficient methods in counterfactual policy learning and sequential decision making PDF

[35] Multi-Expert Distributionally Robust Optimization for Out-of-Distribution Generalization PDF

Table of Contents