Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.4 Download Report PDF

chain of continuous thoughttraining dynamicsreasoningsuperposition

Previous work shows that the chain of continuous thought (continuous CoT) improves the reasoning capability of large language models (LLMs) by enabling implicit parallel thinking, and a subsequent work provided theoretical insight by showing that a two-layer transformer equipped with continuous CoT can efficiently solve directed graph reachability by maintaining a superposition of multiple reasoning traces in the continuous thought. However, it remains unclear how the superposition mechanism is naturally learned from gradient-based training methods. To fill this gap, we theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem to unveil how the superposition mechanism emerges during training in two training stages -- (i) a thought-generation stage that autoregressively expands the continuous thought, and (ii) a prediction stage that converts the thought into the final answer. Our analysis reveals that during training using continuous thought, the index-matching logit, an important quantity which reflects the strength of the model's local search ability, will first increase and then remain bounded under mild assumptions. The bounded index-matching logit effectively balances exploration and exploitation during the reasoning process: the model will exploit local problem structures to identify plausible search traces, and assign comparable weights to multiple such traces to explore when it is uncertain about which solution is correct, which results in superposition. Our experimental results tracking the growth of logits further validate our theory.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a theoretical analysis of how continuous chain-of-thought mechanisms emerge during gradient-based training in two-layer transformers solving directed graph reachability. It resides in the 'Training Dynamics and Convergence Analysis' leaf under 'Theoretical Foundations of CoT Mechanisms,' alongside three sibling papers examining gradient dynamics and convergence properties. This leaf represents a moderately populated research direction within the broader taxonomy of 50 papers across 36 topics, indicating focused but not overcrowded attention to training dynamics questions in continuous CoT.

The taxonomy reveals neighboring theoretical branches including 'Expressivity and Computational Power' (four papers proving what transformers can solve with CoT) and 'Superposition and Parallel Reasoning Theory' (one paper on maintaining multiple traces). The paper bridges these areas by explaining how superposition—previously shown to enable parallel reasoning—actually emerges through training. Nearby practical branches like 'Continuous CoT Architectures and Training' (three papers on model implementations) and 'Latent-Variable CoT Training' (one paper on unsupervised optimization) address related but distinct questions about architecture design and training objectives rather than gradient dynamics.

Among 21 candidates examined across three contributions, no clear refutations emerged. The core contribution on training dynamics analyzed 10 candidates with none providing overlapping prior work; the bounded index-matching logit behavior examined 1 candidate without refutation; and the superposition emergence explanation reviewed 10 candidates, again finding no direct overlap. This limited search scope—focused on top semantic matches and citations—suggests the specific combination of continuous CoT, training dynamics, and superposition emergence may occupy relatively unexplored theoretical territory, though the analysis cannot claim exhaustive coverage of all relevant gradient dynamics literature.

Based on examination of 21 semantically related papers, the work appears to address a gap between expressivity proofs and empirical continuous CoT implementations by analyzing how training naturally discovers superposition mechanisms. The bounded search scope means potentially relevant work in broader optimization theory or neural tangent kernel analyses may exist outside the examined candidates. The taxonomy positioning and sibling paper analysis suggest this represents a natural theoretical extension within an active but not saturated research direction.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: training dynamics of chain of continuous thought in transformers. The field has evolved from early discrete prompting methods like Chain of Thought Prompting[1] toward a richer landscape that spans multiple paradigms. At the top level, the taxonomy distinguishes Continuous Latent Reasoning Paradigms—where models learn implicit reasoning steps in hidden representations—from Discrete CoT Prompting and Inference Methods that rely on explicit token sequences. Theoretical Foundations of CoT Mechanisms investigate the expressive power and convergence properties underlying these approaches, while CoT Training and Optimization Methods address how to effectively learn reasoning behaviors. Architectural Innovations for Reasoning explore modifications such as looped or recurrent structures (e.g., Looped Transformers[4]), and World Models and Planning-Based Reasoning connect transformers to decision-making frameworks like Reasoning as Planning[3]. Meanwhile, Transformer Learning Dynamics and Mechanisms examine gradient flow, feature evolution, and emergent behaviors during training, and Specialized Transformer Applications adapt these ideas to domains ranging from vision to reinforcement learning. Within this landscape, a particularly active line of work focuses on understanding how transformers internalize multi-step reasoning during training. Superposition Training Dynamics[0] sits squarely in the Theoretical Foundations branch under Training Dynamics and Convergence Analysis, examining how reasoning emerges through superposed representations over the course of optimization. This contrasts with neighboring studies like Nonlinear Transformers CoT[28], which explores architectural nonlinearities to enhance reasoning capacity, and Kinetics of Reasoning[49], which applies statistical physics perspectives to characterize the evolution of reasoning states. Other closely related efforts include Multi Step Gradient Descent[37], which models iterative refinement processes, and works on continuous latent reasoning such as Continuous Latent Reasoning[2] and Scaling Latent Reasoning[33], which emphasize learning implicit thought chains without discrete tokens. Together, these studies reveal ongoing tensions between discrete versus continuous representations, the role of architectural depth versus recurrence, and the interplay between training objectives and emergent reasoning capabilities.

Claimed Contributions

Theoretical analysis of training dynamics for continuous chain-of-thought

10 retrieved papers

The authors provide a theoretical analysis of how gradient-based training naturally leads to the superposition mechanism in continuous chain-of-thought models. They analyze two training stages: thought generation and prediction, revealing how the model learns to maintain multiple reasoning traces in parallel.

10 retrieved papers

Discovery of bounded index-matching logit behavior

1 retrieved paper

The authors discover that the index-matching logit, which quantifies local search capability, grows initially but remains bounded during training with continuous CoT. This bounded behavior contrasts with unbounded logit growth in discrete settings and enables effective exploration-exploitation balance.

1 retrieved paper

Explanation of superposition emergence through bounded logits

10 retrieved papers

The authors explain how bounded index-matching logits lead to superposition by balancing exploration and exploitation. When logits remain bounded, the model assigns comparable weights to multiple plausible reasoning paths rather than over-committing to a single path, naturally producing the superposition mechanism.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[28] Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis PDF

LI Hongkang, Lu, Songtao, Hongkang Li, Chen, Pin-Yu, Meng Wang, Cui Xiao-dong, Songtao Lu, Wang Meng, Xiaodong Cui, Pin-Yu Chen (2024) • International Conference on Learning Representations

[37] Transformers learn to implement multi-step gradient descent with chain of thought PDF

Huang Jianhao, Wang Zi-xuan, Lee, Jason D. (2025)

[49] The Kinetics of Reasoning: How Chain-of-Thought Shapes Learning in Transformers? PDF

Mavromatis, Costas, Shen Zhengyuan, Zhang Yunyi, Ioannidis, Vassilis N., Rangwala Huzefa (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of training dynamics for continuous chain-of-thought

[62] A generalization of transformer networks to graphs PDF

Cannot Refute

[63] Understanding transformer reasoning capabilities via graph algorithms PDF

Cannot Refute

[64] Do transformers really perform badly for graph representation? PDF

Cannot Refute

[65] GET-Zero: Graph Embodiment Transformer for Zero-Shot Embodiment Generalization PDF

Cannot Refute

[66] Semformer: Transformer language models with semantic planning PDF

Cannot Refute

[67] Self-Supervised Graph Transformer with Contrastive Learning for Brain Connectivity Analysis Towards Improving Autism Detection PDF

Cannot Refute

[68] Lost in transmission: When and why llms fail to reason globally PDF

Cannot Refute

[69] Long-range brain graph transformer PDF

Cannot Refute

[70] Transformers struggle to learn to search PDF

Cannot Refute

[71] A transformer-based knowledge graph embedding model combining graph paths and local neighborhood PDF

Cannot Refute

Contribution

Discovery of bounded index-matching logit behavior

[61] InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion PDF

Cannot Refute

Contribution

Explanation of superposition emergence through bounded logits

[51] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

Cannot Refute

[52] Balancing Exploration and Exploitation for Solving Large-scale Multiobjective Optimization via Attention Mechanism PDF

Cannot Refute

[53] An Integrated Graph Neural Network and Reinforcement Learning Framework for Intelligent Drug Discovery PDF

Cannot Refute

[54] Aggregating knowledge-aware graph neural network and adaptive relational attention for recommendation PDF

Cannot Refute

[55] Deep attentive belief propagation: Integrating reasoning and learning for solving constraint optimization problems PDF

Cannot Refute

[56] Learning to Gaze: Bio-Inspired Attention Adaptation Strategy for Social Robots PDF

Cannot Refute

[57] Optimizing Landmark Graphs in DHRL: A Dual Approach of Attention and Weighted Sampling PDF

Cannot Refute

[58] Hulu video recommendation: from relevance to reasoning PDF

Cannot Refute

[59] Each Complexity Deserves a Pruning Policy PDF

Cannot Refute

[60] GraphPuma-Flow: Graph-Neural Metaheuristics for Constraint-Aware Hybrid Scheduling PDF

Cannot Refute

Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[28] Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis PDF

[37] Transformers learn to implement multi-step gradient descent with chain of thought PDF

[49] The Kinetics of Reasoning: How Chain-of-Thought Shapes Learning in Transformers? PDF

Contribution Analysis

Theoretical analysis of training dynamics for continuous chain-of-thought

[62] A generalization of transformer networks to graphs PDF

[63] Understanding transformer reasoning capabilities via graph algorithms PDF

[64] Do transformers really perform badly for graph representation? PDF

[65] GET-Zero: Graph Embodiment Transformer for Zero-Shot Embodiment Generalization PDF

[66] Semformer: Transformer language models with semantic planning PDF

[67] Self-Supervised Graph Transformer with Contrastive Learning for Brain Connectivity Analysis Towards Improving Autism Detection PDF

[68] Lost in transmission: When and why llms fail to reason globally PDF

[69] Long-range brain graph transformer PDF

[70] Transformers struggle to learn to search PDF

[71] A transformer-based knowledge graph embedding model combining graph paths and local neighborhood PDF

Discovery of bounded index-matching logit behavior

[61] InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion PDF

Explanation of superposition emergence through bounded logits

[51] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

[52] Balancing Exploration and Exploitation for Solving Large-scale Multiobjective Optimization via Attention Mechanism PDF

[53] An Integrated Graph Neural Network and Reinforcement Learning Framework for Intelligent Drug Discovery PDF

[54] Aggregating knowledge-aware graph neural network and adaptive relational attention for recommendation PDF

[55] Deep attentive belief propagation: Integrating reasoning and learning for solving constraint optimization problems PDF

[56] Learning to Gaze: Bio-Inspired Attention Adaptation Strategy for Social Robots PDF

[57] Optimizing Landmark Graphs in DHRL: A Dual Approach of Attention and Weighted Sampling PDF

[58] Hulu video recommendation: from relevance to reasoning PDF

[59] Each Complexity Deserves a Pruning Policy PDF

[60] GraphPuma-Flow: Graph-Neural Metaheuristics for Constraint-Aware Hybrid Scheduling PDF

Table of Contents