Long Chain-of-Thought Reasoning Across Languages

ICLR 2026 Conference SubmissionAnonymous Authors
MultilingualReasoningLong Chain-of-Thought
Abstract:

While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world’s languages. In this work, we systematically investigate four key stages of model development–scaling, pretraining, post-training, and inference–to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper systematically investigates how long chain-of-thought reasoning capabilities extend beyond English across four model development stages: scaling, pretraining, post-training, and inference. It resides in the Supporting Techniques and Infrastructure leaf under Specialized Applications and Extensions, alongside four sibling papers focused on foundational techniques like embedding models, neuron localization, and rectification strategies. This leaf represents a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting the paper addresses infrastructure-level challenges rather than core reasoning methods or evaluation benchmarks.

The taxonomy reveals that most multilingual reasoning research concentrates in Cross-Lingual Prompting and Alignment Strategies (14 papers) and Evaluation and Analysis (16 papers), with substantial work on prompting frameworks, benchmarking, and mechanism analysis. The paper's position in Supporting Techniques distinguishes it from these crowded areas, connecting instead to foundational infrastructure that enables reasoning across languages. Neighboring leaves include Domain-Specific Reasoning and Multimodal Multilingual Reasoning, which apply reasoning to specialized contexts, while the paper focuses on understanding how reasoning depth scales across linguistic boundaries through systematic stage-by-stage investigation.

Among 24 candidates examined across three contributions, none were identified as clearly refuting the paper's claims. The systematic investigation of long CoT reasoning examined 10 candidates with no refutations, synthetic data curation approaches examined 10 candidates with no refutations, and analysis of language-specific failure modes examined 4 candidates with no refutations. This suggests that within the limited search scope, the paper's focus on systematically comparing En-CoT versus Target-CoT across model development stages and its exploration of synthetic data curation for multilingual reasoning appear relatively distinct from examined prior work.

Based on the top-24 semantic matches examined, the paper's contributions appear to occupy a less-explored intersection of systematic stage-wise analysis and multilingual reasoning infrastructure. The analysis does not cover the full breadth of multilingual reasoning literature, particularly work published in specialized venues or non-English research communities. The taxonomy structure indicates the paper addresses foundational questions about how reasoning capabilities transfer across languages, complementing but not directly overlapping with the more populated areas of prompting methods and benchmark development.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multilingual long chain-of-thought reasoning capabilities. The field has organized itself around several complementary branches that address different facets of enabling and understanding complex reasoning across languages. Cross-Lingual Prompting and Alignment Strategies explore how to elicit reasoning in diverse languages through prompt design and alignment techniques, with works like Cross-lingual Prompting[1] and xcot[4] investigating how to transfer reasoning patterns across linguistic boundaries. Training and Fine-Tuning Approaches focus on model adaptation methods, including distillation and specialized training regimes such as Multilingual CoT Reasoners[5] and AdaMCoT[23], which aim to build or enhance multilingual reasoning capacity. Code-Based and Program-Aided Reasoning examines hybrid methods that leverage executable code alongside natural language, exemplified by Program-of-Thought[19] and Structured Reasoning Code[21]. Evaluation and Analysis encompasses diagnostic studies like Demystifying Multilingual CoT[2] and Cross-lingual Collapse[9] that probe performance boundaries and failure modes. Finally, Specialized Applications and Extensions address domain-specific challenges and supporting infrastructure, ranging from medical reasoning to embedding models and neurosymbolic integration. A particularly active line of work investigates the tension between language-agnostic reasoning and language-specific performance gaps, with studies like Not All Languages[10] and Performance Boundaries[7] revealing that reasoning quality varies substantially across languages. Within the Specialized Applications and Extensions branch, Long Chain-of-Thought[0] sits among supporting techniques that enable or enhance reasoning infrastructure, positioned near works like Query-Relevant Neurons[32] and Rectification Self-Optimization[35]. While Query-Relevant Neurons[32] focuses on identifying which model components are activated during reasoning tasks, Long Chain-of-Thought[0] emphasizes extending reasoning depth across multilingual contexts, complementing efforts like Embedding Models[39] that provide foundational representations. This cluster addresses the practical challenge of scaling reasoning chains in resource-constrained or linguistically diverse settings, bridging the gap between core reasoning methods and real-world deployment considerations.

Claimed Contributions

Systematic investigation of long CoT reasoning across languages

The authors conduct a comprehensive study examining how long chain-of-thought reasoning transfers to nine non-English languages across four model development stages. They introduce two reasoning settings (En-CoT and Target-CoT) to separately evaluate input comprehension and reasoning capabilities in target languages.

10 retrieved papers
Synthetic data curation approaches for multilingual reasoning

The authors compare two methods for creating multilingual reasoning data: translating English reasoning traces versus directly distilling target-language traces from reasoning models. They show that translation-based approaches generally outperform distillation, especially for high-resource languages.

10 retrieved papers
Analysis of language-specific failure modes in long CoT

The authors provide an error taxonomy revealing that En-CoT failures primarily stem from reasoning mistakes, while Target-CoT exhibits higher rates of output generation errors and conceptual misunderstandings. They also demonstrate negative correlations between inference cost and performance across languages.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic investigation of long CoT reasoning across languages

The authors conduct a comprehensive study examining how long chain-of-thought reasoning transfers to nine non-English languages across four model development stages. They introduce two reasoning settings (En-CoT and Target-CoT) to separately evaluate input comprehension and reasoning capabilities in target languages.

Contribution

Synthetic data curation approaches for multilingual reasoning

The authors compare two methods for creating multilingual reasoning data: translating English reasoning traces versus directly distilling target-language traces from reasoning models. They show that translation-based approaches generally outperform distillation, especially for high-resource languages.

Contribution

Analysis of language-specific failure modes in long CoT

The authors provide an error taxonomy revealing that En-CoT failures primarily stem from reasoning mistakes, while Target-CoT exhibits higher rates of output generation errors and conceptual misunderstandings. They also demonstrate negative correlations between inference cost and performance across languages.