Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

visual chain of thoughtinterleaved text and image generationmultimodal reasoning

Humans often rely on visual aids, such as diagrams or sketches, when tackling complex problems. Teaching multimodal models to adopt similar strategies, a process known as Visual Chain of Thought (visual CoT), is much more difficult. The main challenges are: (1) weak performance of off-the-shelf visual CoT, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce Zebra-CoT a diverse large-scale interleaved text-image reasoning dataset with 182,384 reasoning traces across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual CoT. We emphasize four categories of tasks where sketching or visual reasoning is especially natural, spanning (a) scientific questions such as geometry, physics, and algorithms; (b) 2D visual reasoning tasks like visual search and jigsaw puzzles; (c) 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; and (d) visual logic problems and strategic games like chess. Fine-tuning Anole‑7B model on Zebra-CoT yields a +12% improvement in our test‑set accuracy and up to +13% performance gains on standard VLM benchmarks. Similarly, fine-tuning Bagel‑7B produces models capable of generating high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness in advancing multimodal reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: interleaved vision-language reasoning with visual chain of thought. This field explores how models can generate and leverage intermediate visual representations—such as sketches, diagrams, or annotated images—alongside textual reasoning steps to solve complex multimodal problems. The taxonomy reveals several complementary directions: Visual Chain-of-Thought Frameworks and Architectures develop core methods for interleaving visual and linguistic reasoning (e.g., Zebra-CoT[0], Visual Thoughts[2], Whiteboard-of-Thought[1]); Domain-Specific Applications adapt these techniques to areas like autonomous driving and mathematical problem-solving; Training Data Synthesis and Supervision Strategies address how to create or curate datasets that support visual reasoning traces; Benchmarking and Evaluation branches establish metrics and testbeds for assessing visual chain-of-thought capabilities; Foundational Multimodal Reasoning examines broader integration of knowledge and modalities; Grounding and Segmentation branches connect reasoning to pixel-level understanding; and Visual Programming frameworks treat reasoning as executable code over visual inputs (e.g., Visual Programming[11]). A particularly active line of work focuses on frameworks that generate intermediate visual artifacts during reasoning. Zebra-CoT[0] sits within the Interleaved Multimodal Reasoning Frameworks cluster, emphasizing the production of visual chain-of-thought steps that bridge perception and symbolic reasoning. Nearby approaches like Vocot[3] and MIRA[5] similarly explore how to scaffold reasoning with visual intermediates, though they may differ in whether they rely on retrieval, generation, or annotation mechanisms. Another contrasting theme emerges in methods like Simple o3[7] and ThinkMorph[16], which investigate how visual transformations or iterative refinement can support multi-step problem-solving. Open questions persist around the faithfulness of generated visual reasoning traces, the trade-offs between end-to-end learning and modular pipelines, and how to scale supervision for complex visual chain-of-thought without prohibitive annotation costs.

Claimed Contributions

Zebra-CoT dataset for interleaved vision-language reasoning

10 retrieved papers

The authors present Zebra-CoT, a large-scale dataset containing 182,384 interleaved text and image reasoning traces spanning four major categories (scientific questions, 2D visual reasoning, 3D visual reasoning, and visual logic and strategic games) across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual Chain of Thought reasoning.

10 retrieved papers

Data curation pipeline for diverse visual CoT

10 retrieved papers

The authors developed a comprehensive data curation pipeline that combines real-world problems from multiple domains with synthetic examples generated through computer programming, simulation, and graphic rendering. The pipeline leverages frontier vision-language models to enrich reasoning traces and ensure clear logical flow between textual reasoning and visual aids.

10 retrieved papers

Scaffolding experiments demonstrating visual CoT value

10 retrieved papers

The authors design scaffolding experiments that incrementally provide multimodal reasoning steps as context to frontier models. These experiments demonstrate the challenging nature of the dataset and the value of visual CoT, showing substantial accuracy improvements when visual reasoning steps are provided.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought PDF

Cheng, Zihui, Chen, Qiguang, Zihui Cheng, Xu Xiao, Qiguang Chen, Wang, Jiaqi, Xiao Xu, Wang Wei-yun, Jiaqi Wang, Fei Hao, Weiyun Wang, Wang Yi-dong, Hao Fei, Alex Jinpeng, Yidong Wang, Chen Zhi, Alex Jinpeng Wang, Che, Wanxiang, Zhi Chen, Qin, Libo, Wanxiang Che, Libo Qin (2025) • arXiv.org

[7] Simple o3: Towards Interleaved Vision-Language Reasoning PDF

Wang Ye, Chen, Qianglong, Ye Wang, Li ZeJun, Qianglong Chen, Wang Si-yuan, Zejun Li, Guo Shi-jie, Siyuan Wang, Zhang Zhirui, Shijie Guo, Wei, Zhongyu, Zhirui Zhang, Zhongyu Wei (2025) • arXiv.org

[16] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning PDF

Gu Jiawei, Yunzhuo Hao, Li, Linjie, Huichen Will Wang, Linjie Li, Choi, Yejin, Michael Qizhe Shieh, Krishna, Ranjay, Yejin Choi, Cheng Yu, Ranjay Krishna, Yu Cheng (2025) • arXiv.org

[25] Interleaved-Modal Chain-of-Thought PDF

Jun Gao, Yongqi Li, Ziqiang Cao, Yongqing Li, Wenjie Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Zebra-CoT dataset for interleaved vision-language reasoning

[8] Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models PDF

Cannot Refute

[10] Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios PDF

Cannot Refute

[16] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning PDF

Cannot Refute

[59] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Cannot Refute

[60] Mv-Math: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts PDF

Cannot Refute

[61] Eo-1: Interleaved vision-text-action pretraining for general robot control PDF

Cannot Refute

[62] Longvideobench: A benchmark for long-context interleaved video-language understanding PDF

Cannot Refute

[63] Interleaving Reasoning for Better Text-to-Image Generation PDF

Cannot Refute

[64] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF

Cannot Refute

[65] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing PDF

Cannot Refute

Contribution

Data curation pipeline for diverse visual CoT

[66] Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark PDF

Cannot Refute

[67] Being-h0: vision-language-action pretraining from large-scale human videos PDF

Cannot Refute

[68] Scalable Vision Language Model Training via High Quality Data Curation PDF

Cannot Refute

[69] Visual Persona: Foundation Model for Full-Body Human Customization PDF

Cannot Refute

[70] Gqa: A new dataset for real-world visual reasoning and compositional question answering PDF

Cannot Refute

[71] Text-driven adaptation of foundation models for few-shot surgical workflow analysis PDF

Cannot Refute

[72] IntentTuner: an interactive framework for integrating human intentions in fine-tuning text-to-image generative models PDF

Cannot Refute

[73] Eye Tracking-Enhanced Deep Learning for Medical Image Analysis: A Systematic Review on Data Efficiency, Interpretability, and Multimodal Integration PDF

Cannot Refute

[74] Visual-language reasoning large language models for primary care: advancing clinical decision support through multimodal AI: X. Huang et al. PDF

Cannot Refute

[75] MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning PDF

Cannot Refute

Contribution

Scaffolding experiments demonstrating visual CoT value

[19] Autonomous Multimodal Reasoning via Implicit Chain-of-Vision PDF

Cannot Refute

[26] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning PDF

Cannot Refute

[51] R2-MultiOmnia: Leading multilingual multimodal reasoning via self-training PDF

Cannot Refute

[52] Visuothink: Empowering lvlm reasoning with multimodal tree search PDF

Cannot Refute

[53] Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning PDF

Cannot Refute

[54] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning PDF

Cannot Refute

[55] A theory guided scaffolding instruction framework for LLM-enabled metaphor reasoning PDF

Cannot Refute

[56] GlFoMR: A Glance-then-Focus Multimodal Reasoning Framework for Diagram Question Answering PDF

Cannot Refute

[57] Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations PDF

Cannot Refute

[58] Progressive Multimodal Reasoning via Active Retrieval PDF

Cannot Refute

Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought PDF

[7] Simple o3: Towards Interleaved Vision-Language Reasoning PDF

[16] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning PDF

[25] Interleaved-Modal Chain-of-Thought PDF

Contribution Analysis

Zebra-CoT dataset for interleaved vision-language reasoning

[8] Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models PDF

[10] Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios PDF

[16] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning PDF

[59] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

[60] Mv-Math: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts PDF

[61] Eo-1: Interleaved vision-text-action pretraining for general robot control PDF

[62] Longvideobench: A benchmark for long-context interleaved video-language understanding PDF

[63] Interleaving Reasoning for Better Text-to-Image Generation PDF

[64] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF

[65] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing PDF

Data curation pipeline for diverse visual CoT

[66] Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark PDF

[67] Being-h0: vision-language-action pretraining from large-scale human videos PDF

[68] Scalable Vision Language Model Training via High Quality Data Curation PDF

[69] Visual Persona: Foundation Model for Full-Body Human Customization PDF

[70] Gqa: A new dataset for real-world visual reasoning and compositional question answering PDF

[71] Text-driven adaptation of foundation models for few-shot surgical workflow analysis PDF

[72] IntentTuner: an interactive framework for integrating human intentions in fine-tuning text-to-image generative models PDF

[73] Eye Tracking-Enhanced Deep Learning for Medical Image Analysis: A Systematic Review on Data Efficiency, Interpretability, and Multimodal Integration PDF

[74] Visual-language reasoning large language models for primary care: advancing clinical decision support through multimodal AI: X. Huang et al. PDF

[75] MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning PDF

Scaffolding experiments demonstrating visual CoT value

[19] Autonomous Multimodal Reasoning via Implicit Chain-of-Vision PDF

[26] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning PDF

[51] R2-MultiOmnia: Leading multilingual multimodal reasoning via self-training PDF

[52] Visuothink: Empowering lvlm reasoning with multimodal tree search PDF

[53] Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning PDF

[54] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning PDF

[55] A theory guided scaffolding instruction framework for LLM-enabled metaphor reasoning PDF

[56] GlFoMR: A Glance-then-Focus Multimodal Reasoning Framework for Diagram Question Answering PDF

[57] Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations PDF

[58] Progressive Multimodal Reasoning via Active Retrieval PDF

Table of Contents