Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
visual chain of thoughtinterleaved text and image generationmultimodal reasoning
Abstract:

Humans often rely on visual aids, such as diagrams or sketches, when tackling complex problems. Teaching multimodal models to adopt similar strategies, a process known as Visual Chain of Thought (visual CoT), is much more difficult. The main challenges are: (1) weak performance of off-the-shelf visual CoT, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce Zebra-CoT a diverse large-scale interleaved text-image reasoning dataset with 182,384 reasoning traces across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual CoT. We emphasize four categories of tasks where sketching or visual reasoning is especially natural, spanning (a) scientific questions such as geometry, physics, and algorithms; (b) 2D visual reasoning tasks like visual search and jigsaw puzzles; (c) 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; and (d) visual logic problems and strategic games like chess. Fine-tuning Anole‑7B model on Zebra-CoT yields a +12% improvement in our test‑set accuracy and up to +13% performance gains on standard VLM benchmarks. Similarly, fine-tuning Bagel‑7B produces models capable of generating high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness in advancing multimodal reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: interleaved vision-language reasoning with visual chain of thought. This field explores how models can generate and leverage intermediate visual representations—such as sketches, diagrams, or annotated images—alongside textual reasoning steps to solve complex multimodal problems. The taxonomy reveals several complementary directions: Visual Chain-of-Thought Frameworks and Architectures develop core methods for interleaving visual and linguistic reasoning (e.g., Zebra-CoT[0], Visual Thoughts[2], Whiteboard-of-Thought[1]); Domain-Specific Applications adapt these techniques to areas like autonomous driving and mathematical problem-solving; Training Data Synthesis and Supervision Strategies address how to create or curate datasets that support visual reasoning traces; Benchmarking and Evaluation branches establish metrics and testbeds for assessing visual chain-of-thought capabilities; Foundational Multimodal Reasoning examines broader integration of knowledge and modalities; Grounding and Segmentation branches connect reasoning to pixel-level understanding; and Visual Programming frameworks treat reasoning as executable code over visual inputs (e.g., Visual Programming[11]). A particularly active line of work focuses on frameworks that generate intermediate visual artifacts during reasoning. Zebra-CoT[0] sits within the Interleaved Multimodal Reasoning Frameworks cluster, emphasizing the production of visual chain-of-thought steps that bridge perception and symbolic reasoning. Nearby approaches like Vocot[3] and MIRA[5] similarly explore how to scaffold reasoning with visual intermediates, though they may differ in whether they rely on retrieval, generation, or annotation mechanisms. Another contrasting theme emerges in methods like Simple o3[7] and ThinkMorph[16], which investigate how visual transformations or iterative refinement can support multi-step problem-solving. Open questions persist around the faithfulness of generated visual reasoning traces, the trade-offs between end-to-end learning and modular pipelines, and how to scale supervision for complex visual chain-of-thought without prohibitive annotation costs.

Claimed Contributions

Zebra-CoT dataset for interleaved vision-language reasoning

The authors present Zebra-CoT, a large-scale dataset containing 182,384 interleaved text and image reasoning traces spanning four major categories (scientific questions, 2D visual reasoning, 3D visual reasoning, and visual logic and strategic games) across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual Chain of Thought reasoning.

10 retrieved papers
Data curation pipeline for diverse visual CoT

The authors developed a comprehensive data curation pipeline that combines real-world problems from multiple domains with synthetic examples generated through computer programming, simulation, and graphic rendering. The pipeline leverages frontier vision-language models to enrich reasoning traces and ensure clear logical flow between textual reasoning and visual aids.

10 retrieved papers
Scaffolding experiments demonstrating visual CoT value

The authors design scaffolding experiments that incrementally provide multimodal reasoning steps as context to frontier models. These experiments demonstrate the challenging nature of the dataset and the value of visual CoT, showing substantial accuracy improvements when visual reasoning steps are provided.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Zebra-CoT dataset for interleaved vision-language reasoning

The authors present Zebra-CoT, a large-scale dataset containing 182,384 interleaved text and image reasoning traces spanning four major categories (scientific questions, 2D visual reasoning, 3D visual reasoning, and visual logic and strategic games) across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual Chain of Thought reasoning.

Contribution

Data curation pipeline for diverse visual CoT

The authors developed a comprehensive data curation pipeline that combines real-world problems from multiple domains with synthetic examples generated through computer programming, simulation, and graphic rendering. The pipeline leverages frontier vision-language models to enrich reasoning traces and ensure clear logical flow between textual reasoning and visual aids.

Contribution

Scaffolding experiments demonstrating visual CoT value

The authors design scaffolding experiments that incrementally provide multimodal reasoning steps as context to frontier models. These experiments demonstrate the challenging nature of the dataset and the value of visual CoT, showing substantial accuracy improvements when visual reasoning steps are provided.