Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present Zebra-CoT, a large-scale dataset containing 182,384 interleaved text and image reasoning traces spanning four major categories (scientific questions, 2D visual reasoning, 3D visual reasoning, and visual logic and strategic games) across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual Chain of Thought reasoning.
The authors developed a comprehensive data curation pipeline that combines real-world problems from multiple domains with synthetic examples generated through computer programming, simulation, and graphic rendering. The pipeline leverages frontier vision-language models to enrich reasoning traces and ensure clear logical flow between textual reasoning and visual aids.
The authors design scaffolding experiments that incrementally provide multimodal reasoning steps as context to frontier models. These experiments demonstrate the challenging nature of the dataset and the value of visual CoT, showing substantial accuracy improvements when visual reasoning steps are provided.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought PDF
[7] Simple o3: Towards Interleaved Vision-Language Reasoning PDF
[16] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning PDF
[25] Interleaved-Modal Chain-of-Thought PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Zebra-CoT dataset for interleaved vision-language reasoning
The authors present Zebra-CoT, a large-scale dataset containing 182,384 interleaved text and image reasoning traces spanning four major categories (scientific questions, 2D visual reasoning, 3D visual reasoning, and visual logic and strategic games) across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual Chain of Thought reasoning.
[8] Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models PDF
[10] Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios PDF
[16] ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning PDF
[59] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF
[60] Mv-Math: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts PDF
[61] Eo-1: Interleaved vision-text-action pretraining for general robot control PDF
[62] Longvideobench: A benchmark for long-context interleaved video-language understanding PDF
[63] Interleaving Reasoning for Better Text-to-Image Generation PDF
[64] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF
[65] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing PDF
Data curation pipeline for diverse visual CoT
The authors developed a comprehensive data curation pipeline that combines real-world problems from multiple domains with synthetic examples generated through computer programming, simulation, and graphic rendering. The pipeline leverages frontier vision-language models to enrich reasoning traces and ensure clear logical flow between textual reasoning and visual aids.
[66] Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark PDF
[67] Being-h0: vision-language-action pretraining from large-scale human videos PDF
[68] Scalable Vision Language Model Training via High Quality Data Curation PDF
[69] Visual Persona: Foundation Model for Full-Body Human Customization PDF
[70] Gqa: A new dataset for real-world visual reasoning and compositional question answering PDF
[71] Text-driven adaptation of foundation models for few-shot surgical workflow analysis PDF
[72] IntentTuner: an interactive framework for integrating human intentions in fine-tuning text-to-image generative models PDF
[73] Eye Tracking-Enhanced Deep Learning for Medical Image Analysis: A Systematic Review on Data Efficiency, Interpretability, and Multimodal Integration PDF
[74] Visual-language reasoning large language models for primary care: advancing clinical decision support through multimodal AI: X. Huang et al. PDF
[75] MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning PDF
Scaffolding experiments demonstrating visual CoT value
The authors design scaffolding experiments that incrementally provide multimodal reasoning steps as context to frontier models. These experiments demonstrate the challenging nature of the dataset and the value of visual CoT, showing substantial accuracy improvements when visual reasoning steps are provided.