Factuality Matters: When Image Generation and Editing Meet Structured Visuals
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors build a dataset of 1.3 million structured image pairs (charts, diagrams, mathematical figures) generated from executable code. Each sample includes both text-to-image prompts and editing instructions, along with GPT-5-generated chain-of-thought reasoning trajectories that provide explicit analysis and planning steps.
The authors propose a three-stage training pipeline that progressively aligns multimodal features from Qwen-VL with FLUX.1-Kontext via a lightweight MLP connector, infuses structured-visual knowledge, and incorporates chain-of-thought reasoning to enable inference-time scaling with an external reasoner.
The authors introduce StructBench, a benchmark with over 2,000 samples across six categories for structured image generation and editing. They also propose StructScore, a novel metric that uses VLMs in a multi-round question-answer protocol to evaluate fine-grained factual accuracy and reduce hallucinations compared to naive VLM-as-a-Judge approaches.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Large-scale structured image dataset with chain-of-thought annotations
The authors build a dataset of 1.3 million structured image pairs (charts, diagrams, mathematical figures) generated from executable code. Each sample includes both text-to-image prompts and editing instructions, along with GPT-5-generated chain-of-thought reasoning trajectories that provide explicit analysis and planning steps.
[55] Star: A benchmark for situated reasoning in real-world videos PDF
[56] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF
[57] Chartgemma: Visual instruction-tuning for chart reasoning in the wild PDF
[58] Measuring and improving chain-of-thought reasoning in vision-language models PDF
[59] Benchmarking multimodal cot reward model stepwise by visual program PDF
[60] Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images PDF
[61] Simple o3: Towards Interleaved Vision-Language Reasoning PDF
[62] AnnotatedTables: A Large Tabular Dataset with Language Model Annotations PDF
[63] Translating a visual lego manual to a machine-executable plan PDF
[64] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning PDF
Unified model with three-stage progressive training curriculum
The authors propose a three-stage training pipeline that progressively aligns multimodal features from Qwen-VL with FLUX.1-Kontext via a lightweight MLP connector, infuses structured-visual knowledge, and incorporates chain-of-thought reasoning to enable inference-time scaling with an external reasoner.
[65] Multimodal chain-of-thought reasoning in language models PDF
[66] Aguvis: Unified pure vision agents for autonomous gui interaction PDF
[67] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning PDF
[68] Reasonrec: A reasoning-augmented multimodal agent for unified recommendation PDF
[69] Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation PDF
[70] Mutex: Learning unified policies from multimodal task specifications PDF
[71] Unified multimodal chain-of-thought reward model through reinforcement fine-tuning PDF
[72] TinyRS-R1: Compact Multimodal Language Model for Remote Sensing PDF
[73] Universal Visuo-Tactile Video Understanding for Embodied Interaction PDF
[74] Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing PDF
StructBench benchmark and StructScore evaluation metric
The authors introduce StructBench, a benchmark with over 2,000 samples across six categories for structured image generation and editing. They also propose StructScore, a novel metric that uses VLMs in a multi-round question-answer protocol to evaluate fine-grained factual accuracy and reduce hallucinations compared to naive VLM-as-a-Judge approaches.