Factuality Matters: When Image Generation and Editing Meet Structured Visuals

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Generative ModelingUnified ModelImage EditingText-to-Image GenerationBenchmark

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Leveraging this dataset, we train a unified model that integrates a multimodal language model with FLUX.1-Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 2,000 challenging samples, and an accompanying evaluation metric, StructScore, which employs a multi-round Q&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even state-of-the-art systems score below 50%, while our model achieves the strongest open-source performance, with consistent gains from inference-time reasoning. By releasing dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: structured image generation and editing. The field organizes itself around several complementary branches that reflect different ways of imposing structure on generative models. Text-Guided Generation and Editing Frameworks (e.g., VQGAN CLIP[1], Prompt to Prompt[37]) focus on leveraging natural language to steer synthesis, while Spatially-Constrained and Compositional Generation (e.g., BoxDiff[4], Composer[12]) emphasizes explicit layout or bounding-box controls to arrange multiple objects. Subject-Driven and Consistency-Preserving Editing (e.g., MasaCtrl[13], BLIP Diffusion[14]) targets identity preservation and fine-grained attribute manipulation. Domain-Specific Synthesis with Structural Constraints addresses specialized applications such as medical imaging (e.g., CT Ultrasound Synthesis[23]) or industrial defect generation (e.g., Industrial Defect Generation[49]), where domain priors guide the output. Constrained Generation with Physical and Semantic Priors incorporates scene graphs, depth maps, or other intermediate representations (e.g., Scene Graphs Survey[8], Semantic Diffusion Guidance[15]) to enforce realism. Finally, Auxiliary Capabilities and Supporting Technologies provide foundational tools—such as 3D-aware latents (Structured 3D Latents[2], SV3D[3])—that enable richer structural control across these branches. A particularly active line of work explores how to balance flexibility with fidelity: some methods prioritize training-free or plug-and-play guidance (e.g., Training Free Structured[29], Semantic Diffusion Guidance[15]), while others learn specialized modules for compositional reasoning (e.g., MIGC Plus[50], Composer[12]). Trade-offs between user control granularity and model complexity remain central, as do questions about how to verify that generated content respects factual or physical constraints. Factuality Structured Visuals[0] sits within the Structured Visual Generation with Factual Constraints branch, specifically under Factual Fidelity and Reasoning-Augmented Generation. It shares thematic ground with works like Text to Diagram[11], which also emphasizes correctness and structured reasoning, but distinguishes itself by foregrounding explicit factual verification mechanisms. Compared to domain-specific approaches such as IC SEM Synthesis[5], Factuality Structured Visuals[0] appears to target broader applicability across diverse factual scenarios, positioning it as a bridge between general-purpose text-guided frameworks and specialized constraint-driven synthesis.

Claimed Contributions

Large-scale structured image dataset with chain-of-thought annotations

10 retrieved papers

The authors build a dataset of 1.3 million structured image pairs (charts, diagrams, mathematical figures) generated from executable code. Each sample includes both text-to-image prompts and editing instructions, along with GPT-5-generated chain-of-thought reasoning trajectories that provide explicit analysis and planning steps.

10 retrieved papers

Unified model with three-stage progressive training curriculum

10 retrieved papers

The authors propose a three-stage training pipeline that progressively aligns multimodal features from Qwen-VL with FLUX.1-Kontext via a lightweight MLP connector, infuses structured-visual knowledge, and incorporates chain-of-thought reasoning to enable inference-time scaling with an external reasoner.

10 retrieved papers

StructBench benchmark and StructScore evaluation metric

4 retrieved papers

The authors introduce StructBench, a benchmark with over 2,000 samples across six categories for structured image generation and editing. They also propose StructScore, a novel metric that uses VLMs in a multi-round question-answer protocol to evaluate fine-grained factual accuracy and reduce hallucinations compared to naive VLM-as-a-Judge approaches.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale structured image dataset with chain-of-thought annotations

[55] Star: A benchmark for situated reasoning in real-world videos PDF

Cannot Refute

[56] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

Cannot Refute

[57] Chartgemma: Visual instruction-tuning for chart reasoning in the wild PDF

Cannot Refute

[58] Measuring and improving chain-of-thought reasoning in vision-language models PDF

Cannot Refute

[59] Benchmarking multimodal cot reward model stepwise by visual program PDF

Cannot Refute

[60] Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images PDF

Cannot Refute

[61] Simple o3: Towards Interleaved Vision-Language Reasoning PDF

Cannot Refute

[62] AnnotatedTables: A Large Tabular Dataset with Language Model Annotations PDF

Cannot Refute

[63] Translating a visual lego manual to a machine-executable plan PDF

Cannot Refute

[64] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning PDF

Cannot Refute

Contribution

Unified model with three-stage progressive training curriculum

[65] Multimodal chain-of-thought reasoning in language models PDF

Cannot Refute

[66] Aguvis: Unified pure vision agents for autonomous gui interaction PDF

Cannot Refute

[67] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning PDF

Cannot Refute

[68] Reasonrec: A reasoning-augmented multimodal agent for unified recommendation PDF

Cannot Refute

[69] Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation PDF

Cannot Refute

[70] Mutex: Learning unified policies from multimodal task specifications PDF

Cannot Refute

[71] Unified multimodal chain-of-thought reward model through reinforcement fine-tuning PDF

Cannot Refute

[72] TinyRS-R1: Compact Multimodal Language Model for Remote Sensing PDF

Cannot Refute

[73] Universal Visuo-Tactile Video Understanding for Embodied Interaction PDF

Cannot Refute

[74] Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing PDF

Cannot Refute

Contribution

StructBench benchmark and StructScore evaluation metric

[51] Gptdrawer: Enhancing visual synthesis through chatgpt PDF

Cannot Refute

[52] What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance PDF

Cannot Refute

[53] ChatEdit: Towards Multi-turn Interactive Facial Image Editing via Dialogue PDF

Cannot Refute

[54] DSG-GAN: Multi-turn text-to-image synthesis via dual semantic-stream guidance with global and local linguistics PDF

Cannot Refute

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Large-scale structured image dataset with chain-of-thought annotations

[55] Star: A benchmark for situated reasoning in real-world videos PDF

[56] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

[57] Chartgemma: Visual instruction-tuning for chart reasoning in the wild PDF

[58] Measuring and improving chain-of-thought reasoning in vision-language models PDF

[59] Benchmarking multimodal cot reward model stepwise by visual program PDF

[60] Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images PDF

[61] Simple o3: Towards Interleaved Vision-Language Reasoning PDF

[62] AnnotatedTables: A Large Tabular Dataset with Language Model Annotations PDF

[63] Translating a visual lego manual to a machine-executable plan PDF

[64] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning PDF

Unified model with three-stage progressive training curriculum

[65] Multimodal chain-of-thought reasoning in language models PDF

[66] Aguvis: Unified pure vision agents for autonomous gui interaction PDF

[67] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning PDF

[68] Reasonrec: A reasoning-augmented multimodal agent for unified recommendation PDF

[69] Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation PDF

[70] Mutex: Learning unified policies from multimodal task specifications PDF

[71] Unified multimodal chain-of-thought reward model through reinforcement fine-tuning PDF

[72] TinyRS-R1: Compact Multimodal Language Model for Remote Sensing PDF

[73] Universal Visuo-Tactile Video Understanding for Embodied Interaction PDF

[74] Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing PDF

StructBench benchmark and StructScore evaluation metric

[51] Gptdrawer: Enhancing visual synthesis through chatgpt PDF

[52] What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance PDF

[53] ChatEdit: Towards Multi-turn Interactive Facial Image Editing via Dialogue PDF

[54] DSG-GAN: Multi-turn text-to-image synthesis via dual semantic-stream guidance with global and local linguistics PDF

Table of Contents