Factuality Matters: When Image Generation and Editing Meet Structured Visuals

ICLR 2026 Conference SubmissionAnonymous Authors
Generative ModelingUnified ModelImage EditingText-to-Image GenerationBenchmark
Abstract:

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Leveraging this dataset, we train a unified model that integrates a multimodal language model with FLUX.1-Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 2,000 challenging samples, and an accompanying evaluation metric, StructScore, which employs a multi-round Q&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even state-of-the-art systems score below 50%, while our model achieves the strongest open-source performance, with consistent gains from inference-time reasoning. By releasing dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: structured image generation and editing. The field organizes itself around several complementary branches that reflect different ways of imposing structure on generative models. Text-Guided Generation and Editing Frameworks (e.g., VQGAN CLIP[1], Prompt to Prompt[37]) focus on leveraging natural language to steer synthesis, while Spatially-Constrained and Compositional Generation (e.g., BoxDiff[4], Composer[12]) emphasizes explicit layout or bounding-box controls to arrange multiple objects. Subject-Driven and Consistency-Preserving Editing (e.g., MasaCtrl[13], BLIP Diffusion[14]) targets identity preservation and fine-grained attribute manipulation. Domain-Specific Synthesis with Structural Constraints addresses specialized applications such as medical imaging (e.g., CT Ultrasound Synthesis[23]) or industrial defect generation (e.g., Industrial Defect Generation[49]), where domain priors guide the output. Constrained Generation with Physical and Semantic Priors incorporates scene graphs, depth maps, or other intermediate representations (e.g., Scene Graphs Survey[8], Semantic Diffusion Guidance[15]) to enforce realism. Finally, Auxiliary Capabilities and Supporting Technologies provide foundational tools—such as 3D-aware latents (Structured 3D Latents[2], SV3D[3])—that enable richer structural control across these branches. A particularly active line of work explores how to balance flexibility with fidelity: some methods prioritize training-free or plug-and-play guidance (e.g., Training Free Structured[29], Semantic Diffusion Guidance[15]), while others learn specialized modules for compositional reasoning (e.g., MIGC Plus[50], Composer[12]). Trade-offs between user control granularity and model complexity remain central, as do questions about how to verify that generated content respects factual or physical constraints. Factuality Structured Visuals[0] sits within the Structured Visual Generation with Factual Constraints branch, specifically under Factual Fidelity and Reasoning-Augmented Generation. It shares thematic ground with works like Text to Diagram[11], which also emphasizes correctness and structured reasoning, but distinguishes itself by foregrounding explicit factual verification mechanisms. Compared to domain-specific approaches such as IC SEM Synthesis[5], Factuality Structured Visuals[0] appears to target broader applicability across diverse factual scenarios, positioning it as a bridge between general-purpose text-guided frameworks and specialized constraint-driven synthesis.

Claimed Contributions

Large-scale structured image dataset with chain-of-thought annotations

The authors build a dataset of 1.3 million structured image pairs (charts, diagrams, mathematical figures) generated from executable code. Each sample includes both text-to-image prompts and editing instructions, along with GPT-5-generated chain-of-thought reasoning trajectories that provide explicit analysis and planning steps.

10 retrieved papers
Unified model with three-stage progressive training curriculum

The authors propose a three-stage training pipeline that progressively aligns multimodal features from Qwen-VL with FLUX.1-Kontext via a lightweight MLP connector, infuses structured-visual knowledge, and incorporates chain-of-thought reasoning to enable inference-time scaling with an external reasoner.

10 retrieved papers
StructBench benchmark and StructScore evaluation metric

The authors introduce StructBench, a benchmark with over 2,000 samples across six categories for structured image generation and editing. They also propose StructScore, a novel metric that uses VLMs in a multi-round question-answer protocol to evaluate fine-grained factual accuracy and reduce hallucinations compared to naive VLM-as-a-Judge approaches.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale structured image dataset with chain-of-thought annotations

The authors build a dataset of 1.3 million structured image pairs (charts, diagrams, mathematical figures) generated from executable code. Each sample includes both text-to-image prompts and editing instructions, along with GPT-5-generated chain-of-thought reasoning trajectories that provide explicit analysis and planning steps.

Contribution

Unified model with three-stage progressive training curriculum

The authors propose a three-stage training pipeline that progressively aligns multimodal features from Qwen-VL with FLUX.1-Kontext via a lightweight MLP connector, infuses structured-visual knowledge, and incorporates chain-of-thought reasoning to enable inference-time scaling with an external reasoner.

Contribution

StructBench benchmark and StructScore evaluation metric

The authors introduce StructBench, a benchmark with over 2,000 samples across six categories for structured image generation and editing. They also propose StructScore, a novel metric that uses VLMs in a multi-round question-answer protocol to evaluate fine-grained factual accuracy and reduce hallucinations compared to naive VLM-as-a-Judge approaches.