A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

ICLR 2026 Conference SubmissionAnonymous Authors
high-quality datasetmultimodal datasetinterleaved image-text synergyinterleaved evaluation
Abstract:

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce \textbf{InterSyn}, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed \textbf{Self-Evaluation with Iterative Refinement (SEIR)} method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image–text generation capabilities. To evaluate the capabilities, we propose \textbf{SynJudge}, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image–Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K–50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn’s: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces InterSyn, a large-scale dataset (1.8M samples) for training interleaved image-text generation models, alongside SEIR (a quality refinement method) and SynJudge (an automatic evaluator). It resides in the Comprehensive Benchmarks leaf under Evaluation and Benchmarking, which contains four papers total. This leaf focuses on large-scale evaluation frameworks with diverse tasks and human-annotated instances. The paper's dual emphasis on dataset construction and evaluation tooling places it at the intersection of data curation and benchmarking infrastructure within a moderately populated research direction.

The Comprehensive Benchmarks leaf sits alongside Specialized Evaluation Metrics (five papers on targeted aspects like consistency or alignment) and Reward Modeling and Preference Learning (three papers on preference-based evaluation). Neighboring branches include Unified Multimodal Architectures and Modular Multimodal Systems, which develop the generation models that benchmarks like InterSyn aim to evaluate. The taxonomy's scope note clarifies that this leaf excludes specialized single-aspect metrics, focusing instead on holistic assessment frameworks. InterSyn's 3500-topic hierarchy and multi-dimensional scoring (TCC, ICC, IQ, ITS) align with this comprehensive evaluation philosophy, distinguishing it from narrower metric-focused work.

Among thirty candidates examined, the InterSyn dataset contribution shows no clear refutation (zero refutable candidates from ten examined), suggesting relative novelty in its scale and instructional diversity. However, SEIR examined ten candidates with two refutable matches, and SynJudge examined ten with one refutable match, indicating more substantial prior work in automated quality refinement and evaluation metrics. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The dataset's novelty appears stronger than its methodological components, which face more direct precedents in the examined literature.

Based on the limited thirty-candidate search, InterSyn's primary novelty likely resides in its dataset scale and topic hierarchy rather than its refinement or evaluation methods. The taxonomy context shows a moderately active benchmarking area with established frameworks, suggesting incremental rather than transformative contributions. A broader literature search might reveal additional overlaps, particularly in self-evaluation techniques and multimodal scoring systems, which are active research areas beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Interleaved image-text generation involves producing coherent sequences where visual and textual content are interwoven, enabling richer multimodal narratives than isolated text-to-image synthesis. The field's structure reflects diverse methodological philosophies: Unified Multimodal Architectures (e.g., Janus-Pro[1], ANOLE[12]) pursue end-to-end models that handle both modalities within a single framework, while Modular Multimodal Systems combine specialized components such as separate language models and diffusion generators. Specialized Generation Methods explore targeted techniques like chain-of-thought reasoning (Interleaved Chain-of-Thought[2]) or scene graph-based planning (Interleaved Scene Graphs[7]), and Foundational Text-to-Image Models (e.g., Muse[11], GLIGEN[38]) provide the underlying image synthesis capabilities. Meanwhile, Evaluation and Benchmarking efforts establish metrics and datasets to assess generation quality, and Surveys and Cross-Domain Applications document broader trends and novel use cases such as procedural planning (Multimodal Procedural Planning[5]) or chemistry applications (Multimodal AI Chemistry[23]). Recent activity highlights tensions between architectural simplicity and task-specific optimization. Unified models promise streamlined training and coherent cross-modal reasoning, yet modular approaches offer flexibility to swap or upgrade individual components as foundational models improve. Within the Evaluation and Benchmarking branch, comprehensive benchmarks like OpenLEAF Benchmark[9] and GATE OpenING[24] provide standardized testbeds, while High Quality Interleaved Dataset[0] contributes curated data resources essential for training and validating these systems. Compared to neighboring works such as OpenING[6], which emphasizes open-domain interleaved generation, High Quality Interleaved Dataset[0] focuses on dataset quality and curation strategies that underpin reliable evaluation. This positions it as a foundational resource within the benchmarking cluster, complementing evaluation frameworks like Text-to-Visual Evaluation[3] that assess output fidelity and coherence across diverse interleaved scenarios.

Claimed Contributions

InterSyn dataset for interleaved image-text generation

The authors introduce InterSyn, a dataset comprising 1.8M single-turn samples and 50K multi-turn dialogues across 8 domains and 3,500 topics. It is designed to support training of Large Multimodal Models in interactive image-text generation with rich instructional diversity and high quality.

10 retrieved papers
Self-Evaluation with Iterative Refinement (SEIR) method

The authors propose SEIR, a fully automated quality refinement method that embeds self-checking and feedback loops into each generation step through a Generate-Evaluate-Refine loop across three cascaded stages (Question Refinement, Answer Refinement, and Image Refinement) to enhance semantic completeness and cross-modal synergy.

10 retrieved papers
Can Refute
SynJudge evaluator for interleaved outputs

The authors introduce SynJudge, a reliable automatic evaluator that aligns closely with human judgment and outputs four interpretable scores: Text Content Completeness, Image Content Completeness, Image Quality, and Image-Text Synergy. It provides comprehensive evaluation covering both content quality and cross-modal interaction.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

InterSyn dataset for interleaved image-text generation

The authors introduce InterSyn, a dataset comprising 1.8M single-turn samples and 50K multi-turn dialogues across 8 domains and 3,500 topics. It is designed to support training of Large Multimodal Models in interactive image-text generation with rich instructional diversity and high quality.

Contribution

Self-Evaluation with Iterative Refinement (SEIR) method

The authors propose SEIR, a fully automated quality refinement method that embeds self-checking and feedback loops into each generation step through a Generate-Evaluate-Refine loop across three cascaded stages (Question Refinement, Answer Refinement, and Image Refinement) to enhance semantic completeness and cross-modal synergy.

Contribution

SynJudge evaluator for interleaved outputs

The authors introduce SynJudge, a reliable automatic evaluator that aligns closely with human judgment and outputs four interpretable scores: Text Content Completeness, Image Content Completeness, Image Quality, and Image-Text Synergy. It provides comprehensive evaluation covering both content quality and cross-modal interaction.