A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

high-quality datasetmultimodal datasetinterleaved image-text synergyinterleaved evaluation

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce \textbf{InterSyn}, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed \textbf{Self-Evaluation with Iterative Refinement (SEIR)} method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image–text generation capabilities. To evaluate the capabilities, we propose \textbf{SynJudge}, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image–Text Synergy (ITS). These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework. Experimental results on InterSyn subsets of up to 200K samples show that 25K–50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn’s: (1) scalability, as performance consistently improves with more data; (2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces InterSyn, a large-scale dataset (1.8M samples) for training interleaved image-text generation models, alongside SEIR (a quality refinement method) and SynJudge (an automatic evaluator). It resides in the Comprehensive Benchmarks leaf under Evaluation and Benchmarking, which contains four papers total. This leaf focuses on large-scale evaluation frameworks with diverse tasks and human-annotated instances. The paper's dual emphasis on dataset construction and evaluation tooling places it at the intersection of data curation and benchmarking infrastructure within a moderately populated research direction.

The Comprehensive Benchmarks leaf sits alongside Specialized Evaluation Metrics (five papers on targeted aspects like consistency or alignment) and Reward Modeling and Preference Learning (three papers on preference-based evaluation). Neighboring branches include Unified Multimodal Architectures and Modular Multimodal Systems, which develop the generation models that benchmarks like InterSyn aim to evaluate. The taxonomy's scope note clarifies that this leaf excludes specialized single-aspect metrics, focusing instead on holistic assessment frameworks. InterSyn's 3500-topic hierarchy and multi-dimensional scoring (TCC, ICC, IQ, ITS) align with this comprehensive evaluation philosophy, distinguishing it from narrower metric-focused work.

Among thirty candidates examined, the InterSyn dataset contribution shows no clear refutation (zero refutable candidates from ten examined), suggesting relative novelty in its scale and instructional diversity. However, SEIR examined ten candidates with two refutable matches, and SynJudge examined ten with one refutable match, indicating more substantial prior work in automated quality refinement and evaluation metrics. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The dataset's novelty appears stronger than its methodological components, which face more direct precedents in the examined literature.

Based on the limited thirty-candidate search, InterSyn's primary novelty likely resides in its dataset scale and topic hierarchy rather than its refinement or evaluation methods. The taxonomy context shows a moderately active benchmarking area with established frameworks, suggesting incremental rather than transformative contributions. A broader literature search might reveal additional overlaps, particularly in self-evaluation techniques and multimodal scoring systems, which are active research areas beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Interleaved image-text generation involves producing coherent sequences where visual and textual content are interwoven, enabling richer multimodal narratives than isolated text-to-image synthesis. The field's structure reflects diverse methodological philosophies: Unified Multimodal Architectures (e.g., Janus-Pro[1], ANOLE[12]) pursue end-to-end models that handle both modalities within a single framework, while Modular Multimodal Systems combine specialized components such as separate language models and diffusion generators. Specialized Generation Methods explore targeted techniques like chain-of-thought reasoning (Interleaved Chain-of-Thought[2]) or scene graph-based planning (Interleaved Scene Graphs[7]), and Foundational Text-to-Image Models (e.g., Muse[11], GLIGEN[38]) provide the underlying image synthesis capabilities. Meanwhile, Evaluation and Benchmarking efforts establish metrics and datasets to assess generation quality, and Surveys and Cross-Domain Applications document broader trends and novel use cases such as procedural planning (Multimodal Procedural Planning[5]) or chemistry applications (Multimodal AI Chemistry[23]). Recent activity highlights tensions between architectural simplicity and task-specific optimization. Unified models promise streamlined training and coherent cross-modal reasoning, yet modular approaches offer flexibility to swap or upgrade individual components as foundational models improve. Within the Evaluation and Benchmarking branch, comprehensive benchmarks like OpenLEAF Benchmark[9] and GATE OpenING[24] provide standardized testbeds, while High Quality Interleaved Dataset[0] contributes curated data resources essential for training and validating these systems. Compared to neighboring works such as OpenING[6], which emphasizes open-domain interleaved generation, High Quality Interleaved Dataset[0] focuses on dataset quality and curation strategies that underpin reliable evaluation. This positions it as a foundational resource within the benchmarking cluster, complementing evaluation frameworks like Text-to-Visual Evaluation[3] that assess output fidelity and coherence across diverse interleaved scenarios.

Claimed Contributions

InterSyn dataset for interleaved image-text generation

10 retrieved papers

The authors introduce InterSyn, a dataset comprising 1.8M single-turn samples and 50K multi-turn dialogues across 8 domains and 3,500 topics. It is designed to support training of Large Multimodal Models in interactive image-text generation with rich instructional diversity and high quality.

10 retrieved papers

Self-Evaluation with Iterative Refinement (SEIR) method

Can Refute

10 retrieved papers

The authors propose SEIR, a fully automated quality refinement method that embeds self-checking and feedback loops into each generation step through a Generate-Evaluate-Refine loop across three cascaded stages (Question Refinement, Answer Refinement, and Image Refinement) to enhance semantic completeness and cross-modal synergy.

10 retrieved papers

Can Refute

SynJudge evaluator for interleaved outputs

Can Refute

10 retrieved papers

The authors introduce SynJudge, a reliable automatic evaluator that aligns closely with human judgment and outputs four interpretable scores: Text Content Completeness, Image Content Completeness, Image Quality, and Image-Text Synergy. It provides comprehensive evaluation covering both content quality and cross-modal interaction.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation PDF

Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang (2025)

[9] OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation PDF

Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, K. Lin, Lijuan Wang, Jiebo Luo (2024)

[24] GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation PDF

Zhou Pengfei, Peng Xiao-peng, Pengfei Zhou, Song Jiajun, Xiaopeng Peng, Li Chuan-hao, Jiajun Song, Xu, Zhaopan, Chuanhao Li, Yang Yue, Zhaopan Xu, Guo, Ziyao, Yue Yang, Zhang Hao, Ziyao Guo, Lin Yu-qi, Hao Zhang, He, Yefei, Yuqi Lin, Zhao Lirui, Yefei He, Liu Shuo, Lirui Zhao, Li Tianhua, Shuo Liu, Xie Yu-xuan, Tianhua Li, Chang, Xiaojun, Yuxuan Xie, Qiao Yu, Xiaojun Chang, Shao, Wenqi, Yu Qiao, Zhang Kai-peng, Wenqi Shao, Kaipeng Zhang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

InterSyn dataset for interleaved image-text generation

[71] MANTIS: Interleaved Multi-Image Instruction Tuning PDF

Cannot Refute

[72] Fine-tuning multimodal llms to follow zero-shot demonstrative instructions PDF

Cannot Refute

[73] Generative multimodal models are in-context learners PDF

Cannot Refute

[74] InstructPix2Pix: Learning to Follow Image Editing Instructions PDF

Cannot Refute

[75] Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models PDF

Cannot Refute

[76] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text PDF

Cannot Refute

[77] Emu: Generative pretraining in multimodality PDF

Cannot Refute

[78] Codi-2: In-context interleaved and interactive any-to-any generation PDF

Cannot Refute

[79] mplug-owl3: Towards long image-sequence understanding in multi-modal large language models PDF

Cannot Refute

[80] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions PDF

Cannot Refute

Contribution

Self-Evaluation with Iterative Refinement (SEIR) method

[51] Self-Refine: Iterative Refinement with Self-Feedback PDF

Can Refute

[58] REFINER: Reasoning Feedback on Intermediate Representations PDF

Can Refute

[52] Critic: Large language models can self-correct with tool-interactive critiquing PDF

Cannot Refute

[53] Implementing an automated socratic method to reduce hallucinations in large language models PDF

Cannot Refute

[54] Empowering large language model agent through step-level self-critique and self-training PDF

Cannot Refute

[55] LLMLOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops PDF

Cannot Refute

[56] Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement PDF

Cannot Refute

[57] An approach to business continuity self-assessment PDF

Cannot Refute

[59] Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control PDF

Cannot Refute

[60] Rest meets react: Self-improvement for multi-step reasoning llm agent PDF

Cannot Refute

Contribution

SynJudge evaluator for interleaved outputs

[62] Holistic evaluation for interleaved text-and-image generation PDF

Can Refute

[61] MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens PDF

Cannot Refute

[63] Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation PDF

Cannot Refute

[64] Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis PDF

Cannot Refute

[65] Multi-granularity visual-textual jointly modeling for aspect-level multimodal sentiment analysis PDF

Cannot Refute

[66] Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification PDF

Cannot Refute

[67] Automatic Aspect-Based Sentiment Summarization for Visual, Structured, and Textual Summaries PDF

Cannot Refute

[68] Electronic direct-to-consumer advertising of pharmaceuticals: an assessment of textual and visual content of websites. PDF

Cannot Refute

[69] FRABench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks, Modalities PDF

Cannot Refute

[70] Multi-level Visual-textual Alignment Transformer for Multimodal Aspect-Based Sentiment Analysis PDF

Cannot Refute

A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation PDF

[9] OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation PDF

[24] GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation PDF

Contribution Analysis

InterSyn dataset for interleaved image-text generation

[71] MANTIS: Interleaved Multi-Image Instruction Tuning PDF

[72] Fine-tuning multimodal llms to follow zero-shot demonstrative instructions PDF

[73] Generative multimodal models are in-context learners PDF

[74] InstructPix2Pix: Learning to Follow Image Editing Instructions PDF

[75] Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models PDF

[76] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text PDF

[77] Emu: Generative pretraining in multimodality PDF

[78] Codi-2: In-context interleaved and interactive any-to-any generation PDF

[79] mplug-owl3: Towards long image-sequence understanding in multi-modal large language models PDF

[80] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions PDF

Self-Evaluation with Iterative Refinement (SEIR) method

[51] Self-Refine: Iterative Refinement with Self-Feedback PDF

[58] REFINER: Reasoning Feedback on Intermediate Representations PDF

[52] Critic: Large language models can self-correct with tool-interactive critiquing PDF

[53] Implementing an automated socratic method to reduce hallucinations in large language models PDF

[54] Empowering large language model agent through step-level self-critique and self-training PDF

[55] LLMLOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops PDF

[56] Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement PDF

[57] An approach to business continuity self-assessment PDF

[59] Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control PDF

[60] Rest meets react: Self-improvement for multi-step reasoning llm agent PDF

SynJudge evaluator for interleaved outputs

[62] Holistic evaluation for interleaved text-and-image generation PDF

[61] MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens PDF

[63] Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation PDF

[64] Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis PDF

[65] Multi-granularity visual-textual jointly modeling for aspect-level multimodal sentiment analysis PDF

[66] Multi-Model Fusion Framework Using Deep Learning for Visual-Textual Sentiment Classification PDF

[67] Automatic Aspect-Based Sentiment Summarization for Visual, Structured, and Textual Summaries PDF

[68] Electronic direct-to-consumer advertising of pharmaceuticals: an assessment of textual and visual content of websites. PDF

[69] FRABench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks, Modalities PDF

[70] Multi-level Visual-textual Alignment Transformer for Multimodal Aspect-Based Sentiment Analysis PDF

Table of Contents