A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
Overview
Overall Novelty Assessment
The paper introduces InterSyn, a large-scale dataset (1.8M samples) for training interleaved image-text generation models, alongside SEIR (a quality refinement method) and SynJudge (an automatic evaluator). It resides in the Comprehensive Benchmarks leaf under Evaluation and Benchmarking, which contains four papers total. This leaf focuses on large-scale evaluation frameworks with diverse tasks and human-annotated instances. The paper's dual emphasis on dataset construction and evaluation tooling places it at the intersection of data curation and benchmarking infrastructure within a moderately populated research direction.
The Comprehensive Benchmarks leaf sits alongside Specialized Evaluation Metrics (five papers on targeted aspects like consistency or alignment) and Reward Modeling and Preference Learning (three papers on preference-based evaluation). Neighboring branches include Unified Multimodal Architectures and Modular Multimodal Systems, which develop the generation models that benchmarks like InterSyn aim to evaluate. The taxonomy's scope note clarifies that this leaf excludes specialized single-aspect metrics, focusing instead on holistic assessment frameworks. InterSyn's 3500-topic hierarchy and multi-dimensional scoring (TCC, ICC, IQ, ITS) align with this comprehensive evaluation philosophy, distinguishing it from narrower metric-focused work.
Among thirty candidates examined, the InterSyn dataset contribution shows no clear refutation (zero refutable candidates from ten examined), suggesting relative novelty in its scale and instructional diversity. However, SEIR examined ten candidates with two refutable matches, and SynJudge examined ten with one refutable match, indicating more substantial prior work in automated quality refinement and evaluation metrics. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The dataset's novelty appears stronger than its methodological components, which face more direct precedents in the examined literature.
Based on the limited thirty-candidate search, InterSyn's primary novelty likely resides in its dataset scale and topic hierarchy rather than its refinement or evaluation methods. The taxonomy context shows a moderately active benchmarking area with established frameworks, suggesting incremental rather than transformative contributions. A broader literature search might reveal additional overlaps, particularly in self-evaluation techniques and multimodal scoring systems, which are active research areas beyond the examined candidates.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce InterSyn, a dataset comprising 1.8M single-turn samples and 50K multi-turn dialogues across 8 domains and 3,500 topics. It is designed to support training of Large Multimodal Models in interactive image-text generation with rich instructional diversity and high quality.
The authors propose SEIR, a fully automated quality refinement method that embeds self-checking and feedback loops into each generation step through a Generate-Evaluate-Refine loop across three cascaded stages (Question Refinement, Answer Refinement, and Image Refinement) to enhance semantic completeness and cross-modal synergy.
The authors introduce SynJudge, a reliable automatic evaluator that aligns closely with human judgment and outputs four interpretable scores: Text Content Completeness, Image Content Completeness, Image Quality, and Image-Text Synergy. It provides comprehensive evaluation covering both content quality and cross-modal interaction.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[6] OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation PDF
[9] OpenLEAF: A Novel Benchmark for Open-Domain Interleaved Image-Text Generation PDF
[24] GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
InterSyn dataset for interleaved image-text generation
The authors introduce InterSyn, a dataset comprising 1.8M single-turn samples and 50K multi-turn dialogues across 8 domains and 3,500 topics. It is designed to support training of Large Multimodal Models in interactive image-text generation with rich instructional diversity and high quality.
[71] MANTIS: Interleaved Multi-Image Instruction Tuning PDF
[72] Fine-tuning multimodal llms to follow zero-shot demonstrative instructions PDF
[73] Generative multimodal models are in-context learners PDF
[74] InstructPix2Pix: Learning to Follow Image Editing Instructions PDF
[75] Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models PDF
[76] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text PDF
[77] Emu: Generative pretraining in multimodality PDF
[78] Codi-2: In-context interleaved and interactive any-to-any generation PDF
[79] mplug-owl3: Towards long image-sequence understanding in multi-modal large language models PDF
[80] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions PDF
Self-Evaluation with Iterative Refinement (SEIR) method
The authors propose SEIR, a fully automated quality refinement method that embeds self-checking and feedback loops into each generation step through a Generate-Evaluate-Refine loop across three cascaded stages (Question Refinement, Answer Refinement, and Image Refinement) to enhance semantic completeness and cross-modal synergy.
[51] Self-Refine: Iterative Refinement with Self-Feedback PDF
[58] REFINER: Reasoning Feedback on Intermediate Representations PDF
[52] Critic: Large language models can self-correct with tool-interactive critiquing PDF
[53] Implementing an automated socratic method to reduce hallucinations in large language models PDF
[54] Empowering large language model agent through step-level self-critique and self-training PDF
[55] LLMLOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops PDF
[56] Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement PDF
[57] An approach to business continuity self-assessment PDF
[59] Self-Supervised Iterative Refinement for Anomaly Detection in Industrial Quality Control PDF
[60] Rest meets react: Self-improvement for multi-step reasoning llm agent PDF
SynJudge evaluator for interleaved outputs
The authors introduce SynJudge, a reliable automatic evaluator that aligns closely with human judgment and outputs four interpretable scores: Text Content Completeness, Image Content Completeness, Image Quality, and Image-Text Synergy. It provides comprehensive evaluation covering both content quality and cross-modal interaction.