Multimodal Dataset Distillation Made Simple by Prototype-guided Data Synthesis

ICLR 2026 Conference SubmissionAnonymous Authors
Dataset distillationDataset condensationvision-language modelslearning-free approach
Abstract:

Recent advances in multimodal learning have achieved remarkable success across diverse vision–language tasks. However, such progress heavily relies on large-scale image–text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of pixel and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image–text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a learning-free multimodal dataset distillation framework that synthesizes compact vision-language datasets using CLIP embeddings, prototype extraction, and unCLIP-based image generation. It resides in the 'Generative Synthesis Approaches' leaf, which contains only two papers including this one. This leaf sits within the broader 'Multimodal Dataset Distillation Methods' branch, which also includes trajectory matching methods and semantic integration approaches. The sparse population of this specific leaf suggests the learning-free generative synthesis direction is relatively underexplored compared to optimization-based distillation methods.

The taxonomy reveals neighboring research directions including 'Trajectory and Distribution Matching Approaches' (three papers) and 'Vision-Language Semantic Integration' (two papers), both requiring full-dataset optimization. The paper's approach diverges by eliminating iterative training, contrasting with methods that match training trajectories or distributions. Adjacent branches like 'Model Distillation for Vision-Language Systems' focus on compressing models rather than datasets, while 'Data Selection and Filtering' curates existing data without synthesis. The framework bridges generative synthesis with prototype-based semantic knowledge, connecting to themes in semantic integration while avoiding optimization overhead.

Among fourteen candidates examined, no contributions were clearly refuted by prior work. The learning-free framework examined ten candidates with zero refutations, suggesting limited direct overlap in the literature search scope. Prototype-based synthesis via unCLIP examined one candidate, and cross-modal cluster matching examined three candidates, both without refutations. This absence of refutable prior work within the limited search scope indicates the specific combination of learning-free distillation, CLIP-based prototyping, and unCLIP synthesis may represent a relatively unexplored configuration, though the small candidate pool limits definitive conclusions about field-wide novelty.

Based on top-fourteen semantic matches, the work appears to occupy a sparse research direction within multimodal dataset distillation. The learning-free paradigm and generative synthesis approach differentiate it from optimization-heavy trajectory matching methods, though the limited search scope means potentially relevant work in adjacent areas (e.g., prototype learning, CLIP-based generation) may not have been fully captured. The analysis covers immediate semantic neighbors but does not exhaustively survey all prototype-based or generative distillation literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Multimodal dataset distillation for vision-language learning. The field addresses the challenge of condensing large-scale vision-language datasets into compact, informative subsets that preserve training efficacy while reducing computational overhead. The taxonomy reveals five main branches: Multimodal Dataset Distillation Methods focuses on techniques that synthesize or select representative multimodal samples, including generative synthesis approaches and prototype-based strategies; Model Distillation for Vision-Language Systems emphasizes transferring knowledge from large teacher models to smaller student architectures, often through layerwise or adaptive mechanisms such as Layerwised Multimodal[1] and Adaptive Distillation[10]; Data Selection and Filtering for Vision-Language Pretraining targets curation strategies that identify high-quality training pairs, exemplified by works like Filtering Hard Negatives[9]; Specialized Vision-Language Distillation Applications explores domain-specific scenarios including open-vocabulary detection, emotion recognition, and compositional reasoning; and Auxiliary Techniques and Frameworks provides supporting methodologies such as token condensation and cross-modal alignment strategies like Align before Fuse[17]. Recent activity highlights a tension between generative synthesis methods that create novel training samples and selection-based approaches that curate existing data. Generative Distillation[5] and DC-CLIP[7] exemplify efforts to synthesize compact yet expressive datasets, while Category Prototype[8] and Data Reduction[18] pursue efficient subset selection. Prototype Synthesis[0] sits within the generative synthesis cluster, closely aligned with Generative Distillation[5] in its emphasis on creating synthetic prototypes that capture essential multimodal patterns. Unlike purely selection-driven methods, Prototype Synthesis[0] generates representative samples rather than filtering existing pairs, positioning it as a complementary approach to works like Category Prototype[8] that rely on extracting prototypes from real data. This distinction reflects broader questions about whether synthesized or curated data better preserves the semantic richness needed for robust vision-language alignment.

Claimed Contributions

Learning-free multimodal dataset distillation framework (PDS)

The authors introduce PDS (Prototype-guided Data Synthesis), a learning-free framework for multimodal dataset distillation that avoids full-dataset training and joint optimization of pixel and text features. Unlike optimization-based methods, PDS is architecture-independent and achieves superior cross-architecture generalization without requiring re-distillation for new backbones.

10 retrieved papers
Prototype-based image synthesis using unCLIP decoder

The authors propose a three-stage pipeline that extracts CLIP embeddings, performs modality-specific clustering and cross-modal cluster matching to obtain image-text prototypes, then synthesizes images using an unCLIP decoder conditioned on both image prototypes and retrieved captions. This is the first dataset distillation method to generate images directly from CLIP image embeddings using an unCLIP decoder.

1 retrieved paper
Cross-modal cluster matching via linear assignment

The authors formulate cross-modal alignment as a linear assignment problem solved via the Hungarian algorithm, which matches image and text clusters by maximizing their overlap. This establishes semantic correspondences between modalities and produces aligned image-text prototypes for synthesis.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Learning-free multimodal dataset distillation framework (PDS)

The authors introduce PDS (Prototype-guided Data Synthesis), a learning-free framework for multimodal dataset distillation that avoids full-dataset training and joint optimization of pixel and text features. Unlike optimization-based methods, PDS is architecture-independent and achieves superior cross-architecture generalization without requiring re-distillation for new backbones.

Contribution

Prototype-based image synthesis using unCLIP decoder

The authors propose a three-stage pipeline that extracts CLIP embeddings, performs modality-specific clustering and cross-modal cluster matching to obtain image-text prototypes, then synthesizes images using an unCLIP decoder conditioned on both image prototypes and retrieved captions. This is the first dataset distillation method to generate images directly from CLIP image embeddings using an unCLIP decoder.

Contribution

Cross-modal cluster matching via linear assignment

The authors formulate cross-modal alignment as a linear assignment problem solved via the Hungarian algorithm, which matches image and text clusters by maximizing their overlap. This establishes semantic correspondences between modalities and produces aligned image-text prototypes for synthesis.