Asynchronous Matching with Dynamic Sampling for Multimodal Dataset Distillation

ICLR 2026 Conference SubmissionAnonymous Authors
Trajectory MatchingDataset Distillation
Abstract:

Multimodal Dataset Distillation (MDD) has emerged as a vital paradigm for enabling efficient training of vision-language models (VLMs) in the era of multimodal data proliferation. Unlike traditional dataset distillation methods that focus on single-modal tasks, MDD presents distinct challenges: (i) the effective distillation of heterogeneous multimodal knowledge, complicated by feature space misalignment and asynchronous optimization dynamics; and (ii) the lack of discrete class guidance, which hinders the distribution coverage and representativeness of synthetic data due to the vastness and continuity of the semantic space. To address these challenges, this paper proposes an Asynchronous Matching with Dynamic sampling (AMD) framework. AMD enables asynchronous trajectory matching by decoupling the selection of starting points for image and text trajectories. Additionally, a Semantics-Aware Prototype Mining module is introduced, which replaces random initialization by leveraging feature-space clustering to identify representative prototypes, enhancing the coverage and representativeness of the distilled samples. Extensive experiments demonstrate that AMD achieves superior distillation performance on Flickr30k and COCO (e.g., IR@1, IR@5, and IR@10 \textbf{gains of 4.5%, 9.6%, and 10.9%}, respectively, on Flickr30k 200 pairs.) with negligible computational overhead.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an Asynchronous Matching with Dynamic sampling (AMD) framework for multimodal dataset distillation, targeting vision-language models. It resides in the 'Trajectory Matching and Gradient-Based Distillation' leaf, which contains only three papers total, including this work and two siblings. This indicates a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics. The focus on asynchronous trajectory matching and semantics-aware prototype mining positions the work at the intersection of trajectory-based distillation and multimodal optimization challenges.

The taxonomy tree reveals that the paper's immediate neighbors address foundational multimodal distillation techniques and efficiency concerns, while adjacent leaves explore distribution-based methods, generative approaches, and scalability advances. The broader 'Core Dataset Distillation Methods' branch sits alongside three other major directions: model compression for VLMs, cross-modal knowledge transfer, and task-specific applications. The paper's emphasis on asynchronous optimization and prototype mining distinguishes it from distribution-matching methods in neighboring leaves, though both address the challenge of synthesizing representative multimodal data without discrete class labels.

Among 18 candidates examined across three contributions, no clearly refutable prior work was identified. The Asynchronous Matching framework examined 4 candidates with 0 refutations, Semantics-Aware Prototype Mining examined 8 candidates with 0 refutations, and the MMD-based dynamic sampling strategy examined 6 candidates with 0 refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of asynchronous trajectory decoupling and feature-space clustering for prototype initialization appears not to have direct precedents. However, the analysis explicitly notes this is not an exhaustive literature search.

Based on the limited examination of 18 candidates, the work appears to introduce novel mechanisms for handling multimodal distillation challenges, particularly the asynchronous optimization dynamics and prototype-based initialization. The sparse population of its taxonomy leaf and absence of refutable candidates within the search scope suggest potential novelty, though the small scale of the literature search means substantial related work may exist beyond the examined set. The contribution's distinctiveness hinges on the specific integration of asynchronous matching with semantics-aware mining rather than individual components.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
18
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multimodal dataset distillation for vision-language models. The field organizes around four main branches. Core Dataset Distillation Methods for Vision-Language Data focuses on condensing large-scale multimodal datasets into compact, representative subsets using trajectory matching, gradient-based techniques, and other distillation strategies—works like Multimodal Dataset Distillation[5] and Efficient Multimodal Distillation[1] exemplify this direction. Model Distillation and Compression for Vision-Language Models emphasizes reducing model size and computational cost by transferring knowledge from large teacher models to smaller students, often leveraging cross-modal alignment and layer-wise distillation (e.g., AMMKD[3], PromptKD[4]). Cross-Modal Knowledge Transfer and Fusion explores how to effectively align and fuse information across vision and language modalities, addressing challenges like modality imbalance and zero-shot transfer (e.g., Preventing Zero-shot Degradation[7], Align before Fuse[34]). Task-Specific Applications and Extensions applies distillation techniques to downstream problems such as open-vocabulary detection, autonomous driving, and video understanding, demonstrating the practical utility of these methods. A particularly active line of work within Core Dataset Distillation Methods centers on trajectory matching and gradient-based distillation, where the goal is to synthesize small datasets that mimic the training dynamics of full-scale data. Asynchronous Matching Multimodal[0] sits squarely in this cluster, addressing the challenge of aligning asynchronous gradient trajectories across vision and language modalities—a key bottleneck when distilling multimodal data. Compared to Multimodal Dataset Distillation[5], which introduced foundational techniques for multimodal condensation, Asynchronous Matching Multimodal[0] emphasizes temporal alignment and modality-specific optimization schedules. Meanwhile, Efficient Multimodal Distillation[1] explores computational efficiency in similar settings, highlighting trade-offs between distillation quality and resource constraints. These works collectively push toward scalable, high-fidelity dataset distillation, though open questions remain about generalization across diverse vision-language architectures and task distributions.

Claimed Contributions

Asynchronous Matching with Dynamic Sampling (AMD) Framework

The authors propose a novel framework that decouples the sampling of image and text expert trajectories during multimodal dataset distillation. This asynchronous matching strategy addresses the inherent heterogeneity in learning dynamics between visual and text modalities, allowing more flexible combinations of parameters from different training epochs to improve synthetic data optimization.

4 retrieved papers
Semantics-Aware Prototype Mining (SPM) Module

The authors introduce a module that performs clustering in the joint semantic feature space to identify representative sample prototypes. These prototypes replace randomly selected initial points and are used to initialize the synthesis process, substantially enhancing the diversity and representativeness of distilled samples without discrete class guidance.

8 retrieved papers
Maximum Mean Discrepancy Based Dynamic Sampling Strategy

The authors develop a data-driven sampling strategy that uses Maximum Mean Discrepancy to quantify parameter update magnitudes between consecutive epochs. This approach adaptively establishes differential sampling ranges for visual and text modalities based on their relative convergence dynamics, preventing excessive asynchronicity while capturing inter-modal learning speed discrepancies.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Asynchronous Matching with Dynamic Sampling (AMD) Framework

The authors propose a novel framework that decouples the sampling of image and text expert trajectories during multimodal dataset distillation. This asynchronous matching strategy addresses the inherent heterogeneity in learning dynamics between visual and text modalities, allowing more flexible combinations of parameters from different training epochs to improve synthetic data optimization.

Contribution

Semantics-Aware Prototype Mining (SPM) Module

The authors introduce a module that performs clustering in the joint semantic feature space to identify representative sample prototypes. These prototypes replace randomly selected initial points and are used to initialize the synthesis process, substantially enhancing the diversity and representativeness of distilled samples without discrete class guidance.

Contribution

Maximum Mean Discrepancy Based Dynamic Sampling Strategy

The authors develop a data-driven sampling strategy that uses Maximum Mean Discrepancy to quantify parameter update magnitudes between consecutive epochs. This approach adaptively establishes differential sampling ranges for visual and text modalities based on their relative convergence dynamics, preventing excessive asynchronicity while capturing inter-modal learning speed discrepancies.