COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

ICLR 2026 Conference SubmissionAnonymous Authors
ComplexityCompositionalityVisual instruction tuning
Abstract:

Visual instruction tuning (VIT) datasets consist of randomly sampled image-question pairs without regard to the informativeness of each pair. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of task complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning), a VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective visual instruction tuning. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Further, training on the same COMPACT data even improves performance compared to training with full-scale data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on visual language tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces COMPACT, a data recipe that synthesizes compositional visual instruction tuning examples by combining multiple atomic visual capabilities into single training instances. It resides in the 'Compositional Data Synthesis' leaf of the taxonomy, which contains only three papers total. This leaf sits under the broader 'Data-Efficient Visual Instruction Tuning' branch, indicating a relatively focused research direction. The small number of sibling papers suggests this specific approach—synthesizing complexity through atomic capability composition—represents an emerging rather than saturated area within visual instruction tuning.

The taxonomy reveals neighboring leaves focused on 'Informative Sample Selection' (selecting from existing datasets) and 'Instruction Quality and Diversity Enhancement' (improving instruction formulation). COMPACT diverges from these by generating new compositional examples rather than curating existing ones. The broader 'Model Architecture and Training Strategies' branch addresses complementary concerns like connector design and modality balancing, while 'Compositional Reasoning and Task Execution' explores inference-time decomposition. COMPACT's synthesis approach bridges data-centric efficiency with compositional reasoning, occupying a distinct position that emphasizes training-time capability integration rather than architectural or selection-based solutions.

Among thirty candidates examined, the contribution-level analysis shows varied novelty signals. The core COMPACT recipe (Contribution 1) examined ten candidates with zero refutations, suggesting limited direct prior work on this specific complexity-scaling synthesis approach. The k-value complexity metric (Contribution 2) similarly found no refutations among ten candidates. However, the atomic visual capability taxonomy (Contribution 3) encountered two refutable candidates among ten examined, indicating existing frameworks for decomposing visual tasks. These statistics reflect a bounded search scope—top-K semantic matches plus citations—not exhaustive coverage, so the absence of refutations for Contributions 1-2 indicates novelty within this limited sample rather than definitive field-wide uniqueness.

Based on the limited thirty-candidate search, COMPACT appears to occupy a relatively novel position within compositional data synthesis, particularly in its complexity-scaling recipe and metric. The taxonomy decomposition shows more substantial prior work, as expected for foundational task categorization. The sparse population of the 'Compositional Data Synthesis' leaf and the contribution-level statistics together suggest meaningful differentiation from existing methods, though the restricted search scope means adjacent work outside the top-K matches may exist. The analysis captures novelty signals within examined candidates but does not claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: efficient visual instruction tuning through compositional capability integration. The field addresses how to train vision-language models that can follow complex instructions without requiring massive annotated datasets. The taxonomy reveals four main branches: Data-Efficient Visual Instruction Tuning focuses on reducing annotation costs through synthetic data generation and compositional data synthesis approaches like Mosaic-IT[16] and Mosaic-IT Cost-Free[23]; Model Architecture and Training Strategies explores parameter-efficient methods such as SMoLoRA[22] and steering techniques like Modality Linear Steering[4]; Compositional Reasoning and Task Execution examines how models decompose and execute multi-step visual tasks, often drawing on visual programming paradigms like Visual Programming[2] and modular approaches; and Specialized Visual Instruction Applications targets domain-specific challenges in areas like UI understanding with UI-Ins[21] and compositional image retrieval. These branches collectively address the tension between model capability and training efficiency, with data synthesis and architectural innovations serving as complementary strategies. A particularly active line of work centers on compositional data synthesis, where researchers generate training examples by combining simpler visual and linguistic primitives rather than collecting exhaustive human annotations. COMPACT[0] sits squarely within this branch, emphasizing the integration of compositional capabilities through carefully designed data synthesis. This approach contrasts with neighboring works like Mosaic-IT[16], which also synthesizes compositional instruction data but may differ in how capabilities are decomposed and recombined, and Mosaic-IT Cost-Free[23], which pushes the cost-reduction agenda further by eliminating certain annotation expenses. The central trade-off across these methods involves balancing the diversity and realism of synthetic data against the computational overhead of generation, with open questions remaining about how well compositional training transfers to truly novel task combinations and whether synthetic data can fully substitute for human-curated examples in capturing nuanced visual reasoning.

Claimed Contributions

COMPACT data recipe for complexity-aware visual instruction tuning

The authors propose a novel data curation method that synthesizes training samples by combining multiple atomic visual capabilities in single examples, thereby increasing sample complexity and information density. This approach reduces the required training data volume while maintaining or improving performance on multimodal benchmarks.

10 retrieved papers
k-value complexity metric for vision-language tasks

The authors introduce a quantitative measure of task complexity defined as the number of atomic visual capabilities required to answer a question. They use this metric to characterize existing datasets and guide the generation of complexity-controlled training data.

10 retrieved papers
Taxonomy of atomic visual capabilities

The authors establish a taxonomy of 10 fundamental vision-centric skills (including object recognition, spatial relationship, color attribution, etc.) that serve as building blocks for constructing complex visual reasoning tasks through compositional combination.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

COMPACT data recipe for complexity-aware visual instruction tuning

The authors propose a novel data curation method that synthesizes training samples by combining multiple atomic visual capabilities in single examples, thereby increasing sample complexity and information density. This approach reduces the required training data volume while maintaining or improving performance on multimodal benchmarks.

Contribution

k-value complexity metric for vision-language tasks

The authors introduce a quantitative measure of task complexity defined as the number of atomic visual capabilities required to answer a question. They use this metric to characterize existing datasets and guide the generation of complexity-controlled training data.

Contribution

Taxonomy of atomic visual capabilities

The authors establish a taxonomy of 10 fundamental vision-centric skills (including object recognition, spatial relationship, color attribution, etc.) that serve as building blocks for constructing complex visual reasoning tasks through compositional combination.