COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.7 Download Report PDF

ComplexityCompositionalityVisual instruction tuning

Visual instruction tuning (VIT) datasets consist of randomly sampled image-question pairs without regard to the informativeness of each pair. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of task complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning), a VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective visual instruction tuning. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Further, training on the same COMPACT data even improves performance compared to training with full-scale data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on visual language tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces COMPACT, a data recipe that synthesizes compositional visual instruction tuning examples by combining multiple atomic visual capabilities into single training instances. It resides in the 'Compositional Data Synthesis' leaf of the taxonomy, which contains only three papers total. This leaf sits under the broader 'Data-Efficient Visual Instruction Tuning' branch, indicating a relatively focused research direction. The small number of sibling papers suggests this specific approach—synthesizing complexity through atomic capability composition—represents an emerging rather than saturated area within visual instruction tuning.

The taxonomy reveals neighboring leaves focused on 'Informative Sample Selection' (selecting from existing datasets) and 'Instruction Quality and Diversity Enhancement' (improving instruction formulation). COMPACT diverges from these by generating new compositional examples rather than curating existing ones. The broader 'Model Architecture and Training Strategies' branch addresses complementary concerns like connector design and modality balancing, while 'Compositional Reasoning and Task Execution' explores inference-time decomposition. COMPACT's synthesis approach bridges data-centric efficiency with compositional reasoning, occupying a distinct position that emphasizes training-time capability integration rather than architectural or selection-based solutions.

Among thirty candidates examined, the contribution-level analysis shows varied novelty signals. The core COMPACT recipe (Contribution 1) examined ten candidates with zero refutations, suggesting limited direct prior work on this specific complexity-scaling synthesis approach. The k-value complexity metric (Contribution 2) similarly found no refutations among ten candidates. However, the atomic visual capability taxonomy (Contribution 3) encountered two refutable candidates among ten examined, indicating existing frameworks for decomposing visual tasks. These statistics reflect a bounded search scope—top-K semantic matches plus citations—not exhaustive coverage, so the absence of refutations for Contributions 1-2 indicates novelty within this limited sample rather than definitive field-wide uniqueness.

Based on the limited thirty-candidate search, COMPACT appears to occupy a relatively novel position within compositional data synthesis, particularly in its complexity-scaling recipe and metric. The taxonomy decomposition shows more substantial prior work, as expected for foundational task categorization. The sparse population of the 'Compositional Data Synthesis' leaf and the contribution-level statistics together suggest meaningful differentiation from existing methods, though the restricted search scope means adjacent work outside the top-K matches may exist. The analysis captures novelty signals within examined candidates but does not claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient visual instruction tuning through compositional capability integration. The field addresses how to train vision-language models that can follow complex instructions without requiring massive annotated datasets. The taxonomy reveals four main branches: Data-Efficient Visual Instruction Tuning focuses on reducing annotation costs through synthetic data generation and compositional data synthesis approaches like Mosaic-IT[16] and Mosaic-IT Cost-Free[23]; Model Architecture and Training Strategies explores parameter-efficient methods such as SMoLoRA[22] and steering techniques like Modality Linear Steering[4]; Compositional Reasoning and Task Execution examines how models decompose and execute multi-step visual tasks, often drawing on visual programming paradigms like Visual Programming[2] and modular approaches; and Specialized Visual Instruction Applications targets domain-specific challenges in areas like UI understanding with UI-Ins[21] and compositional image retrieval. These branches collectively address the tension between model capability and training efficiency, with data synthesis and architectural innovations serving as complementary strategies. A particularly active line of work centers on compositional data synthesis, where researchers generate training examples by combining simpler visual and linguistic primitives rather than collecting exhaustive human annotations. COMPACT[0] sits squarely within this branch, emphasizing the integration of compositional capabilities through carefully designed data synthesis. This approach contrasts with neighboring works like Mosaic-IT[16], which also synthesizes compositional instruction data but may differ in how capabilities are decomposed and recombined, and Mosaic-IT Cost-Free[23], which pushes the cost-reduction agenda further by eliminating certain annotation expenses. The central trade-off across these methods involves balancing the diversity and realism of synthetic data against the computational overhead of generation, with open questions remaining about how well compositional training transfers to truly novel task combinations and whether synthetic data can fully substitute for human-curated examples in capturing nuanced visual reasoning.

Claimed Contributions

COMPACT data recipe for complexity-aware visual instruction tuning

10 retrieved papers

The authors propose a novel data curation method that synthesizes training samples by combining multiple atomic visual capabilities in single examples, thereby increasing sample complexity and information density. This approach reduces the required training data volume while maintaining or improving performance on multimodal benchmarks.

10 retrieved papers

k-value complexity metric for vision-language tasks

10 retrieved papers

The authors introduce a quantitative measure of task complexity defined as the number of atomic visual capabilities required to answer a question. They use this metric to characterize existing datasets and guide the generation of complexity-controlled training data.

10 retrieved papers

Taxonomy of atomic visual capabilities

Can Refute

10 retrieved papers

The authors establish a taxonomy of 10 fundamental vision-centric skills (including object recognition, spatial relationship, color attribution, etc.) that serve as building blocks for constructing complex visual reasoning tasks through compositional combination.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Mosaic-it: Free compositional data augmentation improves instruction tuning PDF

M Li, P Chen, C Wang, H Zhao, Y Liang, Y Hou, F Liu (2024)

[23] Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning PDF

Chen Pei, Hou, Yupeng, Li Ming, Liang Yi-jun, Liu FuXiao, Wang, Chenguang, Zhao Hongyu, Zhou, Tianyi (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

COMPACT data recipe for complexity-aware visual instruction tuning

[43] Cogvlm: Visual expert for pretrained language models PDF

Cannot Refute

[44] In-context compositional generalization for large vision-language models PDF

Cannot Refute

[45] Visual In-Context Learning for Large Vision-Language Models PDF

Cannot Refute

[46] Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model PDF

Cannot Refute

[47] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality PDF

Cannot Refute

[48] Modality-experts coordinated adaptation for large multimodal models PDF

Cannot Refute

[49] Visual program distillation: Distilling tools and programmatic reasoning into vision-language models PDF

Cannot Refute

[50] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model PDF

Cannot Refute

[51] CLIP-Adapter: Better Vision-Language Models with Feature Adapters PDF

Cannot Refute

[52] Natural language inference improves compositionality in vision-language models PDF

Cannot Refute

Contribution

k-value complexity metric for vision-language tasks

[24] Crepe: Can vision-language foundation models reason compositionally? PDF

Cannot Refute

[25] Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering PDF

Cannot Refute

[26] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF

Cannot Refute

[27] Ava-bench: Atomic visual ability benchmark for vision foundation models PDF

Cannot Refute

[28] Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities PDF

Cannot Refute

[29] Do vision-language models have internal world models? towards an atomic evaluation PDF

Cannot Refute

[30] Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? PDF

Cannot Refute

[31] Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation PDF

Cannot Refute

[32] Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens PDF

Cannot Refute

[33] Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models PDF

Cannot Refute

Contribution

Taxonomy of atomic visual capabilities

[33] Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models PDF

Can Refute

[34] MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities PDF

Can Refute

[35] Ability decomposition and difficulty quantification of visual tasks: Towards systematic evaluations of artificial general intelligence PDF

Cannot Refute

[36] A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task PDF

Cannot Refute

[37] GRAFT: GRaPH and Table Reasoning for Textual Alignment - A Benchmark for Structured Instruction Following and Visual Reasoning PDF

Cannot Refute

[38] Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data PDF

Cannot Refute

[39] PyVision: Agentic Vision with Dynamic Tooling PDF

Cannot Refute

[40] Basic Level Categorization Facilitates Visual Object Recognition PDF

Cannot Refute

[41] Reasoning in computer vision: Taxonomy, models, tasks, and methodologies PDF

Cannot Refute

[42] Visfactor: Benchmarking fundamental visual cognition in multimodal large language models PDF

Cannot Refute

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Mosaic-it: Free compositional data augmentation improves instruction tuning PDF

[23] Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning PDF

Contribution Analysis

COMPACT data recipe for complexity-aware visual instruction tuning

[43] Cogvlm: Visual expert for pretrained language models PDF

[44] In-context compositional generalization for large vision-language models PDF

[45] Visual In-Context Learning for Large Vision-Language Models PDF

[46] Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model PDF

[47] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality PDF

[48] Modality-experts coordinated adaptation for large multimodal models PDF

[49] Visual program distillation: Distilling tools and programmatic reasoning into vision-language models PDF

[50] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model PDF

[51] CLIP-Adapter: Better Vision-Language Models with Feature Adapters PDF

[52] Natural language inference improves compositionality in vision-language models PDF

k-value complexity metric for vision-language tasks

[24] Crepe: Can vision-language foundation models reason compositionally? PDF

[25] Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering PDF

[26] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF

[27] Ava-bench: Atomic visual ability benchmark for vision foundation models PDF

[28] Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities PDF

[29] Do vision-language models have internal world models? towards an atomic evaluation PDF

[30] Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? PDF

[31] Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation PDF

[32] Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens PDF

[33] Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models PDF

Taxonomy of atomic visual capabilities

[33] Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models PDF

[34] MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities PDF

[35] Ability decomposition and difficulty quantification of visual tasks: Towards systematic evaluations of artificial general intelligence PDF

[36] A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task PDF

[37] GRAFT: GRaPH and Table Reasoning for Textual Alignment - A Benchmark for Structured Instruction Following and Visual Reasoning PDF

[38] Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data PDF

[39] PyVision: Agentic Vision with Dynamic Tooling PDF

[40] Basic Level Categorization Facilitates Visual Object Recognition PDF

[41] Reasoning in computer vision: Taxonomy, models, tasks, and methodologies PDF

[42] Visfactor: Benchmarking fundamental visual cognition in multimodal large language models PDF

Table of Contents