COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
Overview
Overall Novelty Assessment
The paper introduces COMPACT, a data recipe that synthesizes compositional visual instruction tuning examples by combining multiple atomic visual capabilities into single training instances. It resides in the 'Compositional Data Synthesis' leaf of the taxonomy, which contains only three papers total. This leaf sits under the broader 'Data-Efficient Visual Instruction Tuning' branch, indicating a relatively focused research direction. The small number of sibling papers suggests this specific approach—synthesizing complexity through atomic capability composition—represents an emerging rather than saturated area within visual instruction tuning.
The taxonomy reveals neighboring leaves focused on 'Informative Sample Selection' (selecting from existing datasets) and 'Instruction Quality and Diversity Enhancement' (improving instruction formulation). COMPACT diverges from these by generating new compositional examples rather than curating existing ones. The broader 'Model Architecture and Training Strategies' branch addresses complementary concerns like connector design and modality balancing, while 'Compositional Reasoning and Task Execution' explores inference-time decomposition. COMPACT's synthesis approach bridges data-centric efficiency with compositional reasoning, occupying a distinct position that emphasizes training-time capability integration rather than architectural or selection-based solutions.
Among thirty candidates examined, the contribution-level analysis shows varied novelty signals. The core COMPACT recipe (Contribution 1) examined ten candidates with zero refutations, suggesting limited direct prior work on this specific complexity-scaling synthesis approach. The k-value complexity metric (Contribution 2) similarly found no refutations among ten candidates. However, the atomic visual capability taxonomy (Contribution 3) encountered two refutable candidates among ten examined, indicating existing frameworks for decomposing visual tasks. These statistics reflect a bounded search scope—top-K semantic matches plus citations—not exhaustive coverage, so the absence of refutations for Contributions 1-2 indicates novelty within this limited sample rather than definitive field-wide uniqueness.
Based on the limited thirty-candidate search, COMPACT appears to occupy a relatively novel position within compositional data synthesis, particularly in its complexity-scaling recipe and metric. The taxonomy decomposition shows more substantial prior work, as expected for foundational task categorization. The sparse population of the 'Compositional Data Synthesis' leaf and the contribution-level statistics together suggest meaningful differentiation from existing methods, though the restricted search scope means adjacent work outside the top-K matches may exist. The analysis captures novelty signals within examined candidates but does not claim exhaustive field coverage.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a novel data curation method that synthesizes training samples by combining multiple atomic visual capabilities in single examples, thereby increasing sample complexity and information density. This approach reduces the required training data volume while maintaining or improving performance on multimodal benchmarks.
The authors introduce a quantitative measure of task complexity defined as the number of atomic visual capabilities required to answer a question. They use this metric to characterize existing datasets and guide the generation of complexity-controlled training data.
The authors establish a taxonomy of 10 fundamental vision-centric skills (including object recognition, spatial relationship, color attribution, etc.) that serve as building blocks for constructing complex visual reasoning tasks through compositional combination.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[16] Mosaic-it: Free compositional data augmentation improves instruction tuning PDF
[23] Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
COMPACT data recipe for complexity-aware visual instruction tuning
The authors propose a novel data curation method that synthesizes training samples by combining multiple atomic visual capabilities in single examples, thereby increasing sample complexity and information density. This approach reduces the required training data volume while maintaining or improving performance on multimodal benchmarks.
[43] Cogvlm: Visual expert for pretrained language models PDF
[44] In-context compositional generalization for large vision-language models PDF
[45] Visual In-Context Learning for Large Vision-Language Models PDF
[46] Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model PDF
[47] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality PDF
[48] Modality-experts coordinated adaptation for large multimodal models PDF
[49] Visual program distillation: Distilling tools and programmatic reasoning into vision-language models PDF
[50] Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model PDF
[51] CLIP-Adapter: Better Vision-Language Models with Feature Adapters PDF
[52] Natural language inference improves compositionality in vision-language models PDF
k-value complexity metric for vision-language tasks
The authors introduce a quantitative measure of task complexity defined as the number of atomic visual capabilities required to answer a question. They use this metric to characterize existing datasets and guide the generation of complexity-controlled training data.
[24] Crepe: Can vision-language foundation models reason compositionally? PDF
[25] Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering PDF
[26] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF
[27] Ava-bench: Atomic visual ability benchmark for vision foundation models PDF
[28] Bear: Benchmarking and enhancing multimodal language models for atomic embodied capabilities PDF
[29] Do vision-language models have internal world models? towards an atomic evaluation PDF
[30] Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? PDF
[31] Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation PDF
[32] Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens PDF
[33] Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models PDF
Taxonomy of atomic visual capabilities
The authors establish a taxonomy of 10 fundamental vision-centric skills (including object recognition, spatial relationship, color attribution, etc.) that serve as building blocks for constructing complex visual reasoning tasks through compositional combination.