When MLLMs Meets Compression Distortion: A Coding Paradigm Tailored to MLLMs
Overview
Overall Novelty Assessment
The paper proposes CoTAM, an image codec designed to preserve multi-level features critical for MLLM downstream tasks, alongside a systematic analysis of how compression artifacts affect different feature levels. It resides in the 'Neural Codec Optimization for MLLMs' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'Image and Video Codec Design for MLLMs' branch, indicating a moderately populated research direction focused on end-to-end compression systems rather than token-level reduction strategies explored in sibling branches.
The taxonomy reveals that codec-level compression (this paper's focus) is distinct from visual token compression methods, which dominate the field with training-free and training-based approaches across multiple leaves. Neighboring work in 'Semantic and Perceptual Compression' explores extremely low-bitrate codecs with semantic disentanglement, while 'LLM-Powered Lossless Compression' addresses entropy modeling. The paper's multi-level feature preservation approach bridges codec design and downstream task adaptation, connecting to 'Task-Oriented Feature Compression' in the context-adaptive branch, though the latter focuses on token-level rather than codec-level optimization.
Among thirty candidates examined, none clearly refute the three core contributions: systematic distortion analysis (ten candidates, zero refutations), the CoTAM codec design (ten candidates, zero refutations), and hierarchical guidance for high-resolution inputs (ten candidates, zero refutations). The sibling papers in the same leaf address video codec integration, latent-space compression, and high-efficiency compression, but the limited search scope suggests these works explore different codec optimization strategies rather than overlapping directly with CoTAM's multi-level feature preservation and CLIP-based importance mapping.
Based on the top-thirty semantic matches and taxonomy structure, the work appears to occupy a relatively sparse niche within codec design for MLLMs, focusing on adaptive multi-level feature protection rather than general-purpose compression or token reduction. The analysis does not cover exhaustive prior work in traditional image compression or broader MLLM efficiency literature, leaving open questions about how CoTAM relates to codec standards outside the examined candidates.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors provide a systematic investigation revealing that compression distortion unevenly impacts different-level image features in MLLMs. They discover that tasks relying on cross-level features are highly susceptible to compression artifacts, while tasks depending on either low-level structural features or coarse high-level semantics remain relatively robust.
The authors introduce CoTAM, a novel codec framework that uses CLIP-based shallow-layer attention for semantic-guided bit allocation at the encoder and employs a lightweight adapter with multi-level loss at the decoder to preserve both low-level details and high-level semantic context for MLLMs.
The authors develop a hierarchical guidance approach that fuses global and local semantic maps to handle high-resolution images and extends the codec to video MLLMs by applying frame-by-frame semantic guidance, addressing the challenge of maintaining both local precision and global semantic awareness.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] When video coding meets multimodal large language models: A unified paradigm for video coding PDF
[5] Bridging compressed image latents and multimodal large language models PDF
[7] High efficiency image compression for large visual-language models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic analysis of compression distortion impact on MLLMs
The authors provide a systematic investigation revealing that compression distortion unevenly impacts different-level image features in MLLMs. They discover that tasks relying on cross-level features are highly susceptible to compression artifacts, while tasks depending on either low-level structural features or coarse high-level semantics remain relatively robust.
[5] Bridging compressed image latents and multimodal large language models PDF
[7] High efficiency image compression for large visual-language models PDF
[8] Divico: Disentangled visual token compression for efficient large vision-language model PDF
[15] Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models PDF
[28] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models PDF
[71] Constructive distortion: Improving mllms with attention-guided image warping PDF
[72] IPCV: Information-Preserving Compression for MLLM Visual Encoders PDF
[73] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models PDF
[74] Research on Intelligent System of Multimodal Deep Learning in Image Recognition PDF
[75] Feature Compression for Cloud-Edge Multimodal 3D Object Detection PDF
CoTAM: image codec tailored to MLLMs
The authors introduce CoTAM, a novel codec framework that uses CLIP-based shallow-layer attention for semantic-guided bit allocation at the encoder and employs a lightweight adapter with multi-level loss at the decoder to preserve both low-level details and high-level semantic context for MLLMs.
[61] Toward semantic communications: Deep learning-based image semantic coding PDF
[62] Towards 360 image compression for machines via modulating pixel significance PDF
[63] Semantic communications: Principles and challenges PDF
[64] DLF: Extreme Image Compression with Dual-generative Latent Fusion PDF
[65] Semantic-assisted image compression PDF
[66] Agnostic Feature Compression with Semantic Guided Channel Importance Analysis PDF
[67] Semantic Prior-Guided Scalable Image Coding PDF
[68] Deep learning-based image semantic coding for semantic communications PDF
[69] Your Demands Deserve More Bits: Referring Semantic Image Compression at Ultra-low Bitrate PDF
[70] Learning convolutional networks for content-weighted image compression PDF
Hierarchical guidance mechanism for high-resolution and video inputs
The authors develop a hierarchical guidance approach that fuses global and local semantic maps to handle high-resolution images and extends the codec to video MLLMs by applying frame-by-frame semantic guidance, addressing the challenge of maintaining both local precision and global semantic awareness.