When MLLMs Meets Compression Distortion: A Coding Paradigm Tailored to MLLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Image CodingImage CompressionMultimodal Large Language Models
Abstract:

The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs' downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP's shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes CoTAM, an image codec designed to preserve multi-level features critical for MLLM downstream tasks, alongside a systematic analysis of how compression artifacts affect different feature levels. It resides in the 'Neural Codec Optimization for MLLMs' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'Image and Video Codec Design for MLLMs' branch, indicating a moderately populated research direction focused on end-to-end compression systems rather than token-level reduction strategies explored in sibling branches.

The taxonomy reveals that codec-level compression (this paper's focus) is distinct from visual token compression methods, which dominate the field with training-free and training-based approaches across multiple leaves. Neighboring work in 'Semantic and Perceptual Compression' explores extremely low-bitrate codecs with semantic disentanglement, while 'LLM-Powered Lossless Compression' addresses entropy modeling. The paper's multi-level feature preservation approach bridges codec design and downstream task adaptation, connecting to 'Task-Oriented Feature Compression' in the context-adaptive branch, though the latter focuses on token-level rather than codec-level optimization.

Among thirty candidates examined, none clearly refute the three core contributions: systematic distortion analysis (ten candidates, zero refutations), the CoTAM codec design (ten candidates, zero refutations), and hierarchical guidance for high-resolution inputs (ten candidates, zero refutations). The sibling papers in the same leaf address video codec integration, latent-space compression, and high-efficiency compression, but the limited search scope suggests these works explore different codec optimization strategies rather than overlapping directly with CoTAM's multi-level feature preservation and CLIP-based importance mapping.

Based on the top-thirty semantic matches and taxonomy structure, the work appears to occupy a relatively sparse niche within codec design for MLLMs, focusing on adaptive multi-level feature protection rather than general-purpose compression or token reduction. The analysis does not cover exhaustive prior work in traditional image compression or broader MLLM efficiency literature, leaving open questions about how CoTAM relates to codec standards outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Image compression for multimodal large language models. The field addresses the challenge of efficiently encoding visual information for MLLMs, which must process high-resolution images while managing computational and memory constraints. The taxonomy reveals several complementary research directions: Visual Token Compression Methods focus on reducing the number of tokens fed to the language model through pruning, merging, or adaptive selection strategies (e.g., Tokencarve[4], Beyond LLaVA-HD[9]); Image and Video Codec Design for MLLMs explores neural codecs and traditional compression standards optimized for downstream MLLM tasks (e.g., Video Coding MLLMs[2], Voco-llama[3]); Efficient MLLM Architectures and Training investigates architectural innovations and training regimes that inherently reduce computational overhead; while Model Compression and Quantization applies post-training techniques to shrink model size. Additional branches cover video-specific temporal modeling, alternative input representations, evaluation frameworks, domain applications, and comparative surveys, collectively forming a landscape where compression happens at multiple stages—from raw pixels to latent representations to final model weights. A particularly active tension exists between codec-level compression and token-level reduction strategies. Works like Compressed Image Latents[5] and High Efficiency Compression[7] optimize neural or traditional codecs to preserve task-relevant information at lower bitrates, while methods such as Deco[1] and Unicode[6] focus on compressing the intermediate visual token sequences. MLLMs Compression Distortion[0] sits within the Neural Codec Optimization for MLLMs branch, closely examining how compression artifacts propagate through the MLLM pipeline—a perspective that bridges codec design and downstream task performance. Compared to Voco-llama[3], which emphasizes video codec integration, and Compressed Image Latents[5], which explores latent-space compression, MLLMs Compression Distortion[0] appears to investigate the fundamental trade-offs between compression efficiency and model accuracy, providing insights into distortion tolerance that inform both codec designers and token reduction practitioners across the taxonomy.

Claimed Contributions

Systematic analysis of compression distortion impact on MLLMs

The authors provide a systematic investigation revealing that compression distortion unevenly impacts different-level image features in MLLMs. They discover that tasks relying on cross-level features are highly susceptible to compression artifacts, while tasks depending on either low-level structural features or coarse high-level semantics remain relatively robust.

10 retrieved papers
CoTAM: image codec tailored to MLLMs

The authors introduce CoTAM, a novel codec framework that uses CLIP-based shallow-layer attention for semantic-guided bit allocation at the encoder and employs a lightweight adapter with multi-level loss at the decoder to preserve both low-level details and high-level semantic context for MLLMs.

10 retrieved papers
Hierarchical guidance mechanism for high-resolution and video inputs

The authors develop a hierarchical guidance approach that fuses global and local semantic maps to handle high-resolution images and extends the codec to video MLLMs by applying frame-by-frame semantic guidance, addressing the challenge of maintaining both local precision and global semantic awareness.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic analysis of compression distortion impact on MLLMs

The authors provide a systematic investigation revealing that compression distortion unevenly impacts different-level image features in MLLMs. They discover that tasks relying on cross-level features are highly susceptible to compression artifacts, while tasks depending on either low-level structural features or coarse high-level semantics remain relatively robust.

Contribution

CoTAM: image codec tailored to MLLMs

The authors introduce CoTAM, a novel codec framework that uses CLIP-based shallow-layer attention for semantic-guided bit allocation at the encoder and employs a lightweight adapter with multi-level loss at the decoder to preserve both low-level details and high-level semantic context for MLLMs.

Contribution

Hierarchical guidance mechanism for high-resolution and video inputs

The authors develop a hierarchical guidance approach that fuses global and local semantic maps to handle high-resolution images and extends the codec to video MLLMs by applying frame-by-frame semantic guidance, addressing the challenge of maintaining both local precision and global semantic awareness.