ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Video UnderstandingVisual Token ReductionMultimodal Large Language Models
Abstract:

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://anonymous.4open.science/r/ST-SimDiff-7225.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ST-SimDiff, a training-free framework for video token compression in multimodal LLMs that combines similarity-based redundancy removal with temporal difference-based event detection. It resides in the 'Spatiotemporal Redundancy Analysis' leaf of the taxonomy, which contains only three papers total. This leaf sits within the broader 'Token Selection and Importance Criteria' branch, indicating the work focuses on defining selection criteria rather than architectural redesign. The small leaf size suggests this specific combination of spatial and temporal redundancy modeling remains relatively underexplored compared to other compression strategies.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Similarity and Redundancy-Based Merging' addresses token clustering without explicit temporal modeling, while 'Attention and Explainability-Based Selection' uses LLM-derived importance scores rather than content-based redundancy. Adjacent branches include 'Temporal Modeling and Video-Specific Strategies', which encompasses explicit temporal encoders and adaptive frame sampling, and 'Training-Free and Plug-and-Play Methods', which shares the inference-time approach but may not emphasize spatiotemporal structure. ST-SimDiff bridges these areas by applying training-free selection criteria specifically to spatiotemporal redundancy patterns.

Among twenty candidates examined across three contributions, the dual perspective on similarity and difference shows one refutable candidate from ten examined, suggesting some prior work addresses similar conceptual framing. The ST-SimDiff framework itself found no refutations among ten candidates, indicating the specific combination of spatio-temporal graph modeling with dual-selection may be less directly anticipated. The parallel dual-selection strategy was not separately evaluated. These statistics reflect a limited semantic search scope rather than exhaustive coverage, meaning additional related work may exist beyond the top-twenty matches examined.

Given the constrained search scope and the sparse population of the taxonomy leaf, the work appears to occupy a relatively distinct position within spatiotemporal redundancy analysis. The explicit focus on difference-based event detection alongside similarity-based compression differentiates it from purely redundancy-focused methods. However, the limited candidate pool and single refutation suggest careful positioning relative to existing temporal modeling and training-free compression literature would strengthen claims of novelty.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient video token compression for multimodal large language models. As video understanding becomes central to multimodal LLMs, the computational burden of processing dense visual tokens has driven a rich landscape of compression strategies. The field organizes around several complementary branches: Token Compression Mechanisms and Architectures develop novel structural designs (e.g., merging layers, quantization schemes) to reduce token counts; Token Selection and Importance Criteria focus on identifying which tokens matter most through attention scores, similarity metrics, or spatiotemporal redundancy analysis; Temporal Modeling and Video-Specific Strategies exploit the unique structure of video data, such as frame-level dependencies and motion patterns; Training Paradigms and Optimization Strategies address how to learn compression policies end-to-end or via auxiliary objectives; Training-Free and Plug-and-Play Methods offer lightweight alternatives that adapt existing models without retraining; and Application-Specific and Domain-Adapted Compression tailors techniques to particular downstream tasks or video domains. Representative works like LongVU[13] and Videollama 3[14] illustrate how temporal redundancy can be leveraged, while methods such as Llava-prumerge[1] and Tokencarve[4] demonstrate diverse architectural choices for token reduction. A particularly active line of inquiry centers on spatiotemporal redundancy analysis, where methods quantify and exploit the overlap between frames and spatial regions. ST-SimDiff[0] sits squarely in this cluster, emphasizing similarity-based filtering to discard redundant tokens across both space and time. Nearby works like LongVU[13] also target long-form video efficiency through temporal pooling, while Adaretake[31] adapts token selection dynamically based on content variation. These approaches contrast with training-free methods such as DeCo[7] and Less is more[8], which apply heuristic pruning without model updates, and with architecture-driven solutions like Mavors[5] or Voco-llama[6], which integrate compression directly into the model backbone. The central trade-off across these branches involves balancing compression ratio against semantic fidelity: aggressive pruning risks losing fine-grained details, while conservative strategies may not sufficiently reduce computational cost. ST-SimDiff[0] navigates this by focusing on spatiotemporal similarity metrics, positioning itself as a middle ground that preserves task-relevant information while achieving substantial token reduction, complementing both the temporal-centric designs of LongVU[13] and the adaptive selection logic of Adaretake[31].

Claimed Contributions

Dual perspective on similarity and difference for video token compression

The authors introduce a conceptual framework that treats similarity as a mechanism for compressing redundant static content in videos, while treating difference as essential for capturing key dynamic events and turning points that drive video narratives.

10 retrieved papers
Can Refute
ST-SimDiff framework with spatio-temporal graph modeling

The authors develop a training-free framework that constructs a spatio-temporal graph to uniformly model complex spatial and temporal relationships between video tokens, enabling joint analysis of spatio-temporal correlations that existing methods fail to capture.

10 retrieved papers
Parallel dual-selection strategy for token compression

The authors propose a novel dual token selection strategy that operates in parallel: similarity-based selection applies community detection to compress redundant static content, while difference-based selection identifies temporal turning points to preserve tokens capturing key dynamic shifts.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dual perspective on similarity and difference for video token compression

The authors introduce a conceptual framework that treats similarity as a mechanism for compressing redundant static content in videos, while treating difference as essential for capturing key dynamic events and turning points that drive video narratives.

Contribution

ST-SimDiff framework with spatio-temporal graph modeling

The authors develop a training-free framework that constructs a spatio-temporal graph to uniformly model complex spatial and temporal relationships between video tokens, enabling joint analysis of spatio-temporal correlations that existing methods fail to capture.

Contribution

Parallel dual-selection strategy for token compression

The authors propose a novel dual token selection strategy that operates in parallel: similarity-based selection applies community detection to compress redundant static content, while difference-based selection identifies temporal turning points to preserve tokens capturing key dynamic shifts.