ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Video UnderstandingVisual Token ReductionMultimodal Large Language Models

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://anonymous.4open.science/r/ST-SimDiff-7225.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ST-SimDiff, a training-free framework for video token compression in multimodal LLMs that combines similarity-based redundancy removal with temporal difference-based event detection. It resides in the 'Spatiotemporal Redundancy Analysis' leaf of the taxonomy, which contains only three papers total. This leaf sits within the broader 'Token Selection and Importance Criteria' branch, indicating the work focuses on defining selection criteria rather than architectural redesign. The small leaf size suggests this specific combination of spatial and temporal redundancy modeling remains relatively underexplored compared to other compression strategies.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Similarity and Redundancy-Based Merging' addresses token clustering without explicit temporal modeling, while 'Attention and Explainability-Based Selection' uses LLM-derived importance scores rather than content-based redundancy. Adjacent branches include 'Temporal Modeling and Video-Specific Strategies', which encompasses explicit temporal encoders and adaptive frame sampling, and 'Training-Free and Plug-and-Play Methods', which shares the inference-time approach but may not emphasize spatiotemporal structure. ST-SimDiff bridges these areas by applying training-free selection criteria specifically to spatiotemporal redundancy patterns.

Among twenty candidates examined across three contributions, the dual perspective on similarity and difference shows one refutable candidate from ten examined, suggesting some prior work addresses similar conceptual framing. The ST-SimDiff framework itself found no refutations among ten candidates, indicating the specific combination of spatio-temporal graph modeling with dual-selection may be less directly anticipated. The parallel dual-selection strategy was not separately evaluated. These statistics reflect a limited semantic search scope rather than exhaustive coverage, meaning additional related work may exist beyond the top-twenty matches examined.

Given the constrained search scope and the sparse population of the taxonomy leaf, the work appears to occupy a relatively distinct position within spatiotemporal redundancy analysis. The explicit focus on difference-based event detection alongside similarity-based compression differentiates it from purely redundancy-focused methods. However, the limited candidate pool and single refutation suggest careful positioning relative to existing temporal modeling and training-free compression literature would strengthen claims of novelty.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient video token compression for multimodal large language models. As video understanding becomes central to multimodal LLMs, the computational burden of processing dense visual tokens has driven a rich landscape of compression strategies. The field organizes around several complementary branches: Token Compression Mechanisms and Architectures develop novel structural designs (e.g., merging layers, quantization schemes) to reduce token counts; Token Selection and Importance Criteria focus on identifying which tokens matter most through attention scores, similarity metrics, or spatiotemporal redundancy analysis; Temporal Modeling and Video-Specific Strategies exploit the unique structure of video data, such as frame-level dependencies and motion patterns; Training Paradigms and Optimization Strategies address how to learn compression policies end-to-end or via auxiliary objectives; Training-Free and Plug-and-Play Methods offer lightweight alternatives that adapt existing models without retraining; and Application-Specific and Domain-Adapted Compression tailors techniques to particular downstream tasks or video domains. Representative works like LongVU[13] and Videollama 3[14] illustrate how temporal redundancy can be leveraged, while methods such as Llava-prumerge[1] and Tokencarve[4] demonstrate diverse architectural choices for token reduction. A particularly active line of inquiry centers on spatiotemporal redundancy analysis, where methods quantify and exploit the overlap between frames and spatial regions. ST-SimDiff[0] sits squarely in this cluster, emphasizing similarity-based filtering to discard redundant tokens across both space and time. Nearby works like LongVU[13] also target long-form video efficiency through temporal pooling, while Adaretake[31] adapts token selection dynamically based on content variation. These approaches contrast with training-free methods such as DeCo[7] and Less is more[8], which apply heuristic pruning without model updates, and with architecture-driven solutions like Mavors[5] or Voco-llama[6], which integrate compression directly into the model backbone. The central trade-off across these branches involves balancing compression ratio against semantic fidelity: aggressive pruning risks losing fine-grained details, while conservative strategies may not sufficiently reduce computational cost. ST-SimDiff[0] navigates this by focusing on spatiotemporal similarity metrics, positioning itself as a middle ground that preserves task-relevant information while achieving substantial token reduction, complementing both the temporal-centric designs of LongVU[13] and the adaptive selection logic of Adaretake[31].

Claimed Contributions

Dual perspective on similarity and difference for video token compression

Can Refute

10 retrieved papers

The authors introduce a conceptual framework that treats similarity as a mechanism for compressing redundant static content in videos, while treating difference as essential for capturing key dynamic events and turning points that drive video narratives.

10 retrieved papers

Can Refute

ST-SimDiff framework with spatio-temporal graph modeling

10 retrieved papers

The authors develop a training-free framework that constructs a spatio-temporal graph to uniformly model complex spatial and temporal relationships between video tokens, enabling joint analysis of spatio-temporal correlations that existing methods fail to capture.

10 retrieved papers

Parallel dual-selection strategy for token compression

0 retrieved papers

The authors propose a novel dual token selection strategy that operates in parallel: similarity-based selection applies community detection to compress redundant static content, while difference-based selection identifies temporal turning points to preserve tokens capturing key dynamic shifts.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding PDF

Shen, Xiaoqian, Xiong Yunyang, Xiaoqian Shen, Zhao Changsheng, Yunyang Xiong, Wu Lemeng, Changsheng Zhao, Chen Jun, Lemeng Wu, Zhu Chenchen, Jun Chen, Liu, Zechun, Chenchen Zhu, Xiao, Fanyi, Zechun Liu, Varadarajan, Balakrishnan, Fanyi Xiao, Bordes, Florian, Bala Varadarajan, Zhuang, Florian Bordes, Xu Hu, Zhuang Liu, Kim, Hyunwoo J., Hu Xu, Hyunwoo J. Kim, Krishnamoorthi, Raghuraman, Bilge Soran, Elhoseiny, Mohamed, Raghuraman Krishnamoorthi, Chandra, Vikas, Mohamed Elhoseiny, Vikas Chandra (2024) • International Conference on Machine Learning

[31] Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding PDF

Wang Xiao, Si, Qingyi, Xiao Wang, Wu, Jianlong, Qingyi Si, Zhu Shi-yu, Jianlong Wu, Cao Li, Shiyu Zhu, Nie, Liqiang, Li Cao, Liqiang Nie (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dual perspective on similarity and difference for video token compression

[26] Framefusion: Combining similarity and importance for video token reduction on large vision language models PDF

Can Refute

[8] Less is more: Vision representation compression for efficient video generation with large language models PDF

Cannot Refute

[13] LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding PDF

Cannot Refute

[21] DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models PDF

Cannot Refute

[50] Timechat-online: 80% visual tokens are naturally redundant in streaming videos PDF

Cannot Refute

[51] Fast: Efficient action tokenization for vision-language-action models PDF

Cannot Refute

[52] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios PDF

Cannot Refute

[53] The Best and Most Efficient Video Compression Methods PDF

Cannot Refute

[54] Don't Look Twice: Faster Video Transformers with Run-Length Tokenization PDF

Cannot Refute

[55] Motion Guided Token Compression for Efficient Masked Video Modeling PDF

Cannot Refute

Contribution

ST-SimDiff framework with spatio-temporal graph modeling

[56] Dynamic spatio-temporal graph reasoning for videoqa with self-supervised event recognition PDF

Cannot Refute

[57] Enhancing video-language representations with structural spatio-temporal alignment PDF

Cannot Refute

[58] Constructing holistic spatio-temporal scene graph for video semantic role labeling PDF

Cannot Refute

[59] ODTrack: Online Dense Temporal Token Learning for Visual Tracking PDF

Cannot Refute

[60] Cross-attentional spatio-temporal semantic graph networks for video question answering PDF

Cannot Refute

[61] Exploring spatioâtemporal graph convolution for video-based humanâobject interaction recognition PDF

Cannot Refute

[62] Efficient video transformers via spatial-temporal token merging for action recognition PDF

Cannot Refute

[63] Spatial-temporal graphs for cross-modal text2video retrieval PDF

Cannot Refute

[64] Video relation detection with spatio-temporal graph PDF

Cannot Refute

[65] Multi-stage spatio-temporal aggregation transformer for video person re-identification PDF

Cannot Refute

Contribution

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding PDF

[31] Adaretake: Adaptive redundancy reduction to perceive longer for video-language understanding PDF

Contribution Analysis

Dual perspective on similarity and difference for video token compression

[26] Framefusion: Combining similarity and importance for video token reduction on large vision language models PDF

[8] Less is more: Vision representation compression for efficient video generation with large language models PDF

[13] LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding PDF

[21] DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models PDF

[50] Timechat-online: 80% visual tokens are naturally redundant in streaming videos PDF

[51] Fast: Efficient action tokenization for vision-language-action models PDF

[52] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios PDF

[53] The Best and Most Efficient Video Compression Methods PDF

[54] Don't Look Twice: Faster Video Transformers with Run-Length Tokenization PDF

[55] Motion Guided Token Compression for Efficient Masked Video Modeling PDF

ST-SimDiff framework with spatio-temporal graph modeling

[56] Dynamic spatio-temporal graph reasoning for videoqa with self-supervised event recognition PDF

[57] Enhancing video-language representations with structural spatio-temporal alignment PDF

[58] Constructing holistic spatio-temporal scene graph for video semantic role labeling PDF

[59] ODTrack: Online Dense Temporal Token Learning for Visual Tracking PDF

[60] Cross-attentional spatio-temporal semantic graph networks for video question answering PDF

[61] Exploring spatioâtemporal graph convolution for video-based humanâobject interaction recognition PDF

[62] Efficient video transformers via spatial-temporal token merging for action recognition PDF

[63] Spatial-temporal graphs for cross-modal text2video retrieval PDF

[64] Video relation detection with spatio-temporal graph PDF

[65] Multi-stage spatio-temporal aggregation transformer for video person re-identification PDF

Parallel dual-selection strategy for token compression

Table of Contents

[61] Exploring spatioâtemporal graph convolution for video-based humanâobject interaction recognition PDF