InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

discrete tokenizationvideo representationeficiencyinformation theory

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving $20\%$ tokens without influence on performance, and achieving $2.3\times$ compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: adaptive video tokenization via information-theoretic compression. The field addresses how to efficiently represent video data by dynamically allocating tokens based on content complexity and information density. The taxonomy reveals several complementary directions: information-theoretic frameworks that optimize compression through entropy measures and rate-distortion principles; spatiotemporal adaptive methods that reduce tokens specifically for video-language models; neural discrete representation learning approaches that build codebooks for video tokenization; generative methods with adaptive spatial allocation; attention-based token reduction for understanding tasks; learned compression systems that adapt to content statistics; inference-time adaptive schemes that compress on-the-fly; and application-specific compression tailored to particular downstream tasks. Works like LongVU[1] and VQToken[2] exemplify spatiotemporal and discrete representation approaches respectively, while Ultra-low Bitrate Transformer[3] and PVC[5] demonstrate learned compression strategies that adapt to varying content characteristics. Particularly active themes include the trade-off between compression efficiency and downstream task performance, the challenge of handling temporal redundancy versus spatial detail, and the question of whether to optimize tokenization jointly with task objectives or as a separate preprocessing step. InfoTok[0] sits within the information-theoretic adaptive compression frameworks branch, specifically focusing on entropy-based optimal tokenization. This positions it closely with works that use rate-distortion theory to guide token allocation, contrasting with purely learned approaches like DynTok[6] that adapt tokens through neural architectures without explicit information-theoretic objectives. Compared to application-specific methods such as Video Compression Commander[4] or Human-Centric Video Compression[16], InfoTok[0] appears to pursue a more general compression principle grounded in entropy optimization, aiming for broader applicability across tasks rather than tuning for specific downstream applications.

Claimed Contributions

Theoretical proof of suboptimality in existing tokenizers

10 retrieved papers

The authors provide rigorous theoretical proofs demonstrating that both fixed-compression tokenizers and existing adaptive tokenizers using data-agnostic routers (such as uniform sampling) are suboptimal in terms of expected token length compared to information-theoretic optimality. They show these methods fail to achieve near-optimal compression rates.

10 retrieved papers

INFOTOK framework with ELBO-based router and adaptive compressor

0 retrieved papers

The authors introduce INFOTOK, a novel framework for adaptive video tokenization that uses an Evidence Lower Bound (ELBO)-based router to determine token sequence lengths based on video information complexity, combined with a transformer-based adaptive compressor that efficiently compresses embeddings into variable-length token sequences.

0 retrieved papers

Empirical validation of superior token efficiency

10 retrieved papers

The authors conduct comprehensive experiments showing that INFOTOK achieves state-of-the-art compression performance, saving approximately 20% tokens without performance loss and achieving 2.3× better compression rates compared to prior adaptive approaches while maintaining or improving reconstruction quality.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical proof of suboptimality in existing tokenizers

[19] Language modeling is compression PDF

Cannot Refute

[20] Single-pass adaptive image tokenization for minimum program search PDF

Cannot Refute

[21] Unpacking tokenization: Evaluating text compression and its correlation with model performance PDF

Cannot Refute

[22] Tokenization and the noiseless channel PDF

Cannot Refute

[23] Emergent architectural dynamics of neural token compression in large language models PDF

Cannot Refute

[24] Training llms over neurally compressed text PDF

Cannot Refute

[25] MOAT: Revealing the Task-Optimality Gap in Adaptive Tokenization PDF

Cannot Refute

[26] Leveraging Information Theoretic ToolsFor Foundation Model Analysis PDF

Cannot Refute

[27] WSDL term tokenization methods for IR-style Web services discovery PDF

Cannot Refute

[28] HutterX â Omniscientrix Hybrid Compressor (vÎ© Unified InformationalâAwareness Framework Build) by Cornelius Aurelius PDF

Cannot Refute

Contribution

INFOTOK framework with ELBO-based router and adaptive compressor

Contribution

Empirical validation of superior token efficiency

[1] LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding PDF

Cannot Refute

[29] Adaptive token sampling for efficient vision transformers PDF

Cannot Refute

[30] LTX-Video: Realtime Video Latent Diffusion PDF

Cannot Refute

[31] Tokenlearner: Adaptive space-time tokenization for videos PDF

Cannot Refute

[32] Dense Video Understanding with Gated Residual Tokenization PDF

Cannot Refute

[33] Rethinking video tokenization: A conditioned diffusion-based approach PDF

Cannot Refute

[34] Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing PDF

Cannot Refute

[35] Don't Look Twice: Faster Video Transformers with Run-Length Tokenization PDF

Cannot Refute

[36] ADAPTOR: Adaptive Token Reduction for Video Diffusion Transformers PDF

Cannot Refute

[37] Image and Video Tokenization with Binary Spherical Quantization PDF

Cannot Refute

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Theoretical proof of suboptimality in existing tokenizers

[19] Language modeling is compression PDF

[20] Single-pass adaptive image tokenization for minimum program search PDF

[21] Unpacking tokenization: Evaluating text compression and its correlation with model performance PDF

[22] Tokenization and the noiseless channel PDF

[23] Emergent architectural dynamics of neural token compression in large language models PDF

[24] Training llms over neurally compressed text PDF

[25] MOAT: Revealing the Task-Optimality Gap in Adaptive Tokenization PDF

[26] Leveraging Information Theoretic ToolsFor Foundation Model Analysis PDF

[27] WSDL term tokenization methods for IR-style Web services discovery PDF

[28] HutterX â Omniscientrix Hybrid Compressor (vÎ© Unified InformationalâAwareness Framework Build) by Cornelius Aurelius PDF

INFOTOK framework with ELBO-based router and adaptive compressor

Empirical validation of superior token efficiency

[1] LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding PDF

[29] Adaptive token sampling for efficient vision transformers PDF

[30] LTX-Video: Realtime Video Latent Diffusion PDF

[31] Tokenlearner: Adaptive space-time tokenization for videos PDF

[32] Dense Video Understanding with Gated Residual Tokenization PDF

[33] Rethinking video tokenization: A conditioned diffusion-based approach PDF

[34] Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing PDF

[35] Don't Look Twice: Faster Video Transformers with Run-Length Tokenization PDF

[36] ADAPTOR: Adaptive Token Reduction for Video Diffusion Transformers PDF

[37] Image and Video Tokenization with Binary Spherical Quantization PDF

Table of Contents

[28] HutterX â Omniscientrix Hybrid Compressor (vÎ© Unified InformationalâAwareness Framework Build) by Cornelius Aurelius PDF