InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

ICLR 2026 Conference SubmissionAnonymous Authors
discrete tokenizationvideo representationeficiencyinformation theory
Abstract:

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20%20\% tokens without influence on performance, and achieving 2.3×2.3\times compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
18
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: adaptive video tokenization via information-theoretic compression. The field addresses how to efficiently represent video data by dynamically allocating tokens based on content complexity and information density. The taxonomy reveals several complementary directions: information-theoretic frameworks that optimize compression through entropy measures and rate-distortion principles; spatiotemporal adaptive methods that reduce tokens specifically for video-language models; neural discrete representation learning approaches that build codebooks for video tokenization; generative methods with adaptive spatial allocation; attention-based token reduction for understanding tasks; learned compression systems that adapt to content statistics; inference-time adaptive schemes that compress on-the-fly; and application-specific compression tailored to particular downstream tasks. Works like LongVU[1] and VQToken[2] exemplify spatiotemporal and discrete representation approaches respectively, while Ultra-low Bitrate Transformer[3] and PVC[5] demonstrate learned compression strategies that adapt to varying content characteristics. Particularly active themes include the trade-off between compression efficiency and downstream task performance, the challenge of handling temporal redundancy versus spatial detail, and the question of whether to optimize tokenization jointly with task objectives or as a separate preprocessing step. InfoTok[0] sits within the information-theoretic adaptive compression frameworks branch, specifically focusing on entropy-based optimal tokenization. This positions it closely with works that use rate-distortion theory to guide token allocation, contrasting with purely learned approaches like DynTok[6] that adapt tokens through neural architectures without explicit information-theoretic objectives. Compared to application-specific methods such as Video Compression Commander[4] or Human-Centric Video Compression[16], InfoTok[0] appears to pursue a more general compression principle grounded in entropy optimization, aiming for broader applicability across tasks rather than tuning for specific downstream applications.

Claimed Contributions

Theoretical proof of suboptimality in existing tokenizers

The authors provide rigorous theoretical proofs demonstrating that both fixed-compression tokenizers and existing adaptive tokenizers using data-agnostic routers (such as uniform sampling) are suboptimal in terms of expected token length compared to information-theoretic optimality. They show these methods fail to achieve near-optimal compression rates.

10 retrieved papers
INFOTOK framework with ELBO-based router and adaptive compressor

The authors introduce INFOTOK, a novel framework for adaptive video tokenization that uses an Evidence Lower Bound (ELBO)-based router to determine token sequence lengths based on video information complexity, combined with a transformer-based adaptive compressor that efficiently compresses embeddings into variable-length token sequences.

0 retrieved papers
Empirical validation of superior token efficiency

The authors conduct comprehensive experiments showing that INFOTOK achieves state-of-the-art compression performance, saving approximately 20% tokens without performance loss and achieving 2.3× better compression rates compared to prior adaptive approaches while maintaining or improving reconstruction quality.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical proof of suboptimality in existing tokenizers

The authors provide rigorous theoretical proofs demonstrating that both fixed-compression tokenizers and existing adaptive tokenizers using data-agnostic routers (such as uniform sampling) are suboptimal in terms of expected token length compared to information-theoretic optimality. They show these methods fail to achieve near-optimal compression rates.

Contribution

INFOTOK framework with ELBO-based router and adaptive compressor

The authors introduce INFOTOK, a novel framework for adaptive video tokenization that uses an Evidence Lower Bound (ELBO)-based router to determine token sequence lengths based on video information complexity, combined with a transformer-based adaptive compressor that efficiently compresses embeddings into variable-length token sequences.

Contribution

Empirical validation of superior token efficiency

The authors conduct comprehensive experiments showing that INFOTOK achieves state-of-the-art compression performance, saving approximately 20% tokens without performance loss and achieving 2.3× better compression rates compared to prior adaptive approaches while maintaining or improving reconstruction quality.