AToken: A Unified Tokenizer for Vision

ICLR 2026 Conference Withdrawn SubmissionJiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang
TokenizerOmni model
Abstract:

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs in a shared 4D latent space, optimizing without separate model designs. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9% classification accuracy for 3D. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving 1.44/2.23 gFID on ImageNet for continuous/discrete tokens, and 48.7% on MMMU and 64.5% on VideoMME. These results shed light on the next-generation multimodal AI systems built upon the unified visual tokenization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

AToken proposes a unified visual tokenizer that encodes images, videos, and 3D assets into a shared 4D latent space, targeting both high-fidelity reconstruction and semantic understanding. The paper resides in the 'Shared Latent Space Tokenizers' leaf, which contains only three papers including AToken itself. This leaf sits within the broader 'Unified Multimodal Tokenization Architectures' branch, indicating a relatively sparse but emerging research direction. The small sibling count suggests that truly unified tokenizers handling all three modalities (images, videos, 3D) within a single architecture remain uncommon, positioning AToken in a less crowded area of the field.

The taxonomy reveals that most related work either specializes in single modalities or adopts modality-specific preprocessing before unification. The 'Video-Specific Tokenization Methods' branch contains numerous papers focused solely on temporal compression and reconstruction, while '3D Scene Tokenization and Understanding' addresses point clouds and volumetric data separately. Neighboring leaves like 'Frozen Encoder Multimodal Frameworks' and 'Heterogeneous Signal Tokenization' pursue cross-modal alignment through different architectural strategies—frozen pretrained encoders or discrete token conversion for LLMs—rather than AToken's approach of learning a shared continuous latent space from scratch across all three visual domains.

Among 29 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core unified tokenizer concept (Contribution A) examined 9 candidates with no clear refutations, suggesting limited direct prior work on this specific three-modality unification. The 4D rotary position embeddings (Contribution B) also found no refutations across 10 candidates, indicating architectural novelty. However, the adversarial-free training objective with Gram matrix loss (Contribution C) encountered 1 refutable candidate among 10 examined, pointing to some overlap with existing reconstruction training strategies. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage.

Given the sparse taxonomy leaf and limited refutations across most contributions, AToken appears to occupy a relatively novel position within the examined literature. The main uncertainty concerns whether the adversarial-free training approach represents a significant departure from prior reconstruction methods, and whether the 30-candidate search captured all relevant unified tokenization work. The analysis suggests incremental innovation in training objectives but potentially stronger novelty in the architectural unification of three visual modalities within a single learned latent space.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: unified visual tokenization across images, videos, and 3D. The field has evolved to address the challenge of representing diverse visual modalities—static images, temporal video sequences, and spatial 3D structures—within a common tokenization framework that can interface with large language models and generative systems. The taxonomy reveals several complementary directions: Unified Multimodal Tokenization Architectures pursue shared latent spaces that handle multiple modalities through common encoders or cross-modal alignment (e.g., Omnitokenizer[15], Show-o2[13]); Video-Specific Tokenization Methods focus on temporal compression and causal modeling tailored to video data (Vidtok[18], Causal Video Tokenization[23]); 3D Scene Tokenization and Understanding develops representations for point clouds, meshes, and volumetric data (LLaVA-3D[16], Gaussian Splatting Tokenization[30]); Token Reduction and Efficiency branches explore adaptive pruning and sparsification to manage computational costs in multimodal LLMs (ElasticTok[41], AdaToken-3D[32]); Multimodal Integration and Application-Driven Tokenization emphasizes end-to-end systems for tasks like visual question answering and content synthesis (LLaVA-NeXT-Interleave[1], I2-world[3]); and Specialized Tokenization for Downstream Tasks targets domain-specific needs such as action recognition or medical imaging (Semantic Action Tokenization[38], OmniV-Med[20]). Within the Unified Multimodal Tokenization Architectures branch, a central theme is whether to learn a single shared encoder or maintain modality-specific pathways that converge in a common latent space. AToken[0] exemplifies the shared latent space approach, aiming to unify image, video, and 3D representations through a cohesive tokenization strategy that balances expressiveness across modalities. This contrasts with works like Meta-transformer[4], which employ modality-specific preprocessing before a unified transformer backbone, and Harmonizer[2], which focuses on aligning heterogeneous token distributions post-encoding. Compared to neighbors such as Show-o2[13], which integrates discrete codebook learning for joint image-text generation, and Omnitokenizer[15], which emphasizes cross-modal retrieval and alignment, AToken[0] prioritizes a more tightly coupled latent space that directly supports reasoning and generation across all three visual domains. The trade-offs revolve around reconstruction fidelity, computational overhead, and the ease of extending tokenization to new modalities or downstream tasks, with ongoing questions about optimal codebook sizes, temporal modeling granularity, and the role of 3D geometric priors in unified frameworks.

Claimed Contributions

AToken: unified visual tokenizer for images, videos, and 3D

The authors introduce AToken, a single tokenizer that unifies reconstruction and understanding tasks across three visual modalities (images, videos, 3D assets) by encoding them into a shared 4D latent space, supporting both continuous and discrete token representations.

9 retrieved papers
Pure transformer architecture with 4D rotary position embeddings

The authors propose a transformer-based encoder-decoder architecture that extends 2D image processing to a unified 4D space using rotary position embeddings, enabling native handling of arbitrary resolutions and temporal lengths across all modalities.

10 retrieved papers
Adversarial-free training objective with Gram matrix loss

The authors develop a stable training approach that replaces adversarial training with a combination of perceptual and Gram matrix losses, directly optimizing second-order statistics to achieve high-fidelity reconstruction without GAN instabilities.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AToken: unified visual tokenizer for images, videos, and 3D

The authors introduce AToken, a single tokenizer that unifies reconstruction and understanding tasks across three visual modalities (images, videos, 3D assets) by encoding them into a shared 4D latent space, supporting both continuous and discrete token representations.

Contribution

Pure transformer architecture with 4D rotary position embeddings

The authors propose a transformer-based encoder-decoder architecture that extends 2D image processing to a unified 4D space using rotary position embeddings, enabling native handling of arbitrary resolutions and temporal lengths across all modalities.

Contribution

Adversarial-free training objective with Gram matrix loss

The authors develop a stable training approach that replaces adversarial training with a combination of perceptual and Gram matrix losses, directly optimizing second-order statistics to achieve high-fidelity reconstruction without GAN instabilities.