AToken: A Unified Tokenizer for Vision

ICLR 2026 Conference Withdrawn SubmissionJiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang

OpenReview Score: 6.5 Download Report PDF

TokenizerOmni model

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs in a shared 4D latent space, optimizing without separate model designs. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9% classification accuracy for 3D. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving 1.44/2.23 gFID on ImageNet for continuous/discrete tokens, and 48.7% on MMMU and 64.5% on VideoMME. These results shed light on the next-generation multimodal AI systems built upon the unified visual tokenization.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

AToken proposes a unified visual tokenizer that encodes images, videos, and 3D assets into a shared 4D latent space, targeting both high-fidelity reconstruction and semantic understanding. The paper resides in the 'Shared Latent Space Tokenizers' leaf, which contains only three papers including AToken itself. This leaf sits within the broader 'Unified Multimodal Tokenization Architectures' branch, indicating a relatively sparse but emerging research direction. The small sibling count suggests that truly unified tokenizers handling all three modalities (images, videos, 3D) within a single architecture remain uncommon, positioning AToken in a less crowded area of the field.

The taxonomy reveals that most related work either specializes in single modalities or adopts modality-specific preprocessing before unification. The 'Video-Specific Tokenization Methods' branch contains numerous papers focused solely on temporal compression and reconstruction, while '3D Scene Tokenization and Understanding' addresses point clouds and volumetric data separately. Neighboring leaves like 'Frozen Encoder Multimodal Frameworks' and 'Heterogeneous Signal Tokenization' pursue cross-modal alignment through different architectural strategies—frozen pretrained encoders or discrete token conversion for LLMs—rather than AToken's approach of learning a shared continuous latent space from scratch across all three visual domains.

Among 29 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core unified tokenizer concept (Contribution A) examined 9 candidates with no clear refutations, suggesting limited direct prior work on this specific three-modality unification. The 4D rotary position embeddings (Contribution B) also found no refutations across 10 candidates, indicating architectural novelty. However, the adversarial-free training objective with Gram matrix loss (Contribution C) encountered 1 refutable candidate among 10 examined, pointing to some overlap with existing reconstruction training strategies. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage.

Given the sparse taxonomy leaf and limited refutations across most contributions, AToken appears to occupy a relatively novel position within the examined literature. The main uncertainty concerns whether the adversarial-free training approach represents a significant departure from prior reconstruction methods, and whether the 30-candidate search captured all relevant unified tokenization work. The analysis suggests incremental innovation in training objectives but potentially stronger novelty in the architectural unification of three visual modalities within a single learned latent space.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified visual tokenization across images, videos, and 3D. The field has evolved to address the challenge of representing diverse visual modalities—static images, temporal video sequences, and spatial 3D structures—within a common tokenization framework that can interface with large language models and generative systems. The taxonomy reveals several complementary directions: Unified Multimodal Tokenization Architectures pursue shared latent spaces that handle multiple modalities through common encoders or cross-modal alignment (e.g., Omnitokenizer[15], Show-o2[13]); Video-Specific Tokenization Methods focus on temporal compression and causal modeling tailored to video data (Vidtok[18], Causal Video Tokenization[23]); 3D Scene Tokenization and Understanding develops representations for point clouds, meshes, and volumetric data (LLaVA-3D[16], Gaussian Splatting Tokenization[30]); Token Reduction and Efficiency branches explore adaptive pruning and sparsification to manage computational costs in multimodal LLMs (ElasticTok[41], AdaToken-3D[32]); Multimodal Integration and Application-Driven Tokenization emphasizes end-to-end systems for tasks like visual question answering and content synthesis (LLaVA-NeXT-Interleave[1], I2-world[3]); and Specialized Tokenization for Downstream Tasks targets domain-specific needs such as action recognition or medical imaging (Semantic Action Tokenization[38], OmniV-Med[20]). Within the Unified Multimodal Tokenization Architectures branch, a central theme is whether to learn a single shared encoder or maintain modality-specific pathways that converge in a common latent space. AToken[0] exemplifies the shared latent space approach, aiming to unify image, video, and 3D representations through a cohesive tokenization strategy that balances expressiveness across modalities. This contrasts with works like Meta-transformer[4], which employ modality-specific preprocessing before a unified transformer backbone, and Harmonizer[2], which focuses on aligning heterogeneous token distributions post-encoding. Compared to neighbors such as Show-o2[13], which integrates discrete codebook learning for joint image-text generation, and Omnitokenizer[15], which emphasizes cross-modal retrieval and alignment, AToken[0] prioritizes a more tightly coupled latent space that directly supports reasoning and generation across all three visual domains. The trade-offs revolve around reconstruction fidelity, computational overhead, and the ease of extending tokenization to new modalities or downstream tasks, with ongoing questions about optimal codebook sizes, temporal modeling granularity, and the role of 3D geometric priors in unified frameworks.

Claimed Contributions

AToken: unified visual tokenizer for images, videos, and 3D

9 retrieved papers

The authors introduce AToken, a single tokenizer that unifies reconstruction and understanding tasks across three visual modalities (images, videos, 3D assets) by encoding them into a shared 4D latent space, supporting both continuous and discrete token representations.

9 retrieved papers

Pure transformer architecture with 4D rotary position embeddings

10 retrieved papers

The authors propose a transformer-based encoder-decoder architecture that extends 2D image processing to a unified 4D space using rotary position embeddings, enabling native handling of arbitrary resolutions and temporal lengths across all modalities.

10 retrieved papers

Adversarial-free training objective with Gram matrix loss

Can Refute

10 retrieved papers

The authors develop a stable training approach that replaces adversarial training with a combination of perceptual and Gram matrix losses, directly optimizing second-order statistics to achieve high-fidelity reconstruction without GAN instabilities.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Show-o2: Improved Native Unified Multimodal Models PDF

Xie, Jinheng, Yang Zhenheng, Jinheng Xie, Shou, Mike Zheng, Zhenheng Yang, Mike Zheng Shou (2025) • arXiv.org

[15] Omnitokenizer: A joint image-video tokenizer for visual generation PDF

Yi Jiang, Yu-Gang Jiang, Binyue Peng, Junke Wang, Zuxuan Wu, Zehuan Yuan (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AToken: unified visual tokenizer for images, videos, and 3D

[11] Image and video tokenization with binary spherical quantization PDF

Cannot Refute

[15] Omnitokenizer: A joint image-video tokenizer for visual generation PDF

Cannot Refute

[51] Learnings from scaling visual tokenizers for reconstruction and generation PDF

Cannot Refute

[53] Language Model Beats Diffusion--Tokenizer is Key to Visual Generation PDF

Cannot Refute

[54] Factorized visual tokenization and generation PDF

Cannot Refute

[55] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation PDF

Cannot Refute

[56] LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token PDF

Cannot Refute

[57] Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies PDF

Cannot Refute

[58] Advanced sign language video generation with compressed and quantized multi-condition tokenization PDF

Cannot Refute

Contribution

Pure transformer architecture with 4D rotary position embeddings

[59] Revisiting Multimodal Positional Encoding in Vision-Language Models PDF

Cannot Refute

[60] Rotary Position Embedding for Vision Transformer PDF

Cannot Refute

[61] Vrope: Rotary position embedding for video large language models PDF

Cannot Refute

[62] VideoRoPE: What Makes for Good Video Rotary Position Embedding? PDF

Cannot Refute

[63] Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models PDF

Cannot Refute

[64] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

Cannot Refute

[65] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models PDF

Cannot Refute

[66] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization PDF

Cannot Refute

[67] Liere: Generalizing rotary position encodings PDF

Cannot Refute

[68] Medical image interpretation with large multimodal models PDF

Cannot Refute

Contribution

Adversarial-free training objective with Gram matrix loss

[70] Local Texture Pattern Estimation for Image Detail Super-Resolution PDF

Can Refute

[69] MFMAM: Image inpainting via multi-scale feature module with attention module PDF

Cannot Refute

[71] SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization PDF

Cannot Refute

[72] Single image HDR reconstruction using a CNN with masked features and perceptual loss PDF

Cannot Refute

[73] Brain-driven facial image reconstruction via StyleGAN inversion with improved identity consistency PDF

Cannot Refute

[74] A training method for image compression networks to improve perceptual quality of reconstructions PDF

Cannot Refute

[75] MR image reconstruction using deep learning: evaluation of network structure and loss functions PDF

Cannot Refute

[76] Flexible style image super-resolution using conditional objective PDF

Cannot Refute

[77] Auxiliary loss reweighting for image inpainting PDF

Cannot Refute

[78] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images PDF

Cannot Refute

AToken: A Unified Tokenizer for Vision

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Show-o2: Improved Native Unified Multimodal Models PDF

[15] Omnitokenizer: A joint image-video tokenizer for visual generation PDF

Contribution Analysis

AToken: unified visual tokenizer for images, videos, and 3D

[11] Image and video tokenization with binary spherical quantization PDF

[15] Omnitokenizer: A joint image-video tokenizer for visual generation PDF

[51] Learnings from scaling visual tokenizers for reconstruction and generation PDF

[53] Language Model Beats Diffusion--Tokenizer is Key to Visual Generation PDF

[54] Factorized visual tokenization and generation PDF

[55] MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation PDF

[56] LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token PDF

[57] Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies PDF

[58] Advanced sign language video generation with compressed and quantized multi-condition tokenization PDF

Pure transformer architecture with 4D rotary position embeddings

[59] Revisiting Multimodal Positional Encoding in Vision-Language Models PDF

[60] Rotary Position Embedding for Vision Transformer PDF

[61] Vrope: Rotary position embedding for video large language models PDF

[62] VideoRoPE: What Makes for Good Video Rotary Position Embedding? PDF

[63] Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models PDF

[64] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

[65] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models PDF

[66] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization PDF

[67] Liere: Generalizing rotary position encodings PDF

[68] Medical image interpretation with large multimodal models PDF

Adversarial-free training objective with Gram matrix loss

[70] Local Texture Pattern Estimation for Image Detail Super-Resolution PDF

[69] MFMAM: Image inpainting via multi-scale feature module with attention module PDF

[71] SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization PDF

[72] Single image HDR reconstruction using a CNN with masked features and perceptual loss PDF

[73] Brain-driven facial image reconstruction via StyleGAN inversion with improved identity consistency PDF

[74] A training method for image compression networks to improve perceptual quality of reconstructions PDF

[75] MR image reconstruction using deep learning: evaluation of network structure and loss functions PDF

[76] Flexible style image super-resolution using conditional objective PDF

[77] Auxiliary loss reweighting for image inpainting PDF

[78] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images PDF

Table of Contents