Latent Denoising Makes Good Visual Tokenizers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Image TokenizerImage Generative ModelsRepresentation Learning

Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective---reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking---a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO) image generation benchmarks demonstrate that our l-DeTok consistently improves generation quality across six representative generative models compared to prior tokenizers. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design. Our code and models will be publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes l-DeTok, a tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking, positioning denoising as a core design principle. It resides in the 'Denoising-Based Training Objectives' leaf, which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 50 papers across 27 leaf nodes, suggesting that explicit denoising-based tokenizer training remains relatively underexplored compared to reconstruction-focused or post-training refinement approaches.

The taxonomy reveals that l-DeTok's immediate neighbors include reconstruction-focused training methods and post-training refinement strategies, both of which emphasize alignment with downstream objectives but differ in mechanism. Nearby branches address latent space stabilization and generation-aware refinement, indicating that the field is actively exploring how to bridge tokenizer learning and generative model requirements. The denoising-based leaf sits within a larger branch on training objectives, distinct from architectural innovations like factorized quantization or continuous latent spaces, clarifying that l-DeTok's novelty centers on the training regime rather than representational structure.

Among 30 candidates examined, the first contribution—l-DeTok as a method—shows one refutable candidate out of 10 examined, suggesting some prior work addresses denoising-based tokenizer training. The second contribution, framing denoising as a unifying principle, examined 10 candidates with none clearly refuting it, indicating this conceptual framing may be less directly anticipated. The third contribution, comprehensive empirical validation across six generative models, also examined 10 candidates with no refutations, suggesting the breadth of experimental coverage is relatively distinctive within the limited search scope.

Given the sparse population of the denoising-based training leaf and the limited search scale, the work appears to occupy a relatively novel position in explicitly aligning tokenizer objectives with downstream denoising processes. However, the presence of one refutable candidate for the core method indicates that the technical approach may have partial precedent. The analysis reflects top-30 semantic matches and does not constitute an exhaustive survey of all tokenizer training strategies or denoising formulations in the broader literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visual tokenizer design for generative modeling. The field organizes around several complementary dimensions. One branch explores tokenization architecture and representation space, examining how discrete codes or continuous embeddings best capture visual structure. Another focuses on training objectives and optimization strategies, including reconstruction losses, adversarial training, and denoising-based formulations that guide tokenizers toward semantically meaningful representations. A third branch investigates unified tokenizers that bridge understanding and generation tasks, while a fourth targets tokenizers optimized specifically for autoregressive generation models. Additional branches address multimodal and cross-domain tokenization—extending visual codes to language, audio, or video—and domain-specific applications such as medical imaging or robotics. Representative works like Muse[3] and MaskGIT[31] illustrate how different training regimes shape tokenizer behavior, while efforts such as VILA-U[8] and OmniTokenizer[38] demonstrate the push toward unified multimodal representations. Within the training objectives branch, a particularly active line of work explores denoising-based formulations that encourage tokenizers to learn robust, noise-invariant features. Latent Denoising Tokenizers[0] exemplifies this approach by incorporating denoising objectives directly into the tokenization process, contrasting with purely reconstruction-driven methods. Nearby, Layton[27] also emphasizes denoising mechanisms but may differ in architectural choices or the balance between reconstruction fidelity and semantic abstraction. Meanwhile, post-training refinement strategies such as Tokenizer Post-training[5] adjust pretrained tokenizers to better align with downstream generative models, highlighting an ongoing tension between end-to-end joint training and modular design. The original paper sits squarely in this denoising-focused cluster, contributing to the broader question of how noise-aware objectives can yield tokenizers that generalize across diverse generative architectures and data distributions.

Claimed Contributions

Latent Denoising Tokenizer (l-DeTok)

Can Refute

10 retrieved papers

The authors propose a tokenizer training method that aligns latent embeddings with downstream generative model objectives by reconstructing clean images from corrupted latent representations. This is achieved through interpolative noise injection and optional random masking during training, encouraging robust and easily reconstructable embeddings.

10 retrieved papers

Can Refute

Denoising as a unifying design principle for tokenizers

10 retrieved papers

The authors establish that modern generative models share a common training objective of reconstructing clean signals from corrupted inputs (denoising), and propose that tokenizers should be designed to align with this principle. This conceptual framework motivates tokenizer embeddings that remain reconstructable under significant corruption.

10 retrieved papers

Comprehensive empirical validation across diverse generative models

10 retrieved papers

The authors demonstrate that their tokenizer generalizes across six representative generative models (both autoregressive and non-autoregressive), multiple tokenizer architectures (2D continuous, 1D continuous, and vector-quantized), and different generation tasks, showing consistent improvements without requiring semantics distillation from external pretrained models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[27] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens PDF

Xie Qing-song, Zhang Zhao, Qingsong Xie, Huang Zhe, Zhao Zhang, ZHANG Yanhao, Zhe Huang, Lu, Haonan, Yanhao Zhang, Yang Zhenyu, Haonan Lu, Zhenyu Yang (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Latent Denoising Tokenizer (l-DeTok)

[56] Robust latent matters: Boosting image generation with sampling error synthesis PDF

Can Refute

[27] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens PDF

Cannot Refute

[57] X-lrm: X-ray large reconstruction model for extremely sparse-view computed tomography recovery in one second PDF

Cannot Refute

[58] Reduce information loss in transformers for pluralistic image inpainting PDF

Cannot Refute

[59] Text-to-Video Generation Based on Diffusion Model PDF

Cannot Refute

[60] GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation PDF

Cannot Refute

[61] Discovering Latent Information from Noisy Sources PDF

Cannot Refute

[62] Beyond Prompts: Preserving Semantics in Diffusion-based Communication PDF

Cannot Refute

[63] Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model PDF

Cannot Refute

[64] LacTok: Latent Consistency Tokenizer for High-resolution Image Reconstruction and Generation by 256 Tokens PDF

Cannot Refute

Contribution

Denoising as a unifying design principle for tokenizers

[65] Denoising token prediction in masked autoregressive models PDF

Cannot Refute

[66] Comparison of Autoencoders for tokenization of ASL datasets PDF

Cannot Refute

[67] Masked autoencoders are effective tokenizers for diffusion models PDF

Cannot Refute

[68] Generative Recommendation with Continuous-Token Diffusion PDF

Cannot Refute

[69] Graph Diffusion Transformers are In-Context Molecular Designers PDF

Cannot Refute

[70] BAMM: Bidirectional Autoregressive Motion Model PDF

Cannot Refute

[71] Rdpm: Solve diffusion probabilistic models via recurrent token prediction PDF

Cannot Refute

[72] Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model PDF

Cannot Refute

[73] Point-RTD: Replaced Token Denoising for Pretraining Transformer Models on Point Clouds PDF

Cannot Refute

[74] TSG-DDT: Time-Series Generative Denoising Diffusion Transformers PDF

Cannot Refute

Contribution

Comprehensive empirical validation across diverse generative models

[2] UniTok: A Unified Tokenizer for Visual Generation and Understanding PDF

Cannot Refute

[4] Tokenflow: Unified image tokenizer for multimodal understanding and generation PDF

Cannot Refute

[12] Open-magvit2: An open-source project toward democratizing auto-regressive visual generation PDF

Cannot Refute

[33] Xq-gan: An open-source image tokenization framework for autoregressive generation PDF

Cannot Refute

[39] Holistic Tokenizer for Autoregressive Image Generation PDF

Cannot Refute

[51] Genie: Generative interactive environments PDF

Cannot Refute

[52] Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation PDF

Cannot Refute

[53] Magvit: Masked generative video transformer PDF

Cannot Refute

[54] Vec-tok speech: speech vectorization and tokenization for neural speech generation PDF

Cannot Refute

[55] Frequency autoregressive image generation with continuous tokens PDF

Cannot Refute

Latent Denoising Makes Good Visual Tokenizers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[27] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens PDF

Contribution Analysis

Latent Denoising Tokenizer (l-DeTok)

[56] Robust latent matters: Boosting image generation with sampling error synthesis PDF

[27] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens PDF

[57] X-lrm: X-ray large reconstruction model for extremely sparse-view computed tomography recovery in one second PDF

[58] Reduce information loss in transformers for pluralistic image inpainting PDF

[59] Text-to-Video Generation Based on Diffusion Model PDF

[60] GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation PDF

[61] Discovering Latent Information from Noisy Sources PDF

[62] Beyond Prompts: Preserving Semantics in Diffusion-based Communication PDF

[63] Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model PDF

[64] LacTok: Latent Consistency Tokenizer for High-resolution Image Reconstruction and Generation by 256 Tokens PDF

Denoising as a unifying design principle for tokenizers

[65] Denoising token prediction in masked autoregressive models PDF

[66] Comparison of Autoencoders for tokenization of ASL datasets PDF

[67] Masked autoencoders are effective tokenizers for diffusion models PDF

[68] Generative Recommendation with Continuous-Token Diffusion PDF

[69] Graph Diffusion Transformers are In-Context Molecular Designers PDF

[70] BAMM: Bidirectional Autoregressive Motion Model PDF

[71] Rdpm: Solve diffusion probabilistic models via recurrent token prediction PDF

[72] Generalized Denoising Diffusion Codebook Models (gDDCM): Tokenizing images using a pre-trained diffusion model PDF

[73] Point-RTD: Replaced Token Denoising for Pretraining Transformer Models on Point Clouds PDF

[74] TSG-DDT: Time-Series Generative Denoising Diffusion Transformers PDF

Comprehensive empirical validation across diverse generative models

[2] UniTok: A Unified Tokenizer for Visual Generation and Understanding PDF

[4] Tokenflow: Unified image tokenizer for multimodal understanding and generation PDF

[12] Open-magvit2: An open-source project toward democratizing auto-regressive visual generation PDF

[33] Xq-gan: An open-source image tokenization framework for autoregressive generation PDF

[39] Holistic Tokenizer for Autoregressive Image Generation PDF

[51] Genie: Generative interactive environments PDF

[52] Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation PDF

[53] Magvit: Masked generative video transformer PDF

[54] Vec-tok speech: speech vectorization and tokenization for neural speech generation PDF

[55] Frequency autoregressive image generation with continuous tokens PDF

Table of Contents