Latent Denoising Makes Good Visual Tokenizers

ICLR 2026 Conference SubmissionAnonymous Authors
Image TokenizerImage Generative ModelsRepresentation Learning
Abstract:

Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective---reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking---a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO) image generation benchmarks demonstrate that our l-DeTok consistently improves generation quality across six representative generative models compared to prior tokenizers. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design. Our code and models will be publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes l-DeTok, a tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking, positioning denoising as a core design principle. It resides in the 'Denoising-Based Training Objectives' leaf, which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 50 papers across 27 leaf nodes, suggesting that explicit denoising-based tokenizer training remains relatively underexplored compared to reconstruction-focused or post-training refinement approaches.

The taxonomy reveals that l-DeTok's immediate neighbors include reconstruction-focused training methods and post-training refinement strategies, both of which emphasize alignment with downstream objectives but differ in mechanism. Nearby branches address latent space stabilization and generation-aware refinement, indicating that the field is actively exploring how to bridge tokenizer learning and generative model requirements. The denoising-based leaf sits within a larger branch on training objectives, distinct from architectural innovations like factorized quantization or continuous latent spaces, clarifying that l-DeTok's novelty centers on the training regime rather than representational structure.

Among 30 candidates examined, the first contribution—l-DeTok as a method—shows one refutable candidate out of 10 examined, suggesting some prior work addresses denoising-based tokenizer training. The second contribution, framing denoising as a unifying principle, examined 10 candidates with none clearly refuting it, indicating this conceptual framing may be less directly anticipated. The third contribution, comprehensive empirical validation across six generative models, also examined 10 candidates with no refutations, suggesting the breadth of experimental coverage is relatively distinctive within the limited search scope.

Given the sparse population of the denoising-based training leaf and the limited search scale, the work appears to occupy a relatively novel position in explicitly aligning tokenizer objectives with downstream denoising processes. However, the presence of one refutable candidate for the core method indicates that the technical approach may have partial precedent. The analysis reflects top-30 semantic matches and does not constitute an exhaustive survey of all tokenizer training strategies or denoising formulations in the broader literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: visual tokenizer design for generative modeling. The field organizes around several complementary dimensions. One branch explores tokenization architecture and representation space, examining how discrete codes or continuous embeddings best capture visual structure. Another focuses on training objectives and optimization strategies, including reconstruction losses, adversarial training, and denoising-based formulations that guide tokenizers toward semantically meaningful representations. A third branch investigates unified tokenizers that bridge understanding and generation tasks, while a fourth targets tokenizers optimized specifically for autoregressive generation models. Additional branches address multimodal and cross-domain tokenization—extending visual codes to language, audio, or video—and domain-specific applications such as medical imaging or robotics. Representative works like Muse[3] and MaskGIT[31] illustrate how different training regimes shape tokenizer behavior, while efforts such as VILA-U[8] and OmniTokenizer[38] demonstrate the push toward unified multimodal representations. Within the training objectives branch, a particularly active line of work explores denoising-based formulations that encourage tokenizers to learn robust, noise-invariant features. Latent Denoising Tokenizers[0] exemplifies this approach by incorporating denoising objectives directly into the tokenization process, contrasting with purely reconstruction-driven methods. Nearby, Layton[27] also emphasizes denoising mechanisms but may differ in architectural choices or the balance between reconstruction fidelity and semantic abstraction. Meanwhile, post-training refinement strategies such as Tokenizer Post-training[5] adjust pretrained tokenizers to better align with downstream generative models, highlighting an ongoing tension between end-to-end joint training and modular design. The original paper sits squarely in this denoising-focused cluster, contributing to the broader question of how noise-aware objectives can yield tokenizers that generalize across diverse generative architectures and data distributions.

Claimed Contributions

Latent Denoising Tokenizer (l-DeTok)

The authors propose a tokenizer training method that aligns latent embeddings with downstream generative model objectives by reconstructing clean images from corrupted latent representations. This is achieved through interpolative noise injection and optional random masking during training, encouraging robust and easily reconstructable embeddings.

10 retrieved papers
Can Refute
Denoising as a unifying design principle for tokenizers

The authors establish that modern generative models share a common training objective of reconstructing clean signals from corrupted inputs (denoising), and propose that tokenizers should be designed to align with this principle. This conceptual framework motivates tokenizer embeddings that remain reconstructable under significant corruption.

10 retrieved papers
Comprehensive empirical validation across diverse generative models

The authors demonstrate that their tokenizer generalizes across six representative generative models (both autoregressive and non-autoregressive), multiple tokenizer architectures (2D continuous, 1D continuous, and vector-quantized), and different generation tasks, showing consistent improvements without requiring semantics distillation from external pretrained models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Latent Denoising Tokenizer (l-DeTok)

The authors propose a tokenizer training method that aligns latent embeddings with downstream generative model objectives by reconstructing clean images from corrupted latent representations. This is achieved through interpolative noise injection and optional random masking during training, encouraging robust and easily reconstructable embeddings.

Contribution

Denoising as a unifying design principle for tokenizers

The authors establish that modern generative models share a common training objective of reconstructing clean signals from corrupted inputs (denoising), and propose that tokenizers should be designed to align with this principle. This conceptual framework motivates tokenizer embeddings that remain reconstructable under significant corruption.

Contribution

Comprehensive empirical validation across diverse generative models

The authors demonstrate that their tokenizer generalizes across six representative generative models (both autoregressive and non-autoregressive), multiple tokenizer architectures (2D continuous, 1D continuous, and vector-quantized), and different generation tasks, showing consistent improvements without requiring semantics distillation from external pretrained models.