Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

ICLR 2026 Conference SubmissionAnonymous Authors
generative modelsvisual synthesisdiffusionflow matching
Abstract:

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained end-to-end using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring it remains smooth and suitable for generation.Our single-token formulation resolves the spatial redundancies of the 2D latent space, simplifies architectures, and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and extends naturally to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling. We will release our model to facilitate further research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes RepTok, a framework representing images with a single continuous latent token derived from self-supervised vision transformers, paired with a flow-matching decoder. According to the taxonomy, it resides in the 'Single-Token and Ultra-Compact Representations' leaf, which contains only two papers total. This leaf sits under the broader 'Continuous Token Representation and Tokenization' branch, indicating a relatively sparse research direction focused on extreme compactness. The sibling paper in this leaf shares the goal of minimal latent dimensionality, suggesting this is an emerging rather than crowded area of investigation.

The taxonomy reveals neighboring leaves pursuing multi-token continuous representations (three papers) and unified multimodal tokenizers (one paper), indicating alternative strategies within continuous tokenization. Adjacent branches explore autoregressive generation with continuous tokens (eight papers across three sub-branches) and hybrid discrete-continuous approaches (two papers). The taxonomy's scope notes clarify that single-token methods explicitly exclude multi-token representations requiring dozens or hundreds of tokens, and also exclude discrete quantization approaches. This positioning suggests RepTok occupies a distinct niche: maximizing compactness while remaining purely continuous, contrasting with both longer-sequence continuous methods and hybrid quantization schemes.

Among three contributions analyzed, the core RepTok framework using a single SSL token examined only one candidate and found one potentially refutable prior work, indicating limited but direct overlap in this ultra-compact space. The cosine-similarity regularization for preserving SSL geometry examined ten candidates with none clearly refuting, suggesting this specific technique is less explored. The lightweight attention-free pipeline examined two candidates with none refuting. The analysis explicitly notes this is based on a limited literature search of thirteen total candidates, not an exhaustive review. The statistics suggest the single-token SSL approach has at least one close precedent, while the regularization strategy and architectural simplifications appear less directly anticipated.

Given the sparse taxonomy leaf (two papers) and limited search scope (thirteen candidates), the work appears to occupy a genuine frontier in ultra-compact continuous representations. However, the single refutable candidate for the core contribution indicates the fundamental idea of single-token SSL-based generation has been explored. The novelty likely resides in the specific combination of SSL fine-tuning, cosine regularization, and flow-matching integration rather than the single-token paradigm itself. The taxonomy context suggests this direction remains under-explored compared to multi-token or autoregressive alternatives, though definitive claims require broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient image generation from single continuous latent token. The field has evolved around the tension between discrete tokenization—long dominant in autoregressive vision models—and emerging continuous representations that promise greater compactness and expressiveness. The taxonomy reflects this landscape through several major branches: Continuous Token Representation and Tokenization explores methods that encode images into smooth latent codes, often achieving ultra-compact single-token or few-token representations (e.g., Layton[44], Adapting Self-Supervised Latent[0]). Autoregressive Generation with Continuous Tokens investigates how to apply sequential modeling without quantization (Autoregressive Without Quantization[2], HART[4]), while Hybrid and Discrete-Continuous Approaches blend both paradigms (SoftVQ VAE[9], Rethinking Discrete Tokens[10]). Unified Multimodal Generation branches (Unified Autoregressive Vision[1], Ming Univision[3]) extend these ideas to joint text-image modeling, and Diffusion-Based Generation in Continuous Latent Spaces leverages diffusion or flow models in smooth embeddings (LN3DIFF[12], V2Flow[13]). Additional branches address conditional control, domain-specific applications, representation learning properties, cross-modal synthesis, and auxiliary architectural components, collectively mapping the diverse strategies for moving beyond discrete codes. A particularly active line of work focuses on pushing continuous tokenization to its extreme: single-token or ultra-compact representations that drastically reduce sequence length while preserving reconstruction fidelity. Adapting Self-Supervised Latent[0] sits squarely in this cluster, emphasizing how self-supervised pretraining can be adapted to yield a single continuous token per image. Nearby, Layton[44] also explores ultra-compact continuous embeddings, sharing the goal of minimal latent dimensionality. In contrast, methods like Autoregressive Without Quantization[2] and HART[4] retain longer sequences of continuous tokens to enable autoregressive modeling, trading compactness for the flexibility of next-token prediction. Another contrast emerges with hybrid approaches (SoftVQ VAE[9]) that soften discrete codes rather than eliminating them entirely. The central open question across these branches is whether a single continuous token can capture sufficient detail for high-resolution synthesis, or whether a small handful of tokens offers a better balance between efficiency and expressiveness. Adapting Self-Supervised Latent[0] contributes to this debate by demonstrating that leveraging pretrained representations can make single-token schemes more viable, positioning it as a bridge between representation learning and ultra-compact generation.

Claimed Contributions

Representation Tokenizer (RepTok) framework using single continuous SSL token

The authors propose RepTok, a method that adapts pre-trained self-supervised learning (SSL) encoders by fine-tuning only the semantic class token embedding. This single continuous token is paired with a generative decoder trained via flow matching, enabling faithful image reconstruction and efficient generation while eliminating spatial redundancies inherent in 2D latent spaces.

1 retrieved paper
Can Refute
Cosine-similarity regularization loss for preserving SSL latent space geometry

A cosine-similarity alignment term is introduced to constrain the fine-tuned token from deviating too far from its pre-trained SSL representation. This regularization maintains the smooth, semantically structured geometry of the original SSL space, which is beneficial for generative modeling, while still allowing the token to integrate fine-grained reconstruction details.

10 retrieved papers
Lightweight attention-free pipeline for latent generative modeling

The authors demonstrate that by compressing images into a single token, token-to-token interactions become unnecessary, enabling the use of simple MLP-based architectures such as MLP-Mixer instead of attention mechanisms. This drastically reduces training compute while preserving generation quality, achieving competitive ImageNet generation at a fraction of the cost of transformer-based diffusion baselines.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Representation Tokenizer (RepTok) framework using single continuous SSL token

The authors propose RepTok, a method that adapts pre-trained self-supervised learning (SSL) encoders by fine-tuning only the semantic class token embedding. This single continuous token is paired with a generative decoder trained via flow matching, enabling faithful image reconstruction and efficient generation while eliminating spatial redundancies inherent in 2D latent spaces.

Contribution

Cosine-similarity regularization loss for preserving SSL latent space geometry

A cosine-similarity alignment term is introduced to constrain the fine-tuned token from deviating too far from its pre-trained SSL representation. This regularization maintains the smooth, semantically structured geometry of the original SSL space, which is beneficial for generative modeling, while still allowing the token to integrate fine-grained reconstruction details.

Contribution

Lightweight attention-free pipeline for latent generative modeling

The authors demonstrate that by compressing images into a single token, token-to-token interactions become unnecessary, enabling the use of simple MLP-based architectures such as MLP-Mixer instead of attention mechanisms. This drastically reduces training compute while preserving generation quality, achieving competitive ImageNet generation at a fraction of the cost of transformer-based diffusion baselines.

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation | Novelty Validation