Flow Autoencoders are Effective Protein Tokenizers

ICLR 2026 Conference SubmissionAnonymous Authors
flow tokenizersproteinsgeneration
Abstract:

Protein structure tokenizers enable the creation of multimodal models of protein structure, sequence, and function. Current approaches to protein structure tokenization rely on bespoke components that are invariant to spatial symmetries, but that are challenging to optimize and scale. We present Kanzi, a flow-based tokenizer for tokenization and generation of protein structures. Kanzi consists of a diffusion autoencoder trained with a flow matching loss. We show that this approach simplifies several aspects of protein structure tokenizers: frame-based representations can be replaced with global coordinates, complex losses are replaced with a single flow matching loss, and SE(3)-invariant attention operations can be replaced with standard attention. We find that these changes stabilize the training of parameter-efficient models that outperform existing tokenizers on reconstruction metrics at a fraction of the model size and training cost. An autoregressive model trained with Kanzi outperforms similar generative models that operate over tokens, although it does not yet match the performance of state-of-the-art continuous diffusion models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Kanzi, a flow-based diffusion autoencoder for protein structure tokenization that replaces vector-quantized codebooks with continuous latent representations. Within the taxonomy, it resides in the 'Flow-Based and Diffusion Autoencoders' leaf under 'Continuous Structure Representation and Embedding', sharing this leaf with only one sibling paper (Flow Autoencoders). This places Kanzi in a relatively sparse research direction—only two papers occupy this specific methodological niche—suggesting the flow-matching approach to structure tokenization remains underexplored compared to the more crowded discrete tokenization branches.

The taxonomy reveals that most structure tokenization work clusters in 'Discrete Structure Tokenization Methods', particularly 'Vector-Quantized Autoencoder Approaches' (five papers) and 'Geometry-Constrained Tokenization' (two papers). Kanzi diverges from these by avoiding explicit codebooks and geometric invariance constraints, instead learning smooth latent spaces through flow matching. Its closest conceptual neighbors are continuous embedding methods like ProteinAE and the original Flow Autoencoders, yet it differs by framing tokenization as a diffusion process rather than pure variational or normalizing-flow objectives. This positions Kanzi at the boundary between continuous representation learning and the broader tokenization ecosystem.

Among eleven candidates examined, one paper was identified as potentially refuting the core contribution of a flow-based tokenizer, while nine others were non-refutable or unclear. The simplification contribution (replacing frame-based representations with global coordinates and standard attention) was examined against one candidate with no refutation found. The reconstruction metric contribution was not examined against any candidates. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—rather than exhaustive coverage. The flow-based tokenization approach appears less contested in the examined literature, though the small candidate pool (eleven total) limits confidence in this assessment.

Given the sparse occupancy of the flow-based autoencoder leaf and the limited overlap found among eleven examined candidates, Kanzi appears to occupy a relatively novel methodological position within the surveyed literature. However, the analysis is constrained by the search scope: only top-K semantic neighbors were examined, and the taxonomy itself captures thirty-nine papers across the broader field. A more exhaustive search—particularly within diffusion-based generative modeling and continuous embedding methods—might reveal additional overlapping work not surfaced by semantic similarity alone.

Taxonomy

Core-task Taxonomy Papers
39
3
Claimed Contributions
11
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: protein structure tokenization and generation. The field has organized itself around several complementary strategies for representing three-dimensional protein structures in forms amenable to machine learning. Discrete Structure Tokenization Methods focus on converting continuous coordinates into symbolic vocabularies—ranging from geometric clustering approaches like Geometric Byte Pair[2] to learned codebooks in works such as FoldToken2[9] and Bio2Token[7]. Continuous Structure Representation and Embedding takes an alternative path, learning smooth latent spaces through autoencoders and flow-based models that preserve geometric information without hard discretization. Multimodal Protein Language Models bridge sequence and structure by integrating both modalities, as seen in Language Protein Structure[1] and ProtTeX[5], while Protein Language Model Architectures and Training explores the underlying neural frameworks—including transformer variants and state-space models like Long-context Mamba[31]. Evaluation and Benchmarking of Tokenization, exemplified by Tokenization Benchmarking[27], provides systematic comparisons of these diverse encoding schemes, and Structure Prediction and Reconstruction addresses the inverse problem of generating plausible three-dimensional conformations from learned representations. A central tension runs through the field between preserving fine-grained geometric detail and achieving compact, generalizable representations. Discrete tokenization methods often face trade-offs between vocabulary size and reconstruction fidelity, with works like Balancing Locality Reconstruction[3] explicitly addressing this challenge. In contrast, continuous embedding approaches—including ProteinAE[22] and the original Flow Autoencoders[0]—sidestep hard quantization by learning smooth latent manifolds, typically using variational or flow-based objectives to maintain structural coherence. Flow Autoencoders[0] sits squarely within this continuous paradigm, employing normalizing flows to map protein backbones into tractable distributions, closely aligned with ProteinAE[22] in its emphasis on differentiable, geometry-preserving encodings. Compared to hybrid approaches like Balancing Locality Reconstruction[3], which negotiate between discrete tokens and local geometric constraints, Flow Autoencoders[0] prioritizes end-to-end continuity, offering a complementary perspective on how to compress and generate structural diversity without categorical boundaries.

Claimed Contributions

Kanzi: a flow-based protein structure tokenizer

The authors introduce Kanzi, a novel protein structure tokenizer that uses a flow matching autoencoder architecture. Unlike existing tokenizers that rely on SE(3)-invariant components and complex losses, Kanzi operates on global coordinates with standard attention and uses a single flow matching loss for training.

10 retrieved papers
Can Refute
Simplification of protein structure tokenization

The authors demonstrate that their flow-based approach eliminates the need for SE(3)-invariant architectural components, frame-based representations, and collections of complex reconstruction losses that are standard in existing protein tokenizers, replacing them with simpler alternatives while maintaining or improving performance.

1 retrieved paper
Reconstruction Fr ́echet Protein Structure Distance (rFPSD) metric

The authors propose rFPSD, a new distribution-level metric for evaluating protein structure tokenizers. This metric extends prior work on generative evaluation to the reconstruction task, providing broader information about tokenization performance beyond point-wise metrics like RMSD.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Kanzi: a flow-based protein structure tokenizer

The authors introduce Kanzi, a novel protein structure tokenizer that uses a flow matching autoencoder architecture. Unlike existing tokenizers that rely on SE(3)-invariant components and complex losses, Kanzi operates on global coordinates with standard attention and uses a single flow matching loss for training.

Contribution

Simplification of protein structure tokenization

The authors demonstrate that their flow-based approach eliminates the need for SE(3)-invariant architectural components, frame-based representations, and collections of complex reconstruction losses that are standard in existing protein tokenizers, replacing them with simpler alternatives while maintaining or improving performance.

Contribution

Reconstruction Fr ́echet Protein Structure Distance (rFPSD) metric

The authors propose rFPSD, a new distribution-level metric for evaluating protein structure tokenizers. This metric extends prior work on generative evaluation to the reconstruction task, providing broader information about tokenization performance beyond point-wise metrics like RMSD.