On Natural Ways to Generate and Their Provable Power

ICLR 2026 Conference SubmissionAnonymous Authors
masked diffusion modelsautoregressive modelsexpressiveness
Abstract:

Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes formal computational power results for Masked Diffusion Models (MDM) and proposes any-process generation with remask, insert, and delete operations. It resides in the Computational Universality and Complexity Analysis leaf, which contains only three papers total, indicating a sparse research direction focused on foundational theory rather than empirical architecture design. This positioning suggests the work addresses underexplored theoretical questions about what non-autoregressive models can fundamentally compute, rather than incremental improvements to existing systems.

The taxonomy reveals that most non-autoregressive generation research concentrates on architectural innovations (flow-based methods, diffusion architectures, attention mechanisms) and application-specific implementations (translation, speech synthesis, code completion). The paper's theoretical focus distinguishes it from these neighboring branches. Its sibling papers in the same leaf examine expressiveness and parallel complexity foundations, but the taxonomy structure shows limited prior work directly addressing computational universality via PRAM simulation or formal separations between generation paradigms. The scope and exclude notes confirm this leaf specifically targets formal complexity analysis, not empirical performance studies.

Among twelve candidates examined across three contributions, none were found to clearly refute the paper's claims. The formal PRAM characterization examined one candidate with no refutable overlap. The any-process generation framework similarly examined one candidate without prior work providing the same remask-insert-delete operations. The theoretical separation result examined ten candidates, the largest search, yet found no prior work establishing that any-order generation fails to expand autoregressive capabilities. These statistics reflect a limited but targeted literature search, suggesting the contributions occupy relatively unexplored theoretical territory within the examined scope.

Based on the limited search of twelve candidates, the work appears to introduce novel theoretical machinery and formal results in a sparse research area. The taxonomy context confirms that foundational computational power questions receive less attention than architectural or application-driven work. However, the small candidate pool means potentially relevant complexity-theoretic results from adjacent fields may not have been captured. The analysis covers top semantic matches and immediate neighbors but cannot claim exhaustive coverage of all relevant theory.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
12
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: understanding the computational power of non-autoregressive generation methods. The field has evolved into a rich landscape organized around five main branches. Theoretical Foundations and Computational Complexity examines fundamental questions about expressiveness and universality, drawing on classical results like Expressibility Parallel Complexity[20] and exploring what non-autoregressive (NAR) models can represent compared to their autoregressive counterparts. Architectural Innovations and Training Methods focuses on novel designs—ranging from masked transformers such as Masked Audio Transformer[2] to flow-based approaches like FlowSeq[25]—that enable parallel generation while maintaining quality. Hybrid and Comparative Approaches investigate how NAR and autoregressive paradigms can be combined or contrasted, as seen in works like NAR Accelerates AR[5] and Autoregressive to NAR[42]. Application-Specific Implementations deploy NAR techniques across domains including speech synthesis (Residual Quantized Speech[11]), translation (Fully NAR Translation[14]), and code completion (NAR Code Completion[8]). Finally, Auxiliary Methods and Supporting Techniques provide enabling tools such as attention mechanisms and token merging strategies that support NAR generation across tasks. Several active lines of work reveal key trade-offs between parallelism, expressiveness, and task-specific performance. One central question is whether NAR models sacrifice representational capacity for speed, a theme explored both theoretically and empirically by studies like Powerful Generation Ways[33] and Natural Generation Power[0]. Natural Generation Power[0] sits squarely within the Computational Universality and Complexity Analysis cluster, addressing foundational questions about what NAR architectures can compute. Its emphasis on theoretical characterization contrasts with more application-driven neighbors: while Powerful Generation Ways[33] also examines expressiveness, it may lean toward empirical comparisons across generation paradigms. Meanwhile, hybrid strategies such as NAR Accelerates AR[5] and semantic approaches like Parallel Semantic IDs[3] highlight ongoing efforts to bridge the gap between pure NAR and autoregressive methods, suggesting that the field increasingly values flexible frameworks that adapt computational budgets to task demands.

Claimed Contributions

Formal characterization of MDM computational power via PRAM simulation

The authors establish that Masked Diffusion Models achieve Turing-completeness with optimal parallel time complexity by simulating PRAM algorithms, enabling exponential speedups over autoregressive models for parallelizable problems while maintaining the same asymptotic decoding steps as the parallel time bound.

1 retrieved paper
Any-process generation framework with remask, insert, and delete operations

The authors introduce a generation paradigm that extends standard masked diffusion with three operations (remask, insert, delete) enabling self-correction, length-variable editing, and adaptive parallelism. This framework is learned end-to-end without architectural changes and achieves optimal parallel time and space complexity.

1 retrieved paper
Theoretical separation showing any-order generation does not expand ARM capabilities

The authors prove that when controlling for parallelism degree and architecture, any-order generation itself provides no computational advantage over autoregressive models, as any computation by MDM can be reorganized into left-to-right order through explicit position and token specification.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Formal characterization of MDM computational power via PRAM simulation

The authors establish that Masked Diffusion Models achieve Turing-completeness with optimal parallel time complexity by simulating PRAM algorithms, enabling exponential speedups over autoregressive models for parallelizable problems while maintaining the same asymptotic decoding steps as the parallel time bound.

Contribution

Any-process generation framework with remask, insert, and delete operations

The authors introduce a generation paradigm that extends standard masked diffusion with three operations (remask, insert, delete) enabling self-correction, length-variable editing, and adaptive parallelism. This framework is learned end-to-end without architectural changes and achieves optimal parallel time and space complexity.

Contribution

Theoretical separation showing any-order generation does not expand ARM capabilities

The authors prove that when controlling for parallelism degree and architecture, any-order generation itself provides no computational advantage over autoregressive models, as any computation by MDM can be reorganized into left-to-right order through explicit position and token specification.