Self-Speculative Decoding Accelerates Lossless Inference in Any-Order and Any-Subset Autoregressive Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

speculative decodingany-order autoregressive modelsdiffusion language models

In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted – notably, previous speculative decoding algorithms lack our efficiency guarantee. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Any-Subset Speculative Decoding (ASSD), an algorithm enabling parallel token generation from correct joint distributions using any-subset autoregressive models (AS-ARMs). It resides in the 'Any-Subset Autoregressive Models with Speculative Decoding' leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader taxonomy of parallel generation methods, suggesting the work addresses a focused problem space where few prior solutions exist. The sibling paper in this leaf shares the any-subset architecture philosophy but differs in its speculative correction mechanism.

The taxonomy reveals neighboring approaches in adjacent branches: 'Any-Order Generation Without Speculative Correction' explores flexible factorizations without correction mechanisms, while 'Diffusion-Based Parallel Generation Methods' and 'Dynamic Multi-Token Prediction Strategies' pursue parallel sampling through fundamentally different paradigms—iterative denoising and confidence-based adaptive prediction, respectively. The paper's position bridges architectural flexibility (any-subset capability) with algorithmic guarantees (speculative decoding), distinguishing it from purely architectural contributions in sibling leaves and from diffusion methods that rely on conditional independence assumptions the authors explicitly critique.

Among 30 candidates examined, the ASSD algorithm contribution shows no clear refutation across 10 examined papers, suggesting novelty in the specific speculative decoding formulation with efficiency guarantees. The training scheme contribution similarly lacks refutable prior work among 10 candidates. However, the architectural design criteria contribution encountered one refutable candidate among 10 examined, indicating some overlap with existing AS-ARM architectural principles. The limited search scope (30 total candidates, not hundreds) means these findings reflect top-semantic-match results rather than exhaustive coverage, particularly relevant given the sparse two-paper leaf this work occupies.

Based on the top-30 semantic matches examined, the work appears to introduce novel algorithmic contributions (ASSD with provable efficiency bounds) within an emerging architectural paradigm (AS-ARMs). The single refutable pair for architectural criteria suggests partial overlap with foundational AS-ARM design principles, while the algorithm and training scheme show no clear precedent in the examined literature. The sparse taxonomy leaf and limited sibling papers reinforce that this represents early-stage exploration of a specific solution space.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: parallel token generation from joint distributions in autoregressive models. The field addresses the fundamental tension between the sequential nature of autoregressive generation and the desire for faster, parallel inference. The taxonomy reveals several complementary strategies: Any-Order and Any-Subset Autoregressive Architectures explore flexible factorizations that permit generating tokens in arbitrary orders or subsets, enabling speculative or adaptive decoding schemes. Diffusion-Based Parallel Generation Methods blend diffusion processes with autoregressive structures to sample multiple tokens jointly. Dynamic Multi-Token Prediction Strategies focus on predicting variable numbers of tokens per step, adapting generation granularity on the fly. Autoregressive Models with Parallel Inference Mechanisms encompass techniques like speculative decoding and lookahead methods that maintain the autoregressive framework while parallelizing computation. Domain-Specific Applications demonstrate these ideas in specialized contexts such as audio codecs or retrieval-augmented generation, while Survey and Methodological Overview Literature and Counterfactual Explanation Methods provide broader perspectives and niche applications of sampling-based reasoning. A particularly active line of work centers on any-subset autoregressive models, which train networks to handle arbitrary token subsets and enable speculative decoding without auxiliary draft models. Self-Speculative Decoding[0] exemplifies this approach by using the model's own any-subset capabilities to propose and verify multiple tokens in parallel, closely aligning with Any-Subset Autoregressive[3], which formalizes the theoretical underpinnings of subset-based factorizations. In contrast, Adaptive Parallel Decoding[2] and DynaMo[8] emphasize dynamic adjustment of the number of tokens predicted per step, trading off between parallelism and accuracy based on model confidence. Meanwhile, diffusion-inspired methods like Guided Autoregressive Diffusion[6] and pseudo-autoregressive approaches such as Pseudo-autoregressive Codec[7] explore hybrid generation paradigms that relax strict left-to-right ordering. Self-Speculative Decoding[0] sits squarely within the any-subset branch, sharing the flexible factorization philosophy of Any-Subset Autoregressive[3] but distinguished by its self-contained speculative mechanism, avoiding the overhead of separate draft models seen in some parallel inference strategies.

Claimed Contributions

Any-Subset Speculative Decoding (ASSD) algorithm

10 retrieved papers

The authors introduce ASSD, a novel algorithm that enables parallel token generation from any-subset autoregressive models while maintaining the correct joint distribution. The algorithm is mathematically guaranteed to never increase the number of function evaluations and can handle exponentially more infilling patterns than traditional speculative decoding.

10 retrieved papers

Mathematically justified training scheme for AS-ARMs

10 retrieved papers

The authors develop a principled training objective based on joint conditional probability maximization with expectations over token orderings and prompt lengths. This training scheme is derived from reversing a discrete-time Markov chain and differs from conditionally independent losses used in prior work.

10 retrieved papers

Architectural design criteria for AS-ARMs supporting parallel sampling and density estimation

Can Refute

10 retrieved papers

The authors establish architectural requirements for AS-ARMs that enable both parallel token generation through arbitrary positional queries and single-step joint density estimation via causal-like attention masking. These design principles allow AS-ARMs to serve as both draft and oracle models simultaneously.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

Ermon, Stefano, Gabe Guo, Stefano Ermon (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Any-Subset Speculative Decoding (ASSD) algorithm

[35] Spectr: Fast speculative decoding via optimal transport PDF

Cannot Refute

[36] Accelerating Large Language Model Decoding with Speculative Sampling PDF

Cannot Refute

[37] DySpec: Faster speculative decoding with dynamic token tree structure PDF

Cannot Refute

[38] DistillSpec: Improving Speculative Decoding via Knowledge Distillation PDF

Cannot Refute

[39] Speculative decoding for multi-sample inference PDF

Cannot Refute

[40] Beyond tokens: A survey on decoding methods for large language models and large vision-language models PDF

Cannot Refute

[41] Fast inference from transformers via speculative decoding PDF

Cannot Refute

[42] Fast Large Language Model Collaborative Decoding via Speculation PDF

Cannot Refute

[43] A unified framework for speculative decoding with multiple drafters as a bandit PDF

Cannot Refute

[44] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration PDF

Cannot Refute

Contribution

Mathematically justified training scheme for AS-ARMs

[3] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

Cannot Refute

[26] Xlnet: Generalized autoregressive pretraining for language understanding PDF

Cannot Refute

[27] Randomized Autoregressive Visual Generation PDF

Cannot Refute

[28] Autoregressive Conditional Neural Processes PDF

Cannot Refute

[29] MotionLM: Multi-Agent Motion Forecasting as Language Modeling PDF

Cannot Refute

[30] Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models PDF

Cannot Refute

[31] SutraNets: Sub-series Autoregressive Networks for Long-Sequence, Probabilistic Forecasting PDF

Cannot Refute

[32] Joint Document-Level Event Extraction via Token-Token Bidirectional Event Completed Graph PDF

Cannot Refute

[33] Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis PDF

Cannot Refute

[34] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images PDF

Cannot Refute

Contribution

Architectural design criteria for AS-ARMs supporting parallel sampling and density estimation

[1] -GPTs: A New Approach to Autoregressive Models PDF

Can Refute

[2] Accelerating Diffusion LLMs via Adaptive Parallel Decoding PDF

Cannot Refute

[3] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

Cannot Refute

[14] Efficient autoregressive inference for transformer probabilistic models PDF

Cannot Refute

[20] GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation PDF

Cannot Refute

[21] Parallel sampling via counting PDF

Cannot Refute

[22] Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding PDF

Cannot Refute

[23] Chunked autoregressive gan for conditional waveform synthesis PDF

Cannot Refute

[24] Set block decoding is a language model inference accelerator PDF

Cannot Refute

[25] Clarinet: Parallel wave generation in end-to-end text-to-speech PDF

Cannot Refute

Self-Speculative Decoding Accelerates Lossless Inference in Any-Order and Any-Subset Autoregressive Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

Contribution Analysis

Any-Subset Speculative Decoding (ASSD) algorithm

[35] Spectr: Fast speculative decoding via optimal transport PDF

[36] Accelerating Large Language Model Decoding with Speculative Sampling PDF

[37] DySpec: Faster speculative decoding with dynamic token tree structure PDF

[38] DistillSpec: Improving Speculative Decoding via Knowledge Distillation PDF

[39] Speculative decoding for multi-sample inference PDF

[40] Beyond tokens: A survey on decoding methods for large language models and large vision-language models PDF

[41] Fast inference from transformers via speculative decoding PDF

[42] Fast Large Language Model Collaborative Decoding via Speculation PDF

[43] A unified framework for speculative decoding with multiple drafters as a bandit PDF

[44] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration PDF

Mathematically justified training scheme for AS-ARMs

[3] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

[26] Xlnet: Generalized autoregressive pretraining for language understanding PDF

[27] Randomized Autoregressive Visual Generation PDF

[28] Autoregressive Conditional Neural Processes PDF

[29] MotionLM: Multi-Agent Motion Forecasting as Language Modeling PDF

[30] Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models PDF

[31] SutraNets: Sub-series Autoregressive Networks for Long-Sequence, Probabilistic Forecasting PDF

[32] Joint Document-Level Event Extraction via Token-Token Bidirectional Event Completed Graph PDF

[33] Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis PDF

[34] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images PDF

Architectural design criteria for AS-ARMs supporting parallel sampling and density estimation

[1] -GPTs: A New Approach to Autoregressive Models PDF

[2] Accelerating Diffusion LLMs via Adaptive Parallel Decoding PDF

[3] Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding PDF

[14] Efficient autoregressive inference for transformer probabilistic models PDF

[20] GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation PDF

[21] Parallel sampling via counting PDF

[22] Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding PDF

[23] Chunked autoregressive gan for conditional waveform synthesis PDF

[24] Set block decoding is a language model inference accelerator PDF

[25] Clarinet: Parallel wave generation in end-to-end text-to-speech PDF

Table of Contents