Self-Speculative Decoding Accelerates Lossless Inference in Any-Order and Any-Subset Autoregressive Models

ICLR 2026 Conference SubmissionAnonymous Authors
speculative decodingany-order autoregressive modelsdiffusion language models
Abstract:

In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted – notably, previous speculative decoding algorithms lack our efficiency guarantee. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Any-Subset Speculative Decoding (ASSD), an algorithm enabling parallel token generation from correct joint distributions using any-subset autoregressive models (AS-ARMs). It resides in the 'Any-Subset Autoregressive Models with Speculative Decoding' leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader taxonomy of parallel generation methods, suggesting the work addresses a focused problem space where few prior solutions exist. The sibling paper in this leaf shares the any-subset architecture philosophy but differs in its speculative correction mechanism.

The taxonomy reveals neighboring approaches in adjacent branches: 'Any-Order Generation Without Speculative Correction' explores flexible factorizations without correction mechanisms, while 'Diffusion-Based Parallel Generation Methods' and 'Dynamic Multi-Token Prediction Strategies' pursue parallel sampling through fundamentally different paradigms—iterative denoising and confidence-based adaptive prediction, respectively. The paper's position bridges architectural flexibility (any-subset capability) with algorithmic guarantees (speculative decoding), distinguishing it from purely architectural contributions in sibling leaves and from diffusion methods that rely on conditional independence assumptions the authors explicitly critique.

Among 30 candidates examined, the ASSD algorithm contribution shows no clear refutation across 10 examined papers, suggesting novelty in the specific speculative decoding formulation with efficiency guarantees. The training scheme contribution similarly lacks refutable prior work among 10 candidates. However, the architectural design criteria contribution encountered one refutable candidate among 10 examined, indicating some overlap with existing AS-ARM architectural principles. The limited search scope (30 total candidates, not hundreds) means these findings reflect top-semantic-match results rather than exhaustive coverage, particularly relevant given the sparse two-paper leaf this work occupies.

Based on the top-30 semantic matches examined, the work appears to introduce novel algorithmic contributions (ASSD with provable efficiency bounds) within an emerging architectural paradigm (AS-ARMs). The single refutable pair for architectural criteria suggests partial overlap with foundational AS-ARM design principles, while the algorithm and training scheme show no clear precedent in the examined literature. The sparse taxonomy leaf and limited sibling papers reinforce that this represents early-stage exploration of a specific solution space.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: parallel token generation from joint distributions in autoregressive models. The field addresses the fundamental tension between the sequential nature of autoregressive generation and the desire for faster, parallel inference. The taxonomy reveals several complementary strategies: Any-Order and Any-Subset Autoregressive Architectures explore flexible factorizations that permit generating tokens in arbitrary orders or subsets, enabling speculative or adaptive decoding schemes. Diffusion-Based Parallel Generation Methods blend diffusion processes with autoregressive structures to sample multiple tokens jointly. Dynamic Multi-Token Prediction Strategies focus on predicting variable numbers of tokens per step, adapting generation granularity on the fly. Autoregressive Models with Parallel Inference Mechanisms encompass techniques like speculative decoding and lookahead methods that maintain the autoregressive framework while parallelizing computation. Domain-Specific Applications demonstrate these ideas in specialized contexts such as audio codecs or retrieval-augmented generation, while Survey and Methodological Overview Literature and Counterfactual Explanation Methods provide broader perspectives and niche applications of sampling-based reasoning. A particularly active line of work centers on any-subset autoregressive models, which train networks to handle arbitrary token subsets and enable speculative decoding without auxiliary draft models. Self-Speculative Decoding[0] exemplifies this approach by using the model's own any-subset capabilities to propose and verify multiple tokens in parallel, closely aligning with Any-Subset Autoregressive[3], which formalizes the theoretical underpinnings of subset-based factorizations. In contrast, Adaptive Parallel Decoding[2] and DynaMo[8] emphasize dynamic adjustment of the number of tokens predicted per step, trading off between parallelism and accuracy based on model confidence. Meanwhile, diffusion-inspired methods like Guided Autoregressive Diffusion[6] and pseudo-autoregressive approaches such as Pseudo-autoregressive Codec[7] explore hybrid generation paradigms that relax strict left-to-right ordering. Self-Speculative Decoding[0] sits squarely within the any-subset branch, sharing the flexible factorization philosophy of Any-Subset Autoregressive[3] but distinguished by its self-contained speculative mechanism, avoiding the overhead of separate draft models seen in some parallel inference strategies.

Claimed Contributions

Any-Subset Speculative Decoding (ASSD) algorithm

The authors introduce ASSD, a novel algorithm that enables parallel token generation from any-subset autoregressive models while maintaining the correct joint distribution. The algorithm is mathematically guaranteed to never increase the number of function evaluations and can handle exponentially more infilling patterns than traditional speculative decoding.

10 retrieved papers
Mathematically justified training scheme for AS-ARMs

The authors develop a principled training objective based on joint conditional probability maximization with expectations over token orderings and prompt lengths. This training scheme is derived from reversing a discrete-time Markov chain and differs from conditionally independent losses used in prior work.

10 retrieved papers
Architectural design criteria for AS-ARMs supporting parallel sampling and density estimation

The authors establish architectural requirements for AS-ARMs that enable both parallel token generation through arbitrary positional queries and single-step joint density estimation via causal-like attention masking. These design principles allow AS-ARMs to serve as both draft and oracle models simultaneously.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Any-Subset Speculative Decoding (ASSD) algorithm

The authors introduce ASSD, a novel algorithm that enables parallel token generation from any-subset autoregressive models while maintaining the correct joint distribution. The algorithm is mathematically guaranteed to never increase the number of function evaluations and can handle exponentially more infilling patterns than traditional speculative decoding.

Contribution

Mathematically justified training scheme for AS-ARMs

The authors develop a principled training objective based on joint conditional probability maximization with expectations over token orderings and prompt lengths. This training scheme is derived from reversing a discrete-time Markov chain and differs from conditionally independent losses used in prior work.

Contribution

Architectural design criteria for AS-ARMs supporting parallel sampling and density estimation

The authors establish architectural requirements for AS-ARMs that enable both parallel token generation through arbitrary positional queries and single-step joint density estimation via causal-like attention masking. These design principles allow AS-ARMs to serve as both draft and oracle models simultaneously.