Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

LLMsInferenceOptimal TransportSpeculative Decoding

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample this token. To improve acceptance and decoding efficiency, recent work has explored the multi-draft extension, where at each step $n$ draft tokens are generated, and the verification criterion is a distribution conditioned on these. When this criterion maximizes the probability of accepting some draft token, it is called the optimal transport (OT). However, finding the OT is difficult, as it is the solution of a linear program (OTLP) in over $V^n$ variables, with $V$ being the vocabulary size. Two recent theoretical works have reframed the OTLP in terms of importance sampling or subset selection. In this work, we prove that these formulations are equivalent to an exponentially large relaxed OTLP, so it remains infeasible to solve. Then, we reverse engineer subset selection to formulate the OTLP as a max-flow problem. With a novel application of polymatroid theory, we reduce the exponentially large OTLP to a convex optimization problem in at most $V$ variables. This allows us to devise an algorithm for optimal $n$ -draft speculative sampling when the $n$ tokens are chosen i.i.d. from a single draft model, which can be tuned to arbitrary accuracy. Finally, we measure acceptance rates and algorithm runtimes for various $n$ and top- $k$ draft sampling settings. Our findings give the first multi-draft algorithm with 90% acceptance and under 100 ms of overhead per generated token with negligible deviation from the target model distribution.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: optimal multi-draft speculative sampling via convex optimization. The field addresses how to accelerate large language model inference by generating multiple candidate token sequences (drafts) from cheaper models and then verifying them against a target model in parallel. The taxonomy organizes research into several main branches. Optimal Verification and Acceptance Criteria focuses on formulating the acceptance decision as a mathematical optimization problem, often using convex methods to determine which drafts to accept. Adaptive Draft Structure and Policy Optimization explores how to dynamically adjust the number, length, or arrangement of drafts based on runtime feedback or learned policies. Multi-Model and Batch Inference Optimization examines scenarios involving multiple draft models or batched requests, seeking to balance computational resources across heterogeneous workloads. Randomized Drafting Strategies investigates stochastic approaches to draft generation, trading deterministic guarantees for potential efficiency gains. Together, these branches reflect a progression from foundational acceptance rules to more flexible, context-aware sampling schemes. A particularly active line of work centers on convex formulations for acceptance, where Global Resolution Convex Optimization[0] and its close neighbor Global Resolution Convex Minimization[1] rigorously frame the verification step as a convex program, enabling provably optimal token acceptance under distributional constraints. Nearby, Optimal Multi-Draft Decoding[2] and SpecHub[3] explore related optimization perspectives, while works like OPT-Tree[4] and Multi-Draft Canonical Decomposition[5] investigate structured draft arrangements that can be optimized jointly. In contrast, Hierarchical Speculative Decoding[7] and Group Tree Optimization[8] emphasize adaptive, tree-based policies that adjust draft topology on the fly, and Randomised Drafting[10] introduces stochastic elements to escape local optima. The original paper, Global Resolution Convex Optimization[0], sits squarely within the convex optimization branch, offering a principled mathematical framework for acceptance decisions. Compared to Global Resolution Convex Minimization[1], which shares a similar optimization foundation, and SpecHub[3], which also targets verification efficiency, the original work emphasizes a global resolution perspective that unifies multiple drafts under a single convex objective, distinguishing it from more heuristic or tree-centric approaches.

Claimed Contributions

Equivalence of canonical decomposition and relaxed OTLP

0 retrieved papers

The authors prove that the canonical decomposition approach by Khisti et al. (2025) and the subset selection formulation by Hu et al. (2025) are mathematically equivalent to solving a relaxed optimal transport linear program. This unifies two recent theoretical frameworks for multi-draft speculative sampling.

0 retrieved papers

Max-flow and convex optimization reduction of OTLP

0 retrieved papers

The authors reformulate the optimal transport linear program as a max-flow problem and then apply polymatroid theory to reduce it to a tractable convex optimization problem with at most V variables, enabling efficient computation of the optimal verification criterion.

0 retrieved papers

Global resolution algorithm for optimal multi-draft speculative sampling

0 retrieved papers

The authors develop the global resolution algorithm that computes the optimal transport solution for i.i.d. multi-draft speculative sampling to arbitrary accuracy. The algorithm achieves over 90% acceptance rates with under 100 ms overhead per token, representing the first practical multi-draft method with such performance.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Equivalence of canonical decomposition and relaxed OTLP

Contribution

Max-flow and convex optimization reduction of OTLP

Contribution

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Equivalence of canonical decomposition and relaxed OTLP

Max-flow and convex optimization reduction of OTLP

Global resolution algorithm for optimal multi-draft speculative sampling

Table of Contents