ATTS: Asynchronous Test-Time Scaling via Conformal Prediction

ICLR 2026 Conference SubmissionAnonymous Authors
Conformal PredictionTest-Time ScalingSpeculative Decoding
Abstract:

Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency. Speculative decoding is a natural way to accelerate the scaling process; however, scaling along both the parallel and sequential dimensions poses significant challenges, including substantial memory-bound execution and synchronization overhead. We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address these challenges. By revisiting arithmetic intensity, ATTS identifies synchronization as the primary bottleneck. It enables asynchronous inference through online calibration and proposes an ordinal classification algorithm that supports a three-stage rejection sampling pipeline, scaling along both the sequential and parallel axes. Across experiments on the MATH, AMC23, AIME24, and AIME25 datasets and across multiple draft–target model families, we show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement, while maintaining accurate control of the rejection rate, reducing latency and memory overhead, and incurring no accuracy loss. By scaling both in parallel and sequential dimensions, we enable the 1.5B/70B draft/target model combination to achieve the performance of the state-of-the-art reasoning model o3-mini (high) on the AIME dataset. We submit the anonymous repository: anonymous.4open.science/r/Asynchronous-Test-Time-Scaling-5940.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ATTS, an asynchronous test-time scaling framework combining adaptive compute allocation with speculative decoding acceleration. It resides in the 'Adaptive and Resource-Aware Scaling Strategies' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader test-time scaling landscape. This leaf focuses specifically on dynamic resource allocation based on query complexity or confidence calibration, distinguishing it from fixed-budget parallel sampling or sequential reasoning approaches that populate neighboring branches.

The taxonomy reveals that ATTS sits at the intersection of two major research threads. Its parent branch, 'Test-Time Scaling Methodologies and Frameworks', encompasses parallel sampling, sequential reasoning, and theoretical scaling laws, while the sibling branch 'Speculative and Accelerated Decoding Techniques' addresses latency reduction through draft models and multi-head architectures. ATTS bridges these domains by applying adaptive resource allocation to speculative decoding, a combination less explored than either approach in isolation. Neighboring leaves contain substantially more papers, suggesting that adaptive strategies integrated with acceleration techniques represent an emerging rather than saturated direction.

Among eleven candidates examined through limited semantic search, the contribution-level analysis found no clear refutations. The conformal prediction for ordinal classification component examined ten candidates with none providing overlapping prior work, while the three-stage rejection sampling pipeline examined one candidate without refutation. The asynchronous arithmetic intensity metric was not matched against any candidates in this search scope. These statistics reflect a focused search rather than exhaustive coverage, indicating that within the examined subset, the specific technical combinations appear distinct, though broader literature may contain related ideas not captured here.

Based on the limited search scope of eleven semantically similar papers, the work appears to occupy a relatively novel position combining adaptive scaling with asynchronous speculative decoding. The sparse population of its taxonomy leaf and absence of refutations among examined candidates suggest originality, though the analysis does not cover the full breadth of related work in distributed inference systems, hardware-specific optimizations, or domain-specific applications that might employ similar techniques.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
11
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Accelerating test-time scaling in large language models. The field has evolved into a rich ecosystem of complementary approaches, organized around several major branches. Test-Time Scaling Methodologies and Frameworks establish foundational techniques for adaptive compute allocation and resource-aware strategies, exemplified by works like ATTS[0] and Self-Calibration[32]. Speculative and Accelerated Decoding Techniques focus on reducing latency through methods such as draft-and-verify schemes (Speculative Sampling[9], Medusa[15]) and staged pipelines (Staged Speculative Decoding[44]). Domain-Specific Applications tailor these ideas to particular problem settings, while Efficient Inference Systems and Serving Infrastructure address deployment challenges at scale (Fast Distributed Inference[1], ServerlessLLM[42]). Model Compression and Efficient Architectures pursue orthogonal gains through pruning and quantization (Model Compression Survey[8]), and Test-Time Adaptation and Learning explore dynamic model updates (Test-Time Learning[7]). Cross-Domain and Multimodal Extensions, Hardware Acceleration, and Surveys round out the taxonomy, reflecting both breadth and depth in the literature. A particularly active line of work centers on adaptive and resource-aware scaling strategies, where the central question is how to allocate compute budgets intelligently across different test instances. ATTS[0] sits squarely in this branch, emphasizing dynamic resource allocation to balance speed and accuracy. Nearby, Self-Calibration[32] and AgentTTS[36] explore similar themes of runtime adaptation, though they differ in whether they prioritize calibration mechanisms or agent-based orchestration. In contrast, works like Scaling Test-Time Compute[10] and Scaling Test-Time Reasoning[14] investigate the theoretical and empirical limits of investing more compute at inference time, often without the fine-grained resource awareness that ATTS[0] targets. Meanwhile, speculative decoding methods (Accelerated Speculative Sampling[21], Eagle-3[25]) offer complementary speed gains but typically assume fixed compute budgets rather than adaptive allocation. The interplay between these branches highlights ongoing trade-offs: whether to optimize for worst-case latency, average-case efficiency, or task-specific performance, and how to integrate adaptive strategies with existing acceleration techniques.

Claimed Contributions

Asynchronous arithmetic intensity metric

The authors introduce a novel variant of arithmetic intensity that accounts for synchronization overhead in test-time scaling. This metric helps identify synchronization as the primary bottleneck when scaling LLM inference along both parallel and sequential dimensions.

0 retrieved papers
Conformal prediction for ordinal classification in rejection sampling

The authors reformulate the task of constructing prediction sets as an ordinal classification problem using conformal prediction. This approach avoids normalization and global ranking operations, enabling asynchronous inference through online calibration and providing distribution-free coverage guarantees.

10 retrieved papers
ATTS framework with three-stage rejection sampling pipeline

The authors present ATTS, a framework that combines asynchronous inference with a three-stage rejection sampling pipeline (draft model sampling, verification, and target model sampling). This framework scales along both sequential and parallel axes while maintaining accurate control of rejection rates and reducing latency and memory overhead.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Asynchronous arithmetic intensity metric

The authors introduce a novel variant of arithmetic intensity that accounts for synchronization overhead in test-time scaling. This metric helps identify synchronization as the primary bottleneck when scaling LLM inference along both parallel and sequential dimensions.

Contribution

Conformal prediction for ordinal classification in rejection sampling

The authors reformulate the task of constructing prediction sets as an ordinal classification problem using conformal prediction. This approach avoids normalization and global ranking operations, enabling asynchronous inference through online calibration and providing distribution-free coverage guarantees.

Contribution

ATTS framework with three-stage rejection sampling pipeline

The authors present ATTS, a framework that combines asynchronous inference with a three-stage rejection sampling pipeline (draft model sampling, verification, and target model sampling). This framework scales along both sequential and parallel axes while maintaining accurate control of rejection rates and reducing latency and memory overhead.