ROC-n-reroll: How verifier imperfection affects test-time scaling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

test-time scalinginference-time scalingbest of ntheorydata quality

Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance — a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier’s ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper provides a theoretical characterization of test-time scaling methods (Best-of-N and Rejection Sampling) through the geometry of verifier ROC curves, proving that RS outperforms BoN at fixed compute and establishing an impossibility result for extrapolating high-compute performance. Within the taxonomy, it resides in the 'Verifier Imperfection and Scaling Limits' leaf under 'Theoretical Foundations and Scaling Laws', alongside only two sibling papers. This leaf represents a relatively sparse but foundational research direction, focusing specifically on how verifier noise constrains scaling rather than on method development or empirical optimization.

The taxonomy reveals that most neighboring work falls into adjacent branches: 'Optimality and Comparative Analysis of Scaling Strategies' examines resource-constrained comparisons without the verifier-imperfection focus, while 'Probabilistic and Statistical Frameworks' formalizes scaling as inference problems. The broader 'Verifier Design and Training' branch (containing process reward models and ensemble methods) addresses improving verifier quality rather than analyzing fundamental limits given imperfection. The paper's theoretical lens on verifier error propagation distinguishes it from the empirical, method-driven work dominating sibling branches like 'Sampling and Search Strategies' or 'Sequential Refinement'.

Among twenty-two candidates examined through semantic search, none clearly refute the three core contributions. The ROC-curve characterization examined two candidates with no overlaps; the RS-versus-BoN optimality claim examined ten candidates with no refutations; the extrapolation impossibility result also examined ten candidates without finding prior work establishing this negative result. This limited search scope suggests the theoretical framing via ROC geometry and the specific impossibility claim may be novel within the examined literature, though the small candidate pool (twenty-two papers) means potentially relevant work outside top semantic matches could exist.

The analysis indicates the paper occupies a theoretically oriented niche within a field otherwise dominated by algorithmic development and empirical evaluation. The sparse population of its taxonomy leaf and absence of refutations among examined candidates suggest substantive novelty in its formal approach, though the limited search scale (twenty-two candidates from semantic retrieval) leaves open the possibility of overlooked prior work in adjacent mathematical or theoretical communities not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: test-time scaling with imperfect verifiers. The field explores how to allocate additional computation at inference time when verifiers—models that score or rank candidate solutions—are themselves noisy or unreliable. The taxonomy organizes research into several main branches: Theoretical Foundations and Scaling Laws examine the mathematical principles governing when and how test-time compute yields gains, including analyses of verifier imperfection and fundamental limits (Inference Scaling fLaws[19], Optimal Transport Verification[30]). Verifier Design and Training focuses on building better scoring models through techniques like step-level supervision (Step-level Verifier[26]) or self-calibration (Self-Calibration[11]). Sampling and Search Strategies investigate how to generate and explore candidate solutions efficiently, from simple repeated sampling to structured search methods. Sequential Refinement and Iterative Improvement studies multi-round approaches where models revise outputs based on feedback (ReVeal[15], Critique to Verify[18]). Latency-Aware and Efficiency Optimization addresses practical deployment constraints, balancing accuracy against wall-clock time (Early Rejection[12], Budget-aware Scaling[17]). Finally, Domain-Specific Applications and Extensions adapt these ideas to specialized settings like legal reasoning (Legal Reasoning Verifiers[31]) or video understanding (Video-T1[24]). A particularly active line of work examines the theoretical limits of scaling under verifier noise, asking when additional samples or compute cease to help (Scalingnoise[5], When Verification Pays[32]). Another contrasting direction emphasizes practical strategies for managing imperfect verifiers in real systems, such as instance-adaptive allocation (Instance-Adaptive Scaling[20]) or early stopping heuristics. ROC-n-reroll[0] sits squarely within the Theoretical Foundations branch, specifically addressing Verifier Imperfection and Scaling Limits. It shares conceptual ground with Inference Scaling fLaws[19], which also probes fundamental constraints, and with Optimal Transport Verification[30], which offers a complementary mathematical lens on verifier quality. Compared to more application-focused neighbors, ROC-n-reroll[0] emphasizes rigorous characterization of how verifier error propagates as compute scales, providing formal insights into the diminishing returns observed in practice.

Claimed Contributions

Theoretical characterization of test-time scaling via ROC curves

2 retrieved papers

The authors establish that for both Rejection Sampling and Best-of-N methods, the accuracy at a given query depends only on the generator's initial accuracy and the verifier's ROC curve. This provides a complete theoretical framework connecting verifier imperfection to test-time scaling performance.

2 retrieved papers

Proof that RS outperforms BoN at fixed compute with concave ROC curves

10 retrieved papers

The authors prove that Rejection Sampling achieves higher accuracy than Best-of-N when controlling for compute budget, provided the verifier has a concave ROC curve. However, both methods reach identical performance as compute approaches infinity.

10 retrieved papers

Impossibility result for extrapolating high-compute performance

10 retrieved papers

The authors prove that observing test-time scaling behavior at low compute levels does not allow reliable prediction of performance at high compute levels. This holds for both RS and BoN, even when assuming concave ROC curves.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers PDF

Kapoor, Sayash, Benedikt Stroebl, Narayanan, Arvind, Sayash Kapoor, Arvind Narayanan (2024)

[30] Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality PDF

Mukherjee, Arpan, Bullo, Marcello, Arpan Mukherjee, Basu, Debabrota, Marcello Bullo, Gunduz, Deniz, Debabrota Basu, Deniz GÃ¼ndÃ¼z (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical characterization of test-time scaling via ROC curves

[30] Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality PDF

Cannot Refute

[63] Enhancing Fine-Tuning-Free Clinical Reasoning via Test-Time Scaling PDF

Cannot Refute

Contribution

Proof that RS outperforms BoN at fixed compute with concave ROC curves

[47] Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving PDF

Cannot Refute

[54] Detecting and preventing hallucinations in large vision language models PDF

Cannot Refute

[55] Fast best-of-n decoding via speculative rejection PDF

Cannot Refute

[56] Accelerating best-of-n via speculative rejection PDF

Cannot Refute

[57] Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment PDF

Cannot Refute

[58] Optimal Stopping vs Best-of- for Inference Time Optimization PDF

Cannot Refute

[59] Diverse Inference and Verification for Advanced Reasoning PDF

Cannot Refute

[60] On the Query Complexity of Verifier-Assisted Language Generation PDF

Cannot Refute

[61] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models PDF

Cannot Refute

[62] Best-of-Majority: Minimax-Optimal Strategy for Pass@ Inference Scaling PDF

Cannot Refute

Contribution

Impossibility result for extrapolating high-compute performance

[44] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling PDF

Cannot Refute

[45] Inference-Time Scaling for Generalist Reward Modeling PDF

Cannot Refute

[46] Probabilistic Optimality for Inference-time Scaling PDF

Cannot Refute

[47] Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving PDF

Cannot Refute

[48] The art of scaling reinforcement learning compute for llms PDF

Cannot Refute

[49] Inference-time scaling for complex tasks: Where we stand and what lies ahead PDF

Cannot Refute

[50] Investigating test-time scaling with reranking for machine translation PDF

Cannot Refute

[51] A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search PDF

Cannot Refute

[52] Inference Scaling for Long-Context Retrieval Augmented Generation PDF

Cannot Refute

[53] Amix-1: A pathway to test-time scalable protein foundation model PDF

Cannot Refute

ROC-n-reroll: How verifier imperfection affects test-time scaling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers PDF

[30] Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality PDF

Contribution Analysis

Theoretical characterization of test-time scaling via ROC curves

[30] Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality PDF

[63] Enhancing Fine-Tuning-Free Clinical Reasoning via Test-Time Scaling PDF

Proof that RS outperforms BoN at fixed compute with concave ROC curves

[47] Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving PDF

[54] Detecting and preventing hallucinations in large vision language models PDF

[55] Fast best-of-n decoding via speculative rejection PDF

[56] Accelerating best-of-n via speculative rejection PDF

[57] Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment PDF

[58] Optimal Stopping vs Best-of- for Inference Time Optimization PDF

[59] Diverse Inference and Verification for Advanced Reasoning PDF

[60] On the Query Complexity of Verifier-Assisted Language Generation PDF

[61] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models PDF

[62] Best-of-Majority: Minimax-Optimal Strategy for Pass@ Inference Scaling PDF

Impossibility result for extrapolating high-compute performance

[44] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling PDF

[45] Inference-Time Scaling for Generalist Reward Modeling PDF

[46] Probabilistic Optimality for Inference-time Scaling PDF

[47] Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving PDF

[48] The art of scaling reinforcement learning compute for llms PDF

[49] Inference-time scaling for complex tasks: Where we stand and what lies ahead PDF

[50] Investigating test-time scaling with reranking for machine translation PDF

[51] A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search PDF

[52] Inference Scaling for Long-Context Retrieval Augmented Generation PDF

[53] Amix-1: A pathway to test-time scalable protein foundation model PDF

Table of Contents