ROC-n-reroll: How verifier imperfection affects test-time scaling
Overview
Overall Novelty Assessment
The paper provides a theoretical characterization of test-time scaling methods (Best-of-N and Rejection Sampling) through the geometry of verifier ROC curves, proving that RS outperforms BoN at fixed compute and establishing an impossibility result for extrapolating high-compute performance. Within the taxonomy, it resides in the 'Verifier Imperfection and Scaling Limits' leaf under 'Theoretical Foundations and Scaling Laws', alongside only two sibling papers. This leaf represents a relatively sparse but foundational research direction, focusing specifically on how verifier noise constrains scaling rather than on method development or empirical optimization.
The taxonomy reveals that most neighboring work falls into adjacent branches: 'Optimality and Comparative Analysis of Scaling Strategies' examines resource-constrained comparisons without the verifier-imperfection focus, while 'Probabilistic and Statistical Frameworks' formalizes scaling as inference problems. The broader 'Verifier Design and Training' branch (containing process reward models and ensemble methods) addresses improving verifier quality rather than analyzing fundamental limits given imperfection. The paper's theoretical lens on verifier error propagation distinguishes it from the empirical, method-driven work dominating sibling branches like 'Sampling and Search Strategies' or 'Sequential Refinement'.
Among twenty-two candidates examined through semantic search, none clearly refute the three core contributions. The ROC-curve characterization examined two candidates with no overlaps; the RS-versus-BoN optimality claim examined ten candidates with no refutations; the extrapolation impossibility result also examined ten candidates without finding prior work establishing this negative result. This limited search scope suggests the theoretical framing via ROC geometry and the specific impossibility claim may be novel within the examined literature, though the small candidate pool (twenty-two papers) means potentially relevant work outside top semantic matches could exist.
The analysis indicates the paper occupies a theoretically oriented niche within a field otherwise dominated by algorithmic development and empirical evaluation. The sparse population of its taxonomy leaf and absence of refutations among examined candidates suggest substantive novelty in its formal approach, though the limited search scale (twenty-two candidates from semantic retrieval) leaves open the possibility of overlooked prior work in adjacent mathematical or theoretical communities not captured by the search strategy.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors establish that for both Rejection Sampling and Best-of-N methods, the accuracy at a given query depends only on the generator's initial accuracy and the verifier's ROC curve. This provides a complete theoretical framework connecting verifier imperfection to test-time scaling performance.
The authors prove that Rejection Sampling achieves higher accuracy than Best-of-N when controlling for compute budget, provided the verifier has a concave ROC curve. However, both methods reach identical performance as compute approaches infinity.
The authors prove that observing test-time scaling behavior at low compute levels does not allow reliable prediction of performance at high compute levels. This holds for both RS and BoN, even when assuming concave ROC curves.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[19] Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers PDF
[30] Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Theoretical characterization of test-time scaling via ROC curves
The authors establish that for both Rejection Sampling and Best-of-N methods, the accuracy at a given query depends only on the generator's initial accuracy and the verifier's ROC curve. This provides a complete theoretical framework connecting verifier imperfection to test-time scaling performance.
Proof that RS outperforms BoN at fixed compute with concave ROC curves
The authors prove that Rejection Sampling achieves higher accuracy than Best-of-N when controlling for compute budget, provided the verifier has a concave ROC curve. However, both methods reach identical performance as compute approaches infinity.
[47] Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving PDF
[54] Detecting and preventing hallucinations in large vision language models PDF
[55] Fast best-of-n decoding via speculative rejection PDF
[56] Accelerating best-of-n via speculative rejection PDF
[57] Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment PDF
[58] Optimal Stopping vs Best-of- for Inference Time Optimization PDF
[59] Diverse Inference and Verification for Advanced Reasoning PDF
[60] On the Query Complexity of Verifier-Assisted Language Generation PDF
[61] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models PDF
[62] Best-of-Majority: Minimax-Optimal Strategy for Pass@ Inference Scaling PDF
Impossibility result for extrapolating high-compute performance
The authors prove that observing test-time scaling behavior at low compute levels does not allow reliable prediction of performance at high compute levels. This holds for both RS and BoN, even when assuming concave ROC curves.