Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelTest-Time ComputeReasoningEffectivenessEfficiency
Abstract:

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budget’s ability to close performance gaps without retraining. Our code is available at anonymous.4open.science/r/P-and-B-6513/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a decomposition-based framework that allocates token budgets adaptively by breaking complex queries into sub-questions. It resides in the 'Decomposition-Based Budget Planning' leaf, which contains three papers total, indicating a moderately sparse research direction within the broader taxonomy of fifty papers. The sibling papers—Plan and Budget LLM and Adaptive Graph Thoughts—similarly emphasize structured planning, suggesting this leaf represents a coherent but not overcrowded niche focused on upfront task decomposition rather than runtime adjustment.

The taxonomy reveals neighboring leaves in 'Adaptive Budget Allocation Frameworks' that pursue alternative strategies: 'Difficulty-Aware Budget Prediction' estimates problem complexity before reasoning, while 'Hierarchical and Multi-Level Budget Control' organizes allocation across multiple granularities. Adjacent branches, such as 'Dynamic Token Management During Inference' and 'Reinforcement Learning for Budget Optimization', address runtime adaptation and policy learning respectively. The paper's decomposition approach diverges from these by committing to a plan upfront, trading runtime flexibility for interpretability and structured resource distribution across identified sub-problems.

Among thirty candidates examined, the Budget Allocation Model (BAM) contribution shows no clear refutation across ten candidates, suggesting theoretical novelty in formalizing reasoning as uncertainty-driven sub-question sequences. However, the Plan-and-Budget framework and the characterization of reasoning miscalibration each face two refutable candidates among ten examined, indicating that decomposition-based planning and the overthinking/underthinking analysis have more substantial prior work. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage, so unexamined literature may contain additional overlaps.

Given the search examined thirty candidates rather than hundreds, the analysis captures high-relevance prior work but cannot claim completeness. The theoretical BAM model appears more distinctive, while the framework and miscalibration insights align more closely with existing decomposition and efficiency studies. The paper's position in a three-paper leaf suggests it extends a recognized but not saturated research direction, though the refutation signals warrant careful comparison with the identified overlapping work to clarify incremental versus substantive contributions.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: efficient test-time reasoning with adaptive token budget allocation. The field addresses how to dynamically manage computational resources during inference, ensuring that models allocate tokens where they matter most. The taxonomy reveals several complementary directions: Adaptive Budget Allocation Frameworks develop high-level strategies for distributing compute across reasoning steps, often through decomposition or policy-based planning (e.g., Plan and Budget[0], Token-Budget-Aware Reasoning[3]). Reinforcement Learning for Budget Optimization treats allocation as a sequential decision problem, learning policies that balance accuracy and efficiency (Budget Policy Optimization[4], Optimal Reasoning Efficiency[5]). Dynamic Token Management During Inference focuses on runtime mechanisms such as pruning, halting, or reweighting tokens to reduce waste (Dynamic Token Pruning[37], Continue-Thinking Token[40]). Meanwhile, Search and Sampling Strategies for Test-Time Scaling explore how to navigate solution spaces efficiently under budget constraints (Dual-Phase Search[25], First Finish Search[36]), and Training Methodologies for Efficient Reasoning investigate how to prepare models for adaptive behavior through specialized fine-tuning or distillation. Recent work highlights a tension between global planning and local adaptation. Some approaches, like Plan and Budget[0] and its close neighbor Plan and Budget LLM[19], emphasize decomposition-based planning that allocates budgets upfront by breaking tasks into subtasks. This contrasts with methods such as SelfBudgeter[8] or Just Enough Thinking[7], which adapt budgets on-the-fly based on intermediate signals. Adaptive Graph Thoughts[12], a neighbor in the decomposition branch, similarly structures reasoning into graph-based plans. The original paper sits within this decomposition-focused cluster, sharing an emphasis on structured planning with Plan and Budget LLM[19] but differing in how granularly it assigns token budgets to subproblems. Across branches, open questions persist around the trade-off between interpretability of budget decisions and the flexibility needed to handle diverse problem difficulties at test time.

Claimed Contributions

Budget Allocation Model (BAM)

The authors introduce BAM, a theoretical framework that formalizes reasoning as a sequence of sub-problems with varying uncertainty levels and derives optimal token allocation strategies. They also propose the E3 metric (Efficiency-Aware Effectiveness Evaluation Score) to jointly measure reasoning accuracy and computational cost.

10 retrieved papers
PLAN-AND-BUDGET framework

The authors develop PLAN-AND-BUDGET, a two-stage inference framework that first decomposes queries into sub-questions (Plan step) and then adaptively allocates token budgets to each sub-question based on estimated complexity (Budget step). This framework is model-agnostic and requires no retraining.

10 retrieved papers
Can Refute
Characterization of reasoning miscalibration

The authors identify and formalize reasoning miscalibration as a fundamental failure mode in LLMs, manifesting as either overthinking (excessive verbose reasoning) or underthinking (premature termination). They analyze this phenomenon through uncertainty decomposition and establish it as a key challenge in test-time computation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Budget Allocation Model (BAM)

The authors introduce BAM, a theoretical framework that formalizes reasoning as a sequence of sub-problems with varying uncertainty levels and derives optimal token allocation strategies. They also propose the E3 metric (Efficiency-Aware Effectiveness Evaluation Score) to jointly measure reasoning accuracy and computational cost.

Contribution

PLAN-AND-BUDGET framework

The authors develop PLAN-AND-BUDGET, a two-stage inference framework that first decomposes queries into sub-questions (Plan step) and then adaptively allocates token budgets to each sub-question based on estimated complexity (Budget step). This framework is model-agnostic and requires no retraining.

Contribution

Characterization of reasoning miscalibration

The authors identify and formalize reasoning miscalibration as a fundamental failure mode in LLMs, manifesting as either overthinking (excessive verbose reasoning) or underthinking (premature termination). They analyze this phenomenon through uncertainty decomposition and establish it as a key challenge in test-time computation.