Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelTest-Time ComputeReasoningEffectivenessEfficiency

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budget’s ability to close performance gaps without retraining. Our code is available at anonymous.4open.science/r/P-and-B-6513/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a decomposition-based framework that allocates token budgets adaptively by breaking complex queries into sub-questions. It resides in the 'Decomposition-Based Budget Planning' leaf, which contains three papers total, indicating a moderately sparse research direction within the broader taxonomy of fifty papers. The sibling papers—Plan and Budget LLM and Adaptive Graph Thoughts—similarly emphasize structured planning, suggesting this leaf represents a coherent but not overcrowded niche focused on upfront task decomposition rather than runtime adjustment.

The taxonomy reveals neighboring leaves in 'Adaptive Budget Allocation Frameworks' that pursue alternative strategies: 'Difficulty-Aware Budget Prediction' estimates problem complexity before reasoning, while 'Hierarchical and Multi-Level Budget Control' organizes allocation across multiple granularities. Adjacent branches, such as 'Dynamic Token Management During Inference' and 'Reinforcement Learning for Budget Optimization', address runtime adaptation and policy learning respectively. The paper's decomposition approach diverges from these by committing to a plan upfront, trading runtime flexibility for interpretability and structured resource distribution across identified sub-problems.

Among thirty candidates examined, the Budget Allocation Model (BAM) contribution shows no clear refutation across ten candidates, suggesting theoretical novelty in formalizing reasoning as uncertainty-driven sub-question sequences. However, the Plan-and-Budget framework and the characterization of reasoning miscalibration each face two refutable candidates among ten examined, indicating that decomposition-based planning and the overthinking/underthinking analysis have more substantial prior work. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage, so unexamined literature may contain additional overlaps.

Given the search examined thirty candidates rather than hundreds, the analysis captures high-relevance prior work but cannot claim completeness. The theoretical BAM model appears more distinctive, while the framework and miscalibration insights align more closely with existing decomposition and efficiency studies. The paper's position in a three-paper leaf suggests it extends a recognized but not saturated research direction, though the refutation signals warrant careful comparison with the identified overlapping work to clarify incremental versus substantive contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient test-time reasoning with adaptive token budget allocation. The field addresses how to dynamically manage computational resources during inference, ensuring that models allocate tokens where they matter most. The taxonomy reveals several complementary directions: Adaptive Budget Allocation Frameworks develop high-level strategies for distributing compute across reasoning steps, often through decomposition or policy-based planning (e.g., Plan and Budget[0], Token-Budget-Aware Reasoning[3]). Reinforcement Learning for Budget Optimization treats allocation as a sequential decision problem, learning policies that balance accuracy and efficiency (Budget Policy Optimization[4], Optimal Reasoning Efficiency[5]). Dynamic Token Management During Inference focuses on runtime mechanisms such as pruning, halting, or reweighting tokens to reduce waste (Dynamic Token Pruning[37], Continue-Thinking Token[40]). Meanwhile, Search and Sampling Strategies for Test-Time Scaling explore how to navigate solution spaces efficiently under budget constraints (Dual-Phase Search[25], First Finish Search[36]), and Training Methodologies for Efficient Reasoning investigate how to prepare models for adaptive behavior through specialized fine-tuning or distillation. Recent work highlights a tension between global planning and local adaptation. Some approaches, like Plan and Budget[0] and its close neighbor Plan and Budget LLM[19], emphasize decomposition-based planning that allocates budgets upfront by breaking tasks into subtasks. This contrasts with methods such as SelfBudgeter[8] or Just Enough Thinking[7], which adapt budgets on-the-fly based on intermediate signals. Adaptive Graph Thoughts[12], a neighbor in the decomposition branch, similarly structures reasoning into graph-based plans. The original paper sits within this decomposition-focused cluster, sharing an emphasis on structured planning with Plan and Budget LLM[19] but differing in how granularly it assigns token budgets to subproblems. Across branches, open questions persist around the trade-off between interpretability of budget decisions and the flexibility needed to handle diverse problem difficulties at test time.

Claimed Contributions

Budget Allocation Model (BAM)

10 retrieved papers

The authors introduce BAM, a theoretical framework that formalizes reasoning as a sequence of sub-problems with varying uncertainty levels and derives optimal token allocation strategies. They also propose the E3 metric (Efficiency-Aware Effectiveness Evaluation Score) to jointly measure reasoning accuracy and computational cost.

10 retrieved papers

PLAN-AND-BUDGET framework

Can Refute

10 retrieved papers

The authors develop PLAN-AND-BUDGET, a two-stage inference framework that first decomposes queries into sub-questions (Plan step) and then adaptively allocates token budgets to each sub-question based on estimated complexity (Budget step). This framework is model-agnostic and requires no retraining.

10 retrieved papers

Can Refute

Characterization of reasoning miscalibration

Can Refute

10 retrieved papers

The authors identify and formalize reasoning miscalibration as a fundamental failure mode in LLMs, manifesting as either overthinking (excessive verbose reasoning) or underthinking (premature termination). They analyze this phenomenon through uncertainty decomposition and establish it as a key challenge in test-time computation.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures PDF

Pandey, Tushar, Ghukasyan, Ara, Tushar Pandey, Goktas, Oktay, A. Ghukasyan, Radha, Santosh Kumar, O. Goktas, Santosh Kumar Radha (2025) • arXiv.org

[19] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning PDF

Lin, Junhong, Zeng Xin-yue, Zhu Jie, Wang, Song, Shun, Julian, Wu Jun, Zhou Da-Wei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Budget Allocation Model (BAM)

[18] EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling PDF

Cannot Refute

[19] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning PDF

Cannot Refute

[23] Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens PDF

Cannot Refute

[44] MUR: Momentum Uncertainty guided Reasoning for Large Language Models PDF

Cannot Refute

[60] Language Model Cascades: Token-level uncertainty and beyond PDF

Cannot Refute

[61] TreeRL: LLM Reinforcement Learning with On-Policy Tree Search PDF

Cannot Refute

[62] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF

Cannot Refute

[63] Cautious next token prediction PDF

Cannot Refute

[64] Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning PDF

Cannot Refute

[65] The Invisible Leash: Why RLVR May or May Not Escape Its Origin PDF

Cannot Refute

Contribution

PLAN-AND-BUDGET framework

[25] Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search PDF

Can Refute

[67] Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models PDF

Can Refute

[19] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning PDF

Cannot Refute

[27] AgentTTS: Large language model agent for test-time compute-optimal scaling strategy in complex tasks PDF

Cannot Refute

[66] Forest-of-thought: Scaling test-time compute for enhancing llm reasoning PDF

Cannot Refute

[68] EdgeAdaptor: Online configuration adaption, model selection and resource provisioning for edge DNN inference serving at scale PDF

Cannot Refute

[69] Adaptive Resource Allocation for Satellite Illumination Pattern Design PDF

Cannot Refute

[70] Adaptive Budget Allocation for Cooperative Task Solving in Crowdsourcing PDF

Cannot Refute

[71] DF-RL: A Dynamic Fuzzy-Neuro Reinforcement Learning Framework for Cloud Resource Management PDF

Cannot Refute

[72] FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration PDF

Cannot Refute

Contribution

Characterization of reasoning miscalibration

[3] Token-Budget-Aware LLM Reasoning PDF

Can Refute

[53] ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning PDF

Can Refute

[51] Reasoning with large language models, a survey PDF

Cannot Refute

[52] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

Cannot Refute

[54] ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model PDF

Cannot Refute

[55] Controlling thinking speed in reasoning models PDF

Cannot Refute

[56] Think before recommend: Unleashing the latent reasoning power for sequential recommendation PDF

Cannot Refute

[57] A Deep Fusion Matching Network Semantic Reasoning Model PDF

Cannot Refute

[58] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF

Cannot Refute

[59] DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning PDF

Cannot Refute

Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures PDF

[19] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning PDF

Contribution Analysis

Budget Allocation Model (BAM)

[18] EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling PDF

[19] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning PDF

[23] Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens PDF

[44] MUR: Momentum Uncertainty guided Reasoning for Large Language Models PDF

[60] Language Model Cascades: Token-level uncertainty and beyond PDF

[61] TreeRL: LLM Reinforcement Learning with On-Policy Tree Search PDF

[62] Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping PDF

[63] Cautious next token prediction PDF

[64] Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning PDF

[65] The Invisible Leash: Why RLVR May or May Not Escape Its Origin PDF

PLAN-AND-BUDGET framework

[25] Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search PDF

[67] Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models PDF

[19] Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning PDF

[27] AgentTTS: Large language model agent for test-time compute-optimal scaling strategy in complex tasks PDF

[66] Forest-of-thought: Scaling test-time compute for enhancing llm reasoning PDF

[68] EdgeAdaptor: Online configuration adaption, model selection and resource provisioning for edge DNN inference serving at scale PDF

[69] Adaptive Resource Allocation for Satellite Illumination Pattern Design PDF

[70] Adaptive Budget Allocation for Cooperative Task Solving in Crowdsourcing PDF

[71] DF-RL: A Dynamic Fuzzy-Neuro Reinforcement Learning Framework for Cloud Resource Management PDF

[72] FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration PDF

Characterization of reasoning miscalibration

[3] Token-Budget-Aware LLM Reasoning PDF

[53] ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning PDF

[51] Reasoning with large language models, a survey PDF

[52] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

[54] ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model PDF

[55] Controlling thinking speed in reasoning models PDF

[56] Think before recommend: Unleashing the latent reasoning power for sequential recommendation PDF

[57] A Deep Fusion Matching Network Semantic Reasoning Model PDF

[58] An Empirical Study of Reasoning Steps in Thinking Code LLMs PDF

[59] DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning PDF

Table of Contents