e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMreasoningtest-time computeRLexploration

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes the e3 recipe to enable test-time compute extrapolation in LLMs through in-context exploration, training models to chain operations like generation and verification or test multiple hypotheses before committing to answers. It resides in the 'In-Context Exploration for Extrapolation' leaf, which contains only two papers including this one. This represents a sparse, emerging research direction within the broader 'In-Context Exploration and Reinforcement Learning' branch, suggesting the work addresses a relatively underexplored problem space compared to more crowded areas like prompt optimization or general search-based inference.

The taxonomy reveals several neighboring directions that contextualize this work. Sibling categories include 'Bandit-Based Exploration' (three papers on contextual bandits for LLM decision-making), 'General In-Context RL' (one paper on multi-round prompting with scalar rewards), and 'Representation-Based Exploration' (one paper using hidden state guidance). The parent branch 'Search-Based Inference Optimization' contains MCTS-based reasoning methods and in-context search techniques. While these neighbors share the goal of improving test-time reasoning, they differ in mechanism: the paper's focus on chaining asymmetric skills and leveraging negative gradients for exploration distinguishes it from bandit frameworks or tree search approaches.

Among twenty-five candidates examined across three contributions, none were identified as clearly refuting the paper's claims. The e3 recipe examined nine candidates with zero refutable overlaps; the theoretical framework on asymmetries examined six candidates with zero refutations; and the coupled curriculum design examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific combination of asymmetric skill chaining, negative gradient amplification, and coupled task-budget curriculum appears novel. However, the small candidate pool and sparse taxonomy leaf mean the analysis covers a narrow slice of potentially relevant prior work.

The limited search scope (twenty-five candidates from semantic search) and sparse taxonomy structure (only one sibling paper in the same leaf) constrain confidence in assessing absolute novelty. The analysis indicates no direct prior work overlap within examined candidates, but the emerging nature of this research direction means the field may lack comprehensive coverage. The work's positioning at the intersection of in-context learning, reinforcement learning, and test-time scaling suggests it synthesizes ideas from multiple established areas into a relatively unexplored combination.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Enabling extrapolation of test-time compute for large language models through in-context exploration. The field structure reflects a multifaceted effort to enhance LLM capabilities beyond training-time constraints. The taxonomy organizes research into several major branches: Test-Time Compute Scaling and Reasoning Enhancement focuses on methods that allocate additional computation during inference to improve reasoning and decision-making, often through exploration strategies and reinforcement learning techniques such as those in LLMs Explore In-Context[1] and Evolve LLM Exploration[2]. Prompt Engineering and In-Context Learning Optimization addresses how to craft effective demonstrations and prompts, exemplified by works like Sample Efficient Demonstrations[10] and Repulsive Bayesian Prompt[21]. Context Propagation and Representation Structuring examines how models maintain and utilize long-range dependencies, with approaches like InfLLM Context Memory[6] and Hierarchical Contextual Manifold[24]. Inference Efficiency and Model Selection targets practical deployment concerns, including resource allocation and adaptive model choice as in Context-Aware Assistant Selection[14]. Finally, Domain-Specific Applications tailors these techniques to specialized tasks. Within the Test-Time Compute Scaling branch, a particularly active line of work explores in-context exploration and reinforcement learning for extrapolation. These studies investigate how LLMs can dynamically explore solution spaces at test time, balancing exploration and exploitation to generalize beyond training distributions. Learning to Explore[0] sits squarely in this cluster, emphasizing mechanisms that enable models to extrapolate by leveraging in-context learning as an exploration tool. It shares thematic ground with LLMs Explore In-Context[1], which similarly examines exploration strategies within the prompt context, though Learning to Explore[0] places stronger emphasis on extrapolation capabilities. Nearby works like Strategic Exploration Exploitation[11] and Context-Guided Test-Time RL[23] highlight the trade-offs between computational budget and reasoning depth, while In-Context Search Scaling[22] investigates how search-based methods scale with test-time resources. The central open question across these directions is how to efficiently allocate test-time compute to achieve robust generalization without prohibitive costs.

Claimed Contributions

e3 recipe for enabling test-time compute extrapolation in LLMs

9 retrieved papers

The authors propose e3, a training recipe consisting of three components: exploiting asymmetric capabilities (like verification-generation gaps) in base models, using negative gradients during RL to encourage longer reasoning chains, and employing a coupled curriculum that coordinates problem difficulty with token budget. This recipe enables models to extrapolate performance beyond their training compute budget.

9 retrieved papers

Theoretical framework showing asymmetries enable in-context exploration

6 retrieved papers

The authors develop a theoretical p_k model that formalizes how chaining asymmetric capabilities (such as verification being easier than generation) enables in-context exploration. They prove that when such asymmetries exist, models can benefit from making multiple attempts and verifying intermediate results.

6 retrieved papers

Coupled curriculum design for structuring exploration during RL training

10 retrieved papers

The authors introduce a coupled curriculum that jointly varies problem difficulty and training token budget during RL. At each stage, they select the smallest budget that allows the model to chain asymmetries and extrapolate to twice that budget, balancing optimization efficiency with exploration incentives.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Can large language models explore in-context? PDF

Dylan Foster, Keegan Harris, Akshay Krishnamurthy, Aleksandrs Slivkins, Cyril Zhang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

e3 recipe for enabling test-time compute extrapolation in LLMs

[27] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling PDF

Cannot Refute

[32] A survey on large language models for mathematical reasoning PDF

Cannot Refute

[33] Reasoning Language Models: A Blueprint PDF

Cannot Refute

[34] A Survey of Reinforcement Learning in Large Language Models: From Data Generation to Test-Time Inference PDF

Cannot Refute

[35] Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement PDF

Cannot Refute

[36] Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models PDF

Cannot Refute

[37] SCALAR: Self-Supervised Composition and Learning of Skills with LLM Planning and RL PDF

Cannot Refute

[38] Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints PDF

Cannot Refute

[39] Learning to Prompt in Unknown Environments: A POMDP Framework with Compositional Actions for Large Language Models PDF

Cannot Refute

Contribution

Theoretical framework showing asymmetries enable in-context exploration

[26] Deepseekmath-v2: Towards self-verifiable mathematical reasoning PDF

Cannot Refute

[27] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling PDF

Cannot Refute

[28] ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification PDF

Cannot Refute

[29] ReVeal: Self-Evolving Code Agents via Reliable Self-Verification PDF

Cannot Refute

[30] Kg-egv: a framework for question answering with integrated knowledge graphs and large language models PDF

Cannot Refute

[31] Natural Language Edge Labelling: Decoupling Intent from Execution in Structured LM Reasoning PDF

Cannot Refute

Contribution

Coupled curriculum design for structuring exploration during RL training

[40] Prompt Curriculum Learning for Efficient LLM Post-Training PDF

Cannot Refute

[41] SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning PDF

Cannot Refute

[42] EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning PDF

Cannot Refute

[43] FlagVNE: A Flexible and Generalizable Reinforcement Learning Framework for Network Resource Allocation PDF

Cannot Refute

[44] Formal Mathematics Statement Curriculum Learning PDF

Cannot Refute

[45] Curriculum Reinforcement Learning for Tokamak Control PDF

Cannot Refute

[46] Curriculum learning for multilevel budgeted combinatorial problems PDF

Cannot Refute

[47] Proximal curriculum for reinforcement learning agents PDF

Cannot Refute

[48] Fast Adaptive Task Offloading and Resource Allocation in Large-Scale MEC Systems via Multiagent Graph Reinforcement Learning PDF

Cannot Refute

[49] Identifying efficient curricula for reinforcement learning in complex environments with a fixed computational budget PDF

Cannot Refute

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Can large language models explore in-context? PDF

Contribution Analysis

e3 recipe for enabling test-time compute extrapolation in LLMs

[27] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling PDF

[32] A survey on large language models for mathematical reasoning PDF

[33] Reasoning Language Models: A Blueprint PDF

[34] A Survey of Reinforcement Learning in Large Language Models: From Data Generation to Test-Time Inference PDF

[35] Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement PDF

[36] Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models PDF

[37] SCALAR: Self-Supervised Composition and Learning of Skills with LLM Planning and RL PDF

[38] Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints PDF

[39] Learning to Prompt in Unknown Environments: A POMDP Framework with Compositional Actions for Large Language Models PDF

Theoretical framework showing asymmetries enable in-context exploration

[26] Deepseekmath-v2: Towards self-verifiable mathematical reasoning PDF

[27] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling PDF

[28] ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification PDF

[29] ReVeal: Self-Evolving Code Agents via Reliable Self-Verification PDF

[30] Kg-egv: a framework for question answering with integrated knowledge graphs and large language models PDF

[31] Natural Language Edge Labelling: Decoupling Intent from Execution in Structured LM Reasoning PDF

Coupled curriculum design for structuring exploration during RL training

[40] Prompt Curriculum Learning for Efficient LLM Post-Training PDF

[41] SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning PDF

[42] EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning PDF

[43] FlagVNE: A Flexible and Generalizable Reinforcement Learning Framework for Network Resource Allocation PDF

[44] Formal Mathematics Statement Curriculum Learning PDF

[45] Curriculum Reinforcement Learning for Tokamak Control PDF

[46] Curriculum learning for multilevel budgeted combinatorial problems PDF

[47] Proximal curriculum for reinforcement learning agents PDF

[48] Fast Adaptive Task Offloading and Resource Allocation in Large-Scale MEC Systems via Multiagent Graph Reinforcement Learning PDF

[49] Identifying efficient curricula for reinforcement learning in complex environments with a fixed computational budget PDF

Table of Contents