e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
LLMreasoningtest-time computeRLexploration
Abstract:

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes the e3 recipe to enable test-time compute extrapolation in LLMs through in-context exploration, training models to chain operations like generation and verification or test multiple hypotheses before committing to answers. It resides in the 'In-Context Exploration for Extrapolation' leaf, which contains only two papers including this one. This represents a sparse, emerging research direction within the broader 'In-Context Exploration and Reinforcement Learning' branch, suggesting the work addresses a relatively underexplored problem space compared to more crowded areas like prompt optimization or general search-based inference.

The taxonomy reveals several neighboring directions that contextualize this work. Sibling categories include 'Bandit-Based Exploration' (three papers on contextual bandits for LLM decision-making), 'General In-Context RL' (one paper on multi-round prompting with scalar rewards), and 'Representation-Based Exploration' (one paper using hidden state guidance). The parent branch 'Search-Based Inference Optimization' contains MCTS-based reasoning methods and in-context search techniques. While these neighbors share the goal of improving test-time reasoning, they differ in mechanism: the paper's focus on chaining asymmetric skills and leveraging negative gradients for exploration distinguishes it from bandit frameworks or tree search approaches.

Among twenty-five candidates examined across three contributions, none were identified as clearly refuting the paper's claims. The e3 recipe examined nine candidates with zero refutable overlaps; the theoretical framework on asymmetries examined six candidates with zero refutations; and the coupled curriculum design examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific combination of asymmetric skill chaining, negative gradient amplification, and coupled task-budget curriculum appears novel. However, the small candidate pool and sparse taxonomy leaf mean the analysis covers a narrow slice of potentially relevant prior work.

The limited search scope (twenty-five candidates from semantic search) and sparse taxonomy structure (only one sibling paper in the same leaf) constrain confidence in assessing absolute novelty. The analysis indicates no direct prior work overlap within examined candidates, but the emerging nature of this research direction means the field may lack comprehensive coverage. The work's positioning at the intersection of in-context learning, reinforcement learning, and test-time scaling suggests it synthesizes ideas from multiple established areas into a relatively unexplored combination.

Taxonomy

Core-task Taxonomy Papers
25
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Enabling extrapolation of test-time compute for large language models through in-context exploration. The field structure reflects a multifaceted effort to enhance LLM capabilities beyond training-time constraints. The taxonomy organizes research into several major branches: Test-Time Compute Scaling and Reasoning Enhancement focuses on methods that allocate additional computation during inference to improve reasoning and decision-making, often through exploration strategies and reinforcement learning techniques such as those in LLMs Explore In-Context[1] and Evolve LLM Exploration[2]. Prompt Engineering and In-Context Learning Optimization addresses how to craft effective demonstrations and prompts, exemplified by works like Sample Efficient Demonstrations[10] and Repulsive Bayesian Prompt[21]. Context Propagation and Representation Structuring examines how models maintain and utilize long-range dependencies, with approaches like InfLLM Context Memory[6] and Hierarchical Contextual Manifold[24]. Inference Efficiency and Model Selection targets practical deployment concerns, including resource allocation and adaptive model choice as in Context-Aware Assistant Selection[14]. Finally, Domain-Specific Applications tailors these techniques to specialized tasks. Within the Test-Time Compute Scaling branch, a particularly active line of work explores in-context exploration and reinforcement learning for extrapolation. These studies investigate how LLMs can dynamically explore solution spaces at test time, balancing exploration and exploitation to generalize beyond training distributions. Learning to Explore[0] sits squarely in this cluster, emphasizing mechanisms that enable models to extrapolate by leveraging in-context learning as an exploration tool. It shares thematic ground with LLMs Explore In-Context[1], which similarly examines exploration strategies within the prompt context, though Learning to Explore[0] places stronger emphasis on extrapolation capabilities. Nearby works like Strategic Exploration Exploitation[11] and Context-Guided Test-Time RL[23] highlight the trade-offs between computational budget and reasoning depth, while In-Context Search Scaling[22] investigates how search-based methods scale with test-time resources. The central open question across these directions is how to efficiently allocate test-time compute to achieve robust generalization without prohibitive costs.

Claimed Contributions

e3 recipe for enabling test-time compute extrapolation in LLMs

The authors propose e3, a training recipe consisting of three components: exploiting asymmetric capabilities (like verification-generation gaps) in base models, using negative gradients during RL to encourage longer reasoning chains, and employing a coupled curriculum that coordinates problem difficulty with token budget. This recipe enables models to extrapolate performance beyond their training compute budget.

9 retrieved papers
Theoretical framework showing asymmetries enable in-context exploration

The authors develop a theoretical p_k model that formalizes how chaining asymmetric capabilities (such as verification being easier than generation) enables in-context exploration. They prove that when such asymmetries exist, models can benefit from making multiple attempts and verifying intermediate results.

6 retrieved papers
Coupled curriculum design for structuring exploration during RL training

The authors introduce a coupled curriculum that jointly varies problem difficulty and training token budget during RL. At each stage, they select the smallest budget that allows the model to chain asymmetries and extrapolate to twice that budget, balancing optimization efficiency with exploration incentives.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

e3 recipe for enabling test-time compute extrapolation in LLMs

The authors propose e3, a training recipe consisting of three components: exploiting asymmetric capabilities (like verification-generation gaps) in base models, using negative gradients during RL to encourage longer reasoning chains, and employing a coupled curriculum that coordinates problem difficulty with token budget. This recipe enables models to extrapolate performance beyond their training compute budget.

Contribution

Theoretical framework showing asymmetries enable in-context exploration

The authors develop a theoretical p_k model that formalizes how chaining asymmetric capabilities (such as verification being easier than generation) enables in-context exploration. They prove that when such asymmetries exist, models can benefit from making multiple attempts and verifying intermediate results.

Contribution

Coupled curriculum design for structuring exploration during RL training

The authors introduce a coupled curriculum that jointly varies problem difficulty and training token budget during RL. At each stage, they select the smallest budget that allows the model to chain asymmetries and extrapolate to twice that budget, balancing optimization efficiency with exploration incentives.