In-Context Learning for Pure Exploration

ICLR 2026 Conference SubmissionAnonymous Authors
active sequential hypothesis testingpure explorationreinforcement learningin-context learningbest arm identification
Abstract:

We study the active sequential hypothesis testing problem, also known as pure exploration: given a new task, the learner adaptively collects data from the environment to efficiently determine an underlying correct hypothesis. A classical instance of this problem is the task of identifying the best arm in a multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions index hypotheses. Another important case is generalized search, a problem of determining the correct label through a sequence of strategically selected queries that indirectly reveal information about the label. In this work, we introduce In-Context Pure Exploration (ICPE), which meta-trains Transformers to map observation histories to query actions and a predicted hypothesis, yielding a model that transfers in-context. At inference time, ICPE actively gathers evidence on new tasks and infers the true hypothesis without parameter updates. Across deterministic, stochastic, and structured benchmarks, including BAI and generalized search, ICPE is competitive with adaptive baselines while requiring no explicit modeling of information structure. Our results support Transformers as practical architectures for general sequential testing.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces In-Context Pure Exploration (ICPE), a meta-learning framework that trains Transformers to perform active sequential hypothesis testing by mapping observation histories to query actions and predicted hypotheses. It resides in the 'Deep Learning for Policy Design' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy of fifty papers. This leaf sits under 'Computational and Learning-Based Approaches', a branch that contrasts with the field's dominant analytical and domain-specific methods.

The taxonomy reveals that most sequential testing work concentrates on analytical stopping rules, domain applications like clinical trials and quantum testing, or classical adaptive sampling strategies. The 'Computational and Learning-Based Approaches' branch is small, with only six papers across two leaves, suggesting that learning-based policy design remains an emerging area. ICPE's sibling paper in the same leaf likely explores similar neural policy learning, but the sparse population indicates limited prior work directly combining Transformers with pure exploration tasks, distinguishing this direction from the field's traditional information-theoretic and asymptotic analysis focus.

Among twenty-one candidates examined, none clearly refute any of the three contributions. The ICPE framework itself was checked against two candidates with no overlaps found. The theoretical characterization of optimal policies examined ten candidates without refutation, and the extension to history-dependent models reviewed nine candidates, also without clear prior work. These statistics suggest that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of in-context learning, Transformers, and pure exploration appears relatively unexplored, though the small candidate pool means gaps in the literature search remain possible.

Given the limited search scale and the sparse taxonomy leaf, the work appears to occupy a novel intersection of meta-learning and active hypothesis testing. However, the analysis covers only a fraction of the broader machine learning and sequential decision-making literature, and the taxonomy's structure shows that learning-based approaches are underrepresented overall. A more exhaustive search across reinforcement learning, meta-learning, and bandit literature would be needed to fully assess novelty beyond the top-twenty-one semantic matches examined here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: active sequential hypothesis testing with adaptive data collection. The field encompasses methods that iteratively decide which data to gather next and when to stop, balancing statistical power against sampling cost. The taxonomy organizes this landscape into five main branches. Adaptive Action Selection and Sampling Strategies focus on choosing informative observations or sensors at each stage, often drawing on bandit-like frameworks and information-theoretic criteria. Sequential Testing Procedures and Stopping Rules develop principled decision boundaries and error guarantees, extending classical sequential analysis to modern settings. Domain-Specific Sequential Testing Applications tailor these ideas to areas such as clinical trials, geothermal exploration, and quantum state verification, where problem structure or ethical constraints shape the design. Computational and Learning-Based Approaches leverage reinforcement learning or deep networks to learn adaptive policies from data, while Theoretical Foundations and Asymptotic Analysis provide rigorous characterizations of optimality and error exponents. Representative works include Adaptive Waveform Design[5] in sensor selection, Group Sequential Methods[20] in clinical contexts, and Deep Learning Policy[14] for learned strategies. Recent activity highlights a tension between model-driven guarantees and data-driven flexibility. Many studies in the theoretical and procedural branches, such as Sequential Markov Testing[2] and One-Sided Markov Testing[3], derive asymptotic optimality under specific distributional assumptions, while computational approaches like Learning to Explore[23] and Deep Learning Policy[14] emphasize scalability and generalization across problem instances. In-Context Pure Exploration[0] sits within the Computational and Learning-Based Approaches branch, specifically under Deep Learning for Policy Design, and shares this emphasis on leveraging modern learning architectures to discover adaptive sampling rules. Compared to Learning to Explore[23], which also targets policy learning, In-Context Pure Exploration[0] appears to focus on in-context mechanisms that adapt without explicit retraining, reflecting a shift toward more flexible, meta-learning-style frameworks. This positioning underscores ongoing questions about how to blend statistical rigor with the representational power of neural methods.

Claimed Contributions

In-Context Pure Exploration (ICPE) framework

The authors propose ICPE, a Transformer-based meta-learning framework that learns both a data-collection policy and an inference rule for active sequential hypothesis testing. The model operates in-context at inference time without parameter updates, handling both fixed-confidence and fixed-budget regimes.

2 retrieved papers
Theoretical characterization of optimal inference and exploration policies

The authors establish that the optimal inference rule is the maximum a posteriori estimator based on the posterior distribution, and derive principled information-theoretic reward functions for training optimal data-collection policies in both fixed-budget and fixed-confidence settings.

10 retrieved papers
Extension to history-dependent and unknown observation models

The authors extend classical active sequential hypothesis testing by allowing environment-specific, history-dependent observation kernels and learning the inference rule from data, rather than assuming memoryless dependence and known estimators as in standard formulations.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

In-Context Pure Exploration (ICPE) framework

The authors propose ICPE, a Transformer-based meta-learning framework that learns both a data-collection policy and an inference rule for active sequential hypothesis testing. The model operates in-context at inference time without parameter updates, handling both fixed-confidence and fixed-budget regimes.

Contribution

Theoretical characterization of optimal inference and exploration policies

The authors establish that the optimal inference rule is the maximum a posteriori estimator based on the posterior distribution, and derive principled information-theoretic reward functions for training optimal data-collection policies in both fixed-budget and fixed-confidence settings.

Contribution

Extension to history-dependent and unknown observation models

The authors extend classical active sequential hypothesis testing by allowing environment-specific, history-dependent observation kernels and learning the inference rule from data, rather than assuming memoryless dependence and known estimators as in standard formulations.

In-Context Learning for Pure Exploration | Novelty Validation