In-Context Learning for Pure Exploration

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

active sequential hypothesis testingpure explorationreinforcement learningin-context learningbest arm identification

We study the active sequential hypothesis testing problem, also known as pure exploration: given a new task, the learner adaptively collects data from the environment to efficiently determine an underlying correct hypothesis. A classical instance of this problem is the task of identifying the best arm in a multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions index hypotheses. Another important case is generalized search, a problem of determining the correct label through a sequence of strategically selected queries that indirectly reveal information about the label. In this work, we introduce In-Context Pure Exploration (ICPE), which meta-trains Transformers to map observation histories to query actions and a predicted hypothesis, yielding a model that transfers in-context. At inference time, ICPE actively gathers evidence on new tasks and infers the true hypothesis without parameter updates. Across deterministic, stochastic, and structured benchmarks, including BAI and generalized search, ICPE is competitive with adaptive baselines while requiring no explicit modeling of information structure. Our results support Transformers as practical architectures for general sequential testing.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces In-Context Pure Exploration (ICPE), a meta-learning framework that trains Transformers to perform active sequential hypothesis testing by mapping observation histories to query actions and predicted hypotheses. It resides in the 'Deep Learning for Policy Design' leaf, which contains only two papers total, indicating a relatively sparse research direction within the broader taxonomy of fifty papers. This leaf sits under 'Computational and Learning-Based Approaches', a branch that contrasts with the field's dominant analytical and domain-specific methods.

The taxonomy reveals that most sequential testing work concentrates on analytical stopping rules, domain applications like clinical trials and quantum testing, or classical adaptive sampling strategies. The 'Computational and Learning-Based Approaches' branch is small, with only six papers across two leaves, suggesting that learning-based policy design remains an emerging area. ICPE's sibling paper in the same leaf likely explores similar neural policy learning, but the sparse population indicates limited prior work directly combining Transformers with pure exploration tasks, distinguishing this direction from the field's traditional information-theoretic and asymptotic analysis focus.

Among twenty-one candidates examined, none clearly refute any of the three contributions. The ICPE framework itself was checked against two candidates with no overlaps found. The theoretical characterization of optimal policies examined ten candidates without refutation, and the extension to history-dependent models reviewed nine candidates, also without clear prior work. These statistics suggest that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of in-context learning, Transformers, and pure exploration appears relatively unexplored, though the small candidate pool means gaps in the literature search remain possible.

Given the limited search scale and the sparse taxonomy leaf, the work appears to occupy a novel intersection of meta-learning and active hypothesis testing. However, the analysis covers only a fraction of the broader machine learning and sequential decision-making literature, and the taxonomy's structure shows that learning-based approaches are underrepresented overall. A more exhaustive search across reinforcement learning, meta-learning, and bandit literature would be needed to fully assess novelty beyond the top-twenty-one semantic matches examined here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: active sequential hypothesis testing with adaptive data collection. The field encompasses methods that iteratively decide which data to gather next and when to stop, balancing statistical power against sampling cost. The taxonomy organizes this landscape into five main branches. Adaptive Action Selection and Sampling Strategies focus on choosing informative observations or sensors at each stage, often drawing on bandit-like frameworks and information-theoretic criteria. Sequential Testing Procedures and Stopping Rules develop principled decision boundaries and error guarantees, extending classical sequential analysis to modern settings. Domain-Specific Sequential Testing Applications tailor these ideas to areas such as clinical trials, geothermal exploration, and quantum state verification, where problem structure or ethical constraints shape the design. Computational and Learning-Based Approaches leverage reinforcement learning or deep networks to learn adaptive policies from data, while Theoretical Foundations and Asymptotic Analysis provide rigorous characterizations of optimality and error exponents. Representative works include Adaptive Waveform Design[5] in sensor selection, Group Sequential Methods[20] in clinical contexts, and Deep Learning Policy[14] for learned strategies. Recent activity highlights a tension between model-driven guarantees and data-driven flexibility. Many studies in the theoretical and procedural branches, such as Sequential Markov Testing[2] and One-Sided Markov Testing[3], derive asymptotic optimality under specific distributional assumptions, while computational approaches like Learning to Explore[23] and Deep Learning Policy[14] emphasize scalability and generalization across problem instances. In-Context Pure Exploration[0] sits within the Computational and Learning-Based Approaches branch, specifically under Deep Learning for Policy Design, and shares this emphasis on leveraging modern learning architectures to discover adaptive sampling rules. Compared to Learning to Explore[23], which also targets policy learning, In-Context Pure Exploration[0] appears to focus on in-context mechanisms that adapt without explicit retraining, reflecting a shift toward more flexible, meta-learning-style frameworks. This positioning underscores ongoing questions about how to blend statistical rigor with the representational power of neural methods.

Claimed Contributions

In-Context Pure Exploration (ICPE) framework

2 retrieved papers

The authors propose ICPE, a Transformer-based meta-learning framework that learns both a data-collection policy and an inference rule for active sequential hypothesis testing. The model operates in-context at inference time without parameter updates, handling both fixed-confidence and fixed-budget regimes.

2 retrieved papers

Theoretical characterization of optimal inference and exploration policies

10 retrieved papers

The authors establish that the optimal inference rule is the maximum a posteriori estimator based on the posterior distribution, and derive principled information-theoretic reward functions for training optimal data-collection policies in both fixed-budget and fixed-confidence settings.

10 retrieved papers

Extension to history-dependent and unknown observation models

9 retrieved papers

The authors extend classical active sequential hypothesis testing by allowing environment-specific, history-dependent observation kernels and learning the inference rule from data, rather than assuming memoryless dependence and known estimators as in standard formulations.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Learning to Explore: An In-Context Learning Approach for Pure Exploration PDF

A Russo, R Welch, A Pacchiano (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

In-Context Pure Exploration (ICPE) framework

[59] Transformer neural processes: Uncertainty-aware meta learning via sequence modeling PDF

Cannot Refute

[60] Meta-Learning Exploration Strategies with Decision Transformers PDF

Cannot Refute

Contribution

Theoretical characterization of optimal inference and exploration policies

[11] Evasive active hypothesis testing PDF

Cannot Refute

[16] Active sequential hypothesis testing PDF

Cannot Refute

[61] Design and Analysis of Optimal and Minimax Robust Sequential Hypothesis Tests PDF

Cannot Refute

[62] Active Fixed-Sample-Size Hypothesis Testing via POMDP Value Function Lipschitz Bounds PDF

Cannot Refute

[63] Minimax Optimal Sequential Hypothesis Tests for Markov Processes PDF

Cannot Refute

[64] Sequential Binary Hypothesis Testing with Competing Agents under Information Asymmetry PDF

Cannot Refute

[65] A robust approach to sequential information theoretic planning PDF

Cannot Refute

[66] Sequential Multi-Hypothesis Testing in Multi-Armed Bandit Problems: An Approach for Asymptotic Optimality PDF

Cannot Refute

[67] Sequential Bayesian optimal experimental design via approximate dynamic programming PDF

Cannot Refute

[68] Dynamic Information Design: A Simple Problem on Optimal Sequential Information Disclosure PDF

Cannot Refute

Contribution

Extension to history-dependent and unknown observation models

[4] Tradeoffs among action taking policies matter in active sequential multi-hypothesis testing: the optimal error exponent region PDF

Cannot Refute

[17] Structureâadaptive sequential testing for online false discovery rate control PDF

Cannot Refute

[51] Instrumental variable estimation for causal inference in longitudinal data with time-dependent latent confounders PDF

Cannot Refute

[53] Incorporating time in sequential recommendation models PDF

Cannot Refute

[54] Sinenet: Learning temporal dynamics in time-dependent partial differential equations PDF

Cannot Refute

[55] AdaPT: an interactive procedure for multiple testing with side information PDF

Cannot Refute

[56] Learning Graph ODE for Continuous-Time Sequential Recommendation PDF

Cannot Refute

[57] Sequential deep operator networks (s-deeponet) for predicting full-field solutions under time-dependent loads PDF

Cannot Refute

[58] Sequential Predictive Conformal Inference for Time Series PDF

Cannot Refute

In-Context Learning for Pure Exploration

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Learning to Explore: An In-Context Learning Approach for Pure Exploration PDF

Contribution Analysis

In-Context Pure Exploration (ICPE) framework

[59] Transformer neural processes: Uncertainty-aware meta learning via sequence modeling PDF

[60] Meta-Learning Exploration Strategies with Decision Transformers PDF

Theoretical characterization of optimal inference and exploration policies

[11] Evasive active hypothesis testing PDF

[16] Active sequential hypothesis testing PDF

[61] Design and Analysis of Optimal and Minimax Robust Sequential Hypothesis Tests PDF

[62] Active Fixed-Sample-Size Hypothesis Testing via POMDP Value Function Lipschitz Bounds PDF

[63] Minimax Optimal Sequential Hypothesis Tests for Markov Processes PDF

[64] Sequential Binary Hypothesis Testing with Competing Agents under Information Asymmetry PDF

[65] A robust approach to sequential information theoretic planning PDF

[66] Sequential Multi-Hypothesis Testing in Multi-Armed Bandit Problems: An Approach for Asymptotic Optimality PDF

[67] Sequential Bayesian optimal experimental design via approximate dynamic programming PDF

[68] Dynamic Information Design: A Simple Problem on Optimal Sequential Information Disclosure PDF

Extension to history-dependent and unknown observation models

[4] Tradeoffs among action taking policies matter in active sequential multi-hypothesis testing: the optimal error exponent region PDF

[17] Structureâadaptive sequential testing for online false discovery rate control PDF

[51] Instrumental variable estimation for causal inference in longitudinal data with time-dependent latent confounders PDF

[53] Incorporating time in sequential recommendation models PDF

[54] Sinenet: Learning temporal dynamics in time-dependent partial differential equations PDF

[55] AdaPT: an interactive procedure for multiple testing with side information PDF

[56] Learning Graph ODE for Continuous-Time Sequential Recommendation PDF

[57] Sequential deep operator networks (s-deeponet) for predicting full-field solutions under time-dependent loads PDF

[58] Sequential Predictive Conformal Inference for Time Series PDF

Table of Contents

[17] Structureâadaptive sequential testing for online false discovery rate control PDF