Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

ICLR 2026 Conference SubmissionAnonymous Authors
Bayesian experimental designinformation-seekingquestion askingCollaborative Battleshipexpected information gain (EIG)explore-exploit tradeoffsresource rationalityprobabilistic inferenceMonte Carlo samplingsymbolic groundingcode generationreasoningdecision-oriented dialoguecognitive modelinghuman behaviorlanguage model agentsscientific discovery
Abstract:

Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303–0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3–42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a strategic dialogue task (Collaborative Battleship) and Monte Carlo inference methods for language model agents that balance question-asking and action-taking under uncertainty. It sits within the 'Language Model Agents for Strategic Dialogue' leaf, which contains only two papers total (including this one). This is a relatively sparse research direction within the broader taxonomy of 50 papers across 27 leaf nodes, suggesting the work addresses an emerging rather than saturated area of inquiry.

The taxonomy reveals that strategic dialogue agents occupy a distinct niche within the AI and Computational Agents branch, separated from multi-agent systems, question-answering strategy selection, and human-AI collaboration. Neighboring leaves focus on distributed architectures or strategy selection without the emphasis on balancing exploration-exploitation trade-offs in dialogue. The paper's use of Bayesian Experimental Design principles connects it conceptually to the Theoretical Foundations branch (normative models), though it remains firmly an applied AI contribution rather than a cognitive modeling effort.

Among 24 candidates examined across three contributions, the Collaborative Battleship task and evaluation framework showed no clear refutation (10 candidates each, zero refutable). However, the Monte Carlo inference strategies based on Bayesian Experimental Design encountered one refutable candidate among four examined, indicating some prior work in this methodological space. The limited search scope (top-K semantic matches plus citation expansion) means these statistics reflect a targeted rather than exhaustive literature review, particularly for the BED-based methods.

Given the sparse taxonomy leaf and the modest search scale, the work appears to occupy relatively novel ground in applying BED principles to LM-based strategic dialogue. The task design and human-agent comparison framework show stronger novelty signals than the inference methods, where at least one overlapping prior approach was identified. The analysis covers semantic neighbors and citations but does not claim comprehensive coverage of all related work in active learning or dialogue systems.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: rational information-seeking through question-asking and strategic action. This field examines how agents—whether human, computational, or robotic—decide what questions to ask and which actions to take in order to reduce uncertainty and achieve goals efficiently. The taxonomy reflects a broad interdisciplinary landscape. Theoretical Foundations and Cognitive Mechanisms explore normative models and psychological underpinnings of curiosity and inquiry, often drawing on work like Good Query Theory[4] and Causal Information Seeking[6]. Developmental and Individual Differences investigate how question-asking emerges across the lifespan, including studies on toddlers' information search strategies. Strategic Communication and Linguistic Approaches focus on the pragmatics of dialogue, examining how speakers frame questions to elicit useful responses or even deceive, as seen in Strategic Questioning Deception[29]. Human Decision-Making Under Uncertainty addresses how people gather and weigh information before committing to choices, with contributions like Anticipated Regret Search[38] and Motivated Decision Making[40]. Meanwhile, Artificial Intelligence and Computational Agents and Robotic and Autonomous Systems branches cover algorithmic methods for active learning, dialogue management, and sensor planning, exemplified by works such as BIG Agent[39] and POAM Robotic Mapping[14]. Domain-Specific Applications span areas from medical triage to legal discovery, while Metacognitive and Reflective Processes consider self-monitoring and adaptive inquiry. Within the AI and computational agents branch, a particularly active line of work centers on language model agents that engage in strategic dialogue to gather information or negotiate outcomes. These systems must balance exploration—asking clarifying questions—with exploitation of known facts, often under constraints like limited interaction turns or noisy environments. The original paper ```json[0] sits squarely in this cluster, focusing on Language Model Agents for Strategic Dialogue. It shares thematic ground with aliases":[2], which also addresses dialogue-based information-seeking, and contrasts with earlier heuristic approaches like Shoot First Ask Later[3], which prioritized rapid action over careful inquiry. Compared to domain-specific agents such as Doctor R1[5], which targets medical reasoning, ```json[0] appears to emphasize more general-purpose strategic communication, exploring how agents can dynamically adapt their questioning strategies across varied conversational contexts. This positioning highlights ongoing tensions between task-specific fine-tuning and broadly transferable dialogue policies.

Claimed Contributions

Collaborative Battleship task and BATTLESHIP QA dataset

The authors develop a two-player dialogue and decision-making task extending the classic Battleship game, where players ask natural language questions to gain information about hidden ships. They collect 126 full human-human game trajectories (N=42 participants) including dialogue and actions, creating the BATTLESHIP QA dataset with 931 gold yes/no questions for evaluating grounded answering and strategic gameplay.

10 retrieved papers
Monte Carlo inference strategies based on Bayesian Experimental Design

The authors formalize three Bayesian-inspired inference-time strategies that leverage sequential Monte Carlo approximation: QBayes for asking questions that maximize expected information gain, MBayes for selecting moves that maximize hit probability, and DBayes for deciding between asking questions or taking actions via one-step lookahead. These strategies enable weaker language models to achieve superhuman performance while maintaining significant cost savings.

4 retrieved papers
Can Refute
Evaluation framework comparing human and agent information-seeking behavior

The authors create a reusable evaluation harness that systematically compares language model agents against human behavior and idealized resource rational strategies in information-seeking tasks. The framework tests distinct agent capabilities including asking informative questions, providing grounded answers, taking strategic actions, and navigating explore/exploit tradeoffs, with demonstrated generalizability to other information-seeking games like Guess Who.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Collaborative Battleship task and BATTLESHIP QA dataset

The authors develop a two-player dialogue and decision-making task extending the classic Battleship game, where players ask natural language questions to gain information about hidden ships. They collect 126 full human-human game trajectories (N=42 participants) including dialogue and actions, creating the BATTLESHIP QA dataset with 931 gold yes/no questions for evaluating grounded answering and strategic gameplay.

Contribution

Monte Carlo inference strategies based on Bayesian Experimental Design

The authors formalize three Bayesian-inspired inference-time strategies that leverage sequential Monte Carlo approximation: QBayes for asking questions that maximize expected information gain, MBayes for selecting moves that maximize hit probability, and DBayes for deciding between asking questions or taking actions via one-step lookahead. These strategies enable weaker language models to achieve superhuman performance while maintaining significant cost savings.

Contribution

Evaluation framework comparing human and agent information-seeking behavior

The authors create a reusable evaluation harness that systematically compares language model agents against human behavior and idealized resource rational strategies in information-seeking tasks. The framework tests distinct agent capabilities including asking informative questions, providing grounded answers, taking strategic actions, and navigating explore/exploit tradeoffs, with demonstrated generalizability to other information-seeking games like Guess Who.

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People | Novelty Validation