Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Bayesian experimental designinformation-seekingquestion askingCollaborative Battleshipexpected information gain (EIG)explore-exploit tradeoffsresource rationalityprobabilistic inferenceMonte Carlo samplingsymbolic groundingcode generationreasoningdecision-oriented dialoguecognitive modelinghuman behaviorlanguage model agentsscientific discovery

Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303–0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% → 82% win rate) and frontier models (0% → 67% win rate vs. GPT-5) at ≈1% of GPT-5's cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3–42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a strategic dialogue task (Collaborative Battleship) and Monte Carlo inference methods for language model agents that balance question-asking and action-taking under uncertainty. It sits within the 'Language Model Agents for Strategic Dialogue' leaf, which contains only two papers total (including this one). This is a relatively sparse research direction within the broader taxonomy of 50 papers across 27 leaf nodes, suggesting the work addresses an emerging rather than saturated area of inquiry.

The taxonomy reveals that strategic dialogue agents occupy a distinct niche within the AI and Computational Agents branch, separated from multi-agent systems, question-answering strategy selection, and human-AI collaboration. Neighboring leaves focus on distributed architectures or strategy selection without the emphasis on balancing exploration-exploitation trade-offs in dialogue. The paper's use of Bayesian Experimental Design principles connects it conceptually to the Theoretical Foundations branch (normative models), though it remains firmly an applied AI contribution rather than a cognitive modeling effort.

Among 24 candidates examined across three contributions, the Collaborative Battleship task and evaluation framework showed no clear refutation (10 candidates each, zero refutable). However, the Monte Carlo inference strategies based on Bayesian Experimental Design encountered one refutable candidate among four examined, indicating some prior work in this methodological space. The limited search scope (top-K semantic matches plus citation expansion) means these statistics reflect a targeted rather than exhaustive literature review, particularly for the BED-based methods.

Given the sparse taxonomy leaf and the modest search scale, the work appears to occupy relatively novel ground in applying BED principles to LM-based strategic dialogue. The task design and human-agent comparison framework show stronger novelty signals than the inference methods, where at least one overlapping prior approach was identified. The analysis covers semantic neighbors and citations but does not claim comprehensive coverage of all related work in active learning or dialogue systems.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: rational information-seeking through question-asking and strategic action. This field examines how agents—whether human, computational, or robotic—decide what questions to ask and which actions to take in order to reduce uncertainty and achieve goals efficiently. The taxonomy reflects a broad interdisciplinary landscape. Theoretical Foundations and Cognitive Mechanisms explore normative models and psychological underpinnings of curiosity and inquiry, often drawing on work like Good Query Theory[4] and Causal Information Seeking[6]. Developmental and Individual Differences investigate how question-asking emerges across the lifespan, including studies on toddlers' information search strategies. Strategic Communication and Linguistic Approaches focus on the pragmatics of dialogue, examining how speakers frame questions to elicit useful responses or even deceive, as seen in Strategic Questioning Deception[29]. Human Decision-Making Under Uncertainty addresses how people gather and weigh information before committing to choices, with contributions like Anticipated Regret Search[38] and Motivated Decision Making[40]. Meanwhile, Artificial Intelligence and Computational Agents and Robotic and Autonomous Systems branches cover algorithmic methods for active learning, dialogue management, and sensor planning, exemplified by works such as BIG Agent[39] and POAM Robotic Mapping[14]. Domain-Specific Applications span areas from medical triage to legal discovery, while Metacognitive and Reflective Processes consider self-monitoring and adaptive inquiry. Within the AI and computational agents branch, a particularly active line of work centers on language model agents that engage in strategic dialogue to gather information or negotiate outcomes. These systems must balance exploration—asking clarifying questions—with exploitation of known facts, often under constraints like limited interaction turns or noisy environments. The original paper ```json[0] sits squarely in this cluster, focusing on Language Model Agents for Strategic Dialogue. It shares thematic ground with aliases":[2], which also addresses dialogue-based information-seeking, and contrasts with earlier heuristic approaches like Shoot First Ask Later[3], which prioritized rapid action over careful inquiry. Compared to domain-specific agents such as Doctor R1[5], which targets medical reasoning, ```json[0] appears to emphasize more general-purpose strategic communication, exploring how agents can dynamically adapt their questioning strategies across varied conversational contexts. This positioning highlights ongoing tensions between task-specific fine-tuning and broadly transferable dialogue policies.

Claimed Contributions

Collaborative Battleship task and BATTLESHIP QA dataset

10 retrieved papers

The authors develop a two-player dialogue and decision-making task extending the classic Battleship game, where players ask natural language questions to gain information about hidden ships. They collect 126 full human-human game trajectories (N=42 participants) including dialogue and actions, creating the BATTLESHIP QA dataset with 931 gold yes/no questions for evaluating grounded answering and strategic gameplay.

10 retrieved papers

Monte Carlo inference strategies based on Bayesian Experimental Design

Can Refute

4 retrieved papers

The authors formalize three Bayesian-inspired inference-time strategies that leverage sequential Monte Carlo approximation: QBayes for asking questions that maximize expected information gain, MBayes for selecting moves that maximize hit probability, and DBayes for deciding between asking questions or taking actions via one-step lookahead. These strategies enable weaker language models to achieve superhuman performance while maintaining significant cost savings.

4 retrieved papers

Can Refute

Evaluation framework comparing human and agent information-seeking behavior

10 retrieved papers

The authors create a reusable evaluation harness that systematically compares language model agents against human behavior and idealized resource rational strategies in information-seeking tasks. The framework tests distinct agent capabilities including asking informative questions, providing grounded answers, taking strategic actions, and navigating explore/exploit tradeoffs, with demonstrated generalizability to other information-seeking games like Guess Who.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning PDF

Liu Kai-ming, Yunghwei Lai, Wang, Ziyue, Kai Liu, Ma, Weizhi, Ziyue Wang, Liu Yang, Weizhi Ma, Yang Liu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Collaborative Battleship task and BATTLESHIP QA dataset

[55] Manipulative underspecification PDF

Cannot Refute

[56] Autonomous agents for collaborative task under information asymmetry PDF

Cannot Refute

[57] The Traitors: Deception and Trust in Multi-Agent Language Model Simulations PDF

Cannot Refute

[58] Prompt, information, and game theory: A strategic guide to existence PDF

Cannot Refute

[59] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs PDF

Cannot Refute

[60] Human-agent cooperation in games under incomplete information through natural language communication PDF

Cannot Refute

[61] Conversation as action under uncertainty PDF

Cannot Refute

[62] Steering language models with game-theoretic solvers PDF

Cannot Refute

[63] Human-Agent Coordination in Games under Incomplete Information via Multi-Step Intent PDF

Cannot Refute

[64] Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry PDF

Cannot Refute

Contribution

Monte Carlo inference strategies based on Bayesian Experimental Design

[51] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design PDF

Can Refute

[52] Doing experiments and revising rules with natural language and probabilistic reasoning PDF

Cannot Refute

[53] Bayesian Computations in the 21st Century PDF

Cannot Refute

[54] Reverse-Annealed Sequential Monte Carlo for Efficient Bayesian Optimal Experiment Design PDF

Cannot Refute

Contribution

Evaluation framework comparing human and agent information-seeking behavior

[65] Student translatorsâ web-based vs. GenAI-based information-seeking behavior in translation process: A comparative study PDF

Cannot Refute

[66] BEARCUBS: A benchmark for computer-using web agents PDF

Cannot Refute

[67] DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration PDF

Cannot Refute

[68] Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge PDF

Cannot Refute

[69] ChatGPTâs Information Seeking Strategy: Insights from the 20-Questions Game PDF

Cannot Refute

[70] ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog PDF

Cannot Refute

[71] On the motivations to seek information from artificial intelligence agents versus humans: A risk information seeking and processing perspective PDF

Cannot Refute

[72] Long-Horizon Prediction for Human-Robot Collaboration PDF

Cannot Refute

[73] The turn: Integration of information seeking and retrieval in context PDF

Cannot Refute

[74] Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent PDF

Cannot Refute

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning PDF

Contribution Analysis

Collaborative Battleship task and BATTLESHIP QA dataset

[55] Manipulative underspecification PDF

[56] Autonomous agents for collaborative task under information asymmetry PDF

[57] The Traitors: Deception and Trust in Multi-Agent Language Model Simulations PDF

[58] Prompt, information, and game theory: A strategic guide to existence PDF

[59] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs PDF

[60] Human-agent cooperation in games under incomplete information through natural language communication PDF

[61] Conversation as action under uncertainty PDF

[62] Steering language models with game-theoretic solvers PDF

[63] Human-Agent Coordination in Games under Incomplete Information via Multi-Step Intent PDF

[64] Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry PDF

Monte Carlo inference strategies based on Bayesian Experimental Design

[51] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design PDF

[52] Doing experiments and revising rules with natural language and probabilistic reasoning PDF

[53] Bayesian Computations in the 21st Century PDF

[54] Reverse-Annealed Sequential Monte Carlo for Efficient Bayesian Optimal Experiment Design PDF

Evaluation framework comparing human and agent information-seeking behavior

[65] Student translatorsâ web-based vs. GenAI-based information-seeking behavior in translation process: A comparative study PDF

[66] BEARCUBS: A benchmark for computer-using web agents PDF

[67] DiscipLink: Unfolding Interdisciplinary Information Seeking Process via Human-AI Co-Exploration PDF

[68] Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge PDF

[69] ChatGPTâs Information Seeking Strategy: Insights from the 20-Questions Game PDF

[70] ClarQ-LLM: A Benchmark for Models Clarifying and Requesting Information in Task-Oriented Dialog PDF

[71] On the motivations to seek information from artificial intelligence agents versus humans: A risk information seeking and processing perspective PDF

[72] Long-Horizon Prediction for Human-Robot Collaboration PDF

[73] The turn: Integration of information seeking and retrieval in context PDF

[74] Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent PDF

Table of Contents

[65] Student translatorsâ web-based vs. GenAI-based information-seeking behavior in translation process: A comparative study PDF

[69] ChatGPTâs Information Seeking Strategy: Insights from the 20-Questions Game PDF