When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelExploration
Abstract:

While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6×\times longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates training LLMs to solve multi-armed bandit problems through both supervised fine-tuning on expert trajectories and reinforcement learning with tailored reward signals, including regret-shaped and algorithmic rewards. It resides in the 'Reinforcement Learning and Supervised Fine-Tuning' leaf under 'Training Paradigms for LLM Bandit Agents', which contains five papers total. This represents a moderately populated research direction within a fifty-paper taxonomy, suggesting active but not overcrowded interest in training-based approaches to LLM exploration-exploitation behavior.

The taxonomy reveals neighboring work in 'Adaptive Test-Time and Online Learning Mechanisms' (three papers) focusing on inference-time adaptation rather than offline training, and 'Zero-Shot and In-Context Exploration Capabilities' (six papers) examining pre-trained model behavior without task-specific training. The paper's dual investigation of SFT and RL bridges these areas: it starts from zero-shot baselines but applies training interventions, distinguishing it from pure capability studies while remaining within the training paradigm scope. The taxonomy's scope note explicitly includes reward shaping and imitation learning, both central to this work's methodological contributions.

Among twenty-three candidates examined across three contributions, none yielded clear refutations. The strategic reward design contribution examined ten candidates with zero refutable overlaps, as did the unified SFT-RL comparison. The behavioral analysis of exploitation bias examined three candidates, again with no refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of regret-shaped rewards, algorithmic oracle imitation, and behavioral analysis of catastrophic exploration failure appears relatively unexplored, though the broader training paradigm is well-represented in the taxonomy.

The analysis covers a focused slice of recent work rather than exhaustive historical coverage. The absence of refutations among twenty-three candidates indicates novelty within this search scope, but the taxonomy shows five sibling papers in the same leaf, suggesting the general approach of training LLMs for bandit tasks is established. The paper's distinctive angle appears to be the systematic comparison of training paradigms and the behavioral insight about emergent greediness, which may differentiate it from prior training-focused studies.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: exploration-exploitation trade-off in multi-armed bandit problems using large language models. The field structure reflects a maturing intersection of classical bandit theory and modern LLM capabilities. At the top level, one branch focuses on LLM-Enhanced Bandit Algorithm Design, where works like LLM Enhanced Bandits[1] and Bandits Meet LLMs[3] integrate language models to improve arm selection or reward prediction. A second branch examines LLM-as-Agent Bandit Behavior and Capabilities, investigating how LLMs themselves navigate exploration versus exploitation when acting as decision-makers, as seen in LLM Human Exploration[5] and Efficient LLM Exploration[6]. Training Paradigms for LLM Bandit Agents addresses how to teach these models effective bandit strategies through reinforcement learning or supervised fine-tuning, while Domain-Specific Applications and Extensions apply bandit frameworks to tasks like model selection or retrieval. Finally, Theoretical Foundations and Methodological Frameworks provide the analytical underpinnings, ensuring rigorous guarantees and principled design. Within the training paradigms branch, a particularly active line of work explores reinforcement learning and supervised fine-tuning to shape LLM bandit agents. Greedy Wins[0] sits squarely in this cluster, examining how simple greedy strategies can be surprisingly effective when LLMs are trained appropriately, contrasting with more complex exploration schemes. Nearby, ETTRL[9] investigates efficient exploration through targeted RL updates, while Self-Evolving Curriculum[25] and Red-Bandit[46] propose adaptive training regimes that progressively refine exploration policies. A central tension across these studies is whether to rely on intrinsic LLM reasoning or to impose external algorithmic structure. Greedy Wins[0] leans toward the former, suggesting that well-tuned greedy policies can leverage LLM priors effectively, whereas works like T-POP[48] emphasize hybrid approaches that combine learned heuristics with classical bandit guarantees. This positioning highlights an open question: how much exploration complexity should be baked into training versus emergent from model inference.

Claimed Contributions

Strategic and algorithmic reward designs for meta-bandit RL training

The authors introduce two novel reward formulations for training LLM agents on multi-armed bandit tasks: a strategic reward that uses immediate regret to simplify credit assignment and reduce variance, and an algorithmic reward that enables RL-based imitation of oracle policies such as UCB without requiring inverse reinforcement learning.

10 retrieved papers
Unified comparison of SFT and RL paradigms for LLM exploration

The paper systematically compares supervised fine-tuning on expert demonstrations versus reinforcement learning with multiple reward designs, evaluating how each paradigm shapes exploration strategies, generalization to longer horizons, and cross-distribution transfer in multi-armed bandit environments.

10 retrieved papers
Behavioral analysis revealing emergent exploitation bias in learned policies

The authors conduct a behavioral analysis using surrogate statistics such as suffix failure rate and greedy action frequency, uncovering that while learned policies achieve lower average regret, they exhibit exploitative tendencies that can lead to premature abandonment of exploration and catastrophic failures in the long term.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Strategic and algorithmic reward designs for meta-bandit RL training

The authors introduce two novel reward formulations for training LLM agents on multi-armed bandit tasks: a strategic reward that uses immediate regret to simplify credit assignment and reduce variance, and an algorithmic reward that enables RL-based imitation of oracle policies such as UCB without requiring inverse reinforcement learning.

Contribution

Unified comparison of SFT and RL paradigms for LLM exploration

The paper systematically compares supervised fine-tuning on expert demonstrations versus reinforcement learning with multiple reward designs, evaluating how each paradigm shapes exploration strategies, generalization to longer horizons, and cross-distribution transfer in multi-armed bandit environments.

Contribution

Behavioral analysis revealing emergent exploitation bias in learned policies

The authors conduct a behavioral analysis using surrogate statistics such as suffix failure rate and greedy action frequency, uncovering that while learned policies achieve lower average regret, they exhibit exploitative tendencies that can lead to premature abandonment of exploration and catastrophic failures in the long term.