When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelExploration

While Large Language Models (LLMs) hold promise to become autonomous agents, they often explore suboptimally in sequential decision-making. Recent work has sought to enhance this capability via supervised fine-tuning (SFT) or reinforcement learning (RL), improving regret on the classic multi-armed bandit task. However, it remains unclear how these learning methods shape exploration strategies and how well they generalize. We investigate both paradigms by training LLMs with SFT on expert trajectories and RL with a range of tailored reward signals including a strategic, regret-shaped reward to reduce variance, and an algorithmic reward that enables oracle imitation. The resulting agents outperform pre-trained models and achieve performance comparable to Upper Confidence Bound (UCB) and Thompson Sampling, with robust generalization to 6 $\times$ longer horizons and across bandit families. Behavioral analysis reveals that gains often stem from more sophisticated but greedier exploitation: RL/SFT agents are more prone to early catastrophic failure than pre-trained models, prematurely abandoning exploration. Furthermore, agents trained to imitate UCB learn to outperform their teacher by adopting more exploitative variants. Our findings clarify when each training paradigm is preferable and advocate tailored reward design and evaluation beyond average regret to promote robust exploratory behavior.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates training LLMs to solve multi-armed bandit problems through both supervised fine-tuning on expert trajectories and reinforcement learning with tailored reward signals, including regret-shaped and algorithmic rewards. It resides in the 'Reinforcement Learning and Supervised Fine-Tuning' leaf under 'Training Paradigms for LLM Bandit Agents', which contains five papers total. This represents a moderately populated research direction within a fifty-paper taxonomy, suggesting active but not overcrowded interest in training-based approaches to LLM exploration-exploitation behavior.

The taxonomy reveals neighboring work in 'Adaptive Test-Time and Online Learning Mechanisms' (three papers) focusing on inference-time adaptation rather than offline training, and 'Zero-Shot and In-Context Exploration Capabilities' (six papers) examining pre-trained model behavior without task-specific training. The paper's dual investigation of SFT and RL bridges these areas: it starts from zero-shot baselines but applies training interventions, distinguishing it from pure capability studies while remaining within the training paradigm scope. The taxonomy's scope note explicitly includes reward shaping and imitation learning, both central to this work's methodological contributions.

Among twenty-three candidates examined across three contributions, none yielded clear refutations. The strategic reward design contribution examined ten candidates with zero refutable overlaps, as did the unified SFT-RL comparison. The behavioral analysis of exploitation bias examined three candidates, again with no refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of regret-shaped rewards, algorithmic oracle imitation, and behavioral analysis of catastrophic exploration failure appears relatively unexplored, though the broader training paradigm is well-represented in the taxonomy.

The analysis covers a focused slice of recent work rather than exhaustive historical coverage. The absence of refutations among twenty-three candidates indicates novelty within this search scope, but the taxonomy shows five sibling papers in the same leaf, suggesting the general approach of training LLMs for bandit tasks is established. The paper's distinctive angle appears to be the systematic comparison of training paradigms and the behavioral insight about emergent greediness, which may differentiate it from prior training-focused studies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: exploration-exploitation trade-off in multi-armed bandit problems using large language models. The field structure reflects a maturing intersection of classical bandit theory and modern LLM capabilities. At the top level, one branch focuses on LLM-Enhanced Bandit Algorithm Design, where works like LLM Enhanced Bandits[1] and Bandits Meet LLMs[3] integrate language models to improve arm selection or reward prediction. A second branch examines LLM-as-Agent Bandit Behavior and Capabilities, investigating how LLMs themselves navigate exploration versus exploitation when acting as decision-makers, as seen in LLM Human Exploration[5] and Efficient LLM Exploration[6]. Training Paradigms for LLM Bandit Agents addresses how to teach these models effective bandit strategies through reinforcement learning or supervised fine-tuning, while Domain-Specific Applications and Extensions apply bandit frameworks to tasks like model selection or retrieval. Finally, Theoretical Foundations and Methodological Frameworks provide the analytical underpinnings, ensuring rigorous guarantees and principled design. Within the training paradigms branch, a particularly active line of work explores reinforcement learning and supervised fine-tuning to shape LLM bandit agents. Greedy Wins[0] sits squarely in this cluster, examining how simple greedy strategies can be surprisingly effective when LLMs are trained appropriately, contrasting with more complex exploration schemes. Nearby, ETTRL[9] investigates efficient exploration through targeted RL updates, while Self-Evolving Curriculum[25] and Red-Bandit[46] propose adaptive training regimes that progressively refine exploration policies. A central tension across these studies is whether to rely on intrinsic LLM reasoning or to impose external algorithmic structure. Greedy Wins[0] leans toward the former, suggesting that well-tuned greedy policies can leverage LLM priors effectively, whereas works like T-POP[48] emphasize hybrid approaches that combine learned heuristics with classical bandit guarantees. This positioning highlights an open question: how much exploration complexity should be baked into training versus emergent from model inference.

Claimed Contributions

Strategic and algorithmic reward designs for meta-bandit RL training

10 retrieved papers

The authors introduce two novel reward formulations for training LLM agents on multi-armed bandit tasks: a strategic reward that uses immediate regret to simplify credit assignment and reduce variance, and an algorithmic reward that enables RL-based imitation of oracle policies such as UCB without requiring inverse reinforcement learning.

10 retrieved papers

Unified comparison of SFT and RL paradigms for LLM exploration

10 retrieved papers

The paper systematically compares supervised fine-tuning on expert demonstrations versus reinforcement learning with multiple reward designs, evaluating how each paradigm shapes exploration strategies, generalization to longer horizons, and cross-distribution transfer in multi-armed bandit environments.

10 retrieved papers

Behavioral analysis revealing emergent exploitation bias in learned policies

3 retrieved papers

The authors conduct a behavioral analysis using surrogate statistics such as suffix failure rate and greedy action frequency, uncovering that while learned policies achieve lower average regret, they exhibit exploitative tendencies that can lead to premature abandonment of exploration and catastrophic failures in the long term.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism PDF

Liu Jia, Jia Liu, ChangYi He, Yang Mingmin, YingQiao Lin, Shen Feiyang, MingMin Yang, Liu, Shaoguo, FeiYang Shen, Shaoguo Liu (2025) • arXiv.org

[25] Self-Evolving Curriculum for LLM Reasoning PDF

Chen Xiao-yin, Lu, Jiarui, Xiaoyin Chen, Kim, Minsu, Jiarui Lu, Zhang, Dinghuai, Minsu Kim, Tang Jian, Dinghuai Zhang, Piche, Alexandre, Jian Tang, Gontier, Nicolas, Alex Pich'e, Bengio, Yoshua, Nicolas Gontier, Kamalloo, Ehsan, Y. Bengio, Ehsan Kamalloo (2025) • arXiv.org

[46] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts PDF

Christos Ziakas, Nicholas Loo, Russo, Alessandra, Nishita Jain, Alessandra Russo (2025)

[48] T-POP: Test-Time Personalization with Online Preference Feedback PDF

Zhang Min, Zikun Qu, Min Zhang, Li Xiang, Mingze Kong, Shang ZhiWei, Xiang Li, Wang Zhiyong, Zhiwei Shang, Ban, Yikun, Zhiyong Wang, Qiu Shuang, Yikun Ban, Shu Yao, Shuang Qiu, Dai, Zhongxiang, Yao Shu, Zhongxiang Dai (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Strategic and algorithmic reward designs for meta-bandit RL training

[54] Reinforcement and Imitation Learning via Interactive No-Regret Learning PDF

Cannot Refute

[55] Contextual bandits and imitation learning with preference-based active queries PDF

Cannot Refute

[56] Contextual Bandits and Imitation Learning via Preference-Based Active Queries PDF

Cannot Refute

[57] Multi-Agent Imitation Learning: Value is Easy, Regret is Hard PDF

Cannot Refute

[58] REBEL: A regularization-based solution for reward overoptimization in robotic reinforcement learning from human feedback PDF

Cannot Refute

[59] Path-Analysis-Based Reinforcement Learning Algorithm for Imitation Filming PDF

Cannot Refute

[60] Leveraging demonstrations to improve online learning: Quality matters PDF

Cannot Refute

[61] I2RL: online inverse reinforcement learning under occlusion PDF

Cannot Refute

[62] A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret PDF

Cannot Refute

[63] Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning PDF

Cannot Refute

Contribution

Unified comparison of SFT and RL paradigms for LLM exploration

[17] Evolve: Evaluating and optimizing llms for exploration PDF

Cannot Refute

[64] Multi-turn reinforcement learning with preference human feedback PDF

Cannot Refute

[65] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning PDF

Cannot Refute

[66] AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy PDF

Cannot Refute

[67] Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards PDF

Cannot Refute

[68] A critical evaluation of ai feedback for aligning large language models PDF

Cannot Refute

[69] Generalizable End-to-End Tool-Use RL with Synthetic CodeGym PDF

Cannot Refute

[70] Exploring Multi-Armed Bandit (MAB) as an AI Tool for Optimising GMA-WAAM Path Planning PDF

Cannot Refute

[71] Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning PDF

Cannot Refute

[72] Use Of User Feedback for Adaptive Model Tuning PDF

Cannot Refute

Contribution

Behavioral analysis revealing emergent exploitation bias in learned policies

[51] Competing bandits: The perils of exploration under competition PDF

Cannot Refute

[52] Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation PDF

Cannot Refute

[53] Bandit Algorithms for Tree Search PDF

Cannot Refute

When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism PDF

[25] Self-Evolving Curriculum for LLM Reasoning PDF

[46] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts PDF

[48] T-POP: Test-Time Personalization with Online Preference Feedback PDF

Contribution Analysis

Strategic and algorithmic reward designs for meta-bandit RL training

[54] Reinforcement and Imitation Learning via Interactive No-Regret Learning PDF

[55] Contextual bandits and imitation learning with preference-based active queries PDF

[56] Contextual Bandits and Imitation Learning via Preference-Based Active Queries PDF

[57] Multi-Agent Imitation Learning: Value is Easy, Regret is Hard PDF

[58] REBEL: A regularization-based solution for reward overoptimization in robotic reinforcement learning from human feedback PDF

[59] Path-Analysis-Based Reinforcement Learning Algorithm for Imitation Filming PDF

[60] Leveraging demonstrations to improve online learning: Quality matters PDF

[61] I2RL: online inverse reinforcement learning under occlusion PDF

[62] A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret PDF

[63] Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning PDF

Unified comparison of SFT and RL paradigms for LLM exploration

[17] Evolve: Evaluating and optimizing llms for exploration PDF

[64] Multi-turn reinforcement learning with preference human feedback PDF

[65] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning PDF

[66] AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy PDF

[67] Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards PDF

[68] A critical evaluation of ai feedback for aligning large language models PDF

[69] Generalizable End-to-End Tool-Use RL with Synthetic CodeGym PDF

[70] Exploring Multi-Armed Bandit (MAB) as an AI Tool for Optimising GMA-WAAM Path Planning PDF

[71] Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning PDF

[72] Use Of User Feedback for Adaptive Model Tuning PDF

Behavioral analysis revealing emergent exploitation bias in learned policies

[51] Competing bandits: The perils of exploration under competition PDF

[52] Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation PDF

[53] Bandit Algorithms for Tree Search PDF

Table of Contents