When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training
Overview
Overall Novelty Assessment
The paper investigates training LLMs to solve multi-armed bandit problems through both supervised fine-tuning on expert trajectories and reinforcement learning with tailored reward signals, including regret-shaped and algorithmic rewards. It resides in the 'Reinforcement Learning and Supervised Fine-Tuning' leaf under 'Training Paradigms for LLM Bandit Agents', which contains five papers total. This represents a moderately populated research direction within a fifty-paper taxonomy, suggesting active but not overcrowded interest in training-based approaches to LLM exploration-exploitation behavior.
The taxonomy reveals neighboring work in 'Adaptive Test-Time and Online Learning Mechanisms' (three papers) focusing on inference-time adaptation rather than offline training, and 'Zero-Shot and In-Context Exploration Capabilities' (six papers) examining pre-trained model behavior without task-specific training. The paper's dual investigation of SFT and RL bridges these areas: it starts from zero-shot baselines but applies training interventions, distinguishing it from pure capability studies while remaining within the training paradigm scope. The taxonomy's scope note explicitly includes reward shaping and imitation learning, both central to this work's methodological contributions.
Among twenty-three candidates examined across three contributions, none yielded clear refutations. The strategic reward design contribution examined ten candidates with zero refutable overlaps, as did the unified SFT-RL comparison. The behavioral analysis of exploitation bias examined three candidates, again with no refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of regret-shaped rewards, algorithmic oracle imitation, and behavioral analysis of catastrophic exploration failure appears relatively unexplored, though the broader training paradigm is well-represented in the taxonomy.
The analysis covers a focused slice of recent work rather than exhaustive historical coverage. The absence of refutations among twenty-three candidates indicates novelty within this search scope, but the taxonomy shows five sibling papers in the same leaf, suggesting the general approach of training LLMs for bandit tasks is established. The paper's distinctive angle appears to be the systematic comparison of training paradigms and the behavioral insight about emergent greediness, which may differentiate it from prior training-focused studies.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce two novel reward formulations for training LLM agents on multi-armed bandit tasks: a strategic reward that uses immediate regret to simplify credit assignment and reduce variance, and an algorithmic reward that enables RL-based imitation of oracle policies such as UCB without requiring inverse reinforcement learning.
The paper systematically compares supervised fine-tuning on expert demonstrations versus reinforcement learning with multiple reward designs, evaluating how each paradigm shapes exploration strategies, generalization to longer horizons, and cross-distribution transfer in multi-armed bandit environments.
The authors conduct a behavioral analysis using surrogate statistics such as suffix failure rate and greedy action frequency, uncovering that while learned policies achieve lower average regret, they exhibit exploitative tendencies that can lead to premature abandonment of exploration and catastrophic failures in the long term.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism PDF
[25] Self-Evolving Curriculum for LLM Reasoning PDF
[46] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts PDF
[48] T-POP: Test-Time Personalization with Online Preference Feedback PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Strategic and algorithmic reward designs for meta-bandit RL training
The authors introduce two novel reward formulations for training LLM agents on multi-armed bandit tasks: a strategic reward that uses immediate regret to simplify credit assignment and reduce variance, and an algorithmic reward that enables RL-based imitation of oracle policies such as UCB without requiring inverse reinforcement learning.
[54] Reinforcement and Imitation Learning via Interactive No-Regret Learning PDF
[55] Contextual bandits and imitation learning with preference-based active queries PDF
[56] Contextual Bandits and Imitation Learning via Preference-Based Active Queries PDF
[57] Multi-Agent Imitation Learning: Value is Easy, Regret is Hard PDF
[58] REBEL: A regularization-based solution for reward overoptimization in robotic reinforcement learning from human feedback PDF
[59] Path-Analysis-Based Reinforcement Learning Algorithm for Imitation Filming PDF
[60] Leveraging demonstrations to improve online learning: Quality matters PDF
[61] I2RL: online inverse reinforcement learning under occlusion PDF
[62] A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret PDF
[63] Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning PDF
Unified comparison of SFT and RL paradigms for LLM exploration
The paper systematically compares supervised fine-tuning on expert demonstrations versus reinforcement learning with multiple reward designs, evaluating how each paradigm shapes exploration strategies, generalization to longer horizons, and cross-distribution transfer in multi-armed bandit environments.
[17] Evolve: Evaluating and optimizing llms for exploration PDF
[64] Multi-turn reinforcement learning with preference human feedback PDF
[65] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning PDF
[66] AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy PDF
[67] Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards PDF
[68] A critical evaluation of ai feedback for aligning large language models PDF
[69] Generalizable End-to-End Tool-Use RL with Synthetic CodeGym PDF
[70] Exploring Multi-Armed Bandit (MAB) as an AI Tool for Optimising GMA-WAAM Path Planning PDF
[71] Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning PDF
[72] Use Of User Feedback for Adaptive Model Tuning PDF
Behavioral analysis revealing emergent exploitation bias in learned policies
The authors conduct a behavioral analysis using surrogate statistics such as suffix failure rate and greedy action frequency, uncovering that while learned policies achieve lower average regret, they exhibit exploitative tendencies that can lead to premature abandonment of exploration and catastrophic failures in the long term.