Inverse Linear Bandits via Linear Programs

ICLR 2026 Conference SubmissionAnonymous Authors
Inverse Reinforcement LearningLinear Bandits
Abstract:

Inverse reinforcement learning (IRL) is a well-established paradigm for circumventing the need for explicit reward. In this paper, we study the problem of estimating the reward function from a single sequence of actions (i.e., a demonstration) of a stochastic linear bandit algorithm. Our main result is a unified approach for inverse linear bandits, based on the idea of formulating a linear program by tightly characterizing the confidence intervals of pulled actions. We show that the estimation error of our algorithms matches the information-theoretic lower bound, up to polynomial factors in dd and logT\log T, where dd is the dimensionality of the feature space and TT is the length of the demonstration. Compared to prior approaches, our approach (i) gives a unified reward estimator that works when the demonstrator employs LinUCB or Phased Elimination, two popular algorithms for stochastic linear bandits, while existing estimator only works for Phased Elimination; (ii) does not require access to hyperparameters or internal states of the demonstrator algorithm as required by prior work; and (iii) works for general action sets, while existing estimator requires assumptions on the density and geometry of the action set. We further demonstrate the practicality of our new approach by validating our new algorithms on synthetic data and demonstrations constructed from real-world datasets, where our estimators significantly outperform existing ones.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a unified linear program approach to estimate reward functions from single demonstration sequences of stochastic linear bandit algorithms, specifically LinUCB and Phased Elimination. Within the taxonomy, it occupies the 'Linear Program-Based Inverse Estimation' leaf under 'Inverse Learning for Linear Bandits'. This leaf contains only the original paper itself, indicating a relatively sparse research direction. The sibling leaf 'One-Shot Inverse Learning' contains one other paper, suggesting that inverse learning for linear bandits remains an emerging area with limited prior work addressing the specific formulation of LP-based confidence interval characterization.

The taxonomy reveals that inverse reward learning from bandit demonstrations branches into linear bandits and multi-armed bandits without linear structure. The original work sits within the linear bandits branch, which contrasts with multi-armed bandit approaches like 'Exploration-Phase Reward Estimation' that leverage demonstrator exploration behavior. Neighboring branches include 'Reward-Biased Maximum Likelihood Estimation for Bandits', which frames inverse problems probabilistically rather than through optimization, and 'Transformer-Based In-Context Bandit Learning', which uses neural architectures for policy adaptation. The taxonomy's scope notes clarify that the original work excludes likelihood-based methods and multi-task settings, focusing instead on optimization-centric inverse techniques for linear reward recovery.

Among the three contributions analyzed, the literature search examined 15 candidates total. The 'Unified linear program approach' examined 3 candidates with 0 refutations, suggesting novelty in the LP formulation. The 'Information-theoretic lower bound' examined 10 candidates and found 2 refutable matches, indicating that lower bound analysis for inverse reward estimation has some prior coverage. The 'Optimal unified reward estimator' examined 2 candidates with 0 refutations. These statistics reflect a limited search scope (15 papers, not exhaustive), and the presence of 2 refutable pairs for the lower bound contribution suggests that theoretical guarantees in this space have been partially explored, though the unified LP approach itself appears less contested.

Based on the limited search of 15 candidates, the work appears to introduce a novel optimization framework for inverse linear bandits, particularly in its unified treatment of multiple demonstrator algorithms without requiring hyperparameter access. The taxonomy structure confirms this sits in a sparse research direction, with only one sibling paper in the broader inverse linear bandits category. However, the information-theoretic lower bound contribution shows overlap with prior theoretical work, suggesting that while the LP methodology is distinctive, the fundamental limits of inverse reward estimation have received some attention in the examined literature.

Taxonomy

Core-task Taxonomy Papers
9
3
Claimed Contributions
15
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Estimating reward functions from demonstrations of stochastic linear bandit algorithms. The field addresses how to recover unknown reward structures when observing an agent's sequential decision-making behavior in bandit settings. The taxonomy reveals four main branches that capture distinct methodological perspectives. Inverse Reward Learning from Bandit Demonstrations focuses on classical inverse problems where one infers reward parameters directly from observed arm-pull sequences, often employing optimization or linear programming techniques as seen in Inverse Linear Bandits[0]. Reward-Biased Maximum Likelihood Estimation for Bandits, exemplified by Reward-Biased Maximum Likelihood[6], frames the problem through a probabilistic lens, modeling how demonstration likelihoods depend on underlying rewards. Transformer-Based In-Context Bandit Learning explores modern neural architectures that can adapt to bandit tasks in-context, with works like Pretraining Decision Transformers[3] and Prompt Tuning Transformers[4] leveraging large-scale pretraining. Preference-Based Reward Learning shifts attention to learning from comparative feedback rather than direct demonstrations, addressing robustness concerns as in Causally Robust Preference[9]. Several active themes emerge across these branches. One central question is how to handle suboptimal or exploratory demonstrators: Learning Exploring Demonstrator[5] and In-Context Suboptimal Data[8] both grapple with the reality that observed behavior may not be perfectly rational. Another contrast lies between model-free neural approaches, such as Neural Contextual Bandits[7], and model-based inverse methods that explicitly solve for reward parameters. Inverse Linear Bandits[0] sits squarely within the classical inverse learning branch, using linear programming to estimate reward vectors from bandit trajectories. Its emphasis on tractable optimization distinguishes it from likelihood-based methods like Reward-Biased Maximum Likelihood[6] and from transformer-driven in-context learners such as Pretraining Decision Transformers[3]. By focusing on linear program formulations, the original work aligns with a small cluster of optimization-centric inverse techniques, offering theoretical guarantees that complement the more flexible but less interpretable neural alternatives.

Claimed Contributions

Unified linear program approach for inverse linear bandits

The authors propose a unified framework that formulates inverse linear bandits as a linear program by characterizing confidence intervals of actions selected by the demonstrator. This approach works for both LinUCB and Phased Elimination algorithms, does not require access to hyperparameters or internal states, and applies to general action sets without density or geometry assumptions.

3 retrieved papers
Information-theoretic lower bound for inverse reward estimation

The authors establish a fundamental lower bound showing that estimation error for any inverse learner is lower bounded by a quantity involving the inverse expected feature covariance matrix. This lower bound serves as a baseline for analyzing specific reward estimators and generalizes prior hardness results from multi-armed bandits to linear bandits.

10 retrieved papers
Can Refute
Optimal unified reward estimator matching lower bound

The authors develop a unified reward estimator that achieves estimation error matching the information-theoretic lower bound up to polynomial factors in dimension d and log T. The estimator requires only an approximation of the best reward, works without access to demonstrator hyperparameters or internal states, and places no assumptions on action set density or geometry.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified linear program approach for inverse linear bandits

The authors propose a unified framework that formulates inverse linear bandits as a linear program by characterizing confidence intervals of actions selected by the demonstrator. This approach works for both LinUCB and Phased Elimination algorithms, does not require access to hyperparameters or internal states, and applies to general action sets without density or geometry assumptions.

Contribution

Information-theoretic lower bound for inverse reward estimation

The authors establish a fundamental lower bound showing that estimation error for any inverse learner is lower bounded by a quantity involving the inverse expected feature covariance matrix. This lower bound serves as a baseline for analyzing specific reward estimators and generalizes prior hardness results from multi-armed bandits to linear bandits.

Contribution

Optimal unified reward estimator matching lower bound

The authors develop a unified reward estimator that achieves estimation error matching the information-theoretic lower bound up to polynomial factors in dimension d and log T. The estimator requires only an approximation of the best reward, works without access to demonstrator hyperparameters or internal states, and places no assumptions on action set density or geometry.