Learning to Answer from Correct Demonstrations

ICLR 2026 Conference SubmissionAnonymous Authors
Promot-CompletionImitation LearningLikelihood Maximization
Abstract:

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead suggest an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from demonstrations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a novel framework for learning from demonstrations in contextual bandits by assuming the reward model belongs to a low-cardinality class rather than requiring the demonstrator policy to be low-complexity. It resides in the Imitation Learning in Contextual Bandits leaf, which currently contains no other papers in the taxonomy. This isolation suggests the specific framing—correct demonstrations with low-cardinality reward classes—represents a relatively unexplored niche within the broader policy learning from demonstrations literature, though the parent branch contains related work on inverse learning and meta-learning strategies.

The taxonomy reveals neighboring research directions that provide important context. The sibling leaf Meta-Learning and Exploration Strategies addresses learning exploration policies from offline tasks, while the parent branch includes Inverse Bandit Problems and Preference-Based Reward Learning, which infer rewards from demonstrator behavior or comparative feedback. The Interactive Learning branch explores user-triggered supervision and conversational feedback mechanisms. The paper's focus on reward model cardinality rather than policy complexity distinguishes it from these approaches, which typically assume demonstrator optimality or learn through active querying rather than passive observation of correct examples.

Among the twenty-one candidates examined through limited semantic search, none clearly refuted any of the three core contributions. The low-cardinality reward model assumption was examined against seven candidates with zero refutations, as were the novel algorithm with logarithmic sample complexity and the demonstration of MLE failures. This absence of overlapping prior work within the search scope suggests these specific theoretical contributions—particularly the shift from policy-class to reward-class complexity assumptions—may represent genuine departures from existing approaches, though the limited search scale (twenty-one papers, not hundreds) means potentially relevant work outside top semantic matches could exist.

The analysis reflects a focused literature search rather than exhaustive coverage of all bandit learning literature. The taxonomy structure shows active research in inverse learning and interactive feedback, but the specific combination of correct demonstrations, low-cardinality reward classes, and critique of likelihood maximization appears underexplored within the examined scope. The theoretical nature of the contributions and their positioning between pure imitation and reward inference suggests potential novelty, though broader searches in optimization theory or statistical learning might reveal additional connections.

Taxonomy

Core-task Taxonomy Papers
16
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Learning to generate correct answers from demonstrations in contextual bandits. The field structure reflects three main branches that together address how agents can learn effective policies when demonstrations or feedback are available. The first branch, Theoretical Foundations and Algorithm Design, develops principled methods for policy learning from demonstrations, including imitation learning in contextual bandits and inverse reinforcement learning approaches such as Inverse RL Bandits[6] and Exploring Demonstrator Bandits[8]. The second branch, Interactive Learning with User Feedback, focuses on systems that actively solicit or incorporate human guidance, spanning conversational interfaces like Conversational Bandits[15], user-triggered supervision mechanisms as in User Triggered Supervision[10], and preference-based queries exemplified by Preference Active Queries[11]. The third branch, Application Domains and Empirical Studies, examines real-world deployments in education (Tutoring Feedback Bandits[2], Adaptive Tutoring Strategies[12], Adaptive Vocabulary Learning[7]), personalized decision support (Personalized Decision Support[3], Personalized Multimodal Feedback[5]), and embodied or robotic tasks (P3nav Embodied Navigation[4], Deep RL Manipulation[9]). A particularly active theme across these branches concerns the trade-off between leveraging expert demonstrations and exploring beyond demonstrated behaviors, especially when demonstrations may be suboptimal or context-dependent. Works in the theoretical branch investigate how to extract reward signals or policies from imperfect demonstrators, while interactive learning studies explore when and how to query users for additional feedback. Learning from Correct Demonstrations[0] sits naturally within the policy learning from demonstrations cluster, emphasizing the use of correct (rather than arbitrary) demonstrations to guide contextual bandit algorithms. Compared to nearby works like Exploring Demonstrator Bandits[8], which balances exploration with demonstrator advice, or Inverse RL Bandits[6], which infers rewards from demonstrations, Learning from Correct Demonstrations[0] appears to focus more directly on the correctness guarantee of the provided examples. This positions it as a bridge between pure imitation approaches and methods that must handle noisy or exploratory demonstrator behavior.

Claimed Contributions

Low-cardinality reward model class assumption

The authors introduce an alternative assumption that the reward model class has low cardinality, rather than assuming the demonstrator policy belongs to a low-capacity class. They argue this is strictly weaker and more realistic for learning from demonstrations where multiple correct answers exist.

7 retrieved papers
Novel learning algorithm with logarithmic sample complexity

The authors present a new learning algorithm (Algorithm 1) that achieves sample complexity proportional to log of the reward class cardinality, independent of the action space size or support size. This exponentially improves upon natural baseline methods.

7 retrieved papers
Demonstration of MLE failures under low-cardinality reward classes

The authors prove that maximum likelihood estimation, which is optimal under low-capacity policy class assumptions, can fail to generalize even in simple situations when only the reward model class has low cardinality (Theorems 1 and 7).

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Low-cardinality reward model class assumption

The authors introduce an alternative assumption that the reward model class has low cardinality, rather than assuming the demonstrator policy belongs to a low-capacity class. They argue this is strictly weaker and more realistic for learning from demonstrations where multiple correct answers exist.

Contribution

Novel learning algorithm with logarithmic sample complexity

The authors present a new learning algorithm (Algorithm 1) that achieves sample complexity proportional to log of the reward class cardinality, independent of the action space size or support size. This exponentially improves upon natural baseline methods.

Contribution

Demonstration of MLE failures under low-cardinality reward classes

The authors prove that maximum likelihood estimation, which is optimal under low-capacity policy class assumptions, can fail to generalize even in simple situations when only the reward model class has low cardinality (Theorems 1 and 7).