Reinforced Preference Optimization for Recommendation

ICLR 2026 Conference SubmissionAnonymous Authors
Recommender SystemLarge Language ModelReinfocement Learning with Verifiable Reward
Abstract:

Recent breakthroughs in large language models (LLMs) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals. However, applying RLVR to generative recommenders remains non-trivial. Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is sparse since most items receive identical zero rewards. To address these challenges, we propose \textbf{Reinforced Preference Optimization for Recommendation} (\textbf{ReRe}), a reinforcement-based paradigm tailored to LLM-based recommenders, an important direction in generative recommendation. ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision. Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance. Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales. Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research. Our codes are available at \url{https://anonymous.4open.science/r/ReRe-E1B0}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Reinforced Preference Optimization for Recommendation (ReRe), applying reinforcement learning with verifiable rewards to LLM-based generative recommenders. It resides in the 'Recommendation-Specific LLM RL Training' leaf, which contains six papers including the original work. This leaf sits within the broader 'RL-Enhanced LLM Training and Alignment' branch, indicating a moderately populated research direction focused on adapting pretrained language models to recommendation objectives through policy optimization and reward shaping.

The taxonomy reveals neighboring leaves addressing LLM-based reward modeling, state representation, and agentic recommendation policies. The 'Generative Recommendation with LLMs' branch explores end-to-end item generation, while 'Optimization Objectives and Metrics' focuses on diversity and controllability. ReRe's emphasis on constrained sampling and ranking reward augmentation bridges these areas, connecting training methodology with generation quality and fine-grained supervision. The taxonomy's scope_note clarifies that this leaf excludes general LLM training not focused on recommendation, positioning ReRe within a specialized but active subfield.

Among sixteen candidates examined, no papers clearly refute the three core contributions. The ReRe paradigm itself was assessed against ten candidates with zero refutable overlaps. Constrained beam search examined two candidates, and ranking reward augmentation examined four, both yielding no clear prior work. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of constrained sampling and ranking-based reward augmentation for LLM recommenders appears relatively unexplored, though the broader paradigm of RL-enhanced LLM training for recommendation is well-established.

The analysis covers a focused slice of the literature rather than an exhaustive survey. The taxonomy shows that while RL for LLM-based recommendation is an active area with multiple sibling papers, the specific technical mechanisms proposed here—constrained beam search to address invalid generation and ranking reward augmentation for sparse supervision—do not appear prominently in the examined candidates. This suggests incremental novelty in execution details within a recognized research direction, though broader literature may contain related techniques not captured by the search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
16
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning for generative recommendation with language models. The field structure reflects a convergence of RL techniques and large language models to produce personalized, explainable, and interactive recommendations. The taxonomy organizes work into several major branches: RL-Enhanced LLM Training and Alignment focuses on adapting pretrained models to recommendation objectives through policy optimization and reward shaping (e.g., Reinforced Preference Optimization[0], Explainable Quality Reward[15]); LLM-Powered Recommendation Agents and Policies emphasizes agentic systems that reason and plan over user contexts (e.g., Tunable Proactive Agent[4], Reason-to-Recommend[34]); Generative Recommendation with LLMs explores end-to-end generation of items or explanations (e.g., Generative Job Recommendations[2], Generative Job LLM[7]); and Simulation and Evaluation Environments provides testbeds for interactive learning (e.g., Recommender System Environment[8], User Simulator[13]). Additional branches address optimization objectives, cross-domain applications, and practical deployment, reflecting the breadth of challenges from training stability to real-world integration. A particularly active line of work centers on aligning LLMs with recommendation-specific rewards, balancing user satisfaction, diversity, and explainability. Reinforced Preference Optimization[0] sits squarely within Recommendation-Specific LLM RL Training, emphasizing direct policy optimization tailored to recommendation feedback. Nearby efforts such as Explainable Quality Reward[15] and Rank-GRPO Conversational[33] similarly refine reward signals to capture nuanced user preferences, while Re2llm Reflective[22] and Reinforced Latent Reasoning[35] incorporate reasoning steps to improve interpretability and long-term engagement. Trade-offs emerge between sample efficiency, computational cost, and the richness of generated explanations. Open questions include how to scale these RL methods to massive item catalogs, integrate multi-modal signals, and ensure robustness across diverse user populations without overfitting to narrow reward proxies.

Claimed Contributions

Reinforced Preference Optimization for Recommendation (ReRe) paradigm

The authors introduce ReRe, a novel reinforcement learning framework specifically designed for LLM-based recommender systems. This paradigm addresses limitations in existing generative recommenders by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals rather than implicit ones.

10 retrieved papers
Constrained beam search for efficient sampling

The method employs constrained beam search as a sampling strategy to generate diverse candidate items in a single pass. This approach ensures both sampling efficiency and exposure to informative negatives, addressing the challenge of repetitive item generation in the constrained recommendation space.

2 retrieved papers
Ranking reward augmentation for fine-grained supervision

ReRe introduces an auxiliary ranking reward that assigns additional penalties to hard negatives according to their generation probabilities. This augmentation provides finer-grained supervision beyond binary correctness signals, enhancing the model's discriminative ability for ranking tasks.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reinforced Preference Optimization for Recommendation (ReRe) paradigm

The authors introduce ReRe, a novel reinforcement learning framework specifically designed for LLM-based recommender systems. This paradigm addresses limitations in existing generative recommenders by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals rather than implicit ones.

Contribution

Constrained beam search for efficient sampling

The method employs constrained beam search as a sampling strategy to generate diverse candidate items in a single pass. This approach ensures both sampling efficiency and exposure to informative negatives, addressing the challenge of repetitive item generation in the constrained recommendation space.

Contribution

Ranking reward augmentation for fine-grained supervision

ReRe introduces an auxiliary ranking reward that assigns additional penalties to hard negatives according to their generation probabilities. This augmentation provides finer-grained supervision beyond binary correctness signals, enhancing the model's discriminative ability for ranking tasks.