Learning to summarize user information for personalized reinforcement learning from human feedback

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

pluralistic preference alignmentRL finetuning of LLMspluralistic reward modeling

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley–Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11–77% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PLUS, a framework that learns text-based summaries of individual user preferences to condition personalized reward models in RLHF. Within the taxonomy, it resides in the 'Text-Based User Summary Learning' leaf under 'Personalized RLHF with User Modeling'. This leaf contains only two papers total, including the original work and one sibling (Premium). This indicates a relatively sparse research direction focused specifically on generating natural language user profiles for alignment, contrasting with the broader 'Personalized RLHF with User Modeling' branch which encompasses four distinct approaches to user representation.

The taxonomy reveals neighboring work in sibling leaves: 'Lightweight User Model Integration' (two papers using compact joint-trained models), 'Variational and Probabilistic User Preference Modeling' (one paper with probabilistic frameworks), and 'Optimized Natural Language Preference Inference' (one paper on preference extraction). These adjacent directions share the goal of personalized alignment but differ in representation strategy—PLUS uses explicit text summaries while siblings employ embeddings, probabilistic samples, or optimized inference. The broader taxonomy shows parallel efforts in 'Rich Natural Language Feedback for RLHF' (seven papers) that leverage textual signals but without personalized user modeling, highlighting PLUS's unique intersection of natural language representation and individual preference learning.

Among the three contributions analyzed, the core PLUS framework examined ten candidates with zero refutations, suggesting novelty in the specific approach of learned text summaries for personalized reward modeling. The online co-adaptation training procedure examined only two candidates with no refutations, indicating limited prior work on simultaneous summarizer-reward model training. The empirical validation contribution examined ten candidates and found one refutable match, likely reflecting overlap in benchmark usage rather than methodological duplication. These statistics are based on twenty-two total candidates from a limited semantic search, not an exhaustive literature review, so the analysis captures top-K similarity rather than comprehensive field coverage.

Given the sparse taxonomy leaf (two papers) and limited refutations across contributions, PLUS appears to occupy a relatively novel position within personalized RLHF research. The framework's combination of text-based user summaries with online co-adaptation distinguishes it from existing user modeling approaches. However, the analysis scope—twenty-two candidates from semantic search—means adjacent work in related conferences or emerging preprints may not be fully represented. The empirical validation shows some benchmark overlap, which is typical for establishing comparative baselines in alignment research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: personalized reinforcement learning from human feedback using text-based user summaries. The field addresses how to tailor language model behavior to individual user preferences by leveraging natural language descriptions of user characteristics or feedback. The taxonomy reveals several complementary directions: Personalized RLHF with User Modeling explores methods that explicitly represent and learn from user-specific information, often through textual summaries or profiles; Rich Natural Language Feedback for RLHF investigates how to incorporate diverse forms of linguistic guidance beyond simple preference labels, including critiques and instructions (e.g., Natural Language Guidance[3], Text2Reward[8]); Language Model Reward Shaping and Generation focuses on using language models themselves to construct or refine reward signals; Real-World Human Interaction and Feedback Collection examines practical systems for gathering authentic user input at scale (Real World Interaction[6]); and Domain-Specific Personalized Learning Applications targets concrete use cases such as recommendation or content generation where personalization is critical (SumRecom[12], Predilect[13]). A particularly active line of work centers on how to efficiently encode and utilize user-specific information without requiring exhaustive data from each individual. Some approaches learn shared representations across users while adapting to personal preferences (Shared LoRA RLHF[18], Personalized Language Modeling[1]), whereas others emphasize converting natural language feedback into actionable training signals (Text2Grad[4], Feedback Goal Conditioning[19]). Personalized RLHF Summarization[0] sits within the Text-Based User Summary Learning cluster, closely related to Premium[7], both emphasizing the use of concise textual user profiles to guide model alignment. Compared to methods that rely on large-scale preference datasets (Data Efficient Alignment[9]) or aggregate feedback across populations (Uni-RLHF[11]), this work prioritizes interpretable, text-driven personalization that can adapt quickly to individual user contexts with minimal additional data collection.

Claimed Contributions

PLUS framework for personalized RLHF using learned text summaries

10 retrieved papers

PLUS is a new RLHF framework that jointly trains a summarizer (via PPO) to generate natural-language user summaries and a reward model conditioned on those summaries. This co-adaptive loop enables personalized preference modeling without requiring fixed user identifiers or embeddings.

10 retrieved papers

Online co-adaptation training procedure for summarizer and reward model

2 retrieved papers

The authors introduce a training procedure that alternates between updating the summarizer (using PPO with rewards from the reward model) and updating the reward model (conditioned on generated summaries), allowing both components to adapt to each other iteratively.

2 retrieved papers

Empirical validation on pluralistic benchmarks and real-world PRISM dataset

Can Refute

10 retrieved papers

The authors demonstrate that PLUS achieves substantial accuracy improvements over existing methods on established pluralistic preference datasets (Pets, UltraFeedback) and extend evaluation to the challenging real-world PRISM dataset, showing robustness to new users and conversation topics.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Premium: Llm personalization with individual-level preference feedback PDF

Y Sun, T Feng, G Liu, J You (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PLUS framework for personalized RLHF using learned text summaries

[5] Personalizing reinforcement learning from human feedback with variational preference learning PDF

Cannot Refute

[21] Deep reinforcement learning-driven smart and dynamic mass personalization PDF

Cannot Refute

[22] A personalized reinforcement learning recommendation algorithm using bi-clustering techniques PDF

Cannot Refute

[23] Adapting user experience with reinforcement learning: Personalizing interfaces based on user behavior analysis in real-time PDF

Cannot Refute

[24] PDMOR: Personalized Dynamic Multi-Objective Reinforcement Learning with Preference Evolution Modeling for Adaptive Recommendation PDF

Cannot Refute

[25] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization PDF

Cannot Refute

[26] Innovative Application of Reinforcement Learning in User Growth and Behavior Prediction PDF

Cannot Refute

[27] Learning to summarize with human feedback PDF

Cannot Refute

[28] A personalized reinforcement learning summarization service for learning structure from unstructured data PDF

Cannot Refute

[29] RLPer: A reinforcement learning model for personalized search PDF

Cannot Refute

Contribution

Online co-adaptation training procedure for summarizer and reward model

[25] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization PDF

Cannot Refute

[30] Enhancing text summarization with linguistic prompting and reinforcement learning: a human-centered approach PDF

Cannot Refute

Contribution

Empirical validation on pluralistic benchmarks and real-world PRISM dataset

[5] Personalizing reinforcement learning from human feedback with variational preference learning PDF

Can Refute

[31] â¦ dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models PDF

Cannot Refute

[32] Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback PDF

Cannot Refute

[33] M-rewardbench: Evaluating reward models in multilingual settings PDF

Cannot Refute

[34] B-Pref: Benchmarking Preference-Based Reinforcement Learning PDF

Cannot Refute

[35] RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation PDF

Cannot Refute

[36] Pal: Pluralistic alignment framework for learning from heterogeneous preferences PDF

Cannot Refute

[37] Basereward: A strong baseline for multimodal reward model PDF

Cannot Refute

[38] User-centric subjective leaderboard by customizable reward modeling PDF

Cannot Refute

[39] Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

Cannot Refute

Learning to summarize user information for personalized reinforcement learning from human feedback

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Premium: Llm personalization with individual-level preference feedback PDF

Contribution Analysis

PLUS framework for personalized RLHF using learned text summaries

[5] Personalizing reinforcement learning from human feedback with variational preference learning PDF

[21] Deep reinforcement learning-driven smart and dynamic mass personalization PDF

[22] A personalized reinforcement learning recommendation algorithm using bi-clustering techniques PDF

[23] Adapting user experience with reinforcement learning: Personalizing interfaces based on user behavior analysis in real-time PDF

[24] PDMOR: Personalized Dynamic Multi-Objective Reinforcement Learning with Preference Evolution Modeling for Adaptive Recommendation PDF

[25] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization PDF

[26] Innovative Application of Reinforcement Learning in User Growth and Behavior Prediction PDF

[27] Learning to summarize with human feedback PDF

[28] A personalized reinforcement learning summarization service for learning structure from unstructured data PDF

[29] RLPer: A reinforcement learning model for personalized search PDF

Online co-adaptation training procedure for summarizer and reward model

[25] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization PDF

[30] Enhancing text summarization with linguistic prompting and reinforcement learning: a human-centered approach PDF

Empirical validation on pluralistic benchmarks and real-world PRISM dataset

[5] Personalizing reinforcement learning from human feedback with variational preference learning PDF

[31] â¦ dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models PDF

[32] Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback PDF

[33] M-rewardbench: Evaluating reward models in multilingual settings PDF

[34] B-Pref: Benchmarking Preference-Based Reinforcement Learning PDF

[35] RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation PDF

[36] Pal: Pluralistic alignment framework for learning from heterogeneous preferences PDF

[37] Basereward: A strong baseline for multimodal reward model PDF

[38] User-centric subjective leaderboard by customizable reward modeling PDF

[39] Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards PDF

Table of Contents

[31] â¦ dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models PDF