Learning to summarize user information for personalized reinforcement learning from human feedback

ICLR 2026 Conference SubmissionAnonymous Authors
pluralistic preference alignmentRL finetuning of LLMspluralistic reward modeling
Abstract:

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users' preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone's preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user's preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley–Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11–77% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PLUS, a framework that learns text-based summaries of individual user preferences to condition personalized reward models in RLHF. Within the taxonomy, it resides in the 'Text-Based User Summary Learning' leaf under 'Personalized RLHF with User Modeling'. This leaf contains only two papers total, including the original work and one sibling (Premium). This indicates a relatively sparse research direction focused specifically on generating natural language user profiles for alignment, contrasting with the broader 'Personalized RLHF with User Modeling' branch which encompasses four distinct approaches to user representation.

The taxonomy reveals neighboring work in sibling leaves: 'Lightweight User Model Integration' (two papers using compact joint-trained models), 'Variational and Probabilistic User Preference Modeling' (one paper with probabilistic frameworks), and 'Optimized Natural Language Preference Inference' (one paper on preference extraction). These adjacent directions share the goal of personalized alignment but differ in representation strategy—PLUS uses explicit text summaries while siblings employ embeddings, probabilistic samples, or optimized inference. The broader taxonomy shows parallel efforts in 'Rich Natural Language Feedback for RLHF' (seven papers) that leverage textual signals but without personalized user modeling, highlighting PLUS's unique intersection of natural language representation and individual preference learning.

Among the three contributions analyzed, the core PLUS framework examined ten candidates with zero refutations, suggesting novelty in the specific approach of learned text summaries for personalized reward modeling. The online co-adaptation training procedure examined only two candidates with no refutations, indicating limited prior work on simultaneous summarizer-reward model training. The empirical validation contribution examined ten candidates and found one refutable match, likely reflecting overlap in benchmark usage rather than methodological duplication. These statistics are based on twenty-two total candidates from a limited semantic search, not an exhaustive literature review, so the analysis captures top-K similarity rather than comprehensive field coverage.

Given the sparse taxonomy leaf (two papers) and limited refutations across contributions, PLUS appears to occupy a relatively novel position within personalized RLHF research. The framework's combination of text-based user summaries with online co-adaptation distinguishes it from existing user modeling approaches. However, the analysis scope—twenty-two candidates from semantic search—means adjacent work in related conferences or emerging preprints may not be fully represented. The empirical validation shows some benchmark overlap, which is typical for establishing comparative baselines in alignment research.

Taxonomy

Core-task Taxonomy Papers
20
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: personalized reinforcement learning from human feedback using text-based user summaries. The field addresses how to tailor language model behavior to individual user preferences by leveraging natural language descriptions of user characteristics or feedback. The taxonomy reveals several complementary directions: Personalized RLHF with User Modeling explores methods that explicitly represent and learn from user-specific information, often through textual summaries or profiles; Rich Natural Language Feedback for RLHF investigates how to incorporate diverse forms of linguistic guidance beyond simple preference labels, including critiques and instructions (e.g., Natural Language Guidance[3], Text2Reward[8]); Language Model Reward Shaping and Generation focuses on using language models themselves to construct or refine reward signals; Real-World Human Interaction and Feedback Collection examines practical systems for gathering authentic user input at scale (Real World Interaction[6]); and Domain-Specific Personalized Learning Applications targets concrete use cases such as recommendation or content generation where personalization is critical (SumRecom[12], Predilect[13]). A particularly active line of work centers on how to efficiently encode and utilize user-specific information without requiring exhaustive data from each individual. Some approaches learn shared representations across users while adapting to personal preferences (Shared LoRA RLHF[18], Personalized Language Modeling[1]), whereas others emphasize converting natural language feedback into actionable training signals (Text2Grad[4], Feedback Goal Conditioning[19]). Personalized RLHF Summarization[0] sits within the Text-Based User Summary Learning cluster, closely related to Premium[7], both emphasizing the use of concise textual user profiles to guide model alignment. Compared to methods that rely on large-scale preference datasets (Data Efficient Alignment[9]) or aggregate feedback across populations (Uni-RLHF[11]), this work prioritizes interpretable, text-driven personalization that can adapt quickly to individual user contexts with minimal additional data collection.

Claimed Contributions

PLUS framework for personalized RLHF using learned text summaries

PLUS is a new RLHF framework that jointly trains a summarizer (via PPO) to generate natural-language user summaries and a reward model conditioned on those summaries. This co-adaptive loop enables personalized preference modeling without requiring fixed user identifiers or embeddings.

10 retrieved papers
Online co-adaptation training procedure for summarizer and reward model

The authors introduce a training procedure that alternates between updating the summarizer (using PPO with rewards from the reward model) and updating the reward model (conditioned on generated summaries), allowing both components to adapt to each other iteratively.

2 retrieved papers
Empirical validation on pluralistic benchmarks and real-world PRISM dataset

The authors demonstrate that PLUS achieves substantial accuracy improvements over existing methods on established pluralistic preference datasets (Pets, UltraFeedback) and extend evaluation to the challenging real-world PRISM dataset, showing robustness to new users and conversation topics.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PLUS framework for personalized RLHF using learned text summaries

PLUS is a new RLHF framework that jointly trains a summarizer (via PPO) to generate natural-language user summaries and a reward model conditioned on those summaries. This co-adaptive loop enables personalized preference modeling without requiring fixed user identifiers or embeddings.

Contribution

Online co-adaptation training procedure for summarizer and reward model

The authors introduce a training procedure that alternates between updating the summarizer (using PPO with rewards from the reward model) and updating the reward model (conditioned on generated summaries), allowing both components to adapt to each other iteratively.

Contribution

Empirical validation on pluralistic benchmarks and real-world PRISM dataset

The authors demonstrate that PLUS achieves substantial accuracy improvements over existing methods on established pluralistic preference datasets (Pets, UltraFeedback) and extend evaluation to the challenging real-world PRISM dataset, showing robustness to new users and conversation topics.