What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

rlhfexplaining datasetsinterpretabilityreward modelingpersonalization

Preference data is widely used for aligning language models, but remains largely opaque. While prior work has studied specific aspects of annotator preference (e.g., length or sycophancy), automatically inferring preferences without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback (WIMHF), a method that produces human-interpretable, natural language features from preference data using sparse autoencoders. We show that a sparse set of interpretable features can account for two-thirds of the preference signal achieved by black-box models. Applying WIMHF to 7 widely-used datasets, we precisely characterize both (1) which preferences are even possible to measure from each dataset and (2) which preferences humans actually display. WIMHF surfaces preferences that are unintentional or even actively harmful, like a preference for toxic outputs in Chatbot Arena. We show how these findings enable interpretable data curation: re-labeling the examples that contain the harmful preference yields large safety gains (+37%) with no cost to general performance. We also demonstrate a new approach to personalization: on the Community Alignment dataset, we identify preferences that are subjective across annotators, and use the features as interpretable knobs to adjust model behavior along these axes.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WIMHF, a method that extracts human-interpretable natural language features from preference data using sparse autoencoders. It occupies the 'Interpretable Preference Representations' leaf within the 'Preference Modeling Frameworks' branch of the taxonomy. Notably, this leaf contains only the original paper itself—no sibling papers exist in this specific category. This positioning suggests the work addresses a relatively sparse research direction: while the broader field includes numerous preference modeling approaches (Beyond Bradley-Terry models, multi-objective frameworks), the specific focus on extracting interpretable features from preference data appears less explored within the examined literature.

The taxonomy reveals substantial activity in adjacent areas. The parent branch 'Preference Modeling Frameworks' includes work on complex preference structures (intransitivity, game-theoretic approaches) and multi-objective modeling, but these typically remain black-box representations. Neighboring branches address preference optimization algorithms (DPO variants, reward-based RL) and data quality methods (influence functions, annotation efficiency), yet these focus on algorithmic refinement rather than interpretability. The 'Alignment Evaluation and Analysis' branch includes factor-level preference analysis, which shares interpretability goals but approaches the problem from an evaluation rather than modeling perspective. WIMHF's use of sparse autoencoders to surface interpretable features bridges preference modeling and analysis in a way that appears distinct from existing categorical boundaries.

Among 30 candidates examined across three contributions, none clearly refute the core claims. The WIMHF method itself (10 candidates examined, 0 refutable) appears novel in its application of sparse autoencoders to preference data interpretation. The interpretable data curation contribution (10 candidates, 0 refutable) demonstrates practical safety improvements through targeted re-labeling, a use case not prominently covered in the examined literature. The personalization approach (10 candidates, 0 refutable) similarly shows no substantial prior overlap. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage, but within this sample, the work's combination of interpretability techniques and preference data analysis appears distinctive.

Based on the examined candidates and taxonomy structure, the work occupies a relatively unexplored niche at the intersection of interpretability and preference modeling. The absence of sibling papers in its taxonomy leaf and the lack of refutable prior work among 30 candidates suggest meaningful novelty, though the limited search scope prevents definitive claims about the broader literature. The practical applications to safety and personalization extend beyond pure modeling contributions, addressing gaps in how preference data is understood and curated.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Interpreting human preference data for language model alignment. The field has evolved into a rich ecosystem organized around several major themes. At the highest level, researchers address data quality and selection—ensuring that preference signals are informative and representative—while simultaneously developing preference modeling frameworks that translate raw comparisons into learnable representations. Parallel branches focus on preference optimization algorithms, which refine model behavior given these signals, and on online or adaptive learning schemes that update models as new feedback arrives. Additional branches explore personalized and pluralistic alignment to accommodate diverse user values, methods for acquiring annotations and feedback (including AI-generated alternatives), and domain-specific tuning for specialized tasks. Complementary work examines evaluation strategies, continual learning paradigms, and the interplay between pre-training and fine-tuning, with surveys and practical systems rounding out the taxonomy. Within this landscape, a particularly active line of inquiry concerns how to represent and leverage preference information more effectively. Some studies question the sufficiency of standard pairwise comparisons, proposing listwise or ranking-based formulations (Listwise Preference Optimization[8], Preference Ranking Optimization[7]) or revisiting foundational assumptions like the Bradley-Terry model (Beyond Bradley-Terry[11], Rethinking Reward Modeling[3]). Others investigate what makes preference data valuable (Valuable Preference Data[1]) or how strategic annotator behavior shapes feedback (Strategic Human Feedback[28]). Interpretable Preference Descriptions[0] sits squarely in the preference modeling frameworks branch, emphasizing interpretable representations that make the underlying structure of human judgments more transparent. This focus on interpretability contrasts with purely algorithmic approaches like Self-Play Preference Optimization[2] and aligns closely with efforts to understand the role and limitations of pairwise signals (Pairwise Preference Role[5]), offering a complementary lens on how preference data can be both modeled and explained.

Claimed Contributions

What's In My Human Feedback (WIMHF) method

10 retrieved papers

The authors propose WIMHF, a three-step procedure that uses sparse autoencoders to automatically discover interpretable natural language features from preference datasets, enabling analysis of both measurable preferences (features that vary between responses) and realized preferences (features that affect human labels) without pre-specifying hypotheses.

10 retrieved papers

Interpretable data curation for safety improvement

10 retrieved papers

The authors demonstrate that WIMHF enables targeted data curation by identifying and correcting misaligned preferences in datasets. For example, flipping labels on examples with harmful anti-refusal preferences in Chatbot Arena substantially improves safety metrics while preserving overall model performance.

10 retrieved papers

Interpretable personalization approach

10 retrieved papers

The authors introduce an interpretable personalization method that identifies subjective preferences across annotators and learns user-specific coefficients for selected features. This approach allows practitioners to personalize models on acceptable attributes while preventing undesirable personalization, such as creating echo chambers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

What's In My Human Feedback (WIMHF) method

[61] Interpretable Reward Model via Sparse Autoencoder PDF

Cannot Refute

[62] SAFER: Probing Safety in Reward Models with Sparse Autoencoder PDF

Cannot Refute

[63] Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts PDF

Cannot Refute

[64] Transcoders Beat Sparse Autoencoders for Interpretability PDF

Cannot Refute

[65] Sparse autoencoders match supervised features for model steering on the ioi task PDF

Cannot Refute

[66] Sparse Autoencoders Reveal Interpretable Features in Single-Cell Foundation Models PDF

Cannot Refute

[67] Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations PDF

Cannot Refute

[68] Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment PDF

Cannot Refute

[69] Sparse autoencoders uncover biologically interpretable features in protein language model representations. PDF

Cannot Refute

[70] Sparse Autoencoders Find Highly Interpretable Features in Language Models PDF

Cannot Refute

Contribution

Interpretable data curation for safety improvement

[71] Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions PDF

Cannot Refute

[72] Constraint-guided online data selection for scalable data-driven safety filters in uncertain robotic systems PDF

Cannot Refute

[73] Phi-4-reasoning technical report PDF

Cannot Refute

[74] Large language models for reticular chemistry PDF

Cannot Refute

[75] SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models PDF

Cannot Refute

[76] Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms PDF

Cannot Refute

[77] Safe delta: Consistently preserving safety when fine-tuning LLMs on diverse datasets PDF

Cannot Refute

[78] EnsembleXAI-Motor: A Lightweight Framework for Fault Classification in Electric Vehicle Drive Motors Using Feature Selection, Ensemble Learning, and Explainable AI PDF

Cannot Refute

[79] Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs PDF

Cannot Refute

[80] Compassjudger-2: Towards generalist judge model via verifiable rewards PDF

Cannot Refute

Contribution

Interpretable personalization approach

[51] On generative agents in recommendation PDF

Cannot Refute

[52] When large language models meet personalization: Perspectives of challenges and opportunities PDF

Cannot Refute

[53] ReasoningRec: Bridging Personalized Recommendations and Human-Interpretable Explanations through LLM Reasoning PDF

Cannot Refute

[54] The effectiveness of personalised food choice advice tailored to an individual's socio-demographic, cognitive characteristics, and sensory preferences PDF

Cannot Refute

[55] Prefpalette: Personalized preference modeling with latent attributes PDF

Cannot Refute

[56] RPM: Reasoning-Level Personalization for Black-Box Large Language Models PDF

Cannot Refute

[57] Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5) PDF

Cannot Refute

[58] Integrating food preference Profiling, behavior change Strategies, and machine learning for cardiovascular disease prevention in a personalized nutrition â¦ PDF

Cannot Refute

[59] Recommender Systems for Renewable Energy Communities: Tailoring LLM-Powered Recommendations to User Personal Values and Literacy PDF

Cannot Refute

[60] Exploring the impact of explainable AI and cognitive capabilities on usersâ decisions PDF

Cannot Refute

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

What's In My Human Feedback (WIMHF) method

[61] Interpretable Reward Model via Sparse Autoencoder PDF

[62] SAFER: Probing Safety in Reward Models with Sparse Autoencoder PDF

[63] Towards Interpretable Scientific Foundation Models: Sparse Autoencoders for Disentangling Dense Embeddings of Scientific Concepts PDF

[64] Transcoders Beat Sparse Autoencoders for Interpretability PDF

[65] Sparse autoencoders match supervised features for model steering on the ioi task PDF

[66] Sparse Autoencoders Reveal Interpretable Features in Single-Cell Foundation Models PDF

[67] Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations PDF

[68] Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment PDF

[69] Sparse autoencoders uncover biologically interpretable features in protein language model representations. PDF

[70] Sparse Autoencoders Find Highly Interpretable Features in Language Models PDF

Interpretable data curation for safety improvement

[71] Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions PDF

[72] Constraint-guided online data selection for scalable data-driven safety filters in uncertain robotic systems PDF

[73] Phi-4-reasoning technical report PDF

[74] Large language models for reticular chemistry PDF

[75] SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models PDF

[76] Deep ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight llms PDF

[77] Safe delta: Consistently preserving safety when fine-tuning LLMs on diverse datasets PDF

[78] EnsembleXAI-Motor: A Lightweight Framework for Fault Classification in Electric Vehicle Drive Motors Using Feature Selection, Ensemble Learning, and Explainable AI PDF

[79] Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs PDF

[80] Compassjudger-2: Towards generalist judge model via verifiable rewards PDF

Interpretable personalization approach

[51] On generative agents in recommendation PDF

[52] When large language models meet personalization: Perspectives of challenges and opportunities PDF

[53] ReasoningRec: Bridging Personalized Recommendations and Human-Interpretable Explanations through LLM Reasoning PDF

[54] The effectiveness of personalised food choice advice tailored to an individual's socio-demographic, cognitive characteristics, and sensory preferences PDF

[55] Prefpalette: Personalized preference modeling with latent attributes PDF

[56] RPM: Reasoning-Level Personalization for Black-Box Large Language Models PDF

[57] Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5) PDF

[58] Integrating food preference Profiling, behavior change Strategies, and machine learning for cardiovascular disease prevention in a personalized nutrition â¦ PDF

[59] Recommender Systems for Renewable Energy Communities: Tailoring LLM-Powered Recommendations to User Personal Values and Literacy PDF

[60] Exploring the impact of explainable AI and cognitive capabilities on usersâ decisions PDF

Table of Contents

[58] Integrating food preference Profiling, behavior change Strategies, and machine learning for cardiovascular disease prevention in a personalized nutrition â¦ PDF

[60] Exploring the impact of explainable AI and cognitive capabilities on usersâ decisions PDF