What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

ICLR 2026 Conference SubmissionAnonymous Authors
rlhfexplaining datasetsinterpretabilityreward modelingpersonalization
Abstract:

Preference data is widely used for aligning language models, but remains largely opaque. While prior work has studied specific aspects of annotator preference (e.g., length or sycophancy), automatically inferring preferences without pre-specifying hypotheses remains challenging. We introduce What's In My Human Feedback (WIMHF), a method that produces human-interpretable, natural language features from preference data using sparse autoencoders. We show that a sparse set of interpretable features can account for two-thirds of the preference signal achieved by black-box models. Applying WIMHF to 7 widely-used datasets, we precisely characterize both (1) which preferences are even possible to measure from each dataset and (2) which preferences humans actually display. WIMHF surfaces preferences that are unintentional or even actively harmful, like a preference for toxic outputs in Chatbot Arena. We show how these findings enable interpretable data curation: re-labeling the examples that contain the harmful preference yields large safety gains (+37%) with no cost to general performance. We also demonstrate a new approach to personalization: on the Community Alignment dataset, we identify preferences that are subjective across annotators, and use the features as interpretable knobs to adjust model behavior along these axes.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces WIMHF, a method that extracts human-interpretable natural language features from preference data using sparse autoencoders. It occupies the 'Interpretable Preference Representations' leaf within the 'Preference Modeling Frameworks' branch of the taxonomy. Notably, this leaf contains only the original paper itself—no sibling papers exist in this specific category. This positioning suggests the work addresses a relatively sparse research direction: while the broader field includes numerous preference modeling approaches (Beyond Bradley-Terry models, multi-objective frameworks), the specific focus on extracting interpretable features from preference data appears less explored within the examined literature.

The taxonomy reveals substantial activity in adjacent areas. The parent branch 'Preference Modeling Frameworks' includes work on complex preference structures (intransitivity, game-theoretic approaches) and multi-objective modeling, but these typically remain black-box representations. Neighboring branches address preference optimization algorithms (DPO variants, reward-based RL) and data quality methods (influence functions, annotation efficiency), yet these focus on algorithmic refinement rather than interpretability. The 'Alignment Evaluation and Analysis' branch includes factor-level preference analysis, which shares interpretability goals but approaches the problem from an evaluation rather than modeling perspective. WIMHF's use of sparse autoencoders to surface interpretable features bridges preference modeling and analysis in a way that appears distinct from existing categorical boundaries.

Among 30 candidates examined across three contributions, none clearly refute the core claims. The WIMHF method itself (10 candidates examined, 0 refutable) appears novel in its application of sparse autoencoders to preference data interpretation. The interpretable data curation contribution (10 candidates, 0 refutable) demonstrates practical safety improvements through targeted re-labeling, a use case not prominently covered in the examined literature. The personalization approach (10 candidates, 0 refutable) similarly shows no substantial prior overlap. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage, but within this sample, the work's combination of interpretability techniques and preference data analysis appears distinctive.

Based on the examined candidates and taxonomy structure, the work occupies a relatively unexplored niche at the intersection of interpretability and preference modeling. The absence of sibling papers in its taxonomy leaf and the lack of refutable prior work among 30 candidates suggest meaningful novelty, though the limited search scope prevents definitive claims about the broader literature. The practical applications to safety and personalization extend beyond pure modeling contributions, addressing gaps in how preference data is understood and curated.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Interpreting human preference data for language model alignment. The field has evolved into a rich ecosystem organized around several major themes. At the highest level, researchers address data quality and selection—ensuring that preference signals are informative and representative—while simultaneously developing preference modeling frameworks that translate raw comparisons into learnable representations. Parallel branches focus on preference optimization algorithms, which refine model behavior given these signals, and on online or adaptive learning schemes that update models as new feedback arrives. Additional branches explore personalized and pluralistic alignment to accommodate diverse user values, methods for acquiring annotations and feedback (including AI-generated alternatives), and domain-specific tuning for specialized tasks. Complementary work examines evaluation strategies, continual learning paradigms, and the interplay between pre-training and fine-tuning, with surveys and practical systems rounding out the taxonomy. Within this landscape, a particularly active line of inquiry concerns how to represent and leverage preference information more effectively. Some studies question the sufficiency of standard pairwise comparisons, proposing listwise or ranking-based formulations (Listwise Preference Optimization[8], Preference Ranking Optimization[7]) or revisiting foundational assumptions like the Bradley-Terry model (Beyond Bradley-Terry[11], Rethinking Reward Modeling[3]). Others investigate what makes preference data valuable (Valuable Preference Data[1]) or how strategic annotator behavior shapes feedback (Strategic Human Feedback[28]). Interpretable Preference Descriptions[0] sits squarely in the preference modeling frameworks branch, emphasizing interpretable representations that make the underlying structure of human judgments more transparent. This focus on interpretability contrasts with purely algorithmic approaches like Self-Play Preference Optimization[2] and aligns closely with efforts to understand the role and limitations of pairwise signals (Pairwise Preference Role[5]), offering a complementary lens on how preference data can be both modeled and explained.

Claimed Contributions

What's In My Human Feedback (WIMHF) method

The authors propose WIMHF, a three-step procedure that uses sparse autoencoders to automatically discover interpretable natural language features from preference datasets, enabling analysis of both measurable preferences (features that vary between responses) and realized preferences (features that affect human labels) without pre-specifying hypotheses.

10 retrieved papers
Interpretable data curation for safety improvement

The authors demonstrate that WIMHF enables targeted data curation by identifying and correcting misaligned preferences in datasets. For example, flipping labels on examples with harmful anti-refusal preferences in Chatbot Arena substantially improves safety metrics while preserving overall model performance.

10 retrieved papers
Interpretable personalization approach

The authors introduce an interpretable personalization method that identifies subjective preferences across annotators and learns user-specific coefficients for selected features. This approach allows practitioners to personalize models on acceptable attributes while preventing undesirable personalization, such as creating echo chambers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

What's In My Human Feedback (WIMHF) method

The authors propose WIMHF, a three-step procedure that uses sparse autoencoders to automatically discover interpretable natural language features from preference datasets, enabling analysis of both measurable preferences (features that vary between responses) and realized preferences (features that affect human labels) without pre-specifying hypotheses.

Contribution

Interpretable data curation for safety improvement

The authors demonstrate that WIMHF enables targeted data curation by identifying and correcting misaligned preferences in datasets. For example, flipping labels on examples with harmful anti-refusal preferences in Chatbot Arena substantially improves safety metrics while preserving overall model performance.

Contribution

Interpretable personalization approach

The authors introduce an interpretable personalization method that identifies subjective preferences across annotators and learns user-specific coefficients for selected features. This approach allows practitioners to personalize models on acceptable attributes while preventing undesirable personalization, such as creating echo chambers.