Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling

ICLR 2026 Conference SubmissionAnonymous Authors
Language Models Fine-tuningDiscrete Choice ModelRanked Choice ModelAlignmentPreference OptimizationLearning From Human Feedback
Abstract:

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top-kk rankings. We introduce Ranked Choice Preference Optimization\textit{Ranked Choice Preference Optimization} (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. RCPO supports both utility based and rank based models, subsumes several pairwise methods (such as DPO and SimPO) as special cases, and provide principled training objectives for richer feedback formats. We instantiate this framework with two representative models (Multinomial Logit and Mallows-RMJ). Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across in-distribution and out-of-distribution settings show that RCPO consistently outperforms competitive baselines. RCPO shows that directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers an extensible foundation for incorporating (ranked) choice modeling into LLM training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RCPO, a unified framework that extends preference optimization to ranked choice data using maximum likelihood estimation over choice models. It resides in the 'Direct Preference Optimization and Variants' leaf, which contains eight papers including DPO extensions, curriculum strategies, and focal weighting methods. This is a moderately populated research direction within the broader algorithmic core of preference optimization, suggesting active interest in refining objective functions and training procedures beyond standard pairwise comparisons.

The taxonomy reveals neighboring leaves addressing reward-model-based RLHF, multi-objective optimization, and listwise ranking methods. RCPO bridges the 'Direct Preference Optimization' cluster with the 'Preference Ranking and Listwise Optimization' category by formalizing ranked data through choice modeling rather than heuristic reordering or curriculum design. The scope note for its leaf explicitly excludes reward-model-based methods and multi-objective approaches, positioning RCPO as a direct policy optimization technique that leverages richer feedback formats without introducing separate reward components or Pareto frontiers.

Among the three contributions analyzed, the RCPO framework examined nine candidates with zero refutations, the model instantiations examined three candidates with zero refutations, and the choice-modeling connection examined ten candidates with zero refutations. The total search scope covered twenty-two candidates from top-K semantic retrieval and citation expansion. This limited search suggests that within the examined literature, no prior work directly anticipates RCPO's unified treatment of ranked preferences via Multinomial Logit and Mallows-RMJ models, though the modest candidate pool leaves open the possibility of relevant work outside this scope.

Based on the twenty-two candidates examined, RCPO appears to occupy a distinct niche by formalizing ranked preference optimization through established choice-modeling frameworks. The analysis does not cover exhaustive citation networks or domain-specific applications, and the absence of refutations reflects the limited search rather than definitive novelty. The framework's positioning between direct optimization and listwise methods suggests incremental but principled progress in leveraging richer feedback structures.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Aligning large language models using ranked choice preference data. The field has evolved into a rich ecosystem of complementary research directions. At the highest level, one finds branches dedicated to algorithmic innovation—developing new preference optimization objectives and training procedures—alongside efforts focused on data quality, acquisition strategies, and the curation of high-quality preference signals. Other branches address personalization and diversity, ensuring that alignment respects varied user needs and cultural contexts, while domain-specific work tailors methods to applications such as translation, multimodal reasoning, or safety-critical deployments. Additional lines of inquiry examine evaluation frameworks, robustness under noisy or adversarial preferences, continual learning paradigms that adapt models over time, and alternative representation-based approaches that bypass traditional reward modeling. Surveys and meta-analyses synthesize these threads, offering comprehensive overviews of the rapidly maturing landscape. Within the algorithmic core, a particularly active cluster centers on direct preference optimization and its many variants. Works such as Preference Ranking Optimization[17] and Soft Preference Optimization[21] explore how to refine objective functions to better capture nuanced preference structures, while Curry DPO[35] and FocalPO[36] investigate curriculum strategies and focal weighting to handle noisy or imbalanced data. Ranked Choice Modeling[0] sits naturally in this cluster, emphasizing principled treatment of ranked preference lists rather than simple pairwise comparisons. Compared to neighbors like Preference Re-ranking[37] or Curriculum Ranked Preferences[39], which focus on reordering or scheduling training examples, Ranked Choice Modeling[0] directly models the probabilistic structure of multi-alternative rankings. This distinction highlights an ongoing tension in the field: whether to engineer better training curricula and data pipelines or to fundamentally rethink the loss functions and probabilistic assumptions underlying preference learning.

Claimed Contributions

Ranked Choice Preference Optimization (RCPO) framework

RCPO is a general framework that connects LLM preference optimization with choice modeling theory through maximum likelihood estimation. It supports both utility-based and rank-based models, subsumes pairwise methods like DPO and SimPO as special cases, and provides principled training objectives for richer feedback formats including single-best and top-k rankings.

9 retrieved papers
Instantiation with Multinomial Logit and Mallows-RMJ models

The authors derive concrete alignment objectives for two broad classes of choice models: the Multinomial Logit model for utility-based choices and the Mallows-RMJ model for rank-based choices. For both models, they provide training objectives under single-best and top-k settings, demonstrating the framework's flexibility across diverse preference structures.

3 retrieved papers
Systematic connection between LLM fine-tuning and choice modeling

The authors formalize how LLM alignment can be viewed as maximum likelihood estimation of choice models by interpreting prompts as contexts, responses as items, and candidate sets as assortments. This conceptual insight enables any choice model satisfying certain regularity conditions to be integrated into preference optimization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Ranked Choice Preference Optimization (RCPO) framework

RCPO is a general framework that connects LLM preference optimization with choice modeling theory through maximum likelihood estimation. It supports both utility-based and rank-based models, subsumes pairwise methods like DPO and SimPO as special cases, and provides principled training objectives for richer feedback formats including single-best and top-k rankings.

Contribution

Instantiation with Multinomial Logit and Mallows-RMJ models

The authors derive concrete alignment objectives for two broad classes of choice models: the Multinomial Logit model for utility-based choices and the Mallows-RMJ model for rank-based choices. For both models, they provide training objectives under single-best and top-k settings, demonstrating the framework's flexibility across diverse preference structures.

Contribution

Systematic connection between LLM fine-tuning and choice modeling

The authors formalize how LLM alignment can be viewed as maximum likelihood estimation of choice models by interpreting prompts as contexts, responses as items, and candidate sets as assortments. This conceptual insight enables any choice model satisfying certain regularity conditions to be integrated into preference optimization.