Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Language Models Fine-tuningDiscrete Choice ModelRanked Choice ModelAlignmentPreference OptimizationLearning From Human Feedback

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top- $k$ rankings. We introduce $\textit{Ranked Choice Preference Optimization}$ (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. RCPO supports both utility based and rank based models, subsumes several pairwise methods (such as DPO and SimPO) as special cases, and provide principled training objectives for richer feedback formats. We instantiate this framework with two representative models (Multinomial Logit and Mallows-RMJ). Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across in-distribution and out-of-distribution settings show that RCPO consistently outperforms competitive baselines. RCPO shows that directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers an extensible foundation for incorporating (ranked) choice modeling into LLM training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RCPO, a unified framework that extends preference optimization to ranked choice data using maximum likelihood estimation over choice models. It resides in the 'Direct Preference Optimization and Variants' leaf, which contains eight papers including DPO extensions, curriculum strategies, and focal weighting methods. This is a moderately populated research direction within the broader algorithmic core of preference optimization, suggesting active interest in refining objective functions and training procedures beyond standard pairwise comparisons.

The taxonomy reveals neighboring leaves addressing reward-model-based RLHF, multi-objective optimization, and listwise ranking methods. RCPO bridges the 'Direct Preference Optimization' cluster with the 'Preference Ranking and Listwise Optimization' category by formalizing ranked data through choice modeling rather than heuristic reordering or curriculum design. The scope note for its leaf explicitly excludes reward-model-based methods and multi-objective approaches, positioning RCPO as a direct policy optimization technique that leverages richer feedback formats without introducing separate reward components or Pareto frontiers.

Among the three contributions analyzed, the RCPO framework examined nine candidates with zero refutations, the model instantiations examined three candidates with zero refutations, and the choice-modeling connection examined ten candidates with zero refutations. The total search scope covered twenty-two candidates from top-K semantic retrieval and citation expansion. This limited search suggests that within the examined literature, no prior work directly anticipates RCPO's unified treatment of ranked preferences via Multinomial Logit and Mallows-RMJ models, though the modest candidate pool leaves open the possibility of relevant work outside this scope.

Based on the twenty-two candidates examined, RCPO appears to occupy a distinct niche by formalizing ranked preference optimization through established choice-modeling frameworks. The analysis does not cover exhaustive citation networks or domain-specific applications, and the absence of refutations reflects the limited search rather than definitive novelty. The framework's positioning between direct optimization and listwise methods suggests incremental but principled progress in leveraging richer feedback structures.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Aligning large language models using ranked choice preference data. The field has evolved into a rich ecosystem of complementary research directions. At the highest level, one finds branches dedicated to algorithmic innovation—developing new preference optimization objectives and training procedures—alongside efforts focused on data quality, acquisition strategies, and the curation of high-quality preference signals. Other branches address personalization and diversity, ensuring that alignment respects varied user needs and cultural contexts, while domain-specific work tailors methods to applications such as translation, multimodal reasoning, or safety-critical deployments. Additional lines of inquiry examine evaluation frameworks, robustness under noisy or adversarial preferences, continual learning paradigms that adapt models over time, and alternative representation-based approaches that bypass traditional reward modeling. Surveys and meta-analyses synthesize these threads, offering comprehensive overviews of the rapidly maturing landscape. Within the algorithmic core, a particularly active cluster centers on direct preference optimization and its many variants. Works such as Preference Ranking Optimization[17] and Soft Preference Optimization[21] explore how to refine objective functions to better capture nuanced preference structures, while Curry DPO[35] and FocalPO[36] investigate curriculum strategies and focal weighting to handle noisy or imbalanced data. Ranked Choice Modeling[0] sits naturally in this cluster, emphasizing principled treatment of ranked preference lists rather than simple pairwise comparisons. Compared to neighbors like Preference Re-ranking[37] or Curriculum Ranked Preferences[39], which focus on reordering or scheduling training examples, Ranked Choice Modeling[0] directly models the probabilistic structure of multi-alternative rankings. This distinction highlights an ongoing tension in the field: whether to engineer better training curricula and data pipelines or to fundamentally rethink the loss functions and probabilistic assumptions underlying preference learning.

Claimed Contributions

Ranked Choice Preference Optimization (RCPO) framework

9 retrieved papers

RCPO is a general framework that connects LLM preference optimization with choice modeling theory through maximum likelihood estimation. It supports both utility-based and rank-based models, subsumes pairwise methods like DPO and SimPO as special cases, and provides principled training objectives for richer feedback formats including single-best and top-k rankings.

9 retrieved papers

Instantiation with Multinomial Logit and Mallows-RMJ models

3 retrieved papers

The authors derive concrete alignment objectives for two broad classes of choice models: the Multinomial Logit model for utility-based choices and the Mallows-RMJ model for rank-based choices. For both models, they provide training objectives under single-best and top-k settings, demonstrating the framework's flexibility across diverse preference structures.

3 retrieved papers

Systematic connection between LLM fine-tuning and choice modeling

10 retrieved papers

The authors formalize how LLM alignment can be viewed as maximum likelihood estimation of choice models by interpreting prompts as contexts, responses as items, and candidate sets as assortments. This conceptual insight enables any choice model satisfying certain regularity conditions to be integrated into preference optimization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Human Alignment of Large Language Models through Online Preference Optimisation PDF

Calandriello, Daniele, Guo, Daniel, Daniele Calandriello, Munos, RÃ©mi, Daniel Guo, Rowland, Mark, RÃ©mi Munos, Tang, Yunhao, Mark Rowland, Pires, Bernardo Ãvila, Yunhao Tang, Richemond, Pierre Harvey, B. '. Pires, Lan, Charline Le, Pierre H. Richemond, Valko, Michal, Charline Le Lan, Liu Tianqi, Michal Valko, Joshi, Rishabh, Tianqi Liu, Zheng, Zeyu, Rishabh Joshi, Piot, Bilal, Zeyu Zheng, Bilal Piot (2024)

[17] Preference Ranking Optimization for Human Alignment PDF

Huang Fei, Li, Yongbin, Li Ming-hao, Song, Feifan, Wang, Houfeng, Yu, Haiyang, Yu Bowen (2023)

[21] Soft preference optimization: Aligning language models to expert distributions PDF

Sharifnassab, Arsalan, Arsalan Sharifnassab, Salehkaleybar, Saber, Sina Ghiassian, Ghiassian, Sina, Saber Salehkaleybar, Kanoria, Surya, Surya Kanoria, Schuurmans, Dale, Dale Schuurmans (2024)

[35] Curry-dpo: Enhancing alignment using curriculum learning & ranked preferences PDF

Maheshwary, Rishabh, Pulkit Pattnaik, Ogueji, Kelechi, Rishabh Maheshwary, Yadav, Vikas, Kelechi Ogueji, Vikas Yadav, Sathwik Tejaswi Madhusudhan (2024)

[36] FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings PDF

Gu, Jindong, Liu Tong, Tresp, Volker, Yu Xiao, Zhou, Wenxuan (2025)

[37] Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling PDF

Deng Qi-yuan, Bai Xue-feng, Chen Ke-hai, Wang, Yaowei, Nie, Liqiang, Zhang Min (2025) • Annual Meeting of the Association for Computational Linguistics

[39] Enhancing alignment using curriculum learning & ranked preferences PDF

Maheshwary, Rishabh, Ogueji, Kelechi, Yadav, Vikas (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Ranked Choice Preference Optimization (RCPO) framework

[53] Deep bayesian active learning for preference modeling in large language models PDF

Cannot Refute

[61] Mallows Model with Learned Distance Metrics: Sampling and Maximum Likelihood Estimation PDF

Cannot Refute

[62] WEPO: Web Element Preference Optimization for LLM-based Web Navigation PDF

Cannot Refute

[63] Energy-based preference model offers better offline alignment than the bradley-terry preference model PDF

Cannot Refute

[64] Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier PDF

Cannot Refute

[65] Cal-dpo: Calibrated direct preference optimization for language model alignment PDF

Cannot Refute

[66] Offline Preference Optimization via Maximum Marginal Likelihood Estimation PDF

Cannot Refute

[67] Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection PDF

Cannot Refute

[68] Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment PDF

Cannot Refute

Contribution

Instantiation with Multinomial Logit and Mallows-RMJ models

[69] A Mallows-Type Model for Preference Learning from (Ranked) Choices PDF

Cannot Refute

[70] Finally Rank-Breaking Conquers MNL Bandits: Optimal and Efficient Algorithms for MNL Assortment PDF

Cannot Refute

[71] Optimal, Efficient and Practical Algorithms for Assortment Optimization PDF

Cannot Refute

Contribution

Systematic connection between LLM fine-tuning and choice modeling

[51] Preference fine-tuning of llms should leverage suboptimal, on-policy data PDF

Cannot Refute

[52] Reproducing Kernel Hilbert Space Choice Model PDF

Cannot Refute

[53] Deep bayesian active learning for preference modeling in large language models PDF

Cannot Refute

[54] Group robust preference optimization in reward-free rlhf PDF

Cannot Refute

[55] How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering PDF

Cannot Refute

[56] Aligning Large Language Models with Human Feedback: Mathematical Foundations and Algorithm Design PDF

Cannot Refute

[57] Enhancing travel choice modeling with large language models: A prompt-learning approach PDF

Cannot Refute

[58] SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks PDF

Cannot Refute

[59] Transportation Research Interdisciplinary Perspectives PDF

Cannot Refute

[60] FORTHCOMING IN CASE STUDIES IN TRANSPORT POLICY 40 PDF

Cannot Refute

Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Human Alignment of Large Language Models through Online Preference Optimisation PDF

[17] Preference Ranking Optimization for Human Alignment PDF

[21] Soft preference optimization: Aligning language models to expert distributions PDF

[35] Curry-dpo: Enhancing alignment using curriculum learning & ranked preferences PDF

[36] FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings PDF

[37] Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling PDF

[39] Enhancing alignment using curriculum learning & ranked preferences PDF

Contribution Analysis

Ranked Choice Preference Optimization (RCPO) framework

[53] Deep bayesian active learning for preference modeling in large language models PDF

[61] Mallows Model with Learned Distance Metrics: Sampling and Maximum Likelihood Estimation PDF

[62] WEPO: Web Element Preference Optimization for LLM-based Web Navigation PDF

[63] Energy-based preference model offers better offline alignment than the bradley-terry preference model PDF

[64] Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier PDF

[65] Cal-dpo: Calibrated direct preference optimization for language model alignment PDF

[66] Offline Preference Optimization via Maximum Marginal Likelihood Estimation PDF

[67] Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection PDF

[68] Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment PDF

Instantiation with Multinomial Logit and Mallows-RMJ models

[69] A Mallows-Type Model for Preference Learning from (Ranked) Choices PDF

[70] Finally Rank-Breaking Conquers MNL Bandits: Optimal and Efficient Algorithms for MNL Assortment PDF

[71] Optimal, Efficient and Practical Algorithms for Assortment Optimization PDF

Systematic connection between LLM fine-tuning and choice modeling

[51] Preference fine-tuning of llms should leverage suboptimal, on-policy data PDF

[52] Reproducing Kernel Hilbert Space Choice Model PDF

[53] Deep bayesian active learning for preference modeling in large language models PDF

[54] Group robust preference optimization in reward-free rlhf PDF

[55] How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering PDF

[56] Aligning Large Language Models with Human Feedback: Mathematical Foundations and Algorithm Design PDF

[57] Enhancing travel choice modeling with large language models: A prompt-learning approach PDF

[58] SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks PDF

[59] Transportation Research Interdisciplinary Perspectives PDF

[60] FORTHCOMING IN CASE STUDIES IN TRANSPORT POLICY 40 PDF

Table of Contents