Semi-Supervised Preference Optimization with Limited Feedback

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Preference OptimizationSemi-Supervised Learning

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a semi-supervised preference optimization framework that learns from both limited labeled preference pairs and large-scale unlabeled data. Within the taxonomy, it resides in the 'Pseudo-Labeling and Self-Training Frameworks' leaf under 'Semi-Supervised and Data-Efficient Preference Learning'. This leaf contains only three papers total, suggesting a relatively sparse research direction compared to the broader field. The work directly addresses the challenge of reducing annotation costs while maintaining alignment quality, positioning itself alongside sibling papers that similarly explore self-training and knowledge propagation strategies for preference learning.

The taxonomy reveals neighboring research directions that tackle data efficiency through different mechanisms. The sibling leaf 'Uncertainty-Guided and Robust Preference Optimization' focuses on filtering unreliable preferences through uncertainty estimation rather than pseudo-labeling. Another sibling, 'Synthetic and Few-Shot Preference Data Generation', generates artificial preference data instead of leveraging unlabeled real data. The parent branch 'Active and Adaptive Preference Acquisition' represents an alternative paradigm emphasizing strategic annotation selection over unlabeled data exploitation. These structural relationships indicate that while semi-supervised approaches exist, the field explores multiple complementary strategies for addressing annotation scarcity.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed work. The SSPO framework examined ten candidates with zero refutable overlaps, as did the theoretical threshold guarantee and adaptive scheduling contributions. This absence of refutation among the limited search scope suggests the specific combination of semi-supervised learning with theoretical threshold guarantees may be relatively unexplored. However, the search examined only top-K semantic matches and citations, not an exhaustive literature review. The theoretical contribution appears particularly distinctive within this limited sample, though broader searches might reveal related threshold-based or semi-supervised reward modeling work.

Based on the limited thirty-candidate search, the work appears to occupy a relatively novel position combining semi-supervised learning with principled pseudo-labeling for preference optimization. The sparse population of its taxonomy leaf and absence of refuting candidates suggest potential originality, though the search scope prevents definitive conclusions about the broader landscape. The theoretical threshold guarantee and integration with curriculum learning represent potentially distinctive elements within the examined sample.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Semi-supervised preference optimization for language model alignment. The field addresses how to align large language models with human preferences when labeled preference data is scarce or expensive to obtain. The taxonomy organizes approaches into three main branches. Semi-Supervised and Data-Efficient Preference Learning encompasses methods that leverage unlabeled data through pseudo-labeling, self-training, and reward modeling techniques, exemplified by works like Semi-supervised fine-tuning[2] and Semi-supervised reward modeling[4]. Active and Adaptive Preference Acquisition focuses on intelligently selecting which examples to label, reducing annotation costs through strategic querying strategies such as Active Preference Learning[1] and Annotation-Efficient Preference[5]. Specialized Preference Optimization Paradigms explores alternative formulations and problem settings, including group-based methods like Group Preference Optimization[12] and novel alignment frameworks such as Alignment as Distribution[15]. These branches reflect a shared goal of making preference learning more practical and scalable while addressing different bottlenecks in the alignment pipeline. A central tension across these branches involves the trade-off between annotation efficiency and model quality. Many studies explore how to bootstrap from limited labels, with some works emphasizing self-evolutionary approaches like Self-evolutionary LLMs[3] that iteratively refine preferences, while others such as MAPLE[7] and CURATRON[9] focus on curating high-quality subsets for labeling. Semi-Supervised Preference Optimization[0] sits squarely within the pseudo-labeling and self-training cluster, sharing methodological DNA with Semi-supervised fine-tuning[2] and Semi-supervised reward modeling[4]. Compared to these neighbors, the original work emphasizes leveraging unlabeled preference pairs to augment sparse human annotations, a strategy that contrasts with purely active methods like Active Preference Learning[1] which prioritize selective querying over pseudo-label generation. This positioning highlights an ongoing question in the field: whether to invest effort in smarter data selection or in better exploitation of abundant unlabeled data.

Claimed Contributions

Semi-Supervised Preference Optimization (SSPO) framework

10 retrieved papers

The authors propose SSPO, a framework that combines limited labeled preference data with abundant unpaired data for language model alignment. This approach addresses the high cost of acquiring paired preference annotations by leveraging semi-supervised learning to improve data efficiency.

10 retrieved papers

Theoretical guarantee for optimal reward threshold existence

10 retrieved papers

The authors provide a theoretical result (Theorem 1) establishing that under mild distributional assumptions, there exists an optimal reward threshold that separates winning and losing responses with high probability. This theoretical foundation justifies their pseudo-labeling strategy for unpaired data.

10 retrieved papers

Adaptive scheduling strategy for curriculum learning

10 retrieved papers

The authors develop an adaptive scheduling mechanism that dynamically balances the influence of paired and unpaired data during training. The scheduler initially prioritizes reliable paired data and gradually increases the weight of pseudo-labeled unpaired data as the reward function improves, implementing a principled curriculum learning approach.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Semi-supervised fine-tuning for large language models PDF

JunYu Luo, Xiao Luo, Xiusi Chen, Zhiping Xiao, Wei Ju, Ming Zhang (2025)

[4] Semi-supervised reward modeling via iterative self-training PDF

He Yifei, Jiang Ziyan, Papangelis, Alexandros, Wang Hao-xiang, Zhao Han (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Semi-Supervised Preference Optimization (SSPO) framework

[30] A critical evaluation of ai feedback for aligning large language models PDF

Cannot Refute

[31] The delta learning hypothesis: Preference tuning on weak data can yield strong gains PDF

Cannot Refute

[32] The political preferences of LLMs PDF

Cannot Refute

[33] Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation PDF

Cannot Refute

[34] Direct preference optimization: Your language model is secretly a reward model PDF

Cannot Refute

[35] Aligning large language models with preference privacy PDF

Cannot Refute

[36] Reinforced Preference Optimization for Recommendation PDF

Cannot Refute

[37] Towards a unified view of preference learning for large language models: A survey PDF

Cannot Refute

[38] Data-centric human preference with rationales for direct preference alignment PDF

Cannot Refute

[39] Modality-balancing preference optimization of large multimodal models by adversarial negative mining PDF

Cannot Refute

Contribution

Theoretical guarantee for optimal reward threshold existence

[20] Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms PDF

Cannot Refute

[21] Self-rewarding correction for mathematical reasoning PDF

Cannot Refute

[22] Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models PDF

Cannot Refute

[23] Dpo-shift: Shifting the distribution of direct preference optimization PDF

Cannot Refute

[24] Length-controlled margin-based preference optimization without reference model PDF

Cannot Refute

[25] Legend: Leveraging representation engineering to annotate safety margin for preference datasets PDF

Cannot Refute

[26] Not all preference pairs are created equal: A recipe for annotation-efficient iterative preference learning PDF

Cannot Refute

[27] A Correctness and Incorrectness Program Logic PDF

Cannot Refute

[28] Can DPO Learn Diverse Human Values? A Theoretical Scaling Law PDF

Cannot Refute

[29] Towards understanding the influence of reward margin on preference model performance PDF

Cannot Refute

Contribution

Adaptive scheduling strategy for curriculum learning

[40] FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling PDF

Cannot Refute

[41] Pruning-guided curriculum learning for semi-supervised semantic segmentation PDF

Cannot Refute

[42] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning PDF

Cannot Refute

[43] Pseudo-labeling curriculum for unsupervised domain adaptation PDF

Cannot Refute

[44] Boosting pseudo-labeling with curriculum self-reflection for attributed graph clustering PDF

Cannot Refute

[45] Clip-vg: Self-paced curriculum adapting of clip for visual grounding PDF

Cannot Refute

[46] GSSCL: A framework for Graph Self-Supervised Curriculum Learning based on clustering label smoothing PDF

Cannot Refute

[47] Toward effective semi-supervised node classification with hybrid curriculum pseudo-labeling PDF

Cannot Refute

[48] Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization PDF

Cannot Refute

[49] Acpl: Anti-curriculum pseudo-labelling for semi-supervised medical image classification PDF

Cannot Refute

Semi-Supervised Preference Optimization with Limited Feedback

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Semi-supervised fine-tuning for large language models PDF

[4] Semi-supervised reward modeling via iterative self-training PDF

Contribution Analysis

Semi-Supervised Preference Optimization (SSPO) framework

[30] A critical evaluation of ai feedback for aligning large language models PDF

[31] The delta learning hypothesis: Preference tuning on weak data can yield strong gains PDF

[32] The political preferences of LLMs PDF

[33] Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation PDF

[34] Direct preference optimization: Your language model is secretly a reward model PDF

[35] Aligning large language models with preference privacy PDF

[36] Reinforced Preference Optimization for Recommendation PDF

[37] Towards a unified view of preference learning for large language models: A survey PDF

[38] Data-centric human preference with rationales for direct preference alignment PDF

[39] Modality-balancing preference optimization of large multimodal models by adversarial negative mining PDF

Theoretical guarantee for optimal reward threshold existence

[20] Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms PDF

[21] Self-rewarding correction for mathematical reasoning PDF

[22] Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models PDF

[23] Dpo-shift: Shifting the distribution of direct preference optimization PDF

[24] Length-controlled margin-based preference optimization without reference model PDF

[25] Legend: Leveraging representation engineering to annotate safety margin for preference datasets PDF

[26] Not all preference pairs are created equal: A recipe for annotation-efficient iterative preference learning PDF

[27] A Correctness and Incorrectness Program Logic PDF

[28] Can DPO Learn Diverse Human Values? A Theoretical Scaling Law PDF

[29] Towards understanding the influence of reward margin on preference model performance PDF

Adaptive scheduling strategy for curriculum learning

[40] FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling PDF

[41] Pruning-guided curriculum learning for semi-supervised semantic segmentation PDF

[42] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning PDF

[43] Pseudo-labeling curriculum for unsupervised domain adaptation PDF

[44] Boosting pseudo-labeling with curriculum self-reflection for attributed graph clustering PDF

[45] Clip-vg: Self-paced curriculum adapting of clip for visual grounding PDF

[46] GSSCL: A framework for Graph Self-Supervised Curriculum Learning based on clustering label smoothing PDF

[47] Toward effective semi-supervised node classification with hybrid curriculum pseudo-labeling PDF

[48] Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization PDF

[49] Acpl: Anti-curriculum pseudo-labelling for semi-supervised medical image classification PDF

Table of Contents