Semi-Supervised Preference Optimization with Limited Feedback

ICLR 2026 Conference SubmissionAnonymous Authors
Preference OptimizationSemi-Supervised Learning
Abstract:

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a semi-supervised preference optimization framework that learns from both limited labeled preference pairs and large-scale unlabeled data. Within the taxonomy, it resides in the 'Pseudo-Labeling and Self-Training Frameworks' leaf under 'Semi-Supervised and Data-Efficient Preference Learning'. This leaf contains only three papers total, suggesting a relatively sparse research direction compared to the broader field. The work directly addresses the challenge of reducing annotation costs while maintaining alignment quality, positioning itself alongside sibling papers that similarly explore self-training and knowledge propagation strategies for preference learning.

The taxonomy reveals neighboring research directions that tackle data efficiency through different mechanisms. The sibling leaf 'Uncertainty-Guided and Robust Preference Optimization' focuses on filtering unreliable preferences through uncertainty estimation rather than pseudo-labeling. Another sibling, 'Synthetic and Few-Shot Preference Data Generation', generates artificial preference data instead of leveraging unlabeled real data. The parent branch 'Active and Adaptive Preference Acquisition' represents an alternative paradigm emphasizing strategic annotation selection over unlabeled data exploitation. These structural relationships indicate that while semi-supervised approaches exist, the field explores multiple complementary strategies for addressing annotation scarcity.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed work. The SSPO framework examined ten candidates with zero refutable overlaps, as did the theoretical threshold guarantee and adaptive scheduling contributions. This absence of refutation among the limited search scope suggests the specific combination of semi-supervised learning with theoretical threshold guarantees may be relatively unexplored. However, the search examined only top-K semantic matches and citations, not an exhaustive literature review. The theoretical contribution appears particularly distinctive within this limited sample, though broader searches might reveal related threshold-based or semi-supervised reward modeling work.

Based on the limited thirty-candidate search, the work appears to occupy a relatively novel position combining semi-supervised learning with principled pseudo-labeling for preference optimization. The sparse population of its taxonomy leaf and absence of refuting candidates suggest potential originality, though the search scope prevents definitive conclusions about the broader landscape. The theoretical threshold guarantee and integration with curriculum learning represent potentially distinctive elements within the examined sample.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Semi-supervised preference optimization for language model alignment. The field addresses how to align large language models with human preferences when labeled preference data is scarce or expensive to obtain. The taxonomy organizes approaches into three main branches. Semi-Supervised and Data-Efficient Preference Learning encompasses methods that leverage unlabeled data through pseudo-labeling, self-training, and reward modeling techniques, exemplified by works like Semi-supervised fine-tuning[2] and Semi-supervised reward modeling[4]. Active and Adaptive Preference Acquisition focuses on intelligently selecting which examples to label, reducing annotation costs through strategic querying strategies such as Active Preference Learning[1] and Annotation-Efficient Preference[5]. Specialized Preference Optimization Paradigms explores alternative formulations and problem settings, including group-based methods like Group Preference Optimization[12] and novel alignment frameworks such as Alignment as Distribution[15]. These branches reflect a shared goal of making preference learning more practical and scalable while addressing different bottlenecks in the alignment pipeline. A central tension across these branches involves the trade-off between annotation efficiency and model quality. Many studies explore how to bootstrap from limited labels, with some works emphasizing self-evolutionary approaches like Self-evolutionary LLMs[3] that iteratively refine preferences, while others such as MAPLE[7] and CURATRON[9] focus on curating high-quality subsets for labeling. Semi-Supervised Preference Optimization[0] sits squarely within the pseudo-labeling and self-training cluster, sharing methodological DNA with Semi-supervised fine-tuning[2] and Semi-supervised reward modeling[4]. Compared to these neighbors, the original work emphasizes leveraging unlabeled preference pairs to augment sparse human annotations, a strategy that contrasts with purely active methods like Active Preference Learning[1] which prioritize selective querying over pseudo-label generation. This positioning highlights an ongoing question in the field: whether to invest effort in smarter data selection or in better exploitation of abundant unlabeled data.

Claimed Contributions

Semi-Supervised Preference Optimization (SSPO) framework

The authors propose SSPO, a framework that combines limited labeled preference data with abundant unpaired data for language model alignment. This approach addresses the high cost of acquiring paired preference annotations by leveraging semi-supervised learning to improve data efficiency.

10 retrieved papers
Theoretical guarantee for optimal reward threshold existence

The authors provide a theoretical result (Theorem 1) establishing that under mild distributional assumptions, there exists an optimal reward threshold that separates winning and losing responses with high probability. This theoretical foundation justifies their pseudo-labeling strategy for unpaired data.

10 retrieved papers
Adaptive scheduling strategy for curriculum learning

The authors develop an adaptive scheduling mechanism that dynamically balances the influence of paired and unpaired data during training. The scheduler initially prioritizes reliable paired data and gradually increases the weight of pseudo-labeled unpaired data as the reward function improves, implementing a principled curriculum learning approach.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Semi-Supervised Preference Optimization (SSPO) framework

The authors propose SSPO, a framework that combines limited labeled preference data with abundant unpaired data for language model alignment. This approach addresses the high cost of acquiring paired preference annotations by leveraging semi-supervised learning to improve data efficiency.

Contribution

Theoretical guarantee for optimal reward threshold existence

The authors provide a theoretical result (Theorem 1) establishing that under mild distributional assumptions, there exists an optimal reward threshold that separates winning and losing responses with high probability. This theoretical foundation justifies their pseudo-labeling strategy for unpaired data.

Contribution

Adaptive scheduling strategy for curriculum learning

The authors develop an adaptive scheduling mechanism that dynamically balances the influence of paired and unpaired data during training. The scheduler initially prioritizes reliable paired data and gradually increases the weight of pseudo-labeled unpaired data as the reward function improves, implementing a principled curriculum learning approach.