Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms

ICLR 2026 Conference SubmissionAnonymous Authors
Positive-unlabeled learningweakly supervised learning.
Abstract:

Positive-unlabeled (PU) learning is a weakly supervised binary classification problem, in which the goal is to learn a binary classifier from only positive and unlabeled data, without access to negative data. In recent years, many PU learning algorithms have been developed to improve model performance. However, experimental settings are highly inconsistent, making it difficult to identify which algorithm performs better. In this paper, we propose the first PU learning benchmark to systematically compare PU learning algorithms. During our implementation, we identify subtle yet critical factors that affect the realistic and fair evaluation of PU learning algorithms. On the one hand, many PU learning algorithms rely on a validation set that includes negative data for model selection. This is unrealistic in traditional PU learning settings, where no negative data are available. To handle this problem, we systematically investigate model selection criteria for PU learning. On the other hand, the problem settings and solutions of PU learning have different families, i.e., the one-sample and two-sample settings. However, existing evaluation protocols are heavily biased towards the one-sample setting and neglect the significant difference between them. We identify the internal label shift problem of unlabeled training data for the one-sample setting and propose a simple yet effective calibration approach to ensure fair comparisons within and across families. We hope our framework will provide an accessible, realistic, and fair environment for evaluating PU learning algorithms in the future.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes the first systematic benchmarking framework for positive-unlabeled learning algorithms, addressing inconsistent experimental settings that hinder fair comparison. Within the taxonomy, it resides in the 'Comprehensive Benchmarking and Evaluation Protocols' leaf under 'Evaluation Methodologies and Benchmarking Frameworks'. This leaf contains only two papers, including the original work and one sibling ('Evaluating PU Learning'). The sparse population suggests this is an emerging research direction rather than a crowded subfield, with limited prior work establishing standardized evaluation protocols for PU learning.

The taxonomy reveals that while core PU algorithms and specialized paradigms are well-developed (with multiple leaves containing 3-5 papers each), the evaluation methodology branch remains relatively underpopulated. Neighboring leaves address 'Performance Metrics and Measurement' (3 papers) and 'Model Selection and Validation Strategies' (1 paper), indicating that measurement and validation are recognized challenges but lack comprehensive benchmarking frameworks. The paper bridges evaluation infrastructure with algorithmic diversity across risk estimation, iterative methods, and robustness challenges, connecting to multiple branches while focusing specifically on systematic comparison protocols.

Among 30 candidates examined, none clearly refute the three main contributions. The first contribution (systematic benchmark) examined 10 candidates with zero refutable matches, suggesting novelty in establishing comprehensive evaluation infrastructure. The second contribution (model selection without negative validation data) also examined 10 candidates with no refutations, indicating this addresses a previously unresolved practical challenge. The third contribution (internal label shift identification and calibration) similarly found no overlapping prior work among 10 candidates. The limited search scope means these findings reflect top-30 semantic matches rather than exhaustive coverage, but the absence of refutations across all contributions suggests substantive novelty within the examined literature.

Given the sparse taxonomy leaf (2 papers total) and zero refutations across 30 candidates examined, the work appears to address a recognized gap in PU learning evaluation. The analysis covers top-K semantic matches and does not claim exhaustive field coverage, so additional related work may exist beyond this scope. However, the convergence of taxonomy structure and contribution-level statistics suggests the paper occupies relatively unexplored territory in establishing standardized, fair benchmarking protocols for a mature algorithmic landscape.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: benchmarking and evaluation of positive-unlabeled learning algorithms. The field of positive-unlabeled (PU) learning addresses scenarios where only a subset of positive examples is labeled while the remaining data is unlabeled, mixing hidden positives with true negatives. The taxonomy reflects a mature research landscape organized around several complementary dimensions. Evaluation Methodologies and Benchmarking Frameworks focus on systematic protocols for comparing algorithms, as seen in works like Evaluating PU Learning[1] and Accessible Fair PU Evaluation[0]. Core PU Learning Algorithms and Theoretical Foundations encompass foundational methods and theoretical guarantees, including risk estimators and class-prior estimation techniques such as those explored in PU versus PN Theory[7] and Mixture Proportion Estimation[26]. Specialized PU Learning Paradigms branch into extensions like online learning (Online PU Learning[4]), meta-learning (Meta PU Learning[5]), and self-paced strategies (Self PU[3]). Robustness and Data Quality Challenges address noise, label corruption, and distribution shifts, while Advanced Representation and Regularization Techniques explore contrastive learning (PU Contrastive Learning[11]) and geometric constraints (Angular Regularization Hypersphere[10]). Domain-Specific Applications span cybersecurity, healthcare, and recommendation systems, and Methodological Surveys provide comprehensive overviews like PU Classifier Review[6]. Recent activity highlights tensions between algorithmic innovation and rigorous evaluation. Many studies introduce novel loss functions, regularization schemes, or architectural designs, yet standardized benchmarking remains challenging due to varying assumptions about class priors, label noise, and data distributions. Works addressing robustness, such as Positive Distribution Pollution[12] and Confidence Instance Noise[15], underscore the difficulty of maintaining performance when positive labels are corrupted or when unlabeled data deviates from training assumptions. Accessible Fair PU Evaluation[0] sits squarely within the Evaluation Methodologies branch, emphasizing the need for transparent, reproducible benchmarking protocols that can fairly compare diverse PU algorithms. Its focus on accessibility and fairness in evaluation complements earlier efforts like Evaluating PU Learning[1], which laid groundwork for systematic comparison, and contrasts with algorithm-centric works such as Meta PU Learning[5] or Self PU[3] that prioritize novel learning strategies over evaluation infrastructure. By addressing gaps in how PU methods are assessed, Accessible Fair PU Evaluation[0] aims to provide the community with tools to navigate the growing diversity of approaches and ensure that empirical claims rest on solid experimental foundations.

Claimed Contributions

First PU learning benchmark for systematic algorithm comparison

The authors develop a unified experimental framework that enables systematic and fair comparison of state-of-the-art positive-unlabeled learning algorithms. This benchmark provides careful and unified implementations of data generation, algorithm training, and evaluation processes.

10 retrieved papers
Model selection criteria for PU learning without negative validation data

The authors address the unrealistic practice of using negative data in validation sets by proposing and analyzing model selection criteria (proxy accuracy and proxy AUC score) that rely only on positive and unlabeled validation data, with theoretical and empirical validation.

10 retrieved papers
Identification of internal label shift problem and calibration approach

The authors identify for the first time that the one-sample setting causes an internal label shift in unlabeled training data, which degrades performance of two-sample algorithms. They propose a calibration technique (Algorithm 1) with theoretical guarantees to ensure fair cross-family comparisons.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First PU learning benchmark for systematic algorithm comparison

The authors develop a unified experimental framework that enables systematic and fair comparison of state-of-the-art positive-unlabeled learning algorithms. This benchmark provides careful and unified implementations of data generation, algorithm training, and evaluation processes.

Contribution

Model selection criteria for PU learning without negative validation data

The authors address the unrealistic practice of using negative data in validation sets by proposing and analyzing model selection criteria (proxy accuracy and proxy AUC score) that rely only on positive and unlabeled validation data, with theoretical and empirical validation.

Contribution

Identification of internal label shift problem and calibration approach

The authors identify for the first time that the one-sample setting causes an internal label shift in unlabeled training data, which degrades performance of two-sample algorithms. They propose a calibration technique (Algorithm 1) with theoretical guarantees to ensure fair cross-family comparisons.