PU-BENCH: A UNIFIED BENCHMARK FOR RIGOROUS AND REPRODUCIBLE PU LEARNING

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

PU learningsemi-supervised leaningbenchmark

Positive-Unlabeled (PU) learning, a challenging paradigm for training binary classifiers from only positive and unlabeled samples, is fundamental to many applications. While numerous PU learning methods have been proposed, the research is systematically hindered by the lack of a standardized and comprehensive benchmark for rigorous evaluation. Inconsistent data generation, disparate experimental settings, and divergent metrics have led to irreproducible findings and unsubstantiated performance claims. To address this foundational challenge, we introduce \textbf{PU-Bench}, the first unified open-source benchmark for PU learning. PU-Bench provides: 1) a unified data generation pipeline to ensure consistent input across configurable sampling schemes, label ratios and labeling mechanisms ; 2) an integrated framework of 16 state-of-the-art PU methods; and 3) standardized protocols for reproducible assessment. Through a large-scale empirical study on 8 diverse datasets (\textbf{2,560 }evaluations in total), PU-Bench reveals a complex while intuitional performance landscape, uncovering critical trade-offs between effectiveness and efficiency, and those of robustness and label frequency and selection bias. It is anticipated to serve as a foundational resource to catalyze reproducible, rigorous, and impactful research in the PU learning community.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PU-Bench, a unified benchmarking framework for positive-unlabeled learning that standardizes data generation, integrates sixteen state-of-the-art methods, and provides reproducible evaluation protocols. Within the taxonomy, it resides in the 'Theoretical Analysis and Comparative Studies' leaf under 'Core PU Learning Methodologies and Theoretical Frameworks,' alongside only one sibling paper (PU versus PN theoretical comparison). This leaf represents a sparse research direction focused on rigorous comparative analysis rather than novel algorithmic contributions, suggesting that systematic benchmarking efforts remain underexplored in the PU learning literature despite the proliferation of methods across other branches.

The taxonomy reveals a densely populated field with over fifty papers distributed across methodological innovations (risk estimation, sample selection, representation learning), data challenges (selection bias, class prior estimation), and domain applications (biomedical, fraud detection, computer vision). PU-Bench connects most directly to the 'Core Methodologies' branch by evaluating methods from multiple subtopics—unbiased risk estimators, pseudo-labeling strategies, and contrastive approaches—but diverges by focusing on empirical comparison rather than proposing new algorithms. Neighboring leaves like 'Unbiased Risk Estimation Approaches' and 'Reliable Negative Identification' contain the algorithmic work that PU-Bench evaluates, positioning this contribution as infrastructure for the broader research ecosystem.

Among thirty candidates examined through semantic search and citation expansion, none clearly refute any of the three core contributions: the unified benchmarking framework (ten candidates examined, zero refutable), the large-scale empirical study (ten candidates, zero refutable), and the analysis with actionable guidelines (ten candidates, zero refutable). This absence of overlapping prior work within the limited search scope suggests that comprehensive, standardized benchmarking infrastructure for PU learning has not been previously established at this scale. The framework contribution appears most distinctive, as existing comparative studies like the sibling paper focus on theoretical guarantees rather than reproducible empirical evaluation across diverse methods and datasets.

Based on the top-thirty semantic matches and taxonomy structure, the work addresses a recognized gap in PU learning research: the lack of standardized evaluation infrastructure that has led to inconsistent experimental settings and irreproducible findings. While the limited search scope cannot guarantee exhaustiveness, the absence of refuting candidates across all contributions and the sparse population of the 'Theoretical Analysis and Comparative Studies' leaf suggest that this benchmarking effort occupies relatively uncontested ground within the field's current landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Benchmarking positive-unlabeled learning methods across diverse datasets and experimental conditions. The field of positive-unlabeled (PU) learning has evolved into a rich landscape organized around several major themes. At the highest level, the taxonomy distinguishes Core PU Learning Methodologies and Theoretical Frameworks—encompassing foundational algorithmic approaches such as cost-sensitive methods, class-prior estimation techniques, and theoretical analyses like PU versus PN[16]—from branches addressing Handling Data Characteristics and Distributional Challenges, which tackle issues like label noise (e.g., PULNS[4], Noisy PU Self-Training[9]), selection bias (Selection Bias PU[11]), and domain shift (PU Domain Adaptation[12]). A third branch, Extended PU Learning Paradigms and Hybrid Frameworks, explores connections to semi-supervised learning (PU by Semi-Supervised[10]), contrastive learning (PU Contrastive Learning[14]), and multi-class or multi-positive settings (Multi-Positive Unlabeled[37]). Finally, Domain-Specific PU Learning Applications illustrate how these methods are deployed in areas ranging from bioinformatics (Disease Gene Identification[41], Secreted Proteins Prediction[2]) to fraud detection (Financial Misstatement Detection[7], Fraud Detection Contrastive[46]) and computer vision (Object Detection PU[40], Dense-PU[19]). Within this landscape, particularly active lines of work focus on robust loss functions and class-prior-free approaches that relax traditional assumptions, as seen in GradPU[3], Class Prior-Free PU[8], and Adaptive Asymmetric Loss[33], versus methods that explicitly model labeling mechanisms or propensity scores, such as Labeling Bias Estimation[13] and Propensity Score Recovery[38]. PU-BENCH[0] situates itself squarely in the theoretical and comparative studies cluster, providing a systematic empirical evaluation that complements foundational comparisons like PU versus PN[16]. Where earlier theoretical work established when PU learning is preferable to traditional supervised settings, PU-BENCH[0] extends this by rigorously benchmarking a wide array of modern methods across varied datasets and experimental conditions, offering practitioners evidence-based guidance on algorithm selection and highlighting open questions around generalization, scalability, and the interplay between algorithmic design and data characteristics.

Claimed Contributions

PU-Bench unified open-source benchmarking framework

10 retrieved papers

The authors present PU-Bench, an open-source framework that provides a configurable PU data generator, an integrated framework of 16 state-of-the-art PU methods, and standardized protocols for reproducible assessment of positive-unlabeled learning algorithms.

10 retrieved papers

Large-scale comprehensive empirical study

10 retrieved papers

The authors perform a systematic evaluation benchmarking 16 representative PU methods across 8 diverse datasets with 15 distinct labeling ratios under 4 labeling assumptions, totaling more than 2,560 evaluations.

10 retrieved papers

In-depth analysis and actionable guidelines

10 retrieved papers

The authors deliver comprehensive analysis revealing strengths and limitations of current PU methods and propose practical, data-driven guidelines for algorithm selection and design based on effectiveness, efficiency, and robustness considerations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Theoretical comparisons of positive-unlabeled learning against positive-negative learning PDF

Gang Niu, Marthinus Christoffel du Plessis, Tomoya Sakai, Yao Ma, Masashi Sugiyama (2016)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PU-Bench unified open-source benchmarking framework

[51] Unleashing the strengths of unlabelled data in deep learning-assisted pan-cancer abdominal organ quantification: the FLARE22 challenge PDF

Cannot Refute

[52] Usb: A unified semi-supervised learning benchmark for classification PDF

Cannot Refute

[53] Multiscale positive-unlabeled detection of ai-generated texts PDF

Cannot Refute

[54] Advancing emotional analysis with large language models PDF

Cannot Refute

[55] Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform PDF

Cannot Refute

[56] Learning Gait Representation From Massive Unlabelled Walking Videos: A Benchmark PDF

Cannot Refute

[57] SoK: The Impact of Unlabelled Data in Cyberthreat Detection PDF

Cannot Refute

[58] Deep representation features from DreamDIAXMBD improve the analysis of data-independent acquisition proteomics PDF

Cannot Refute

[59] Queryable and Interpretable PU Learning Through Probabilistic Circuits PDF

Cannot Refute

[60] Benchmarking anomaly detection algorithms in an industrial context: dealing with scarce labels and multiple positive types PDF

Cannot Refute

Contribution

Large-scale comprehensive empirical study

[26] Conditional generative positive and unlabeled learning PDF

Cannot Refute

[28] A novel observation pointsâbased positiveâunlabeled learning algorithm PDF

Cannot Refute

[63] BiCSA-PUL: binary crow search algorithm for enhancing positive and unlabeled learning PDF

Cannot Refute

[68] Towards Improved Illicit Node Detection with Positive-Unlabelled Learning PDF

Cannot Refute

[69] A recent survey on instance-dependent positive and unlabeled learning PDF

Cannot Refute

[70] ESA: Example Sieve Approach for Multi-Positive and Unlabeled Learning PDF

Cannot Refute

[71] Spotting fake reviews via collective positive-unlabeled learning PDF

Cannot Refute

[72] Weighted Contrastive Learning With Hard Negative Mining for Positive and Unlabeled Learning PDF

Cannot Refute

[73] A Source Code Vulnerability Detection Method Based on Positive-Unlabeled Learning PDF

Cannot Refute

[74] Uncertainty-Aware Neighbor Calibration for Positive and Unlabeled Learning in Large Machine Learning Models PDF

Cannot Refute

Contribution

In-depth analysis and actionable guidelines

[26] Conditional generative positive and unlabeled learning PDF

Cannot Refute

[28] A novel observation pointsâbased positiveâunlabeled learning algorithm PDF

Cannot Refute

[61] A boosting framework for positive-unlabeled learning PDF

Cannot Refute

[62] ROPU: A robust online positive-unlabeled learning algorithm PDF

Cannot Refute

[63] BiCSA-PUL: binary crow search algorithm for enhancing positive and unlabeled learning PDF

Cannot Refute

[64] Fairness-Aware Online Positive-Unlabeled Learning PDF

Cannot Refute

[65] On positive-unlabeled classification in GAN PDF

Cannot Refute

[66] Efficient training for positive unlabeled learning PDF

Cannot Refute

[67] Recovering True Classifier Performance in Positive-Unlabeled Learning PDF

Cannot Refute

[68] Towards Improved Illicit Node Detection with Positive-Unlabelled Learning PDF

Cannot Refute

PU-BENCH: A UNIFIED BENCHMARK FOR RIGOROUS AND REPRODUCIBLE PU LEARNING

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Theoretical comparisons of positive-unlabeled learning against positive-negative learning PDF

Contribution Analysis

PU-Bench unified open-source benchmarking framework

[51] Unleashing the strengths of unlabelled data in deep learning-assisted pan-cancer abdominal organ quantification: the FLARE22 challenge PDF

[52] Usb: A unified semi-supervised learning benchmark for classification PDF

[53] Multiscale positive-unlabeled detection of ai-generated texts PDF

[54] Advancing emotional analysis with large language models PDF

[55] Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform PDF

[56] Learning Gait Representation From Massive Unlabelled Walking Videos: A Benchmark PDF

[57] SoK: The Impact of Unlabelled Data in Cyberthreat Detection PDF

[58] Deep representation features from DreamDIAXMBD improve the analysis of data-independent acquisition proteomics PDF

[59] Queryable and Interpretable PU Learning Through Probabilistic Circuits PDF

[60] Benchmarking anomaly detection algorithms in an industrial context: dealing with scarce labels and multiple positive types PDF

Large-scale comprehensive empirical study

[26] Conditional generative positive and unlabeled learning PDF

[28] A novel observation pointsâbased positiveâunlabeled learning algorithm PDF

[63] BiCSA-PUL: binary crow search algorithm for enhancing positive and unlabeled learning PDF

[68] Towards Improved Illicit Node Detection with Positive-Unlabelled Learning PDF

[69] A recent survey on instance-dependent positive and unlabeled learning PDF

[70] ESA: Example Sieve Approach for Multi-Positive and Unlabeled Learning PDF

[71] Spotting fake reviews via collective positive-unlabeled learning PDF

[72] Weighted Contrastive Learning With Hard Negative Mining for Positive and Unlabeled Learning PDF

[73] A Source Code Vulnerability Detection Method Based on Positive-Unlabeled Learning PDF

[74] Uncertainty-Aware Neighbor Calibration for Positive and Unlabeled Learning in Large Machine Learning Models PDF

In-depth analysis and actionable guidelines

[26] Conditional generative positive and unlabeled learning PDF

[28] A novel observation pointsâbased positiveâunlabeled learning algorithm PDF

[61] A boosting framework for positive-unlabeled learning PDF

[62] ROPU: A robust online positive-unlabeled learning algorithm PDF

[63] BiCSA-PUL: binary crow search algorithm for enhancing positive and unlabeled learning PDF

[64] Fairness-Aware Online Positive-Unlabeled Learning PDF

[65] On positive-unlabeled classification in GAN PDF

[66] Efficient training for positive unlabeled learning PDF

[67] Recovering True Classifier Performance in Positive-Unlabeled Learning PDF

[68] Towards Improved Illicit Node Detection with Positive-Unlabelled Learning PDF

Table of Contents

[28] A novel observation pointsâbased positiveâunlabeled learning algorithm PDF

[28] A novel observation pointsâbased positiveâunlabeled learning algorithm PDF