PU-BENCH: A UNIFIED BENCHMARK FOR RIGOROUS AND REPRODUCIBLE PU LEARNING

ICLR 2026 Conference SubmissionAnonymous Authors
PU learningsemi-supervised leaningbenchmark
Abstract:

Positive-Unlabeled (PU) learning, a challenging paradigm for training binary classifiers from only positive and unlabeled samples, is fundamental to many applications. While numerous PU learning methods have been proposed, the research is systematically hindered by the lack of a standardized and comprehensive benchmark for rigorous evaluation. Inconsistent data generation, disparate experimental settings, and divergent metrics have led to irreproducible findings and unsubstantiated performance claims. To address this foundational challenge, we introduce \textbf{PU-Bench}, the first unified open-source benchmark for PU learning. PU-Bench provides: 1) a unified data generation pipeline to ensure consistent input across configurable sampling schemes, label ratios and labeling mechanisms ; 2) an integrated framework of 16 state-of-the-art PU methods; and 3) standardized protocols for reproducible assessment. Through a large-scale empirical study on 8 diverse datasets (\textbf{2,560 }evaluations in total), PU-Bench reveals a complex while intuitional performance landscape, uncovering critical trade-offs between effectiveness and efficiency, and those of robustness and label frequency and selection bias. It is anticipated to serve as a foundational resource to catalyze reproducible, rigorous, and impactful research in the PU learning community.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PU-Bench, a unified benchmarking framework for positive-unlabeled learning that standardizes data generation, integrates sixteen state-of-the-art methods, and provides reproducible evaluation protocols. Within the taxonomy, it resides in the 'Theoretical Analysis and Comparative Studies' leaf under 'Core PU Learning Methodologies and Theoretical Frameworks,' alongside only one sibling paper (PU versus PN theoretical comparison). This leaf represents a sparse research direction focused on rigorous comparative analysis rather than novel algorithmic contributions, suggesting that systematic benchmarking efforts remain underexplored in the PU learning literature despite the proliferation of methods across other branches.

The taxonomy reveals a densely populated field with over fifty papers distributed across methodological innovations (risk estimation, sample selection, representation learning), data challenges (selection bias, class prior estimation), and domain applications (biomedical, fraud detection, computer vision). PU-Bench connects most directly to the 'Core Methodologies' branch by evaluating methods from multiple subtopics—unbiased risk estimators, pseudo-labeling strategies, and contrastive approaches—but diverges by focusing on empirical comparison rather than proposing new algorithms. Neighboring leaves like 'Unbiased Risk Estimation Approaches' and 'Reliable Negative Identification' contain the algorithmic work that PU-Bench evaluates, positioning this contribution as infrastructure for the broader research ecosystem.

Among thirty candidates examined through semantic search and citation expansion, none clearly refute any of the three core contributions: the unified benchmarking framework (ten candidates examined, zero refutable), the large-scale empirical study (ten candidates, zero refutable), and the analysis with actionable guidelines (ten candidates, zero refutable). This absence of overlapping prior work within the limited search scope suggests that comprehensive, standardized benchmarking infrastructure for PU learning has not been previously established at this scale. The framework contribution appears most distinctive, as existing comparative studies like the sibling paper focus on theoretical guarantees rather than reproducible empirical evaluation across diverse methods and datasets.

Based on the top-thirty semantic matches and taxonomy structure, the work addresses a recognized gap in PU learning research: the lack of standardized evaluation infrastructure that has led to inconsistent experimental settings and irreproducible findings. While the limited search scope cannot guarantee exhaustiveness, the absence of refuting candidates across all contributions and the sparse population of the 'Theoretical Analysis and Comparative Studies' leaf suggest that this benchmarking effort occupies relatively uncontested ground within the field's current landscape.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Benchmarking positive-unlabeled learning methods across diverse datasets and experimental conditions. The field of positive-unlabeled (PU) learning has evolved into a rich landscape organized around several major themes. At the highest level, the taxonomy distinguishes Core PU Learning Methodologies and Theoretical Frameworks—encompassing foundational algorithmic approaches such as cost-sensitive methods, class-prior estimation techniques, and theoretical analyses like PU versus PN[16]—from branches addressing Handling Data Characteristics and Distributional Challenges, which tackle issues like label noise (e.g., PULNS[4], Noisy PU Self-Training[9]), selection bias (Selection Bias PU[11]), and domain shift (PU Domain Adaptation[12]). A third branch, Extended PU Learning Paradigms and Hybrid Frameworks, explores connections to semi-supervised learning (PU by Semi-Supervised[10]), contrastive learning (PU Contrastive Learning[14]), and multi-class or multi-positive settings (Multi-Positive Unlabeled[37]). Finally, Domain-Specific PU Learning Applications illustrate how these methods are deployed in areas ranging from bioinformatics (Disease Gene Identification[41], Secreted Proteins Prediction[2]) to fraud detection (Financial Misstatement Detection[7], Fraud Detection Contrastive[46]) and computer vision (Object Detection PU[40], Dense-PU[19]). Within this landscape, particularly active lines of work focus on robust loss functions and class-prior-free approaches that relax traditional assumptions, as seen in GradPU[3], Class Prior-Free PU[8], and Adaptive Asymmetric Loss[33], versus methods that explicitly model labeling mechanisms or propensity scores, such as Labeling Bias Estimation[13] and Propensity Score Recovery[38]. PU-BENCH[0] situates itself squarely in the theoretical and comparative studies cluster, providing a systematic empirical evaluation that complements foundational comparisons like PU versus PN[16]. Where earlier theoretical work established when PU learning is preferable to traditional supervised settings, PU-BENCH[0] extends this by rigorously benchmarking a wide array of modern methods across varied datasets and experimental conditions, offering practitioners evidence-based guidance on algorithm selection and highlighting open questions around generalization, scalability, and the interplay between algorithmic design and data characteristics.

Claimed Contributions

PU-Bench unified open-source benchmarking framework

The authors present PU-Bench, an open-source framework that provides a configurable PU data generator, an integrated framework of 16 state-of-the-art PU methods, and standardized protocols for reproducible assessment of positive-unlabeled learning algorithms.

10 retrieved papers
Large-scale comprehensive empirical study

The authors perform a systematic evaluation benchmarking 16 representative PU methods across 8 diverse datasets with 15 distinct labeling ratios under 4 labeling assumptions, totaling more than 2,560 evaluations.

10 retrieved papers
In-depth analysis and actionable guidelines

The authors deliver comprehensive analysis revealing strengths and limitations of current PU methods and propose practical, data-driven guidelines for algorithm selection and design based on effectiveness, efficiency, and robustness considerations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PU-Bench unified open-source benchmarking framework

The authors present PU-Bench, an open-source framework that provides a configurable PU data generator, an integrated framework of 16 state-of-the-art PU methods, and standardized protocols for reproducible assessment of positive-unlabeled learning algorithms.

Contribution

Large-scale comprehensive empirical study

The authors perform a systematic evaluation benchmarking 16 representative PU methods across 8 diverse datasets with 15 distinct labeling ratios under 4 labeling assumptions, totaling more than 2,560 evaluations.

Contribution

In-depth analysis and actionable guidelines

The authors deliver comprehensive analysis revealing strengths and limitations of current PU methods and propose practical, data-driven guidelines for algorithm selection and design based on effectiveness, efficiency, and robustness considerations.