PepBenchmark: A Standardized Benchmark for Peptide Machine Learning

ICLR 2026 Conference SubmissionAnonymous Authors
peptide machine learningbenchmarkprotein language models
Abstract:

Peptide therapeutics are widely regarded as the “third generation” of drugs, yet progress in peptide Machine Learning (ML) are hindered by the absence of standardized benchmarks. Here we present \textbf{PepBenchmark}, which standardizes datasets, preprocessing, and evaluation protocols for peptide drug discovery. PepBenchmark comprises three components: (1) \textbf{PepBenchData}, a well-curated collection comprising 29 canonical-peptide and 6 non-canonical-peptide datasets across 7 groups, systematically covering key aspects of peptide drug development—representing, to the best of our knowledge, the most comprehensive AI-ready dataset resource to date; (2) \textbf{PepBenchPipeline}, a standardized preprocessing pipeline that ensures consistent cleaning, representation conversion, and dataset splitting, addressing the quality issues that often arise from ad-hoc pipelines; and (3) \textbf{PepBenchLeaderboard}, a unified evaluation protocol and leaderboard with strong baselines across 4 major methodological families: fingerprint-based, GNN-based, PLM-based, and SMILES-based models. Together, PepBenchmark provides the first standardized and comparable foundation for peptide drug discovery, facilitating methodological advances and translation into real-world applications. Code is included in the supplementary material and will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

PepBenchmark introduces a comprehensive benchmarking platform for peptide machine learning comprising curated datasets (PepBenchData), standardized preprocessing (PepBenchPipeline), and unified evaluation protocols (PepBenchLeaderboard). The work resides in the 'Comprehensive Multi-Property Benchmarking Platforms' leaf, which contains only two papers: this submission and PeptiVerse. This represents a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that unified multi-property benchmarking remains an underexplored area despite the proliferation of task-specific prediction methods across neighboring branches.

The taxonomy reveals a field heavily weighted toward specialized prediction methods: twenty-nine papers across five leaves address bioactive peptide classification (anticancer, antimicrobial, cell-penetrating, immunogenic), while structural prediction and mass spectrometry applications occupy separate branches. PepBenchmark's positioning in Benchmark Frameworks distinguishes it from these application-focused efforts. The scope note for this leaf emphasizes 'standardized datasets and evaluation across multiple peptide properties and model families,' explicitly excluding single-property benchmarks that populate the adjacent 'Property-Specific Benchmarking Studies' leaf (seven papers). This structural separation highlights the paper's ambition to bridge fragmented evaluation practices across diverse peptide tasks.

Among thirteen candidates examined through limited semantic search, no papers clearly refute the three core contributions. PepBenchData examined ten candidates without finding overlapping comprehensive dataset collections; PepBenchPipeline examined zero candidates, indicating limited prior work on standardized preprocessing protocols; PepBenchLeaderboard examined three candidates with no refutations. The small search scope (thirteen total candidates versus fifty papers in the taxonomy) means this analysis captures immediate semantic neighbors rather than exhaustive field coverage. The absence of refutations among examined candidates suggests these contributions address gaps in current practice, though the limited search cannot rule out relevant work outside the top-ranked semantic matches.

Given the sparse population of the target leaf and the absence of refutations among examined candidates, the work appears to occupy a genuine gap in peptide machine learning infrastructure. However, the analysis reflects a constrained literature search (top-thirteen semantic matches) rather than comprehensive field coverage. The taxonomy structure itself—showing only one sibling paper in a fifty-paper field—provides independent evidence that unified multi-property benchmarking platforms remain rare, lending credibility to the novelty assessment despite search limitations.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Standardized benchmarking for peptide property prediction using machine learning. The field has organized itself around four main branches that reflect distinct methodological and application priorities. Benchmark Frameworks and Standardization focuses on creating unified evaluation platforms that enable fair comparison across diverse prediction tasks, exemplified by comprehensive resources like PepBenchmark[0] and PeptiVerse[43]. Prediction Methods for Bioactive Peptides encompasses a dense collection of specialized algorithms targeting antimicrobial, anticancer, cell-penetrating, and other therapeutic properties, with works ranging from early efforts like MLACP[4] to recent deep learning approaches such as PLMACPred[32] and sAMP-PFPDeep[45]. Structural Prediction and Representation addresses the challenge of modeling peptide three-dimensional conformations and embeddings, including benchmarking studies like Structure Predictor Benchmarking[6] and applications of tools such as AlphaFold Peptides[7]. Mass Spectrometry Applications bridges computational prediction with experimental proteomics, featuring AI-driven methods like MassNet[17] and Mass Spectrometry AI[12] for spectrum interpretation and property inference. A central tension across these branches involves balancing task-specific optimization against generalizability: many bioactive peptide predictors achieve strong performance on narrow properties but struggle when evaluated on standardized multi-property benchmarks. PepBenchmark[0] sits squarely within the Benchmark Frameworks branch alongside PeptiVerse[43], both aiming to provide comprehensive evaluation suites that span multiple peptide functions. While PeptiVerse[43] emphasizes breadth across diverse bioactivity classes, PepBenchmark[0] appears to prioritize rigorous standardization protocols and reproducibility metrics. This contrasts with works in Prediction Methods, such as Antimicrobial Assessment[2] or AnOxPePred[3], which typically validate on single-property datasets. The interplay between structural representations—whether sequence-based, topology-enhanced as in Topology-enhanced ML[37], or informed by tools like AlphaFold—and downstream prediction accuracy remains an active question, particularly for properties like membrane permeability studied in Cyclic Peptide Permeability[5]. Establishing common ground rules through platforms like PepBenchmark[0] helps clarify which modeling choices genuinely improve generalization versus overfitting to specific assay conditions.

Claimed Contributions

PepBenchData: Comprehensive AI-ready peptide dataset collection

The authors curate and standardize 35 datasets (29 canonical and 6 non-canonical peptides) spanning 7 task groups related to peptide drug discovery, including activity modeling, pharmacokinetics profiling, and safety assessment. They also develop a tool to convert non-canonical peptide representations into unified SMILES format.

10 retrieved papers
PepBenchPipeline: Standardized preprocessing with BDNegSamp and hybrid-split

The authors propose a novel four-step preprocessing pipeline featuring Biologically-informed and Distribution-controlled Negative Sampling (BDNegSamp) to avoid false negatives and artifacts, plus a hybrid-split strategy combining kmer-based and similarity-based clustering to prevent data leakage and ensure rigorous evaluation.

0 retrieved papers
PepBenchLeaderboard: Unified evaluation protocol with systematic model comparison

The authors establish a standardized evaluation framework that benchmarks four model families (fingerprint-based, GNN-based, PLM-based, and SMILES-based) using consistent metrics across all datasets, revealing that PLMs achieve superior performance and can be enhanced through peptide-specific fine-tuning.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PepBenchData: Comprehensive AI-ready peptide dataset collection

The authors curate and standardize 35 datasets (29 canonical and 6 non-canonical peptides) spanning 7 task groups related to peptide drug discovery, including activity modeling, pharmacokinetics profiling, and safety assessment. They also develop a tool to convert non-canonical peptide representations into unified SMILES format.

Contribution

PepBenchPipeline: Standardized preprocessing with BDNegSamp and hybrid-split

The authors propose a novel four-step preprocessing pipeline featuring Biologically-informed and Distribution-controlled Negative Sampling (BDNegSamp) to avoid false negatives and artifacts, plus a hybrid-split strategy combining kmer-based and similarity-based clustering to prevent data leakage and ensure rigorous evaluation.

Contribution

PepBenchLeaderboard: Unified evaluation protocol with systematic model comparison

The authors establish a standardized evaluation framework that benchmarks four model families (fingerprint-based, GNN-based, PLM-based, and SMILES-based) using consistent metrics across all datasets, revealing that PLMs achieve superior performance and can be enhanced through peptide-specific fine-tuning.

PepBenchmark: A Standardized Benchmark for Peptide Machine Learning | Novelty Validation