CausalNovo: Advancing De Novo Peptide Sequencing via a Causality-Informed Framework

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

De Novo Peptide SequencingCausalityProtemics

\emph{De novo} peptide sequencing is a foundational computational technique in proteomics, which is critical for discovering and characterizing novel peptides and proteins within complex biological systems. To predict peptide sequences directly from tandem mass spectra, mainstream deep learning approaches aim to model the relationship between mass spectra and corresponding peptides. However, these models face significant challenges, particularly under noisy conditions. These deep learning models often capture superficial correlations within noisy spectral data, failing to identify the underlying causal mechanisms that link true signal fragment ions to peptide sequences. Consequently, these models tend to learn spurious associations that cannot generalize in practice, where noise peaks are prone to change due to different co-elutions or chemical contaminants. To tackle this, we introduce CausalNovo, a model-agnostic framework designed to learn the causal representations of mass spectra in peptide sequencing models by focusing on signal fragment ions. Specifically, grounded in two practical and general principles, independence and sufficiency, CausalNovo employs causal interventions and information-theoretic objectives to disentangle causal representations from spurious noise peaks. Extensive experiments on three public datasets show that CausalNovo effectively generalizes across varying Noise Signal Ratios (NSR) and remains relatively stable against non-causal peak changes. Consequently, CausalNovo yields consistent and significant performance gains of up to 10% in amino acid, peptide, and PTM-level performance. Code is available at https://anonymous.4open.science/r/CausalNovo-C134.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CausalNovo, a model-agnostic framework applying causal reasoning to de novo peptide sequencing from tandem mass spectra. Within the taxonomy, it occupies a newly defined leaf node labeled 'Causality-Informed and Robust Learning Frameworks' under the broader 'Transformer and Attention-Based Models' branch. Notably, this leaf contains only the original paper itself, with no sibling papers identified, suggesting this represents a relatively sparse and emerging research direction within the deep learning-based sequencing landscape.

The taxonomy tree reveals that CausalNovo sits within a well-populated parent branch of transformer and attention-based models, which includes neighboring leaves such as 'Bidirectional and Encoder-Decoder Architectures' containing five papers. These sibling directions focus on architectural innovations like bidirectional prediction and encoder-decoder frameworks, whereas CausalNovo's leaf explicitly targets causal reasoning and robustness mechanisms. The taxonomy's scope note clarifies that standard transformer models without explicit causality components belong elsewhere, positioning this work as a methodological departure from purely architectural advances toward principled handling of noisy spectra and spurious correlations.

Across three identified contributions—the CausalNovo framework, structural causal model formalization, and independence-sufficiency principles—the analysis examined twenty candidate papers total, with five, six, and nine candidates respectively. Critically, zero refutable pairs were found for any contribution, meaning that among the limited set of top-K semantic matches and citation expansions examined, no prior work was identified that clearly overlaps with or anticipates these specific causal intervention strategies. This suggests that within the examined scope, the causal framing and information-theoretic objectives appear distinct from existing transformer-based sequencing methods.

Based on the limited literature search covering twenty candidates, the work appears to introduce a novel angle within deep learning-based peptide sequencing by explicitly incorporating causal reasoning. However, the analysis does not claim exhaustive coverage of all relevant prior work in causality or robustness for mass spectrometry, and the absence of sibling papers in the taxonomy leaf may reflect either genuine novelty or incomplete taxonomy construction rather than definitive field-wide uniqueness.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: de novo peptide sequencing from tandem mass spectra. The field has evolved from traditional algorithmic approaches—such as dynamic programming and graph-based methods exemplified by early works like DeNovo Tandem[4] and Database Searches[10]—toward modern deep learning-based sequencing methods that now dominate recent research. The taxonomy reflects this shift, with top-level branches spanning Deep Learning-Based Sequencing Methods, Traditional Algorithmic Sequencing Approaches, Data-Independent Acquisition Sequencing, Hybrid and Database-Assisted Sequencing, and several specialized contexts. Within the deep learning branch, transformer and attention-based models have become particularly prominent, leveraging architectures inspired by natural language processing to decode complex spectral patterns. Representative works include Transformer DeNovo[8], InstaNovo[34], and Bidirectional Transformer[2], which demonstrate how sequence-to-sequence frameworks can effectively map mass spectra to peptide sequences. Meanwhile, traditional methods and hybrid approaches continue to provide foundational insights, and specialized branches address niche experimental settings such as top-down proteomics, cross-linking studies, and immunopeptidomics. Recent attention has focused on improving model robustness, generalization, and interpretability within transformer-based architectures. PowerNovo[3] and Fully Convolutional[5] models explore different neural designs to handle noisy or incomplete spectra, while works like BERT Sequencing[9] adapt masked language modeling strategies to peptide prediction. CausalNovo[0] sits within the causality-informed and robust learning frameworks subgroup, emphasizing principled approaches to model training that account for causal relationships and distributional shifts in mass spectrometry data. Compared to nearby transformer models such as PowerNovo[3] or InstaNovo[34], CausalNovo[0] distinguishes itself by integrating causal reasoning to enhance reliability and reduce biases that can arise from spurious correlations in training data. This focus on robustness addresses a key challenge across the field: ensuring that deep learning models generalize well beyond the specific datasets and experimental conditions on which they were trained.

Claimed Contributions

CausalNovo framework for de novo peptide sequencing

5 retrieved papers

The authors propose CausalNovo, a model-agnostic framework that applies causal principles to de novo peptide sequencing. The framework learns causal representations from mass spectra by distinguishing signal fragment ions from spurious noise peaks, improving robustness and generalization across different noise conditions.

5 retrieved papers

Structural Causal Model formalization for peptide sequencing

6 retrieved papers

The authors formalize de novo peptide sequencing using Structural Causal Models to explicitly represent causal relationships between mass spectra and peptide sequences. This formalization distinguishes causal factors from non-causal spurious correlations, providing a principled foundation for robust model design.

6 retrieved papers

Independence and sufficiency principles with information-theoretic objectives

9 retrieved papers

The authors derive two fundamental principles—independence (ensuring representations are invariant to non-causal factors) and sufficiency (retaining predictive information)—and operationalize them through causal interventions and information-theoretic objectives. These principles guide the disentanglement of causal signal from noise in the latent representation space.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CausalNovo framework for de novo peptide sequencing

[51] Prediction of peptide mass spectral libraries with machine learning PDF

Cannot Refute

[52] PepNovo: de novo peptide sequencing via probabilistic network modeling PDF

Cannot Refute

[53] Towards automated scientific discovery: Knowledge representation and reasoning in cell signalling networks PDF

Cannot Refute

[54] Distilling Non-Autoregressive Model Knowledge for Autoregressive De Novo Peptide Sequencing PDF

Cannot Refute

[55] CHARACTERIZATION AND DE NOVO SEQUENCING OF MULTI-CHARGE MS/MS SPECTRA PDF

Cannot Refute

Contribution

Structural Causal Model formalization for peptide sequencing

[56] Peptidomics-based analysis and preparation of umami peptides from enzymatically digested chicken bone fluid PDF

Cannot Refute

[57] â¦ screening of umami peptides from skipjack tuna (Katsuwonus pelamis) hydrolysates using EAD/CID based micro-UPLC-QTOF-MS and the molecular interaction with â¦ PDF

Cannot Refute

[58] Umami peptides screened based on peptidomics and virtual screening from Ruditapes philippinarum and Mactra veneriformis clams PDF

Cannot Refute

[59] Graphical Models for Peptide Identification of Tandem Mass Spectra PDF

Cannot Refute

[60] Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification PDF

Cannot Refute

[61] Faster graphical model identification of tandem mass spectra using peptide word lattices PDF

Cannot Refute

Contribution

Independence and sufficiency principles with information-theoretic objectives

[63] Partial disentanglement for domain adaptation PDF

Cannot Refute

[64] On Causally Disentangled State Representation Learning for Reinforcement Learning based Recommender Systems PDF

Cannot Refute

[65] Learning Independent Causal Mechanisms PDF

Cannot Refute

[66] Causal Discovery with Continuous Additive Noise Models PDF

Cannot Refute

[67] Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms PDF

Cannot Refute

[68] Cause-effect inference in location-scale noise models: Maximum likelihood vs. independence testing PDF

Cannot Refute

[69] Kernel-Based Independence Tests for Causal Structure Learning on Functional Data PDF

Cannot Refute

[70] Identifying Independencies in Causal Graphs with Feedback PDF

Cannot Refute

[71] CÂ²DR: Robust Cross-Domain Recommendation based on Causal Disentanglement PDF

Cannot Refute

CausalNovo: Advancing De Novo Peptide Sequencing via a Causality-Informed Framework

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

CausalNovo framework for de novo peptide sequencing

[51] Prediction of peptide mass spectral libraries with machine learning PDF

[52] PepNovo: de novo peptide sequencing via probabilistic network modeling PDF

[53] Towards automated scientific discovery: Knowledge representation and reasoning in cell signalling networks PDF

[54] Distilling Non-Autoregressive Model Knowledge for Autoregressive De Novo Peptide Sequencing PDF

[55] CHARACTERIZATION AND DE NOVO SEQUENCING OF MULTI-CHARGE MS/MS SPECTRA PDF

Structural Causal Model formalization for peptide sequencing

[56] Peptidomics-based analysis and preparation of umami peptides from enzymatically digested chicken bone fluid PDF

[57] â¦ screening of umami peptides from skipjack tuna (Katsuwonus pelamis) hydrolysates using EAD/CID based micro-UPLC-QTOF-MS and the molecular interaction with â¦ PDF

[58] Umami peptides screened based on peptidomics and virtual screening from Ruditapes philippinarum and Mactra veneriformis clams PDF

[59] Graphical Models for Peptide Identification of Tandem Mass Spectra PDF

[60] Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification PDF

[61] Faster graphical model identification of tandem mass spectra using peptide word lattices PDF

Independence and sufficiency principles with information-theoretic objectives

[63] Partial disentanglement for domain adaptation PDF

[64] On Causally Disentangled State Representation Learning for Reinforcement Learning based Recommender Systems PDF

[65] Learning Independent Causal Mechanisms PDF

[66] Causal Discovery with Continuous Additive Noise Models PDF

[67] Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms PDF

[68] Cause-effect inference in location-scale noise models: Maximum likelihood vs. independence testing PDF

[69] Kernel-Based Independence Tests for Causal Structure Learning on Functional Data PDF

[70] Identifying Independencies in Causal Graphs with Feedback PDF

[71] CÂ²DR: Robust Cross-Domain Recommendation based on Causal Disentanglement PDF

Table of Contents

[57] â¦ screening of umami peptides from skipjack tuna (Katsuwonus pelamis) hydrolysates using EAD/CID based micro-UPLC-QTOF-MS and the molecular interaction with â¦ PDF