CausalNovo: Advancing De Novo Peptide Sequencing via a Causality-Informed Framework

ICLR 2026 Conference SubmissionAnonymous Authors
De Novo Peptide SequencingCausalityProtemics
Abstract:

\emph{De novo} peptide sequencing is a foundational computational technique in proteomics, which is critical for discovering and characterizing novel peptides and proteins within complex biological systems. To predict peptide sequences directly from tandem mass spectra, mainstream deep learning approaches aim to model the relationship between mass spectra and corresponding peptides. However, these models face significant challenges, particularly under noisy conditions. These deep learning models often capture superficial correlations within noisy spectral data, failing to identify the underlying causal mechanisms that link true signal fragment ions to peptide sequences. Consequently, these models tend to learn spurious associations that cannot generalize in practice, where noise peaks are prone to change due to different co-elutions or chemical contaminants. To tackle this, we introduce CausalNovo, a model-agnostic framework designed to learn the causal representations of mass spectra in peptide sequencing models by focusing on signal fragment ions. Specifically, grounded in two practical and general principles, independence and sufficiency, CausalNovo employs causal interventions and information-theoretic objectives to disentangle causal representations from spurious noise peaks. Extensive experiments on three public datasets show that CausalNovo effectively generalizes across varying Noise Signal Ratios (NSR) and remains relatively stable against non-causal peak changes. Consequently, CausalNovo yields consistent and significant performance gains of up to 10% in amino acid, peptide, and PTM-level performance. Code is available at https://anonymous.4open.science/r/CausalNovo-C134.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CausalNovo, a model-agnostic framework applying causal reasoning to de novo peptide sequencing from tandem mass spectra. Within the taxonomy, it occupies a newly defined leaf node labeled 'Causality-Informed and Robust Learning Frameworks' under the broader 'Transformer and Attention-Based Models' branch. Notably, this leaf contains only the original paper itself, with no sibling papers identified, suggesting this represents a relatively sparse and emerging research direction within the deep learning-based sequencing landscape.

The taxonomy tree reveals that CausalNovo sits within a well-populated parent branch of transformer and attention-based models, which includes neighboring leaves such as 'Bidirectional and Encoder-Decoder Architectures' containing five papers. These sibling directions focus on architectural innovations like bidirectional prediction and encoder-decoder frameworks, whereas CausalNovo's leaf explicitly targets causal reasoning and robustness mechanisms. The taxonomy's scope note clarifies that standard transformer models without explicit causality components belong elsewhere, positioning this work as a methodological departure from purely architectural advances toward principled handling of noisy spectra and spurious correlations.

Across three identified contributions—the CausalNovo framework, structural causal model formalization, and independence-sufficiency principles—the analysis examined twenty candidate papers total, with five, six, and nine candidates respectively. Critically, zero refutable pairs were found for any contribution, meaning that among the limited set of top-K semantic matches and citation expansions examined, no prior work was identified that clearly overlaps with or anticipates these specific causal intervention strategies. This suggests that within the examined scope, the causal framing and information-theoretic objectives appear distinct from existing transformer-based sequencing methods.

Based on the limited literature search covering twenty candidates, the work appears to introduce a novel angle within deep learning-based peptide sequencing by explicitly incorporating causal reasoning. However, the analysis does not claim exhaustive coverage of all relevant prior work in causality or robustness for mass spectrometry, and the absence of sibling papers in the taxonomy leaf may reflect either genuine novelty or incomplete taxonomy construction rather than definitive field-wide uniqueness.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: de novo peptide sequencing from tandem mass spectra. The field has evolved from traditional algorithmic approaches—such as dynamic programming and graph-based methods exemplified by early works like DeNovo Tandem[4] and Database Searches[10]—toward modern deep learning-based sequencing methods that now dominate recent research. The taxonomy reflects this shift, with top-level branches spanning Deep Learning-Based Sequencing Methods, Traditional Algorithmic Sequencing Approaches, Data-Independent Acquisition Sequencing, Hybrid and Database-Assisted Sequencing, and several specialized contexts. Within the deep learning branch, transformer and attention-based models have become particularly prominent, leveraging architectures inspired by natural language processing to decode complex spectral patterns. Representative works include Transformer DeNovo[8], InstaNovo[34], and Bidirectional Transformer[2], which demonstrate how sequence-to-sequence frameworks can effectively map mass spectra to peptide sequences. Meanwhile, traditional methods and hybrid approaches continue to provide foundational insights, and specialized branches address niche experimental settings such as top-down proteomics, cross-linking studies, and immunopeptidomics. Recent attention has focused on improving model robustness, generalization, and interpretability within transformer-based architectures. PowerNovo[3] and Fully Convolutional[5] models explore different neural designs to handle noisy or incomplete spectra, while works like BERT Sequencing[9] adapt masked language modeling strategies to peptide prediction. CausalNovo[0] sits within the causality-informed and robust learning frameworks subgroup, emphasizing principled approaches to model training that account for causal relationships and distributional shifts in mass spectrometry data. Compared to nearby transformer models such as PowerNovo[3] or InstaNovo[34], CausalNovo[0] distinguishes itself by integrating causal reasoning to enhance reliability and reduce biases that can arise from spurious correlations in training data. This focus on robustness addresses a key challenge across the field: ensuring that deep learning models generalize well beyond the specific datasets and experimental conditions on which they were trained.

Claimed Contributions

CausalNovo framework for de novo peptide sequencing

The authors propose CausalNovo, a model-agnostic framework that applies causal principles to de novo peptide sequencing. The framework learns causal representations from mass spectra by distinguishing signal fragment ions from spurious noise peaks, improving robustness and generalization across different noise conditions.

5 retrieved papers
Structural Causal Model formalization for peptide sequencing

The authors formalize de novo peptide sequencing using Structural Causal Models to explicitly represent causal relationships between mass spectra and peptide sequences. This formalization distinguishes causal factors from non-causal spurious correlations, providing a principled foundation for robust model design.

6 retrieved papers
Independence and sufficiency principles with information-theoretic objectives

The authors derive two fundamental principles—independence (ensuring representations are invariant to non-causal factors) and sufficiency (retaining predictive information)—and operationalize them through causal interventions and information-theoretic objectives. These principles guide the disentanglement of causal signal from noise in the latent representation space.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CausalNovo framework for de novo peptide sequencing

The authors propose CausalNovo, a model-agnostic framework that applies causal principles to de novo peptide sequencing. The framework learns causal representations from mass spectra by distinguishing signal fragment ions from spurious noise peaks, improving robustness and generalization across different noise conditions.

Contribution

Structural Causal Model formalization for peptide sequencing

The authors formalize de novo peptide sequencing using Structural Causal Models to explicitly represent causal relationships between mass spectra and peptide sequences. This formalization distinguishes causal factors from non-causal spurious correlations, providing a principled foundation for robust model design.

Contribution

Independence and sufficiency principles with information-theoretic objectives

The authors derive two fundamental principles—independence (ensuring representations are invariant to non-causal factors) and sufficiency (retaining predictive information)—and operationalize them through causal interventions and information-theoretic objectives. These principles guide the disentanglement of causal signal from noise in the latent representation space.