Patient-Specific Biomolecular Instruction Tuning of Graph-LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelsFoundation ModelsGraph-LLMInstruction TuningMulti-modal LLMsBioinformaticsProteomics

Proteomics data is imperative to pathogenic understanding of a disease phenotype. In cancer, analysis of molecular signatures enables precision medicine through the identification of biological processes that drive individualized tumor progression, therapeutic resistance, and clinical heterogeneity. Recent advances in multimodal large language models (LLMs) have shown remarkable capacity to integrate and reason across heterogeneous data modalities. However, performing multi-modal language modeling for molecular understanding of patient-specific proteomics remains a significant challenge due to 2 barriers: (1) the lack of instruction-tuning datasets that enable clinical interpretation from proteomics data, and (2) the absence of language-modeling architectures designed to capture the rich heterogeneity of molecular data. In this work, we introduce cptac-prot-instruct, the first patient-centric instruction tuning dataset for molecular understanding of oncology, comprising over 370k open-ended examples derived from individualized proteomic profiles curated from the largest national proteomics cancer study (CPTAC). Additionally, we propose KRONOS (Knowledge Representation of individualized Omics Networks via Structured tuning), a novel graph-llm framework that leverages molecular interaction topology with proteomics to learn patient-specific graph representations for enhanced clinical reasoning. In this work, w show that KRONOS achieves consistent improvements across benchmark clinical tasks, with AUC performance of up to $0.857\pm0.025$ in prognostic tasks such as mortality prediction, cancer type OS prediction, and tumor stage classification from proteomics data. Ultimately, this approach empowers LLMs to understand patient-level pathogenesis, advancing precision medicine through more accurate diagnosis, prognosis, and treatment stratification.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a multimodal large language model framework that integrates patient-specific proteomics with graph-structured molecular interactions for clinical reasoning in oncology. It resides in the 'Multimodal LLM Integration for Patient-Specific Proteomics' leaf, which contains only two papers including this one. This represents a sparse and emerging research direction within the broader taxonomy of 50 papers across 36 topics, indicating that the intersection of instruction-tuned LLMs and individualized proteomic networks remains relatively unexplored compared to more established branches like multi-omics GNN architectures or disease-specific profiling.

The taxonomy reveals that neighboring research directions pursue related but distinct goals. The sibling leaf 'Individualized Protein Interaction Network Construction' focuses on network inference algorithms without language modeling, while 'Personalized Pathway Activity Profiling' emphasizes pathway scoring methods. Nearby branches such as 'Multi-Omics Cancer Subtyping' and 'AI-Driven Precision Medicine Frameworks' prioritize predictive accuracy over natural language interpretability. The paper's positioning suggests it bridges patient-specific network inference with clinical reasoning systems, addressing a gap between complex molecular graphs and narrative-style clinical interpretation that other branches do not directly tackle.

Among 25 candidates examined through limited semantic search, none clearly refute the three core contributions. The CPTAC-PROTSTRUCT dataset examined 10 candidates with zero refutable matches, suggesting novelty in creating patient-centric instruction tuning data from national proteomics studies. The KRONOS graph-LLM framework similarly showed no refutable candidates among 10 examined, indicating architectural distinctiveness in combining graph encoders with language models for proteomics. The two-stage curriculum learning approach examined 5 candidates without refutation, though the limited search scope means potentially relevant prior work in curriculum learning for biomedical LLMs may exist beyond the top-25 semantic matches.

Based on the limited literature search covering 25 candidates, the work appears to occupy a novel position at the intersection of instruction-tuned language models and patient-specific molecular networks. The sparse taxonomy leaf and absence of refutable candidates suggest originality, though the analysis does not cover exhaustive prior work in broader LLM instruction tuning or graph-based biomedical reasoning beyond the top semantic matches examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Clinical reasoning from patient-specific proteomics data using graph-structured molecular interactions. The field encompasses a diverse set of approaches that leverage network representations of molecular data to support personalized clinical decision-making. At the highest level, the taxonomy organizes work into branches focused on graph neural network architectures for multi-omics integration, knowledge graph construction and clinical integration, patient-specific network inference and personalized modeling, protein interaction network discovery from proteomics, spatial proteomics analysis and geometric deep learning, disease-specific proteomic profiling and biomarker discovery, multimodal clinical prediction and precision medicine frameworks, clinical reasoning systems and interpretability, and methodological reviews and design principles. Some branches emphasize computational architectures—such as GNN-based methods that fuse heterogeneous omics layers (e.g., MoGCN[8], Explainable Multi-Omics GNN[11])—while others concentrate on constructing structured knowledge resources (e.g., Clinical Proteomics Knowledge Graph[1], Clinical Knowledge Graph[22]) or inferring individualized interaction networks (e.g., Individualized Interactomes[16], Patient-Specific Pathways[41]). Disease-specific profiling branches gather studies targeting particular conditions through proteomic signatures, and spatial proteomics branches explore geometric representations of tissue-level molecular data. Within this landscape, a particularly active line of work centers on integrating large language models with patient-specific molecular networks to enable interpretable clinical reasoning. Biomolecular Instruction Tuning[0] exemplifies this direction by combining instruction-tuned language models with graph-structured proteomics, aiming to generate clinically actionable insights that are both personalized and explainable. This approach contrasts with purely data-driven multi-omics frameworks like Multi-Omics Precision Medicine[3] or Multi-Omics Precision Oncology[4], which prioritize predictive accuracy over natural language interpretability. Compared to ExplainMIX[5], which focuses on post-hoc explanations of multi-omics predictions, Biomolecular Instruction Tuning[0] embeds reasoning directly into the model's generative process. The work sits at the intersection of patient-specific network inference and multimodal clinical prediction, addressing the open question of how to bridge the gap between complex molecular graphs and the narrative reasoning style familiar to clinicians, while maintaining fidelity to individual patient biology.

Claimed Contributions

CPTAC-PROTSTRUCT instruction tuning dataset

10 retrieved papers

The authors create the first patient-level instruction-tuning dataset for molecular oncology, containing over 370,000 examples that bridge individualized proteomic profiles with clinical reasoning tasks. The dataset includes schema alignment questions for navigating proteomics data and clinical reasoning questions for prognostic interpretation.

10 retrieved papers

KRONOS graph-LLM framework

10 retrieved papers

The authors introduce a unified architecture that integrates protein-protein interaction network topology with patient-specific proteomics data through graph neural networks, enabling language models to perform semantic reasoning over structured biological interactions for clinical predictions.

10 retrieved papers

Two-stage curriculum learning approach for proteomics instruction tuning

5 retrieved papers

The authors develop a curriculum learning strategy with two stages: schema alignment training to bridge the modality gap between text and proteomics, followed by clinical reasoning training to enable advanced molecular interpretation for patient prognosis.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] Patient-specific Biomolecular Instruction Tuning PDF

Chen, Zekai, Irsyad Adam, Zekai Chen, David Laub, Shaun Porwal, Brown, Kevin, Arda Pekis, Kevin Brown (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CPTAC-PROTSTRUCT instruction tuning dataset

[15] Patient-specific Biomolecular Instruction Tuning PDF

Cannot Refute

[51] Towards multimodal foundation models in molecular cell biology PDF

Cannot Refute

[52] The potential of large language models to advance precision oncology PDF

Cannot Refute

[53] Decoding Breast Cancer Heterogeneity via Multi-Omics Integration and Language Model-Based Interpretation PDF

Cannot Refute

[54] OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks PDF

Cannot Refute

[55] Adversary-aware multimodal neural networks for cancer susceptibility prediction from multiomics data PDF

Cannot Refute

[56] A cross-level information transmission network for predicting phenotype from new genotype: Application to cancer precision medicine PDF

Cannot Refute

[57] Postoperative Complications Prediction of Lung Cancer Multimodal Fusion PDF

Cannot Refute

[58] Language ModelâBased Representation Learning for Venom Protein Identification and Therapeutic Target Discovery in Cancer PDF

Cannot Refute

[59] Multi-Modal Data Analysis for Patient Outcome Prediction in Colorectal Cancer PDF

Cannot Refute

Contribution

KRONOS graph-LLM framework

[32] PPIxGPN: plasma proteomic profiling of neurodegenerative biomarkers with proteinâprotein interaction-based eXplainable graph propagational network PDF

Cannot Refute

[65] Leveraging protein-protein interactions in phenotype prediction through graph neural networks PDF

Cannot Refute

[66] Spatially resolved subcellular proteinâprotein interactomics in drug-perturbed lung-cancer cultures and tissues PDF

Cannot Refute

[67] A graph neural network approach for hierarchical mapping of breast cancer protein communities PDF

Cannot Refute

[68] MGPPI: multiscale graph neural networks for explainable proteinâprotein interaction prediction PDF

Cannot Refute

[69] Identification of molecular subtypes of dementia by using blood-proteins interaction-aware graph propagational network PDF

Cannot Refute

[70] MVMSGAT: Integrating Multiview, Multi-Scale Graph Convolutional Networks with Biological Prior Knowledge for Predicting Bladder Cancer Response to Neoadjuvant Therapy PDF

Cannot Refute

[71] DriverOmicsNet: an integrated graph convolutional network for multi-omics exploration of cancer driver genes PDF

Cannot Refute

[72] â¦ for Parkinson's Disease Diagnosis: A Graph Neural Network (GNN) Based Classification Approach with Graph Wavelet Transform (GWT) Using Protein â¦ PDF

Cannot Refute

[73] Graph Neural Network Model for Prediction of Non-Small Cell Lung Cancer Lymph Node Metastasis Using ProteinâProtein Interaction Network and 18F-FDG â¦ PDF

Cannot Refute

Contribution

Two-stage curriculum learning approach for proteomics instruction tuning

[60] An adaptive, continuous-learning framework for clinical decision-making from proteome-wide biofluid data PDF

Cannot Refute

[61] Deciphering Early and Progressive Molecular Signatures in Alzheimer's Disease through Integrated Longitudinal Proteomic and Pathway Analysis in a â¦ PDF

Cannot Refute

[62] Designing proteins with reduced T-cell epitopes through policy optimization PDF

Cannot Refute

[63] From Neural Connectopathy to a Therapeutic Path: An Integrated Multi-omics Framework Identifies a Causal Gene and Drug Candidates for Parkinson's Disease with â¦ PDF

Cannot Refute

[64] A Review of Graph Neural Networks for Brain Diseases Analysis PDF

Cannot Refute

Patient-Specific Biomolecular Instruction Tuning of Graph-LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] Patient-specific Biomolecular Instruction Tuning PDF

Contribution Analysis

CPTAC-PROTSTRUCT instruction tuning dataset

[15] Patient-specific Biomolecular Instruction Tuning PDF

[51] Towards multimodal foundation models in molecular cell biology PDF

[52] The potential of large language models to advance precision oncology PDF

[53] Decoding Breast Cancer Heterogeneity via Multi-Omics Integration and Language Model-Based Interpretation PDF

[54] OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks PDF

[55] Adversary-aware multimodal neural networks for cancer susceptibility prediction from multiomics data PDF

[56] A cross-level information transmission network for predicting phenotype from new genotype: Application to cancer precision medicine PDF

[57] Postoperative Complications Prediction of Lung Cancer Multimodal Fusion PDF

[58] Language ModelâBased Representation Learning for Venom Protein Identification and Therapeutic Target Discovery in Cancer PDF

[59] Multi-Modal Data Analysis for Patient Outcome Prediction in Colorectal Cancer PDF

KRONOS graph-LLM framework

[32] PPIxGPN: plasma proteomic profiling of neurodegenerative biomarkers with proteinâprotein interaction-based eXplainable graph propagational network PDF

[65] Leveraging protein-protein interactions in phenotype prediction through graph neural networks PDF

[66] Spatially resolved subcellular proteinâprotein interactomics in drug-perturbed lung-cancer cultures and tissues PDF

[67] A graph neural network approach for hierarchical mapping of breast cancer protein communities PDF

[68] MGPPI: multiscale graph neural networks for explainable proteinâprotein interaction prediction PDF

[69] Identification of molecular subtypes of dementia by using blood-proteins interaction-aware graph propagational network PDF

[70] MVMSGAT: Integrating Multiview, Multi-Scale Graph Convolutional Networks with Biological Prior Knowledge for Predicting Bladder Cancer Response to Neoadjuvant Therapy PDF

[71] DriverOmicsNet: an integrated graph convolutional network for multi-omics exploration of cancer driver genes PDF

[72] â¦ for Parkinson's Disease Diagnosis: A Graph Neural Network (GNN) Based Classification Approach with Graph Wavelet Transform (GWT) Using Protein â¦ PDF

[73] Graph Neural Network Model for Prediction of Non-Small Cell Lung Cancer Lymph Node Metastasis Using ProteinâProtein Interaction Network and 18F-FDG â¦ PDF

Two-stage curriculum learning approach for proteomics instruction tuning

[60] An adaptive, continuous-learning framework for clinical decision-making from proteome-wide biofluid data PDF

[61] Deciphering Early and Progressive Molecular Signatures in Alzheimer's Disease through Integrated Longitudinal Proteomic and Pathway Analysis in a â¦ PDF

[62] Designing proteins with reduced T-cell epitopes through policy optimization PDF

[63] From Neural Connectopathy to a Therapeutic Path: An Integrated Multi-omics Framework Identifies a Causal Gene and Drug Candidates for Parkinson's Disease with â¦ PDF

[64] A Review of Graph Neural Networks for Brain Diseases Analysis PDF

Table of Contents

[58] Language ModelâBased Representation Learning for Venom Protein Identification and Therapeutic Target Discovery in Cancer PDF

[32] PPIxGPN: plasma proteomic profiling of neurodegenerative biomarkers with proteinâprotein interaction-based eXplainable graph propagational network PDF

[66] Spatially resolved subcellular proteinâprotein interactomics in drug-perturbed lung-cancer cultures and tissues PDF

[68] MGPPI: multiscale graph neural networks for explainable proteinâprotein interaction prediction PDF

[72] â¦ for Parkinson's Disease Diagnosis: A Graph Neural Network (GNN) Based Classification Approach with Graph Wavelet Transform (GWT) Using Protein â¦ PDF

[73] Graph Neural Network Model for Prediction of Non-Small Cell Lung Cancer Lymph Node Metastasis Using ProteinâProtein Interaction Network and 18F-FDG â¦ PDF

[61] Deciphering Early and Progressive Molecular Signatures in Alzheimer's Disease through Integrated Longitudinal Proteomic and Pathway Analysis in a â¦ PDF

[63] From Neural Connectopathy to a Therapeutic Path: An Integrated Multi-omics Framework Identifies a Causal Gene and Drug Candidates for Parkinson's Disease with â¦ PDF