Patient-Specific Biomolecular Instruction Tuning of Graph-LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsFoundation ModelsGraph-LLMInstruction TuningMulti-modal LLMsBioinformaticsProteomics
Abstract:

Proteomics data is imperative to pathogenic understanding of a disease phenotype. In cancer, analysis of molecular signatures enables precision medicine through the identification of biological processes that drive individualized tumor progression, therapeutic resistance, and clinical heterogeneity. Recent advances in multimodal large language models (LLMs) have shown remarkable capacity to integrate and reason across heterogeneous data modalities. However, performing multi-modal language modeling for molecular understanding of patient-specific proteomics remains a significant challenge due to 2 barriers: (1) the lack of instruction-tuning datasets that enable clinical interpretation from proteomics data, and (2) the absence of language-modeling architectures designed to capture the rich heterogeneity of molecular data. In this work, we introduce cptac-prot-instruct, the first patient-centric instruction tuning dataset for molecular understanding of oncology, comprising over 370k open-ended examples derived from individualized proteomic profiles curated from the largest national proteomics cancer study (CPTAC). Additionally, we propose KRONOS (Knowledge Representation of individualized Omics Networks via Structured tuning), a novel graph-llm framework that leverages molecular interaction topology with proteomics to learn patient-specific graph representations for enhanced clinical reasoning. In this work, w show that KRONOS achieves consistent improvements across benchmark clinical tasks, with AUC performance of up to 0.857±0.0250.857\pm0.025 in prognostic tasks such as mortality prediction, cancer type OS prediction, and tumor stage classification from proteomics data. Ultimately, this approach empowers LLMs to understand patient-level pathogenesis, advancing precision medicine through more accurate diagnosis, prognosis, and treatment stratification.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a multimodal large language model framework that integrates patient-specific proteomics with graph-structured molecular interactions for clinical reasoning in oncology. It resides in the 'Multimodal LLM Integration for Patient-Specific Proteomics' leaf, which contains only two papers including this one. This represents a sparse and emerging research direction within the broader taxonomy of 50 papers across 36 topics, indicating that the intersection of instruction-tuned LLMs and individualized proteomic networks remains relatively unexplored compared to more established branches like multi-omics GNN architectures or disease-specific profiling.

The taxonomy reveals that neighboring research directions pursue related but distinct goals. The sibling leaf 'Individualized Protein Interaction Network Construction' focuses on network inference algorithms without language modeling, while 'Personalized Pathway Activity Profiling' emphasizes pathway scoring methods. Nearby branches such as 'Multi-Omics Cancer Subtyping' and 'AI-Driven Precision Medicine Frameworks' prioritize predictive accuracy over natural language interpretability. The paper's positioning suggests it bridges patient-specific network inference with clinical reasoning systems, addressing a gap between complex molecular graphs and narrative-style clinical interpretation that other branches do not directly tackle.

Among 25 candidates examined through limited semantic search, none clearly refute the three core contributions. The CPTAC-PROTSTRUCT dataset examined 10 candidates with zero refutable matches, suggesting novelty in creating patient-centric instruction tuning data from national proteomics studies. The KRONOS graph-LLM framework similarly showed no refutable candidates among 10 examined, indicating architectural distinctiveness in combining graph encoders with language models for proteomics. The two-stage curriculum learning approach examined 5 candidates without refutation, though the limited search scope means potentially relevant prior work in curriculum learning for biomedical LLMs may exist beyond the top-25 semantic matches.

Based on the limited literature search covering 25 candidates, the work appears to occupy a novel position at the intersection of instruction-tuned language models and patient-specific molecular networks. The sparse taxonomy leaf and absence of refutable candidates suggest originality, though the analysis does not cover exhaustive prior work in broader LLM instruction tuning or graph-based biomedical reasoning beyond the top semantic matches examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Clinical reasoning from patient-specific proteomics data using graph-structured molecular interactions. The field encompasses a diverse set of approaches that leverage network representations of molecular data to support personalized clinical decision-making. At the highest level, the taxonomy organizes work into branches focused on graph neural network architectures for multi-omics integration, knowledge graph construction and clinical integration, patient-specific network inference and personalized modeling, protein interaction network discovery from proteomics, spatial proteomics analysis and geometric deep learning, disease-specific proteomic profiling and biomarker discovery, multimodal clinical prediction and precision medicine frameworks, clinical reasoning systems and interpretability, and methodological reviews and design principles. Some branches emphasize computational architectures—such as GNN-based methods that fuse heterogeneous omics layers (e.g., MoGCN[8], Explainable Multi-Omics GNN[11])—while others concentrate on constructing structured knowledge resources (e.g., Clinical Proteomics Knowledge Graph[1], Clinical Knowledge Graph[22]) or inferring individualized interaction networks (e.g., Individualized Interactomes[16], Patient-Specific Pathways[41]). Disease-specific profiling branches gather studies targeting particular conditions through proteomic signatures, and spatial proteomics branches explore geometric representations of tissue-level molecular data. Within this landscape, a particularly active line of work centers on integrating large language models with patient-specific molecular networks to enable interpretable clinical reasoning. Biomolecular Instruction Tuning[0] exemplifies this direction by combining instruction-tuned language models with graph-structured proteomics, aiming to generate clinically actionable insights that are both personalized and explainable. This approach contrasts with purely data-driven multi-omics frameworks like Multi-Omics Precision Medicine[3] or Multi-Omics Precision Oncology[4], which prioritize predictive accuracy over natural language interpretability. Compared to ExplainMIX[5], which focuses on post-hoc explanations of multi-omics predictions, Biomolecular Instruction Tuning[0] embeds reasoning directly into the model's generative process. The work sits at the intersection of patient-specific network inference and multimodal clinical prediction, addressing the open question of how to bridge the gap between complex molecular graphs and the narrative reasoning style familiar to clinicians, while maintaining fidelity to individual patient biology.

Claimed Contributions

CPTAC-PROTSTRUCT instruction tuning dataset

The authors create the first patient-level instruction-tuning dataset for molecular oncology, containing over 370,000 examples that bridge individualized proteomic profiles with clinical reasoning tasks. The dataset includes schema alignment questions for navigating proteomics data and clinical reasoning questions for prognostic interpretation.

10 retrieved papers
KRONOS graph-LLM framework

The authors introduce a unified architecture that integrates protein-protein interaction network topology with patient-specific proteomics data through graph neural networks, enabling language models to perform semantic reasoning over structured biological interactions for clinical predictions.

10 retrieved papers
Two-stage curriculum learning approach for proteomics instruction tuning

The authors develop a curriculum learning strategy with two stages: schema alignment training to bridge the modality gap between text and proteomics, followed by clinical reasoning training to enable advanced molecular interpretation for patient prognosis.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

CPTAC-PROTSTRUCT instruction tuning dataset

The authors create the first patient-level instruction-tuning dataset for molecular oncology, containing over 370,000 examples that bridge individualized proteomic profiles with clinical reasoning tasks. The dataset includes schema alignment questions for navigating proteomics data and clinical reasoning questions for prognostic interpretation.

Contribution

KRONOS graph-LLM framework

The authors introduce a unified architecture that integrates protein-protein interaction network topology with patient-specific proteomics data through graph neural networks, enabling language models to perform semantic reasoning over structured biological interactions for clinical predictions.

Contribution

Two-stage curriculum learning approach for proteomics instruction tuning

The authors develop a curriculum learning strategy with two stages: schema alignment training to bridge the modality gap between text and proteomics, followed by clinical reasoning training to enable advanced molecular interpretation for patient prognosis.

Patient-Specific Biomolecular Instruction Tuning of Graph-LLMs | Novelty Validation