Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Computational pathologyMultimodal LearningContrastive Learning

Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M will be released for community development after reviewing the manuscript.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

STAMP proposes a foundation model that integrates spatial transcriptomics with histopathology images through self-supervised, gene-guided contrastive learning. The paper resides in the 'Pan-Cancer and Multi-Organ Foundation Models' leaf, which contains seven papers including the original work. This leaf represents a moderately populated research direction within the broader taxonomy of fifty papers, indicating active but not overcrowded exploration of large-scale cross-modal pretraining approaches that aim for generalizability across diverse tissue types and cancer contexts.

The taxonomy reveals that STAMP's immediate neighbors pursue similar pan-cancer foundation modeling goals, while adjacent leaves explore contrastive image-gene alignment and specialized pretraining paradigms. The 'Contrastive Learning for Image-Gene Alignment' leaf contains six papers focused on latent space alignment, and the 'Specialized Pretraining Paradigms' leaf includes five papers using alternative objectives like pathway-level alignment. STAMP appears to bridge these directions by combining contrastive alignment with spatial context modeling, distinguishing itself through explicit incorporation of spatially-resolved gene expression rather than bulk or pathway-level representations.

Among thirty candidates examined through semantic search, none clearly refuted any of STAMP's three core contributions. The STAMP framework itself was assessed against ten candidates with zero refutable overlaps; the SpaVis-6M dataset construction similarly showed no prior work among ten examined papers; and the unified alignment loss combining spatial and multi-scale objectives found no refuting evidence across ten candidates. These statistics suggest that within the limited search scope, STAMP's specific combination of spatial transcriptomics integration, large-scale dataset construction, and hierarchical multi-scale alignment appears relatively unexplored, though the search does not cover the entire literature landscape.

Based on the top-thirty semantic matches examined, STAMP's contributions appear to occupy a distinct position within the foundation model space. The absence of refuting candidates across all three contributions indicates potential novelty in the specific technical approach, though this assessment is constrained by the search methodology and does not preclude the existence of related work outside the examined set. The moderately populated taxonomy leaf suggests the paper enters an active research area with established precedents but room for methodological differentiation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal representation learning integrating pathology images and spatial transcriptomics. This field seeks to bridge high-resolution histology with spatially resolved gene expression profiles, enabling richer characterizations of tissue architecture and molecular function. The taxonomy organizes research into several major branches. Foundation Models and Cross-Modal Pretraining encompasses large-scale efforts that learn joint embeddings across imaging and genomic modalities, often leveraging contrastive or generative objectives to capture pan-cancer or multi-organ patterns (e.g., Past Multimodal Foundation[1], STPath Foundation Model[10]). Spatial Domain Identification and Tissue Segmentation focuses on delineating biologically meaningful regions within tissue sections, combining graph-based and deep learning techniques (e.g., SpaGCN Spatial Domains[5]). Gene Expression Prediction from Histology aims to infer transcriptomic profiles directly from morphology, using regression, diffusion, or ranking-based models (e.g., Diffusion Gene Expression[3]). Multi-Modal Disentanglement and Integration addresses the challenge of separating and recombining modality-specific versus shared information (e.g., Multimodal Disentanglement[8]). Finally, Datasets, Benchmarks, and Methodological Reviews provide standardized resources and comparative analyses to guide method development (e.g., Stimage Dataset[6], Cross-Modal Benchmark[9]). Several active lines of work highlight key trade-offs and open questions. Foundation model approaches emphasize scalability and transferability, training on diverse cohorts to produce general-purpose representations, yet they must balance pretraining complexity with downstream task performance. In contrast, spatial domain methods prioritize interpretability and biological fidelity, often incorporating graph structures or topological constraints, but may struggle with heterogeneity across tissue types. Gene expression prediction methods explore whether morphology alone suffices for transcriptomic inference or whether explicit spatial context is essential. Within this landscape, Fusing Pixels Genes[0] sits naturally among pan-cancer foundation models, sharing the ambition of Pan-Cancer Histology-Genomic[14] and spEMO Foundation Models[15] to learn cross-modal embeddings at scale. Compared to these neighbors, Fusing Pixels Genes[0] likely emphasizes tighter integration of pixel-level histology features with spatially indexed gene profiles, positioning itself as a bridge between large-scale pretraining and spatially aware representation learning.

Claimed Contributions

STAMP framework for spatially-aware multimodal pathology learning

10 retrieved papers

The authors introduce STAMP, a novel framework that combines pathology images with spatial transcriptomics data through spatially-aware and multi-scale contrastive learning. The framework uses hierarchical multi-scale contrastive alignment and cross-scale patch localization to capture spatial structure and molecular variation.

10 retrieved papers

SpaVis-6M dataset construction

10 retrieved papers

The authors constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset containing 5.75 million spatial transcriptomics entries from 35 organs, 1,982 slices, and 262 datasets or publications. This resource supports training of a robust spatial-aware gene encoder.

10 retrieved papers

Unified alignment loss combining spatial and multi-scale objectives

10 retrieved papers

The authors develop a unified alignment loss function that integrates multiple objectives including cross-scale patch positioning, inter-modal contrastive alignment between images and genes, and intra-modal alignment between patches and regions. This design enables the model to learn spatial relationships and multi-scale features effectively.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Past: A multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer PDF

Yang Chang-chun, Li, Haoyang, Changchun Yang, Wu Yushuai, Haoyang Li, Zhang, Yilan, Yushuai Wu, Yilan Zhang, Yu, Yifeng Jiao, Huang Ri-han, Yu Zhang, Cheng Yuan, Rihan Huang, Qi Yuan, Yuan Cheng, Guo Xin, Yuan Qi, Gao Xin, Xin Guo, Xin Gao (2025)

[9] A large-scale benchmark of cross-modal learning for histology and gene expression in spatial transcriptomics PDF

Gindra, Rushin H., Palla, Giovanni, Wagner, Sophia J., Tran, Manuel, Theis, Fabian J., Saur Dieter, Crawford, Lorin, Peng, Tingying (2025)

[10] STPath: a generative foundation model for integrating spatial transcriptomics and whole-slide images PDF

Tinglin Huang, Tianyu Liu, Mehrtash Babadi, Rex Ying, Wengong Jin (2025)

[14] Pan-cancer integrative histology-genomic analysis via multimodal deep learning PDF

Richard J Chen, Ming Y. Lu, Richard J. Chen, Drew F. K. Williamson, Tiffany Y. Chen, Jana Lipkova, Zahra Noor, Jana LipkovÃ¡, Muhammad Shaban, Maha Shady, M. Shaban, Mane Williams, Bumjin Joo, Faisal Mahmood (2022)

[15] spEMO: Leveraging Multi-Modal Foundation Models for Analyzing Spatial Multi-Omic and Histopathology Data PDF

Hongyu Zhao, Tianyu Liu, Tinglin Huang, Tong Ding, Hao Wu, Peter Humphrey, Sudhir Perincheri, Kurt Schalper, Rex Ying, Hua Xu, James Zou, Faisal Mahmood (2025)

[36] Large-Scale Representation Learning and Generative Modeling for Multimodal Healthcare Data PDF

E Redekop (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

STAMP framework for spatially-aware multimodal pathology learning

[1] Past: A multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer PDF

Cannot Refute

[8] Multi-modal disentanglement of spatial transcriptomics and histopathology imaging PDF

Cannot Refute

[10] STPath: a generative foundation model for integrating spatial transcriptomics and whole-slide images PDF

Cannot Refute

[19] GenST: A generative cross-modal model for predicting spatial transcriptomics from histology images PDF

Cannot Refute

[43] Geometry-informed multimodal fusion network for enhancing high-density spatial transcriptomics from histology images PDF

Cannot Refute

[51] Combining spatial transcriptomics with tissue morphology PDF

Cannot Refute

[52] Predicting breast cancer molecular subtypes from h &e-stained histopathological images using a spatial-transcriptomics-based patch filter PDF

Cannot Refute

[53] Histopathologic analysis of human kidney spatial transcriptomics data: toward precision pathology PDF

Cannot Refute

[54] Breast cancer histopathology image-based gene expression prediction using spatial transcriptomics data and deep learning PDF

Cannot Refute

[55] Benchmarking the translational potential of spatial gene expression prediction from histology PDF

Cannot Refute

Contribution

SpaVis-6M dataset construction

[65] Hest-1k: A dataset for spatial transcriptomics and histology image analysis PDF

Cannot Refute

[66] A spatially resolved transcriptome landscape during thyroid cancer progression PDF

Cannot Refute

[67] Museum of spatial transcriptomics PDF

Cannot Refute

[68] A practical guide to spatial transcriptomics: lessons from over 1000 samples PDF

Cannot Refute

[69] High-definition spatial transcriptomic profiling of immune cell populations in colorectal cancer PDF

Cannot Refute

[70] Single-cell, single-nucleus, and spatial transcriptomics characterization of the immunological landscape in the healthy and PSC human liver PDF

Cannot Refute

[71] Integrating single-cell and spatially resolved transcriptomic strategies to survey the astrocyte response to stroke in male mice PDF

Cannot Refute

[72] Systematic benchmarking of high-throughput subcellular spatial transcriptomics platforms across human tumors PDF

Cannot Refute

[73] Spatial transcriptomics in health and disease PDF

Cannot Refute

[74] Spatial transcriptomics at subspot resolution with BayesSpace PDF

Cannot Refute

Contribution

Unified alignment loss combining spatial and multi-scale objectives

[27] Multimodal contrastive learning for spatial gene expression prediction using histology images PDF

Cannot Refute

[56] Spatial-aware Multi-modal Contrastive Learning for RGB-D salient object detection and beyond PDF

Cannot Refute

[57] Transcending fusion: A multi-scale alignment method for remote sensing image-text retrieval PDF

Cannot Refute

[58] Complementary and Contrastive Learning for Audio-Visual Segmentation PDF

Cannot Refute

[59] Multi-scale multi-instance visual sound localization and segmentation PDF

Cannot Refute

[60] 3d coca: Contrastive learners are 3d captioners PDF

Cannot Refute

[61] Hierarchical set-to-set representation for 3-d cross-modal retrieval PDF

Cannot Refute

[62] Spatialâtemporal video grounding with cross-modal understanding and enhancement PDF

Cannot Refute

[63] Counterfactual contrastive learning for weakly-supervised vision-language grounding PDF

Cannot Refute

[64] Multi-modal multi-scale representation learning via cross-attention between chest radiology images and free-text reports PDF

Cannot Refute

Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Past: A multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer PDF

[9] A large-scale benchmark of cross-modal learning for histology and gene expression in spatial transcriptomics PDF

[10] STPath: a generative foundation model for integrating spatial transcriptomics and whole-slide images PDF

[14] Pan-cancer integrative histology-genomic analysis via multimodal deep learning PDF

[15] spEMO: Leveraging Multi-Modal Foundation Models for Analyzing Spatial Multi-Omic and Histopathology Data PDF

[36] Large-Scale Representation Learning and Generative Modeling for Multimodal Healthcare Data PDF

Contribution Analysis

STAMP framework for spatially-aware multimodal pathology learning

[1] Past: A multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer PDF

[8] Multi-modal disentanglement of spatial transcriptomics and histopathology imaging PDF

[10] STPath: a generative foundation model for integrating spatial transcriptomics and whole-slide images PDF

[19] GenST: A generative cross-modal model for predicting spatial transcriptomics from histology images PDF

[43] Geometry-informed multimodal fusion network for enhancing high-density spatial transcriptomics from histology images PDF

[51] Combining spatial transcriptomics with tissue morphology PDF

[52] Predicting breast cancer molecular subtypes from h &e-stained histopathological images using a spatial-transcriptomics-based patch filter PDF

[53] Histopathologic analysis of human kidney spatial transcriptomics data: toward precision pathology PDF

[54] Breast cancer histopathology image-based gene expression prediction using spatial transcriptomics data and deep learning PDF

[55] Benchmarking the translational potential of spatial gene expression prediction from histology PDF

SpaVis-6M dataset construction

[65] Hest-1k: A dataset for spatial transcriptomics and histology image analysis PDF

[66] A spatially resolved transcriptome landscape during thyroid cancer progression PDF

[67] Museum of spatial transcriptomics PDF

[68] A practical guide to spatial transcriptomics: lessons from over 1000 samples PDF

[69] High-definition spatial transcriptomic profiling of immune cell populations in colorectal cancer PDF

[70] Single-cell, single-nucleus, and spatial transcriptomics characterization of the immunological landscape in the healthy and PSC human liver PDF

[71] Integrating single-cell and spatially resolved transcriptomic strategies to survey the astrocyte response to stroke in male mice PDF

[72] Systematic benchmarking of high-throughput subcellular spatial transcriptomics platforms across human tumors PDF

[73] Spatial transcriptomics in health and disease PDF

[74] Spatial transcriptomics at subspot resolution with BayesSpace PDF

Unified alignment loss combining spatial and multi-scale objectives

[27] Multimodal contrastive learning for spatial gene expression prediction using histology images PDF

[56] Spatial-aware Multi-modal Contrastive Learning for RGB-D salient object detection and beyond PDF

[57] Transcending fusion: A multi-scale alignment method for remote sensing image-text retrieval PDF

[58] Complementary and Contrastive Learning for Audio-Visual Segmentation PDF

[59] Multi-scale multi-instance visual sound localization and segmentation PDF

[60] 3d coca: Contrastive learners are 3d captioners PDF

[61] Hierarchical set-to-set representation for 3-d cross-modal retrieval PDF

[62] Spatialâtemporal video grounding with cross-modal understanding and enhancement PDF

[63] Counterfactual contrastive learning for weakly-supervised vision-language grounding PDF

[64] Multi-modal multi-scale representation learning via cross-attention between chest radiology images and free-text reports PDF

Table of Contents

[62] Spatialâtemporal video grounding with cross-modal understanding and enhancement PDF