Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

ICLR 2026 Conference SubmissionAnonymous Authors
Computational pathologyMultimodal LearningContrastive Learning
Abstract:

Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M will be released for community development after reviewing the manuscript.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

STAMP proposes a foundation model that integrates spatial transcriptomics with histopathology images through self-supervised, gene-guided contrastive learning. The paper resides in the 'Pan-Cancer and Multi-Organ Foundation Models' leaf, which contains seven papers including the original work. This leaf represents a moderately populated research direction within the broader taxonomy of fifty papers, indicating active but not overcrowded exploration of large-scale cross-modal pretraining approaches that aim for generalizability across diverse tissue types and cancer contexts.

The taxonomy reveals that STAMP's immediate neighbors pursue similar pan-cancer foundation modeling goals, while adjacent leaves explore contrastive image-gene alignment and specialized pretraining paradigms. The 'Contrastive Learning for Image-Gene Alignment' leaf contains six papers focused on latent space alignment, and the 'Specialized Pretraining Paradigms' leaf includes five papers using alternative objectives like pathway-level alignment. STAMP appears to bridge these directions by combining contrastive alignment with spatial context modeling, distinguishing itself through explicit incorporation of spatially-resolved gene expression rather than bulk or pathway-level representations.

Among thirty candidates examined through semantic search, none clearly refuted any of STAMP's three core contributions. The STAMP framework itself was assessed against ten candidates with zero refutable overlaps; the SpaVis-6M dataset construction similarly showed no prior work among ten examined papers; and the unified alignment loss combining spatial and multi-scale objectives found no refuting evidence across ten candidates. These statistics suggest that within the limited search scope, STAMP's specific combination of spatial transcriptomics integration, large-scale dataset construction, and hierarchical multi-scale alignment appears relatively unexplored, though the search does not cover the entire literature landscape.

Based on the top-thirty semantic matches examined, STAMP's contributions appear to occupy a distinct position within the foundation model space. The absence of refuting candidates across all three contributions indicates potential novelty in the specific technical approach, though this assessment is constrained by the search methodology and does not preclude the existence of related work outside the examined set. The moderately populated taxonomy leaf suggests the paper enters an active research area with established precedents but room for methodological differentiation.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multimodal representation learning integrating pathology images and spatial transcriptomics. This field seeks to bridge high-resolution histology with spatially resolved gene expression profiles, enabling richer characterizations of tissue architecture and molecular function. The taxonomy organizes research into several major branches. Foundation Models and Cross-Modal Pretraining encompasses large-scale efforts that learn joint embeddings across imaging and genomic modalities, often leveraging contrastive or generative objectives to capture pan-cancer or multi-organ patterns (e.g., Past Multimodal Foundation[1], STPath Foundation Model[10]). Spatial Domain Identification and Tissue Segmentation focuses on delineating biologically meaningful regions within tissue sections, combining graph-based and deep learning techniques (e.g., SpaGCN Spatial Domains[5]). Gene Expression Prediction from Histology aims to infer transcriptomic profiles directly from morphology, using regression, diffusion, or ranking-based models (e.g., Diffusion Gene Expression[3]). Multi-Modal Disentanglement and Integration addresses the challenge of separating and recombining modality-specific versus shared information (e.g., Multimodal Disentanglement[8]). Finally, Datasets, Benchmarks, and Methodological Reviews provide standardized resources and comparative analyses to guide method development (e.g., Stimage Dataset[6], Cross-Modal Benchmark[9]). Several active lines of work highlight key trade-offs and open questions. Foundation model approaches emphasize scalability and transferability, training on diverse cohorts to produce general-purpose representations, yet they must balance pretraining complexity with downstream task performance. In contrast, spatial domain methods prioritize interpretability and biological fidelity, often incorporating graph structures or topological constraints, but may struggle with heterogeneity across tissue types. Gene expression prediction methods explore whether morphology alone suffices for transcriptomic inference or whether explicit spatial context is essential. Within this landscape, Fusing Pixels Genes[0] sits naturally among pan-cancer foundation models, sharing the ambition of Pan-Cancer Histology-Genomic[14] and spEMO Foundation Models[15] to learn cross-modal embeddings at scale. Compared to these neighbors, Fusing Pixels Genes[0] likely emphasizes tighter integration of pixel-level histology features with spatially indexed gene profiles, positioning itself as a bridge between large-scale pretraining and spatially aware representation learning.

Claimed Contributions

STAMP framework for spatially-aware multimodal pathology learning

The authors introduce STAMP, a novel framework that combines pathology images with spatial transcriptomics data through spatially-aware and multi-scale contrastive learning. The framework uses hierarchical multi-scale contrastive alignment and cross-scale patch localization to capture spatial structure and molecular variation.

10 retrieved papers
SpaVis-6M dataset construction

The authors constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset containing 5.75 million spatial transcriptomics entries from 35 organs, 1,982 slices, and 262 datasets or publications. This resource supports training of a robust spatial-aware gene encoder.

10 retrieved papers
Unified alignment loss combining spatial and multi-scale objectives

The authors develop a unified alignment loss function that integrates multiple objectives including cross-scale patch positioning, inter-modal contrastive alignment between images and genes, and intra-modal alignment between patches and regions. This design enables the model to learn spatial relationships and multi-scale features effectively.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

STAMP framework for spatially-aware multimodal pathology learning

The authors introduce STAMP, a novel framework that combines pathology images with spatial transcriptomics data through spatially-aware and multi-scale contrastive learning. The framework uses hierarchical multi-scale contrastive alignment and cross-scale patch localization to capture spatial structure and molecular variation.

Contribution

SpaVis-6M dataset construction

The authors constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset containing 5.75 million spatial transcriptomics entries from 35 organs, 1,982 slices, and 262 datasets or publications. This resource supports training of a robust spatial-aware gene encoder.

Contribution

Unified alignment loss combining spatial and multi-scale objectives

The authors develop a unified alignment loss function that integrates multiple objectives including cross-scale patch positioning, inter-modal contrastive alignment between images and genes, and intra-modal alignment between patches and regions. This design enables the model to learn spatial relationships and multi-scale features effectively.