HYPED: A Multimodal HYbrid Perturbation Gene Expression and Imaging Dataset

ICLR 2026 Conference SubmissionAnonymous Authors
cellular reprogrammingmultimodal datasetGene ExpressionCell Cycle Imaging
Abstract:

Integrating multimodal, high-resolution biological data is a useful way to characterize biological processes, such as how cells respond to perturbations. Cell perturbation prediction is a major experimental challenge and has motivated substantial research in machine learning for biology. In this work, we generated a multimodal benchmark dataset that captures the dynamic response of human fibroblasts to transient transcription factor perturbations. We performed time-series live cell imaging with fluorescent cell cycle reporters over 72 hours and collected long-read single-cell RNA sequencing data from the same population of cells. We release the processed dataset, preprocessing pipelines and benchmarking code along with the evaluation of existing models using our data as ground truth. This work supports the development and evaluation of machine learning methods for modeling dynamical systems from multimodal datasets. HYPED consists of RNA sequencing data from approximately 20,000 cells and 203 imaging timepoints across four experimental conditions, totaling 2030 imaging frames. HYPED makes the cell perturbation problem accessible to machine learning researchers with state-of-the-art experimental data.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HYPED, a multimodal benchmark dataset combining time-series live cell imaging with fluorescent cell cycle reporters and long-read single-cell RNA sequencing from human fibroblasts subjected to transient transcription factor perturbations. Within the taxonomy, it resides in the 'Multimodal Perturbation Datasets' leaf under 'Experimental Perturbation Platforms and Resources'. This leaf contains only two papers total, indicating a relatively sparse research direction compared to more crowded computational branches like 'Machine Learning Models for Gene Expression Prediction' or 'Foundation Models and Generative Approaches'.

The neighboring 'CRISPR-Based Perturbation Libraries' leaf focuses on genome-scale CRISPR screens with transcriptomic readouts, while HYPED employs transient RNA-based perturbations with multimodal measurements. The broader 'Experimental Perturbation Platforms and Resources' branch sits alongside computational prediction methods and regulatory network inference, serving as the empirical substrate for model development. The taxonomy's scope note explicitly distinguishes multimodal datasets from single-modality transcriptomic platforms, positioning HYPED's integration of imaging and sequencing as a defining characteristic within this sparse experimental niche.

Among 26 candidates examined through limited semantic search, none clearly refuted any of the three contributions. The 'HYPED multimodal benchmark dataset' contribution examined 6 candidates with no refutations. The 'first perturbation dataset using transient RNA-based methods' claim examined 10 candidates without finding prior work demonstrating this specific combination. The 'processed dataset with preprocessing pipelines and benchmarking code' contribution similarly examined 10 candidates with no clear overlaps. These statistics suggest novelty within the examined scope, though the search was not exhaustive.

Given the limited search scale and the sparse population of the 'Multimodal Perturbation Datasets' leaf, the work appears to occupy a relatively underexplored niche combining transient perturbations with multimodal temporal measurements. The absence of refutations among 26 candidates supports novelty claims, though a broader literature search might reveal additional context. The dataset's emphasis on benchmarking code and preprocessing pipelines addresses practical reproducibility concerns in this emerging experimental domain.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: predicting cellular responses to transcription factor perturbations. The field has organized itself around several complementary branches. Computational Prediction Methods develop machine learning and deep learning frameworks—ranging from foundation models like Transcription Foundation Model[3] to specialized architectures such as Neural ODEs GRN[4]—that forecast gene expression changes following TF interventions. Regulatory Network Inference and Causal Discovery focuses on reconstructing the underlying wiring diagrams and causal relationships that govern TF activity. Experimental Perturbation Platforms and Resources provide the empirical substrate: large-scale libraries (e.g., Human TF Library[2]), systematic perturbation screens, and multimodal datasets that capture diverse cellular contexts. Benchmarking and Evaluation Frameworks, exemplified by efforts like PertEval scFM[7], establish standardized metrics and test beds to compare predictive models. Mechanistic and Systems Biology Models integrate biophysical principles and dynamical systems to explain how TF dynamics encode information. Finally, Domain-Specific Applications and Case Studies translate these tools into concrete biological questions, from cancer pathways (p53 DNA Damage[12]) to vascular disease (Vascular Cell Atherosclerosis[15]). Within the Experimental Perturbation Platforms branch, a dense cluster of works has emerged around multimodal perturbation datasets that combine high-throughput screening with rich molecular readouts. The HYPED Dataset[0] exemplifies this trend by integrating multiple data modalities to capture TF perturbation effects across diverse conditions, positioning itself alongside resources like the Perturbation Atlas[10], which similarly aggregates large-scale perturbation profiles. These datasets address a critical bottleneck: computational models require extensive training data that span varied cell types, stimuli, and genetic backgrounds. In contrast, earlier platforms often focused on single modalities or narrower experimental designs. By providing a more comprehensive empirical foundation, HYPED Dataset[0] and related efforts enable more robust benchmarking of predictive algorithms and facilitate transfer learning approaches (Transfer Learning TF[9]) that generalize across contexts. The interplay between such resource-building initiatives and computational method development remains a central theme, as richer datasets continually push the frontier of what models can learn and predict.

Claimed Contributions

HYPED multimodal benchmark dataset

The authors created a new dataset combining time-series live cell imaging with fluorescent cell cycle reporters and long-read single-cell RNA sequencing from the same population of cells undergoing transient transcription factor perturbations. This dataset includes approximately 20,000 cells and 203 imaging timepoints across four experimental conditions.

6 retrieved papers
First perturbation dataset using transient RNA-based methods

The authors provide the first multimodal cell perturbation dataset generated using non-integrating transient RNA delivery methods (modified mRNA and siRNA) rather than permanent genome modification approaches like viral vectors or CRISPR, offering safer experimental conditions that better reflect clinical translation potential.

10 retrieved papers
Processed dataset with preprocessing pipelines and benchmarking code

The authors provide not only the raw and processed multimodal data but also complete preprocessing pipelines and benchmarking code, enabling machine learning researchers to evaluate and develop models for cell perturbation prediction with standardized evaluation protocols.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HYPED multimodal benchmark dataset

The authors created a new dataset combining time-series live cell imaging with fluorescent cell cycle reporters and long-read single-cell RNA sequencing from the same population of cells undergoing transient transcription factor perturbations. This dataset includes approximately 20,000 cells and 203 imaging timepoints across four experimental conditions.

Contribution

First perturbation dataset using transient RNA-based methods

The authors provide the first multimodal cell perturbation dataset generated using non-integrating transient RNA delivery methods (modified mRNA and siRNA) rather than permanent genome modification approaches like viral vectors or CRISPR, offering safer experimental conditions that better reflect clinical translation potential.

Contribution

Processed dataset with preprocessing pipelines and benchmarking code

The authors provide not only the raw and processed multimodal data but also complete preprocessing pipelines and benchmarking code, enabling machine learning researchers to evaluate and develop models for cell perturbation prediction with standardized evaluation protocols.

HYPED: A Multimodal HYbrid Perturbation Gene Expression and Imaging Dataset | Novelty Validation