A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

ICLR 2026 Conference SubmissionAnonymous Authors
image registrationdistributed optimizationCUDA kernelsneuroanatomy
Abstract:

In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100μm ex-vivo human brain MRI volume at native resolution – an inverse problem more than 570× larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 − 7× while reducing peak memory consumption by 20 − 59%. Comparative analysis on a 250μm dataset shows that FFDP can fit upto 64× larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: large-scale multimodal deformable image registration. The field has evolved into a rich ecosystem organized around several complementary dimensions. Deep Learning Architecture and Feature Representation explores how neural networks encode cross-modal correspondences, ranging from transformer-based hierarchical designs (Transformer Hierarchical[28]) to modality-agnostic feature extractors (Modality-agnostic Learning[6]) and foundation models adapted from vision tasks (Dino-reg[3]). Learning Paradigms and Optimization Strategies addresses training regimes—unsupervised methods that bypass manual labels (Unsupervised Multimodal[24]), weakly-supervised approaches leveraging partial annotations (Label-driven Weakly-supervised[21]), and reinforcement learning formulations (Contextual Reinforcement Learning[46]). Regularization and Deformation Modeling focuses on enforcing plausible transformations through biomechanical constraints (Biomechanically Regularized[43]) or diffeomorphic flows (Neural ODEs[29]). Application Domains and Modality-Specific Methods tailors solutions to clinical contexts such as lung CT (Lung CT Registration[27]), retinal imaging (Two-step Retinal[30]), or PET-MR fusion (PET-MR Integration[20]). Multi-Task and Joint Learning Frameworks combine registration with synthesis or segmentation, while Surveys and Comprehensive Reviews synthesize progress (Deep Learning Survey[10], Deep Learning Review[35]). Finally, Computational Efficiency and Scalability tackles the challenge of processing massive datasets through distributed frameworks and parallel computing strategies. A central tension runs through the literature: balancing model expressiveness with computational feasibility at scale. Many architectures achieve strong accuracy on moderate-sized volumes but struggle when datasets grow to gigavoxel resolutions or require real-time throughput (Real-time Whole Slides[45]). GigaVoxel Registration[0] directly confronts this bottleneck by developing distributed and parallel computing frameworks that enable deformable alignment of extremely large multimodal images. It sits within the Computational Efficiency and Scalability branch, closely aligned with Efficient Large Scale[22], which similarly prioritizes speed and memory optimization for high-resolution data. Whereas works like Dino-reg[3] or Modality-agnostic Learning[6] emphasize feature learning to handle cross-modal appearance shifts, GigaVoxel Registration[0] assumes that architectural innovations alone may not suffice and instead re-engineers the computational pipeline itself. This focus on scalability complements the broader ecosystem: as new learning paradigms and regularization techniques emerge, efficient execution frameworks become essential to deploy them on the massive, heterogeneous datasets encountered in modern medical imaging and remote sensing applications.

Claimed Contributions

IO-aware non-GEMM fused kernels for image registration

The authors introduce specialized fused CUDA kernels (composite implicit grid sampler, implicit Parzen windowing for mutual information, and fused cross-correlation) that minimize high-bandwidth memory usage by computing intermediate variables in registers or shared memory rather than materializing them in global memory, enabling problems up to 64× larger on a single GPU.

7 retrieved papers
Distributed framework for multimodal gigavoxel image registration

The authors develop a distributed optimization framework that includes Grid Parallel (GP) for boundary-synchronized tensor sharding, a distributed ring sampler for memory-efficient interpolation across GPUs, and distributed loss functions, enabling registration of images with over 11 billion parameters across multiple GPUs.

10 retrieved papers
Grid Parallel abstraction for convolution-aware tensor sharding

The authors introduce Grid Parallel as a new parallelism technique that complements existing model parallelism methods by enabling boundary synchronization between sharded tensors, which is necessary for convolutional operations in image registration but not addressed by transformer-focused parallelism strategies.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IO-aware non-GEMM fused kernels for image registration

The authors introduce specialized fused CUDA kernels (composite implicit grid sampler, implicit Parzen windowing for mutual information, and fused cross-correlation) that minimize high-bandwidth memory usage by computing intermediate variables in registers or shared memory rather than materializing them in global memory, enabling problems up to 64× larger on a single GPU.

Contribution

Distributed framework for multimodal gigavoxel image registration

The authors develop a distributed optimization framework that includes Grid Parallel (GP) for boundary-synchronized tensor sharding, a distributed ring sampler for memory-efficient interpolation across GPUs, and distributed loss functions, enabling registration of images with over 11 billion parameters across multiple GPUs.

Contribution

Grid Parallel abstraction for convolution-aware tensor sharding

The authors introduce Grid Parallel as a new parallelism technique that complements existing model parallelism methods by enabling boundary synchronization between sharded tensors, which is necessary for convolutional operations in image registration but not addressed by transformer-focused parallelism strategies.

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration | Novelty Validation