A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

image registrationdistributed optimizationCUDA kernelsneuroanatomy

In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100μm ex-vivo human brain MRI volume at native resolution – an inverse problem more than 570× larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 − 7× while reducing peak memory consumption by 20 − 59%. Comparative analysis on a 250μm dataset shows that FFDP can fit upto 64× larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: large-scale multimodal deformable image registration. The field has evolved into a rich ecosystem organized around several complementary dimensions. Deep Learning Architecture and Feature Representation explores how neural networks encode cross-modal correspondences, ranging from transformer-based hierarchical designs (Transformer Hierarchical[28]) to modality-agnostic feature extractors (Modality-agnostic Learning[6]) and foundation models adapted from vision tasks (Dino-reg[3]). Learning Paradigms and Optimization Strategies addresses training regimes—unsupervised methods that bypass manual labels (Unsupervised Multimodal[24]), weakly-supervised approaches leveraging partial annotations (Label-driven Weakly-supervised[21]), and reinforcement learning formulations (Contextual Reinforcement Learning[46]). Regularization and Deformation Modeling focuses on enforcing plausible transformations through biomechanical constraints (Biomechanically Regularized[43]) or diffeomorphic flows (Neural ODEs[29]). Application Domains and Modality-Specific Methods tailors solutions to clinical contexts such as lung CT (Lung CT Registration[27]), retinal imaging (Two-step Retinal[30]), or PET-MR fusion (PET-MR Integration[20]). Multi-Task and Joint Learning Frameworks combine registration with synthesis or segmentation, while Surveys and Comprehensive Reviews synthesize progress (Deep Learning Survey[10], Deep Learning Review[35]). Finally, Computational Efficiency and Scalability tackles the challenge of processing massive datasets through distributed frameworks and parallel computing strategies. A central tension runs through the literature: balancing model expressiveness with computational feasibility at scale. Many architectures achieve strong accuracy on moderate-sized volumes but struggle when datasets grow to gigavoxel resolutions or require real-time throughput (Real-time Whole Slides[45]). GigaVoxel Registration[0] directly confronts this bottleneck by developing distributed and parallel computing frameworks that enable deformable alignment of extremely large multimodal images. It sits within the Computational Efficiency and Scalability branch, closely aligned with Efficient Large Scale[22], which similarly prioritizes speed and memory optimization for high-resolution data. Whereas works like Dino-reg[3] or Modality-agnostic Learning[6] emphasize feature learning to handle cross-modal appearance shifts, GigaVoxel Registration[0] assumes that architectural innovations alone may not suffice and instead re-engineers the computational pipeline itself. This focus on scalability complements the broader ecosystem: as new learning paradigms and regularization techniques emerge, efficient execution frameworks become essential to deploy them on the massive, heterogeneous datasets encountered in modern medical imaging and remote sensing applications.

Claimed Contributions

IO-aware non-GEMM fused kernels for image registration

7 retrieved papers

The authors introduce specialized fused CUDA kernels (composite implicit grid sampler, implicit Parzen windowing for mutual information, and fused cross-correlation) that minimize high-bandwidth memory usage by computing intermediate variables in registers or shared memory rather than materializing them in global memory, enabling problems up to 64× larger on a single GPU.

7 retrieved papers

Distributed framework for multimodal gigavoxel image registration

10 retrieved papers

The authors develop a distributed optimization framework that includes Grid Parallel (GP) for boundary-synchronized tensor sharding, a distributed ring sampler for memory-efficient interpolation across GPUs, and distributed loss functions, enabling registration of images with over 11 billion parameters across multiple GPUs.

10 retrieved papers

Grid Parallel abstraction for convolution-aware tensor sharding

10 retrieved papers

The authors introduce Grid Parallel as a new parallelism technique that complements existing model parallelism methods by enabling boundary synchronization between sharded tensors, which is necessary for convolutional operations in image registration but not addressed by transformer-focused parallelism strategies.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[22] Efficient Large Scale Multimodal Image Registration PDF

Lindblad, Joakim (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

IO-aware non-GEMM fused kernels for image registration

[51] RoMa v2: Harder Better Faster Denser Feature Matching PDF

Cannot Refute

[52] Exploring HW/SW co-optimizations for accelerating large-scale texture identification on distributed GPUs PDF

Cannot Refute

[53] High-speed CUDA Algorithm for GPU-based Feature Extraction in Iris Images PDF

Cannot Refute

[54] Speeding up mutual information computation using NVIDIA CUDA hardware PDF

Cannot Refute

[55] Optimizing memory usage and accesses on cuda-based recurrent pattern matching image compression PDF

Cannot Refute

[56] Fast organization of large photo collections using CUDA PDF

Cannot Refute

[57] Efficient Implementation of Nonrigid Registration Methods on commodity Hardware with CUDA PDF

Cannot Refute

Contribution

Distributed framework for multimodal gigavoxel image registration

[58] Recurrent Inference Machine for Medical Image Registration PDF

Cannot Refute

[59] Multimodal continuous ant colony optimization for multisensor remote sensing image registration with local search PDF

Cannot Refute

[60] Multimodal Medical Image Fusion Using a Progressive Parallel Strategy Based on Deep Learning PDF

Cannot Refute

[61] Fast computation of mutual information in the frequency domain with applications to global multimodal image alignment PDF

Cannot Refute

[62] M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts PDF

Cannot Refute

[63] RA-SIFA: Unsupervised domain adaptation multi-modality cardiac segmentation network combining parallel attention module and residual attention unit PDF

Cannot Refute

[64] Distributed-memory large deformation diffeomorphic 3D image registration PDF

Cannot Refute

[65] A Robust Multi-Modal Wide-Field Satellite Image Registration Algorithm Based on Weighted Random Partition Optimization PDF

Cannot Refute

[66] Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images PDF

Cannot Refute

[67] A fully parallel algorithm for multimodal image registration using normalized gradient fields PDF

Cannot Refute

Contribution

Grid Parallel abstraction for convolution-aware tensor sharding

[68] A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings PDF

Cannot Refute

[69] DeeperThings: Fully distributed CNN inference on resource-constrained edge devices PDF

Cannot Refute

[70] Fully distributed deep learning inference on resource-constrained edge devices PDF

Cannot Refute

[71] CNNPC: End-edge-cloud collaborative CNN inference with joint model partition and compression PDF

Cannot Refute

[72] Distributing deep neural networks with containerized partitions at the edge PDF

Cannot Refute

[73] Joint compressing and partitioning of CNNs for fast edge-cloud collaborative intelligence for IoT PDF

Cannot Refute

[74] A distributed hierarchical deep computation model for federated learning in edge computing PDF

Cannot Refute

[75] Heterogeneous model parallelism for deep neural networks PDF

Cannot Refute

[76] The Effects of Partitioning Strategies on Energy Consumption in Distributed CNN Inference at The Edge PDF

Cannot Refute

[77] Low-memory and high-performance CNN inference on distributed systems at the edge PDF

Cannot Refute

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[22] Efficient Large Scale Multimodal Image Registration PDF

Contribution Analysis

IO-aware non-GEMM fused kernels for image registration

[51] RoMa v2: Harder Better Faster Denser Feature Matching PDF

[52] Exploring HW/SW co-optimizations for accelerating large-scale texture identification on distributed GPUs PDF

[53] High-speed CUDA Algorithm for GPU-based Feature Extraction in Iris Images PDF

[54] Speeding up mutual information computation using NVIDIA CUDA hardware PDF

[55] Optimizing memory usage and accesses on cuda-based recurrent pattern matching image compression PDF

[56] Fast organization of large photo collections using CUDA PDF

[57] Efficient Implementation of Nonrigid Registration Methods on commodity Hardware with CUDA PDF

Distributed framework for multimodal gigavoxel image registration

[58] Recurrent Inference Machine for Medical Image Registration PDF

[59] Multimodal continuous ant colony optimization for multisensor remote sensing image registration with local search PDF

[60] Multimodal Medical Image Fusion Using a Progressive Parallel Strategy Based on Deep Learning PDF

[61] Fast computation of mutual information in the frequency domain with applications to global multimodal image alignment PDF

[62] M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts PDF

[63] RA-SIFA: Unsupervised domain adaptation multi-modality cardiac segmentation network combining parallel attention module and residual attention unit PDF

[64] Distributed-memory large deformation diffeomorphic 3D image registration PDF

[65] A Robust Multi-Modal Wide-Field Satellite Image Registration Algorithm Based on Weighted Random Partition Optimization PDF

[66] Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images PDF

[67] A fully parallel algorithm for multimodal image registration using normalized gradient fields PDF

Grid Parallel abstraction for convolution-aware tensor sharding

[68] A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings PDF

[69] DeeperThings: Fully distributed CNN inference on resource-constrained edge devices PDF

[70] Fully distributed deep learning inference on resource-constrained edge devices PDF

[71] CNNPC: End-edge-cloud collaborative CNN inference with joint model partition and compression PDF

[72] Distributing deep neural networks with containerized partitions at the edge PDF

[73] Joint compressing and partitioning of CNNs for fast edge-cloud collaborative intelligence for IoT PDF

[74] A distributed hierarchical deep computation model for federated learning in edge computing PDF

[75] Heterogeneous model parallelism for deep neural networks PDF

[76] The Effects of Partitioning Strategies on Energy Consumption in Distributed CNN Inference at The Edge PDF

[77] Low-memory and high-performance CNN inference on distributed systems at the edge PDF

Table of Contents