A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce specialized fused CUDA kernels (composite implicit grid sampler, implicit Parzen windowing for mutual information, and fused cross-correlation) that minimize high-bandwidth memory usage by computing intermediate variables in registers or shared memory rather than materializing them in global memory, enabling problems up to 64× larger on a single GPU.
The authors develop a distributed optimization framework that includes Grid Parallel (GP) for boundary-synchronized tensor sharding, a distributed ring sampler for memory-efficient interpolation across GPUs, and distributed loss functions, enabling registration of images with over 11 billion parameters across multiple GPUs.
The authors introduce Grid Parallel as a new parallelism technique that complements existing model parallelism methods by enabling boundary synchronization between sharded tensors, which is necessary for convolutional operations in image registration but not addressed by transformer-focused parallelism strategies.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[22] Efficient Large Scale Multimodal Image Registration PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
IO-aware non-GEMM fused kernels for image registration
The authors introduce specialized fused CUDA kernels (composite implicit grid sampler, implicit Parzen windowing for mutual information, and fused cross-correlation) that minimize high-bandwidth memory usage by computing intermediate variables in registers or shared memory rather than materializing them in global memory, enabling problems up to 64× larger on a single GPU.
[51] RoMa v2: Harder Better Faster Denser Feature Matching PDF
[52] Exploring HW/SW co-optimizations for accelerating large-scale texture identification on distributed GPUs PDF
[53] High-speed CUDA Algorithm for GPU-based Feature Extraction in Iris Images PDF
[54] Speeding up mutual information computation using NVIDIA CUDA hardware PDF
[55] Optimizing memory usage and accesses on cuda-based recurrent pattern matching image compression PDF
[56] Fast organization of large photo collections using CUDA PDF
[57] Efficient Implementation of Nonrigid Registration Methods on commodity Hardware with CUDA PDF
Distributed framework for multimodal gigavoxel image registration
The authors develop a distributed optimization framework that includes Grid Parallel (GP) for boundary-synchronized tensor sharding, a distributed ring sampler for memory-efficient interpolation across GPUs, and distributed loss functions, enabling registration of images with over 11 billion parameters across multiple GPUs.
[58] Recurrent Inference Machine for Medical Image Registration PDF
[59] Multimodal continuous ant colony optimization for multisensor remote sensing image registration with local search PDF
[60] Multimodal Medical Image Fusion Using a Progressive Parallel Strategy Based on Deep Learning PDF
[61] Fast computation of mutual information in the frequency domain with applications to global multimodal image alignment PDF
[62] M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts PDF
[63] RA-SIFA: Unsupervised domain adaptation multi-modality cardiac segmentation network combining parallel attention module and residual attention unit PDF
[64] Distributed-memory large deformation diffeomorphic 3D image registration PDF
[65] A Robust Multi-Modal Wide-Field Satellite Image Registration Algorithm Based on Weighted Random Partition Optimization PDF
[66] Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images PDF
[67] A fully parallel algorithm for multimodal image registration using normalized gradient fields PDF
Grid Parallel abstraction for convolution-aware tensor sharding
The authors introduce Grid Parallel as a new parallelism technique that complements existing model parallelism methods by enabling boundary synchronization between sharded tensors, which is necessary for convolutional operations in image registration but not addressed by transformer-focused parallelism strategies.