Large Scale Ultrasound Simulations on Accelerated Clusters Filip Vaverka supervised by Jiri Jaros Faculty of Information Technology, Brno University of Technology, Centre of Excellence IT4Innovations, CZ Overview The synergy between advance- ments in medicine, ultrasound modeling, and high performance computing leads to the emer- gence of many new applications of biomedical ultrasound. Appli- cations such as HIFU treatment planing, (photo-)acoustic imaging or transcranial ultrasound ther- apy (see gure), require accu- rate large scale ultrasound simu- lations while signicant pressure is being put on the cost and time to compute these simulations. The focal point of the present thesis is the development of k-space pseudo-spectral time-domain methods (KSTD) scalable across a variety of accelerated cluster architectures and dense compute nodes. Ultrasound Wave Propagation Model Ultrasound wave propagation in soft tissue can be understood as wave propagation in a uid, heterogeneous and absorbing medium with weak non-linear effects. The governing acoustic equations for such a problem can be written as: ∂ u ∂ t = - 1 ρ 0 ∇ p + S F (momentum conservation) ∂ρ ∂ t = -(2ρ + ρ 0 )∇· u - u ·∇ρ 0 + S M (mass conservation) p = c 2 0 ρ + d ·∇ρ 0 + B 2A ρ 2 ρ 0 - L ρ (pressure-density relation) The k-space pseudo-spectral method (KSTD) is a highly ecient ap- proach to discretization of this model. The KSTD takes the known analytic solution to the homogeneous variant of the problem to cor- rect for time-stepping errors, thus allowing for simple 2nd order time- stepping. Local Fourier Basis However, implementation of KSTD method on distributed memory ar- chitectures is tricky due to global nature of these methods. Ecient decomposition on modern accelerated clusters is achieved by using local Fourier basis (LFB) decomposition. This approach limits the com- munication to nearest neighbors. subdomain 1 subdomain 2 subdomain 3 local data bell function periodic data Local Fourier Basis Accuracy When the gradient is not calculated on the whole data, numerical error is introduced. The error level can be controlled by the shape of the bell function and the size of the overlap region. 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Overlap Size 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 L error Erf Bell Optimised Bell 1 2 3 4 5 6 7 8 9 Number of Subdomains Cuts Crossed 4 6 8 10 12 14 16 18 20 22 Minimum Overlap Size L < 1e-4 L < 1e-3 Traditional CPU Cluster (MPI + OpenMP) Considering traditional CPU based cluster with each compute node consisting of two Intel Xeon E5-2680v3 CPUs where nodes are con- nected in 7D hypercube topology with FDR Inniband interconnect. 16 64 256 1024 4096 16384 1 2 4 8 16 32 64 128 256 Execution time per time step [ms] Number of CPUs (×12 cores) Global simulation weak scaling (Salomon) 2 21 2 22 2 23 2 24 2 25 2 26 2 27 2 28 8 16 32 64 128 256 512 1024 1 2 4 8 16 32 64 128 256 Execution time per time step [ms] Number of CPUs (×12 cores) Simulation weak scaling (Salomon) 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 The transition from global (left) to local (right) KSTD method offers up to 5× speedup while same amount of CPUs is used. This is in signi- cant part due to reduction in the order of communication complexity as all-to-all communication is reduced to nearest neighbors communi- cation. Multi-GPU Dense Nodes (MPI + CUDA + P2P) Nvidia DGX-2 compute node contains 16 Nvidia V100 GPUs with 32 GB of HBM 2 memory at 900 GB/s each. All 16 GPUs are con- nected together via NVlink 2.0, which provides 300 GB/s of con- nectivity to each GPU and bisection bandwidth of 2.4 TB/s. 1 2 4 8 16 32 64 128 256 1 2 4 8 16 Execution time per time step [ms] Numer of GPUs Simulation strong scaling (DGX-2) 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 0 50 100 150 200 250 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 Execution time per time step [ms] Domain Size [Grid Points] Simulation breakdown (DGX-2) Computation Computation (FFTs) Overlap Exchange GPU Accelerated Cluster (MPI + CUDA) 100 GFlop/s 1 GB 100 GB/s PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM ARIES ASIC PCI-E P100 CPU RAM HBM ARIES ASIC DRAGON FLY 8 16 32 64 128 256 512 1024 2048 4096 1 2 4 8 16 32 64 128 256 512 Execution time per time step [ms] Number of GPUs (1 GPU per node) Simulation strong scaling (Piz Daint) 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 33 2 34 2 35 2 36 0 100 200 300 400 500 600 700 2 27 2 28 2 29 2 30 2 31 2 32 2 33 2 34 2 35 2 36 Execution time per time step [ms] Domain Size [Grid Points] Simulation breakdown (Piz Daint) Computation Overlap Exchange Impact and Outlook The LFB decomposition allows for ecient implementation of KSTD ultrasound wave propagation simulation codes, which are scalable on wide range of cluster architectures. The practical implementation of described approach showed that: • Up to 5× speedup was achieved on traditional CPU based cluster • The proposed method is scalable to simulation with over 7 × 10 10 unknowns across 1024 GPUs while maintaining eciency over 50% • The method is able to take advantage of high-speed interconnects in multi-GPU nodes to achieve eciency over 65% In the realm of personalized ultrasound medical procedures these ad- vancements amount to: • Achieving 6× speedup and 10× cost reduction of typical photoacoustic tomography image reconstruction by enabling ecient use of multi-GPU servers such as Nvidia DGX-2 • 17× price reduction of simulations used in HIFU treatment planning Conclusion The presented approach shows how use of LFB decomposition in con- text of KSTD ultrasound wave propagation models allows these mod- els to scale eciently to GPU accelerated nodes and clusters. This leap in eciency and reduction in cost allows for wider spread of novel ul- trasound medical procedures such as HIFU and photoacoustic tomog- raphy. The method can be extended to similar models such as wave propagation in elastic medium. k-Wave A MATLAB toolbox for the time-domain simulation of acoustic wave fields This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602” and by the IT4Innovations infrastructure which is supported from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center - LM2015070”. This project has received funding from the European Union’s Horizon 2020 research and innovation programme H2020 ICT 2016-2017 under grant agreement No 732411 and is an initiative of the Photonics Public Private Partnership.