Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method Josh Romero, Massimiliano Fatica - NVIDIA Vamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente HPC Advisory Council Workshop, Stanford, CA, February 2018
37
Embed
CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Porting Scientific Research Codes to GPUs with CUDA Fortran:Incompressible Fluid Dynamics using the Immersed Boundary Method
Josh Romero, Massimiliano Fatica - NVIDIAVamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente
HPC Advisory Council Workshop, Stanford, CA, February 2018
Outline
● Introduction and Motivation
● Solver Details
● GPU implementation in CUDA Fortran
● Benchmarking and Results
● Conclusions
Introduction and Motivation
● Increased availability of GPU compute resources:○ Explosion of interest in Machine Learning○ Focus on energy efficiency for exascale
● Lots of choices to make:○ OpenACC vs. CUDA○ CUDA C vs. CUDA Fortran
● Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools
● Talk is focused on getting up and running with “low-effort.”
Solver Details
Solver Details
● Incompressible CFD solver for DNS computations in structured domains
● IB + structural solver using method described in [1]
○ Immersed interface contributes forcing term to fluid
○ Interface structural dynamics treated as triangulated network of springs
[1] Spandan et al., Journal of Computational Physics, 2017
Solver Details
InitializeSolver Compute RK step Compute IB
forcing term Structural update
RK Loop
Timestep Loop
GPU Implementation in CUDA Fortran
CUDA Fortran
● Baseline CPU code is written in Fortran so natural choice for GPU port is CUDA Fortran.
● Benefits:○ More control than OpenACC:
■ Explicit GPU kernels written natively in Fortran are supported■ Full control of host/device data movement
○ Directive-based programming available via CUF kernels○ Easier to maintain than mixed CUDA C and Fortran approaches
● Requires PGI compiler (community edition available now for free)
Profiling with NVPROF + NVVP + NVTX
● NVPROF:○ Can be used to gather detailed kernel properties and timing information
● NVIDIA Visual Profiler (NVVP):○ Graphical interface to visualize and analyze NVPROF generated profiles○ Does not show CPU activity out of the box
● NVIDIA Tools EXtension (NVTX) markers:○ Enables annotation with labeled ranges within program○ Useful for categorizing parts of profile to put activity into context○ Can be used to visualize normally hidden CPU activity (e.g. MPI communication)
NVIDIA Visual Profiler with NVTX Markers
GPU Porting of Key Computational Routines
● In many CFD (and similar) codes, common code patterns appear:
○ Tightly-nested loop computations (computation of derivatives using stencils)
○ Common mathematical computations (Fourier transforms, matrix-algebra)
● But there are also unique patterns specific to a given application:
○ Computation of IB forcing on flow field
○ Computation of interface structural forces
Case 1: Tightly-nested loops
Consider the original CPU subroutine to compute the divergence.
subroutine divg use param use local_arrays, only: q1, q2, q3,& dph, jpv, ipv,& udx3m ...
do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
Case 1: Tightly-nested loops
Now, consider the version for GPU using CUF kernel directives.
subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)
● Porting research codes to GPUs is worth the investment○ Faster runtimes enable larger cases, more rapid experimentation
● Large performance gains can be achieved with low effort using CUDA Fortran○ CUF kernel directives○ CUDA-enabled libraries○ Custom kernels when all else fails
● Working with developers to apply current code to challenging research cases
● Some previous work with these developers can be found on GitHub: https://github.com/PhysicsofFluids/AFiD_GPU_opensource