CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Porting Scientific Research Codes to GPUs with CUDA Fortran:Incompressible Fluid Dynamics using the Immersed Boundary Method

Josh Romero, Massimiliano Fatica - NVIDIAVamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente

HPC Advisory Council Workshop, Stanford, CA, February 2018

Outline

● Introduction and Motivation

● Solver Details

● GPU implementation in CUDA Fortran

● Benchmarking and Results

● Conclusions

Introduction and Motivation

● Increased availability of GPU compute resources:○ Explosion of interest in Machine Learning○ Focus on energy efficiency for exascale

● Lots of choices to make:○ OpenACC vs. CUDA○ CUDA C vs. CUDA Fortran

● Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

● Talk is focused on getting up and running with “low-effort.”

Solver Details

Solver Details

● Incompressible CFD solver for DNS computations in structured domains

● IB + structural solver using method described in [1]

○ Immersed interface contributes forcing term to fluid

○ Interface structural dynamics treated as triangulated network of springs

[1] Spandan et al., Journal of Computational Physics, 2017

Solver Details

InitializeSolver Compute RK step Compute IB

forcing term Structural update

RK Loop

Timestep Loop

GPU Implementation in CUDA Fortran

CUDA Fortran

● Baseline CPU code is written in Fortran so natural choice for GPU port is CUDA Fortran.

● Benefits:○ More control than OpenACC:

■ Explicit GPU kernels written natively in Fortran are supported■ Full control of host/device data movement

○ Directive-based programming available via CUF kernels○ Easier to maintain than mixed CUDA C and Fortran approaches

● Requires PGI compiler (community edition available now for free)

Profiling with NVPROF + NVVP + NVTX

● NVPROF:○ Can be used to gather detailed kernel properties and timing information

● NVIDIA Visual Profiler (NVVP):○ Graphical interface to visualize and analyze NVPROF generated profiles○ Does not show CPU activity out of the box

● NVIDIA Tools EXtension (NVTX) markers:○ Enables annotation with labeled ranges within program○ Useful for categorizing parts of profile to put activity into context○ Can be used to visualize normally hidden CPU activity (e.g. MPI communication)

NVIDIA Visual Profiler with NVTX Markers

GPU Porting of Key Computational Routines

● In many CFD (and similar) codes, common code patterns appear:

○ Tightly-nested loop computations (computation of derivatives using stencils)

○ Common mathematical computations (Fourier transforms, matrix-algebra)

● But there are also unique patterns specific to a given application:

○ Computation of IB forcing on flow field

○ Computation of interface structural forces

Case 1: Tightly-nested loops

Consider the original CPU subroutine to compute the divergence.

subroutine divg use param use local_arrays, only: q1, q2, q3,& dph, jpv, ipv,& udx3m ...

do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)

dph(ic,jc,kc) = dqcap*usdtal

enddo enddo enddo

end subroutine divg


Now, consider the version for GPU using CUF kernel directives.

subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)

dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)


enddo enddo enddo

end subroutine divg


● CUF kernel directive automatically generates GPU kernels for tightly nested loops.

● Scalar data passed by value to device.

● Array data must already be resident on device.




enddo enddo enddo

end subroutine divg


● For getting data onto the device, CUDA Fortran allows for straightforward declaration/allocation of device data.




enddo enddo enddo

end subroutine divg

module local_arrays real(8), allocatable :: q1(:,:,:) real(8), device, allocatable :: q1_d(:,:,:) ...end module local_arrays

allocate(q1(nx,ny,nz)); q1 = 0.d0allocate(q1_d(nx,ny,nz); q1_d = q1

Alternative using sourced allocation:allocate(q1_d, source = q1)

Additional CUF kernel features

● CUF kernels can be used to perform reductions of scalar device data.

● Final reduced result can be on the host or device.

subroutine calculate_volume_gpu (Volume,nv,nf,xyz,vert_of_face) integer, dimension (3,nf), device, intent(in) :: vert_of_face real(8), dimension (nv,3), device, intent(in) ::xyz real(8), intent(out) :: Volume ... Volume = 0.d0

!$cuf kernel do (1) do i = 1,nf v1 = vert_of_face(1,i) v2 = vert_of_face(2,i) v3 = vert_of_face(3,i)

x1 = xyz(v1,1); x2 = xyz(v2,1); x3 = xyz(v3,1) y1 = xyz(v1,2); y2 = xyz(v2,2); y3 = xyz(v3,2) z1 = xyz(v1,3); z2 = xyz(v2,3); z3 = xyz(v3,3)

Volume = Volume + (x1 * (y2*z3 - z2*y3) + & x2 * (y3*z1 - z3*y1) + & x3 * (y1*z2 - z1*y2))/6.d0 enddo

end subroutine calculate_volume_gpu

Case 2: Common Mathematical Computations

● Beyond loop-based computations, many codes use common math computations for which there are GPU libraries readily available:

○ FFT: CUFFT○ BLAS: CUBLAS○ Linear Algebra: CUSOLVER

● Use wisely: Favor batched implementations when available, avoid many repeated calls of small operations


Consider the original CPU code for completing a real-to-complex FFT using FFTW library.

coefnorm = 1.d0/(dble(n1m) * dble(n2m))

do k = kstart,kend do j = 1,n2m do i = 1,n1m xr(j,i) = dph(i,j,k) enddo enddo

call dfftw_execute_dft_r2c(fwd_plan,xr,xa)

do j = 1,n2m/2 + 1 do i = 1,n1m dpho(i,j,k) = dreal(xa(j,i)) * coefnorm dpho(i,j+n2mh,k) = dimag(xa(j,i)) * coefnorm enddo enddo

end do


Now consider the version for GPU using CUFFT library.

● Modified to use batched 2D FFTs

● Final loop merged with later packing loop ← kernel fusion


!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo

istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Now consider the version for GPU using CUFFT library.

● Modified to use batched 2D FFTs

● Final loop merged with later packing loop ← kernel fusion


!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo

istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

integer :: cufft_fwd_planinteger :: rank(2), inembed(2), onembed(2)

rank(1) = n1m; rank(2) = n2minembed(1) = n1m; inembed(2) = n2monembed(1) = n1m; onembed(2) =n2m/2 + 1

istat = cufftPlanMany(cufft_fwd_plan, 2, rank, inembed, 1, & n1m*n2m, onembed, 1, n1m*(n2m/2 + 1),& CUFFT_D2Z, kend-kstart+1)

Interfaces for BLAS routines

● PGI provides overloaded interfaces for BLAS routines.

● Calls with device-resident arrays are automatically passed to the CUBLAS library.

use cudaforuse cublas

integer :: m, n, kreal(8) :: alpha, betareal(8) :: a(m,k), b(k,n), c(m,n)real(8),device :: a_d(m,k), b_d(k,n), c_d(m,n)

...

! DGEMM using linked CPU librarycall DGEMM(‘N’, ‘N’, m, n, k, alpha, a, m, b, k, & beta, c, m)

! DGEMM using CUBLAScall DGEMM(‘N’, ‘N’, m, n, k, alpha, a_d, m, b_d, k, & beta, c_d, m)

Case 3: Unique computations

● The need for custom kernels arises in most programs:

○ Unique computations not amenable to a CUF kernel

○ Common mathematical operation, but no good GPU library implementation:

■ Tridiagonal LU factorization/solves with multiple RHS

○ Pattern of library usage that would be poor performing on GPU:

■ Data interpolation from flow grid to structural grid involves many small matrix and vector computations.

Example 1: Batched Tridiagonal Solver

● Flow solver requires tridiagonal LU factorization/solves with multiple RHS

● Wrote batched tridiagonal solver using Thomas algorithm

● One GPU thread assigned per RHS

● To ensure coalesced access of RHS values by threads, data transposition required:

rhs_d(1:N1*N2, 1:NRHS) → rhs_t_d(1:NRHS, 1:N1*N2)

Example 2: Data Interpolation Between Grids

This is the most time consuming operation in the IBM portion of the solver.

Goal is to compute interpolated value on structural grid from flow grid.


For a given triangle i:● Form 27-point support domain

around triangle centroid. ● Compute transfer function,

using support point and centroid data.

●

Final centroid result scattered back to support points or to triangle vertices.


For a given triangle i:● Form 27-point support domain

around triangle centroid. ● Compute transfer function,

using support point and centroid data.

●

Final centroid result scattered back to support points or to triangle vertices.


Computation of transfer function for each triangle requires:

● 4 x 4 matrix inversion● Several small matrix-vector

multiplies:○ [1 x 4][4 x 4] and [1 x 4][4 x 27]

Final computation of is an inner product of 27 values.


GPU strategy:● Process each triangle using a

warp (32 thread unit), map threads to support points

● Data is warp-local → most matrix algebra can be completed efficiently using warp shuffle intrinsics.

● Scattering of final result completed using atomic adds.

Benchmarking and Results

Verification Case

Benchmarking Case

● Unit cube, quiescent flow

● N = 128, 256, 384

● # of Particles = 1, 8, 27, 64

● Particle Resolution= 1280, 5120, 20480 triangles

● Run on:○ 1x 16-core Intel(R) Xeon(R) CPU

E5-2698 v3 @ 2.30GHz

○ 1x NVIDIA Tesla V100 PCIe

Grid Resolution

Fixed # of Particles = 8Particle Resolution = 5120 triangles

Fluid: ● 10 to 14x speedup vs.

CPU

IB + Structural: ● 40 to 100x speedup vs.

CPU

● Percentage of time:○ CPU: 72% to 14%○ GPU: 20% to 6%

Particle Resolution

Fixed N = 256Fixed # of Particles = 8

IB + structural solver time increases at reduced rate on GPU:

● CPU: 15% to 55%● GPU: 6% to 13%

Number of Particles

Fixed N = 256Particle Resolution = 5120 triangles

IB + Structural solver time increases at similar rates:

● CPU: 14% to 59%● GPU: 5% to 22%

https://docs.google.com/file/d/1tenj6yIEQq95iveutjTjvsitN4s4xFfT/preview

Conclusions

Conclusions

● Porting research codes to GPUs is worth the investment○ Faster runtimes enable larger cases, more rapid experimentation

● Large performance gains can be achieved with low effort using CUDA Fortran○ CUF kernel directives○ CUDA-enabled libraries○ Custom kernels when all else fails

● Working with developers to apply current code to challenging research cases

● Some previous work with these developers can be found on GitHub: https://github.com/PhysicsofFluids/AFiD_GPU_opensource

CUDA Fortran: Porting Scientific Research Codes to GPUs ... · CUDA C vs. CUDA Fortran Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools

Documents