Gangneung-Wonju National University Youngtae Kim
Nov 11, 2015
Gangneung-Wonju National University
Youngtae Kim
Agenda
1 Background 1.1 The future of high performance computing 1.2 GP-GPU 1.3 CUDA Fortran
2 Implementation 2.1 Implementation Fortran program of WRF 2.2 Execution profile of WRF physics 2.3 Implementation of parallel programs
3 Performance 3.1 Performance comparison 3.2 Performance of WRF
4 Conclusions
1
1 Background 1.1 The future of High Performance Computing
H. Meuer, Scientific Computing World: June/July 2009 A thousand-fold performance increase over an 11-year time
period. 1986 Gigaflops
1997 Teraflops
2008 Petaflops
2019 Exaflops
For the near future, we expect that the hardware architecture will be a combination of specialized CPU and GPU type cores.
2
1 Background GP-GPU performance
3
FLOPS/Memory bandwidth for the CPU and GP-GPU *FLOPS: Floating-Point Operations per Seconds
1 Background GP-GPU Acceleration of WRF WSM5
4
1 Background 1.2 GP-GPU(General Purposed Graphic Processing Unit)
Originally graphic processing
Grid of Multi processors
Use PCI
5 Grid (Data Domain)
Thread block (Compute in parallel)
1 Background
6
Grid
Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x j = blockDim%y*(blockIdx%y-1) + threadIdx%y
Caller: function ()
(3*0+1, 3*0+1)
(3*0+1, 3*0+2)
(3*0+1, 3*0+3)
(3*0+1, 3*1+1)
(3*0+1, 3*1+2)
(3*0+1, 3*1+3)
(3*0+1, 3*2+1)
(3*0+1, 3*2+2)
(3*0+1, 3*2+3)
(3*0+2, 3*0+1)
(3*0+2, 3*0+2)
(3*0+2, 3*0+3)
(3*0+2, 3*1+1)
(3*0+2, 3*1+2)
(3*0+2, 3*1+3)
(3*0+2, 3*2+1)
(3*0+2, 3*2+2)
(3*0+2, 3*2+3)
(3*0+3, 3*0+1)
(3*0+3, 3*0+2)
(3*0+3, 3*0+3)
(3*0+3, 3*1+1)
(3*0+3, 3*1+2)
(3*0+3, 3*1+3)
(3*0+3, 3*2+1)
(3*0+3, 3*2+2)
(3*0+3, 3*2+3)
(3*1+1, 3*0+1)
(3*1+1, 3*0+2)
(3*1+1, 3*0+3)
(3*1+1, 3*1+1)
(3*1+1, 3*1+2)
(3*1+1, 3*1+3)
(3*1+1, 3*2+1)
(3*1+1, 3*2+2)
(3*1+1, 3*2+3)
(3*1+2, 3*0+1)
(3*1+2, 3*0+2)
(3*1+2, 3*0+3)
(3*1+2, 3*1+1)
(3*1+2, 3*1+2)
(3*1+2, 3*1+3)
(3*1+2, 3*2+1)
(3*1+2, 3*2+2)
(3*1+2, 3*2+3)
(3*1+3, 3*0+1)
(3*1+3, 3*0+2)
(3*1+3, 3*0+3)
(3*1+3, 3*1+1)
(3*1+3, 3*1+2)
(3*1+3, 3*1+3)
(3*1+3, 3*2+1)
(3*1+3, 3*2+2)
(3*1+3, 3*2+3)
(3*0+1, 3*0+1)
(3*0+1, 3*0+2)
(3*0+1, 3*0+3)
(3*0+2, 3*0+1)
(3*0+2, 3*0+2)
(3*0+2, 3*0+3)
(3*0+3, 3*0+1)
(3*0+3, 3*0+2)
(3*0+3, 3*0+3)
(3*0+1, 3*1+1)
(3*0+1, 3*1+2)
(3*0+1, 3*1+3)
(3*0+2, 3*1+1)
(3*0+2, 3*1+2)
(3*0+2, 3*1+3)
(3*0+3, 3*1+1)
(3*0+3, 3*1+2)
(3*0+3, 3*1+3)
(3*0+1, 3*2+1)
(3*0+1, 3*2+2)
(3*0+1, 3*2+3)
(3*0+2, 3*2+1)
(3*0+2, 3*2+2)
(3*0+2, 3*2+3)
(3*0+3, 3*2+1)
(3*0+3, 3*2+2)
(3*0+3, 3*2+3)
(3*1+1, 3*0+1)
(3*1+1, 3*0+2)
(3*1+1, 3*0+3)
(3*1+2, 3*0+1)
(3*1+2, 3*0+2)
(3*1+2, 3*0+3)
(3*1+3, 3*0+1)
(3*1+3, 3*0+2)
(3*1+3, 3*0+3)
(3*1+1, 3*1+1)
(3*1+1, 3*1+2)
(3*1+1, 3*1+3)
(3*1+2, 3*1+1)
(3*1+2, 3*1+2)
(3*1+2, 3*1+3)
(3*1+3, 3*1+1)
(3*1+3, 3*1+2)
(3*1+3, 3*1+3)
(3*1+1, 3*2+1)
(3*1+1, 3*2+2)
(3*1+1, 3*2+3)
(3*1+2, 3*2+1)
(3*1+2, 3*2+2)
(3*1+2, 3*2+3)
(3*1+3, 3*2+1)
(3*1+3, 3*2+2)
(3*1+3, 3*2+3)
(1,1) (1,2) (1,3)
(2,1) (2,2) (2,3)
1 Background 1.3 CUDA Fortran (PG Fortran version 10)
Developed by Portland Group Inc. and Nvidia(2009/12)
Support CUDA
Use Fortan90(95/03) syntax
Some limitations of CUDA Fortran Not support automatic arrays and module variables
Not supoort common, equivalence
CUDA(Computer United Device Architecture): GP-GPU programming interface by Nvidia
7
2 Implementation 2.1 Implementation Fortran program of WRF(v.3.4) Physics run on GP-GPUs
Micro physics: WSM6 and WDM6
Boundary-layer physics: YSUPBL
Radiation physics - RRTMG_LW, RRTMG_SW
Surface-layer physics - SFCLAY
8
2 Implementation 2.2 Execution profile of WRF physics routines
9
22.5%
14.4%
21.1%
2.1%
WRF execution profile
RRTMG_LW
RRTMG_SW
WDM6
YSUPBL
etc.
2 Implementation 2.3 Implementation of parallel programs
2.3.1 Running environment Modification of configure.wrf(environment set-up file)
Compatible to original WRF program ARCH_LOCAL = -DRUN_ON_GPU (GP-GPU code compile if -DRUN_ON_GPU defined)
Create a directory for CUDA codes only - cuda GP-GPU source codes
Exclusive Makefile
10
2 Implementation 2.3.2 Structure of the GP-GPU program
11
Initialize
Physics routine: do j=.. call 2d routine(..,j,..) enddo
Time steps
(Original code) Initialize: dynamic allocation of
GP-GPU variables
Physics routine: Copy CPU variables to GPU call 3d routine (GPU) Copy GPU variables to CPU
Finalize: deallocation of GPU variables
Time steps
(GPU code)
2 Implementation
12
2.3.3 Initialize & Finalize phys/module_physics_init.F
cuda/module_ra_rrtmg_lw_gpu.F
Initialization of constants
Allocation of GPU device variables
#ifndef RUN_ON_GPU CALL rrtmg_lwinit( ) #else CALL rrtmg_lwinit_gpu() #endif
subroutine rrtmg_lwinit_gpu(...) call rrtmg_lw_ini(cp) allocate(p8w_d(dime,djme),stat=istate)
main/module_wrf_top.F
cuda/module_ra_rrtmg_lw_gpu.F
Deallocation of GPU device variables
subroutine rrtmg_lwfinalize_gpu( ..) deallocate(p8w_d)
#ifdef RUN_ON_GPU call rrtmg_lwfinalize_gpu() #endif
2 Implementation 2.3.4 Calling GPU Functions
phys/module_radiation_driver.F
13
#ifndef RUN_ON_GPU USE module_ra_rrtmg_lw, only: rrtmg_lwrad #else USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu #endif ... #ifndef RUN_ON_GPU CALL RRTMG_LWRAD(...) #else CALL RRTMG_LWRAD_GPU(...)
2 Implementation 2.3.5 Translation into GPU code
Use 3-dimensional domain
Remove the horizontal loop(i, j-loop) of GPU function(Global)
14
SUBROUTINE wdm62D() do k = kts, kte do i = its, ite cpm(i,k) = cpmcal(q(i,k)) xl(i,k) = xlcal(t(i,k)) enddo enddo do k = kts, kte do i = its, ite delz_tmp(i,k) = delz(i,k) den_tmp(i,k) = den(i,k) enddo enddo END SUBROUTINE wdm62D
attributes(global) subroutine wdm6_gpu_kernel(...) i = (blockIdx%x-1)*blockDim%x + threadIdx%x j = (blockIdx%y-1)*blockDim%y + threadIdx%y if (((i.ge.its).and.(i.le.ite)).and. & ((j.ge.jts).and.(j.le.jte))) then do k = kts, kte cpm(i,k,j) = cpmcal(q(i,k,j)) xl(i,k,j) = xlcal(t(i,k,j)) ! enddo do k = kts, kte delz_tmp(i,k,j) = delz(i,k,j) den_tmp(i,k,j) = den(i,k,j) enddo endif end subroutine wdm6_gpu_kernel
2 Implementation 2.3.6 Memory allocation of arrays and copy CPU data
15
real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),
rthratenlw_d = rthratenlw emiss_d = emiss call rrtmg_lwrad_gpu_kernel (rthratenlw_d, emiss_d, )
attributes(global) subroutine rrtmg_lwrad_gpu_kernel(rthratenlw, emiss, ) rthratenlw(i,k,j) = emiss(i,k,j) =
subroutine rrtmg_lwrad(rthratenlw, emiss, ) rthratenlw(i,k,j) = emiss(i,k,j) =
3 Performance 3.1 Performance comparison
System specification used for performance checking
16
GPU Tesla C1060 (1.3GHz)
Global memory 4G bytes
#multiprocessors 30
#cores 240
Registers/block 16384
Max#thread/block 512
CPU Intel Xeon E5405 (2.0GHz)
3 Performance Performance of WRF physics routines
17
0
500
1000
1500
2000
2500
3000
3500
4000
WSM6 WDM6 YSUPBL RRTMG_LW RRTMG_SW SFCLAY
CPU
GPU
3 Performance Performance comparison of CUDA C and CUDA Fortran
18
0
50,000
100,000
150,000
200,000
250,000
300,000
WSM5 WSM6
mic
rose
c
CPU GPU
3 Performance 3.2 Performance of WRF
19
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
WRF
CPU GPU
4 Conclusions Pros
GP-GPUs can be used as efficient hardware accelerators.
GP-GPUs are cheap and energy efficient.
Cons Communication between CPUs and GPUs is slow.
Data transfer between CPU and GP-GPU is a bottleneck.
Overlap of communication and computation is necessary.
Translation into GP-GPU code is not trivial. Parameter passing methods, local resources are limited.
CUDA Fortran need to be improved.
20
WRF Physics models using GP-GPUswith CUDA FortranAgenda1 Background1 Background1 Background1 Background1 Background1 Background2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation3 Performance3 Performance3 Performance3 Performance4 ConclusionsThank you