WRF-GPU.Dr.Young-Tae+Kim

Gangneung-Wonju National University

Youngtae Kim

Agenda

1 Background 1.1 The future of high performance computing 1.2 GP-GPU 1.3 CUDA Fortran

2 Implementation 2.1 Implementation Fortran program of WRF 2.2 Execution profile of WRF physics 2.3 Implementation of parallel programs

3 Performance 3.1 Performance comparison 3.2 Performance of WRF

4 Conclusions

1

1 Background 1.1 The future of High Performance Computing

H. Meuer, Scientific Computing World: June/July 2009 A thousand-fold performance increase over an 11-year time

period. 1986 Gigaflops

1997 Teraflops

2008 Petaflops

2019 Exaflops

For the near future, we expect that the hardware architecture will be a combination of specialized CPU and GPU type cores.

2

1 Background GP-GPU performance

3

FLOPS/Memory bandwidth for the CPU and GP-GPU *FLOPS: Floating-Point Operations per Seconds

1 Background GP-GPU Acceleration of WRF WSM5

4

1 Background 1.2 GP-GPU(General Purposed Graphic Processing Unit)

Originally graphic processing

Grid of Multi processors

Use PCI

5 Grid (Data Domain)

Thread block (Compute in parallel)

1 Background

6

Grid

Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x j = blockDim%y*(blockIdx%y-1) + threadIdx%y

Caller: function ()

(3*0+1, 3*0+1)

(3*0+1, 3*0+2)

(3*0+1, 3*0+3)

(3*0+1, 3*1+1)

(3*0+1, 3*1+2)

(3*0+1, 3*1+3)

(3*0+1, 3*2+1)

(3*0+1, 3*2+2)

(3*0+1, 3*2+3)

(3*0+2, 3*0+1)

(3*0+2, 3*0+2)

(3*0+2, 3*0+3)

(3*0+2, 3*1+1)

(3*0+2, 3*1+2)

(3*0+2, 3*1+3)

(3*0+2, 3*2+1)

(3*0+2, 3*2+2)

(3*0+2, 3*2+3)

(3*0+3, 3*0+1)

(3*0+3, 3*0+2)

(3*0+3, 3*0+3)

(3*0+3, 3*1+1)

(3*0+3, 3*1+2)

(3*0+3, 3*1+3)

(3*0+3, 3*2+1)

(3*0+3, 3*2+2)

(3*0+3, 3*2+3)

(3*1+1, 3*0+1)

(3*1+1, 3*0+2)

(3*1+1, 3*0+3)

(3*1+1, 3*1+1)

(3*1+1, 3*1+2)

(3*1+1, 3*1+3)

(3*1+1, 3*2+1)

(3*1+1, 3*2+2)

(3*1+1, 3*2+3)

(3*1+2, 3*0+1)

(3*1+2, 3*0+2)

(3*1+2, 3*0+3)

(3*1+2, 3*1+1)

(3*1+2, 3*1+2)

(3*1+2, 3*1+3)

(3*1+2, 3*2+1)

(3*1+2, 3*2+2)

(3*1+2, 3*2+3)

(3*1+3, 3*0+1)

(3*1+3, 3*0+2)

(3*1+3, 3*0+3)

(3*1+3, 3*1+1)

(3*1+3, 3*1+2)

(3*1+3, 3*1+3)

(3*1+3, 3*2+1)

(3*1+3, 3*2+2)

(3*1+3, 3*2+3)

(3*0+1, 3*0+1)

(3*0+1, 3*0+2)

(3*0+1, 3*0+3)

(3*0+2, 3*0+1)

(3*0+2, 3*0+2)

(3*0+2, 3*0+3)

(3*0+3, 3*0+1)

(3*0+3, 3*0+2)

(3*0+3, 3*0+3)

(3*0+1, 3*1+1)

(3*0+1, 3*1+2)

(3*0+1, 3*1+3)

(3*0+2, 3*1+1)

(3*0+2, 3*1+2)

(3*0+2, 3*1+3)

(3*0+3, 3*1+1)

(3*0+3, 3*1+2)

(3*0+3, 3*1+3)

(3*0+1, 3*2+1)

(3*0+1, 3*2+2)

(3*0+1, 3*2+3)

(3*0+2, 3*2+1)

(3*0+2, 3*2+2)

(3*0+2, 3*2+3)

(3*0+3, 3*2+1)

(3*0+3, 3*2+2)

(3*0+3, 3*2+3)

(3*1+1, 3*0+1)

(3*1+1, 3*0+2)

(3*1+1, 3*0+3)

(3*1+2, 3*0+1)

(3*1+2, 3*0+2)

(3*1+2, 3*0+3)

(3*1+3, 3*0+1)

(3*1+3, 3*0+2)

(3*1+3, 3*0+3)

(3*1+1, 3*1+1)

(3*1+1, 3*1+2)

(3*1+1, 3*1+3)

(3*1+2, 3*1+1)

(3*1+2, 3*1+2)

(3*1+2, 3*1+3)

(3*1+3, 3*1+1)

(3*1+3, 3*1+2)

(3*1+3, 3*1+3)

(3*1+1, 3*2+1)

(3*1+1, 3*2+2)

(3*1+1, 3*2+3)

(3*1+2, 3*2+1)

(3*1+2, 3*2+2)

(3*1+2, 3*2+3)

(3*1+3, 3*2+1)

(3*1+3, 3*2+2)

(3*1+3, 3*2+3)

(1,1) (1,2) (1,3)

(2,1) (2,2) (2,3)

1 Background 1.3 CUDA Fortran (PG Fortran version 10)

Developed by Portland Group Inc. and Nvidia(2009/12)

Support CUDA

Use Fortan90(95/03) syntax

Some limitations of CUDA Fortran Not support automatic arrays and module variables

Not supoort common, equivalence

CUDA(Computer United Device Architecture): GP-GPU programming interface by Nvidia

7

2 Implementation 2.1 Implementation Fortran program of WRF(v.3.4) Physics run on GP-GPUs

Micro physics: WSM6 and WDM6

Boundary-layer physics: YSUPBL

Radiation physics - RRTMG_LW, RRTMG_SW

Surface-layer physics - SFCLAY

8

2 Implementation 2.2 Execution profile of WRF physics routines

9

22.5%

14.4%

21.1%

2.1%

WRF execution profile

RRTMG_LW

RRTMG_SW

WDM6

YSUPBL

etc.

2 Implementation 2.3 Implementation of parallel programs

2.3.1 Running environment Modification of configure.wrf(environment set-up file)

Compatible to original WRF program ARCH_LOCAL = -DRUN_ON_GPU (GP-GPU code compile if -DRUN_ON_GPU defined)

Create a directory for CUDA codes only - cuda GP-GPU source codes

Exclusive Makefile

10

2 Implementation 2.3.2 Structure of the GP-GPU program

11

Initialize

Physics routine: do j=.. call 2d routine(..,j,..) enddo

Time steps

(Original code) Initialize: dynamic allocation of

GP-GPU variables

Physics routine: Copy CPU variables to GPU call 3d routine (GPU) Copy GPU variables to CPU

Finalize: deallocation of GPU variables

Time steps

(GPU code)

2 Implementation

12

2.3.3 Initialize & Finalize phys/module_physics_init.F

cuda/module_ra_rrtmg_lw_gpu.F

Initialization of constants

Allocation of GPU device variables

#ifndef RUN_ON_GPU CALL rrtmg_lwinit( ) #else CALL rrtmg_lwinit_gpu() #endif

subroutine rrtmg_lwinit_gpu(...) call rrtmg_lw_ini(cp) allocate(p8w_d(dime,djme),stat=istate)

main/module_wrf_top.F

cuda/module_ra_rrtmg_lw_gpu.F

Deallocation of GPU device variables

subroutine rrtmg_lwfinalize_gpu( ..) deallocate(p8w_d)

#ifdef RUN_ON_GPU call rrtmg_lwfinalize_gpu() #endif

2 Implementation 2.3.4 Calling GPU Functions

phys/module_radiation_driver.F

13

#ifndef RUN_ON_GPU USE module_ra_rrtmg_lw, only: rrtmg_lwrad #else USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu #endif ... #ifndef RUN_ON_GPU CALL RRTMG_LWRAD(...) #else CALL RRTMG_LWRAD_GPU(...)

2 Implementation 2.3.5 Translation into GPU code

Use 3-dimensional domain

Remove the horizontal loop(i, j-loop) of GPU function(Global)

14

SUBROUTINE wdm62D() do k = kts, kte do i = its, ite cpm(i,k) = cpmcal(q(i,k)) xl(i,k) = xlcal(t(i,k)) enddo enddo do k = kts, kte do i = its, ite delz_tmp(i,k) = delz(i,k) den_tmp(i,k) = den(i,k) enddo enddo END SUBROUTINE wdm62D

attributes(global) subroutine wdm6_gpu_kernel(...) i = (blockIdx%x-1)*blockDim%x + threadIdx%x j = (blockIdx%y-1)*blockDim%y + threadIdx%y if (((i.ge.its).and.(i.le.ite)).and. & ((j.ge.jts).and.(j.le.jte))) then do k = kts, kte cpm(i,k,j) = cpmcal(q(i,k,j)) xl(i,k,j) = xlcal(t(i,k,j)) ! enddo do k = kts, kte delz_tmp(i,k,j) = delz(i,k,j) den_tmp(i,k,j) = den(i,k,j) enddo endif end subroutine wdm6_gpu_kernel

2 Implementation 2.3.6 Memory allocation of arrays and copy CPU data

15

real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),

rthratenlw_d = rthratenlw emiss_d = emiss call rrtmg_lwrad_gpu_kernel (rthratenlw_d, emiss_d, )

attributes(global) subroutine rrtmg_lwrad_gpu_kernel(rthratenlw, emiss, ) rthratenlw(i,k,j) = emiss(i,k,j) =

subroutine rrtmg_lwrad(rthratenlw, emiss, ) rthratenlw(i,k,j) = emiss(i,k,j) =

3 Performance 3.1 Performance comparison

System specification used for performance checking

16

GPU Tesla C1060 (1.3GHz)

Global memory 4G bytes

#multiprocessors 30

#cores 240

Registers/block 16384

Max#thread/block 512

CPU Intel Xeon E5405 (2.0GHz)

3 Performance Performance of WRF physics routines

17

0

500

1000

1500

2000

2500

3000

3500

4000

WSM6 WDM6 YSUPBL RRTMG_LW RRTMG_SW SFCLAY

CPU

GPU

3 Performance Performance comparison of CUDA C and CUDA Fortran

18

0

50,000

100,000

150,000

200,000

250,000

300,000

WSM5 WSM6

mic

rose

c

CPU GPU

3 Performance 3.2 Performance of WRF

19

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

WRF

CPU GPU

4 Conclusions Pros

GP-GPUs can be used as efficient hardware accelerators.

GP-GPUs are cheap and energy efficient.

Cons Communication between CPUs and GPUs is slow.

Data transfer between CPU and GP-GPU is a bottleneck.

Overlap of communication and computation is necessary.

Translation into GP-GPU code is not trivial. Parameter passing methods, local resources are limited.

CUDA Fortran need to be improved.

20

WRF Physics models using GP-GPUswith CUDA FortranAgenda1 Background1 Background1 Background1 Background1 Background1 Background2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation3 Performance3 Performance3 Performance3 Performance4 ConclusionsThank you

WRF-GPU.Dr.Young-Tae+Kim

Documents

background gpgpu performance

performance of wrf

performance comparison

gpgpus micro physics

gpgpu programming interface

limitations of cuda

fold performance increase

gpu type cores