Top Banner
Gangneung-Wonju National University Youngtae Kim
22
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Gangneung-Wonju National University

    Youngtae Kim

  • Agenda

    1 Background 1.1 The future of high performance computing 1.2 GP-GPU 1.3 CUDA Fortran

    2 Implementation 2.1 Implementation Fortran program of WRF 2.2 Execution profile of WRF physics 2.3 Implementation of parallel programs

    3 Performance 3.1 Performance comparison 3.2 Performance of WRF

    4 Conclusions

    1

  • 1 Background 1.1 The future of High Performance Computing

    H. Meuer, Scientific Computing World: June/July 2009 A thousand-fold performance increase over an 11-year time

    period. 1986 Gigaflops

    1997 Teraflops

    2008 Petaflops

    2019 Exaflops

    For the near future, we expect that the hardware architecture will be a combination of specialized CPU and GPU type cores.

    2

  • 1 Background GP-GPU performance

    3

    FLOPS/Memory bandwidth for the CPU and GP-GPU *FLOPS: Floating-Point Operations per Seconds

  • 1 Background GP-GPU Acceleration of WRF WSM5

    4

  • 1 Background 1.2 GP-GPU(General Purposed Graphic Processing Unit)

    Originally graphic processing

    Grid of Multi processors

    Use PCI

    5 Grid (Data Domain)

    Thread block (Compute in parallel)

  • 1 Background

    6

    Grid

    Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x j = blockDim%y*(blockIdx%y-1) + threadIdx%y

    Caller: function ()

    (3*0+1, 3*0+1)

    (3*0+1, 3*0+2)

    (3*0+1, 3*0+3)

    (3*0+1, 3*1+1)

    (3*0+1, 3*1+2)

    (3*0+1, 3*1+3)

    (3*0+1, 3*2+1)

    (3*0+1, 3*2+2)

    (3*0+1, 3*2+3)

    (3*0+2, 3*0+1)

    (3*0+2, 3*0+2)

    (3*0+2, 3*0+3)

    (3*0+2, 3*1+1)

    (3*0+2, 3*1+2)

    (3*0+2, 3*1+3)

    (3*0+2, 3*2+1)

    (3*0+2, 3*2+2)

    (3*0+2, 3*2+3)

    (3*0+3, 3*0+1)

    (3*0+3, 3*0+2)

    (3*0+3, 3*0+3)

    (3*0+3, 3*1+1)

    (3*0+3, 3*1+2)

    (3*0+3, 3*1+3)

    (3*0+3, 3*2+1)

    (3*0+3, 3*2+2)

    (3*0+3, 3*2+3)

    (3*1+1, 3*0+1)

    (3*1+1, 3*0+2)

    (3*1+1, 3*0+3)

    (3*1+1, 3*1+1)

    (3*1+1, 3*1+2)

    (3*1+1, 3*1+3)

    (3*1+1, 3*2+1)

    (3*1+1, 3*2+2)

    (3*1+1, 3*2+3)

    (3*1+2, 3*0+1)

    (3*1+2, 3*0+2)

    (3*1+2, 3*0+3)

    (3*1+2, 3*1+1)

    (3*1+2, 3*1+2)

    (3*1+2, 3*1+3)

    (3*1+2, 3*2+1)

    (3*1+2, 3*2+2)

    (3*1+2, 3*2+3)

    (3*1+3, 3*0+1)

    (3*1+3, 3*0+2)

    (3*1+3, 3*0+3)

    (3*1+3, 3*1+1)

    (3*1+3, 3*1+2)

    (3*1+3, 3*1+3)

    (3*1+3, 3*2+1)

    (3*1+3, 3*2+2)

    (3*1+3, 3*2+3)

    (3*0+1, 3*0+1)

    (3*0+1, 3*0+2)

    (3*0+1, 3*0+3)

    (3*0+2, 3*0+1)

    (3*0+2, 3*0+2)

    (3*0+2, 3*0+3)

    (3*0+3, 3*0+1)

    (3*0+3, 3*0+2)

    (3*0+3, 3*0+3)

    (3*0+1, 3*1+1)

    (3*0+1, 3*1+2)

    (3*0+1, 3*1+3)

    (3*0+2, 3*1+1)

    (3*0+2, 3*1+2)

    (3*0+2, 3*1+3)

    (3*0+3, 3*1+1)

    (3*0+3, 3*1+2)

    (3*0+3, 3*1+3)

    (3*0+1, 3*2+1)

    (3*0+1, 3*2+2)

    (3*0+1, 3*2+3)

    (3*0+2, 3*2+1)

    (3*0+2, 3*2+2)

    (3*0+2, 3*2+3)

    (3*0+3, 3*2+1)

    (3*0+3, 3*2+2)

    (3*0+3, 3*2+3)

    (3*1+1, 3*0+1)

    (3*1+1, 3*0+2)

    (3*1+1, 3*0+3)

    (3*1+2, 3*0+1)

    (3*1+2, 3*0+2)

    (3*1+2, 3*0+3)

    (3*1+3, 3*0+1)

    (3*1+3, 3*0+2)

    (3*1+3, 3*0+3)

    (3*1+1, 3*1+1)

    (3*1+1, 3*1+2)

    (3*1+1, 3*1+3)

    (3*1+2, 3*1+1)

    (3*1+2, 3*1+2)

    (3*1+2, 3*1+3)

    (3*1+3, 3*1+1)

    (3*1+3, 3*1+2)

    (3*1+3, 3*1+3)

    (3*1+1, 3*2+1)

    (3*1+1, 3*2+2)

    (3*1+1, 3*2+3)

    (3*1+2, 3*2+1)

    (3*1+2, 3*2+2)

    (3*1+2, 3*2+3)

    (3*1+3, 3*2+1)

    (3*1+3, 3*2+2)

    (3*1+3, 3*2+3)

    (1,1) (1,2) (1,3)

    (2,1) (2,2) (2,3)

  • 1 Background 1.3 CUDA Fortran (PG Fortran version 10)

    Developed by Portland Group Inc. and Nvidia(2009/12)

    Support CUDA

    Use Fortan90(95/03) syntax

    Some limitations of CUDA Fortran Not support automatic arrays and module variables

    Not supoort common, equivalence

    CUDA(Computer United Device Architecture): GP-GPU programming interface by Nvidia

    7

  • 2 Implementation 2.1 Implementation Fortran program of WRF(v.3.4) Physics run on GP-GPUs

    Micro physics: WSM6 and WDM6

    Boundary-layer physics: YSUPBL

    Radiation physics - RRTMG_LW, RRTMG_SW

    Surface-layer physics - SFCLAY

    8

  • 2 Implementation 2.2 Execution profile of WRF physics routines

    9

    22.5%

    14.4%

    21.1%

    2.1%

    WRF execution profile

    RRTMG_LW

    RRTMG_SW

    WDM6

    YSUPBL

    etc.

  • 2 Implementation 2.3 Implementation of parallel programs

    2.3.1 Running environment Modification of configure.wrf(environment set-up file)

    Compatible to original WRF program ARCH_LOCAL = -DRUN_ON_GPU (GP-GPU code compile if -DRUN_ON_GPU defined)

    Create a directory for CUDA codes only - cuda GP-GPU source codes

    Exclusive Makefile

    10

  • 2 Implementation 2.3.2 Structure of the GP-GPU program

    11

    Initialize

    Physics routine: do j=.. call 2d routine(..,j,..) enddo

    Time steps

    (Original code) Initialize: dynamic allocation of

    GP-GPU variables

    Physics routine: Copy CPU variables to GPU call 3d routine (GPU) Copy GPU variables to CPU

    Finalize: deallocation of GPU variables

    Time steps

    (GPU code)

  • 2 Implementation

    12

    2.3.3 Initialize & Finalize phys/module_physics_init.F

    cuda/module_ra_rrtmg_lw_gpu.F

    Initialization of constants

    Allocation of GPU device variables

    #ifndef RUN_ON_GPU CALL rrtmg_lwinit( ) #else CALL rrtmg_lwinit_gpu() #endif

    subroutine rrtmg_lwinit_gpu(...) call rrtmg_lw_ini(cp) allocate(p8w_d(dime,djme),stat=istate)

    main/module_wrf_top.F

    cuda/module_ra_rrtmg_lw_gpu.F

    Deallocation of GPU device variables

    subroutine rrtmg_lwfinalize_gpu( ..) deallocate(p8w_d)

    #ifdef RUN_ON_GPU call rrtmg_lwfinalize_gpu() #endif

  • 2 Implementation 2.3.4 Calling GPU Functions

    phys/module_radiation_driver.F

    13

    #ifndef RUN_ON_GPU USE module_ra_rrtmg_lw, only: rrtmg_lwrad #else USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu #endif ... #ifndef RUN_ON_GPU CALL RRTMG_LWRAD(...) #else CALL RRTMG_LWRAD_GPU(...)

  • 2 Implementation 2.3.5 Translation into GPU code

    Use 3-dimensional domain

    Remove the horizontal loop(i, j-loop) of GPU function(Global)

    14

    SUBROUTINE wdm62D() do k = kts, kte do i = its, ite cpm(i,k) = cpmcal(q(i,k)) xl(i,k) = xlcal(t(i,k)) enddo enddo do k = kts, kte do i = its, ite delz_tmp(i,k) = delz(i,k) den_tmp(i,k) = den(i,k) enddo enddo END SUBROUTINE wdm62D

    attributes(global) subroutine wdm6_gpu_kernel(...) i = (blockIdx%x-1)*blockDim%x + threadIdx%x j = (blockIdx%y-1)*blockDim%y + threadIdx%y if (((i.ge.its).and.(i.le.ite)).and. & ((j.ge.jts).and.(j.le.jte))) then do k = kts, kte cpm(i,k,j) = cpmcal(q(i,k,j)) xl(i,k,j) = xlcal(t(i,k,j)) ! enddo do k = kts, kte delz_tmp(i,k,j) = delz(i,k,j) den_tmp(i,k,j) = den(i,k,j) enddo endif end subroutine wdm6_gpu_kernel

  • 2 Implementation 2.3.6 Memory allocation of arrays and copy CPU data

    15

    real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),

    rthratenlw_d = rthratenlw emiss_d = emiss call rrtmg_lwrad_gpu_kernel (rthratenlw_d, emiss_d, )

    attributes(global) subroutine rrtmg_lwrad_gpu_kernel(rthratenlw, emiss, ) rthratenlw(i,k,j) = emiss(i,k,j) =

    subroutine rrtmg_lwrad(rthratenlw, emiss, ) rthratenlw(i,k,j) = emiss(i,k,j) =

  • 3 Performance 3.1 Performance comparison

    System specification used for performance checking

    16

    GPU Tesla C1060 (1.3GHz)

    Global memory 4G bytes

    #multiprocessors 30

    #cores 240

    Registers/block 16384

    Max#thread/block 512

    CPU Intel Xeon E5405 (2.0GHz)

  • 3 Performance Performance of WRF physics routines

    17

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    WSM6 WDM6 YSUPBL RRTMG_LW RRTMG_SW SFCLAY

    CPU

    GPU

  • 3 Performance Performance comparison of CUDA C and CUDA Fortran

    18

    0

    50,000

    100,000

    150,000

    200,000

    250,000

    300,000

    WSM5 WSM6

    mic

    rose

    c

    CPU GPU

  • 3 Performance 3.2 Performance of WRF

    19

    0

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    16000

    18000

    WRF

    CPU GPU

  • 4 Conclusions Pros

    GP-GPUs can be used as efficient hardware accelerators.

    GP-GPUs are cheap and energy efficient.

    Cons Communication between CPUs and GPUs is slow.

    Data transfer between CPU and GP-GPU is a bottleneck.

    Overlap of communication and computation is necessary.

    Translation into GP-GPU code is not trivial. Parameter passing methods, local resources are limited.

    CUDA Fortran need to be improved.

    20

  • WRF Physics models using GP-GPUswith CUDA FortranAgenda1 Background1 Background1 Background1 Background1 Background1 Background2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation2 Implementation3 Performance3 Performance3 Performance3 Performance4 ConclusionsThank you