John Levesque Nov 16, 2001 levesque@cray - Nvidia · John Levesque Nov 16, 2001 [email protected] 2 We see that the GPU is the best device available for us today to be able to get

1John Levesque Nov 16, 2001 [email protected]

John Levesque Nov 16, 2001 [email protected] 2

We see that the GPU is the best device available for us today to be able to get to the performance we want and meet our users’ requirements for a very high performance node with very high memory bandwidth.

Buddy Bland, ORNL Project Director OLCF-3HPCWire Interview, October 14, 2011

“

”


XK6 Compute Node Characteristics

Host Processor AMD Series 6200 (Interlagos)

Tesla X2090 Perf. 665 Gflops

Host Memory 16, 32, or 64GB1600 MHz DDR3

Tesla X090Memory

6GB GDDR5170 GB/sec

Gemini High Speed Interconnect

Upgradeable to Kepler many-core processor

4John Levesque Nov 16, 2001 [email protected]

Accelerator Tools

Optimized Libraries

Analysis and Scoping

Tools

Compiler Directives

Accelerator Tools

Statistics gathering for identification of potential accelerator kernels

Statistics gathering for code running on accelerator

Optimized Libraries

Utilization of Autotuning framework for generating optimized accelerator library

Whole program analysis to performance scoping for OpenMP and OpenACC directives


Open standard for addressing the acceleration of Fortran, C and C++ applications

Originally designed by Cray, PGI and Nvidia

Directives can be ignored on systems without accelerator

Can be used to target accelerators from Nvidia, AMD and Intel


7John Levesque Nov 16, 2001

[email protected]


[email protected]

Final Configuration

Name Titan

Architecture XK6

Processor 16-Core AMD

Cabinets 200

Nodes 18,688

Cores/node 16

Total Cores 299,008

Memory/Node 32GB

Memory/Core 2GB

Interconnect Gemini

GPUS TBD

CAM-SE Denovo LAMMPS

PFLOTRAN S3D WL-LSMS

Early Science Applications


[email protected]

Key code kernels have been ported and their performance project a 4X speed up on XK6 over Jaguar


[email protected]

CAM-SE

Major REMAP kernel

John Levesque Nov 16, 2001 [email protected]

11

All times in Millisecs OpenMP Parallel DO24Threads MagnyCours

Original REMAP 65.30

Rewrite for porting to accelerator

32.65

Hand Coded CUDA 10.2

OpenACC directives 10.6

WL-LSMS

The kernel responsible for 95% of the compute time on the CPU has been ported and shows a 2.5X speed up over the replaced CPU


[email protected]

13

gWL-LSMS3

• First Principles Statistical Mechanics of Magnetic Materials

• identified kernel for initial GPU work

– zblock_lu (95% of wall time on CPU)

– kernel performance: determined by BLAS and LAPACK: ZGEMM, ZGETRS, ZGETRF

• preliminary performance of zblock_lu for 12 atoms/node of Jaguarpf or 12 atoms/GPU

– For Fermi C2050, times include host-GPU PCIe transfers

– Currently GPU node does not utilize AMD Magny Cours host for compute

Jaguarpf

node (12

cores AMD

Istanbul)

Fermi

C2050

using

CUBLAS

Fermi

C2050

using Cray

Libsci

Time (sec) 13.5 11.6 6.4


[email protected]

Denovo

The 3-D sweep kernel, 90% of the runtime, runs 40X faster on Fermi compared to an Opteron core. The new GPU-aware sweeper also runs 2X faster on CPUs compared to the previous CPU-based sweeper due to performance optimizations

Single Major Kernel - SWEEP

• The sweep code is written in C++ using MPI and CUDA runtime calls.

• CUDA constructs are employed to enable generation of both CPU and GPU object code from a single source code.

• C++ template metaprogramming is used to generate highly optimized code at compile time, using techniques such as function inlining and constant propagation to optimize for specific use cases.


15

Denovo Performancermance data

0

2

4

6

8

10

12

14

1 4 16 64 256 1024

Seco

nd

s

Nodes

Jaguar, Denovo, old sweeper

Jaguar, Denovo, new sweeper

Jaguar, standalone new sweeper

Fermi, standalone new sweeper, extrapolated

Fermi + Gemini, standalone sweeper, estimated


[email protected]

LAMMPS

Currently seeing a 2X-5X speed up over the replaced CPU


[email protected]

Host-Device Load Balancing

• Split work further by spatial domain to improve data locality on GPU

• Further split work not ported to GPU across more CPU cores

• Concurrent calculation of routines not ported to GPUwith GPU force calculation

• Concurrent calculation of force on CPU and GPU

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 5 7 9 11 13 15Nodes

Other

Comm

Pair+Neigh+GPUComm

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 5 7 9 11 13 15Nodes

GPU-Comm Other Comm Neigh Pair

10

100

1000

1 2 4 8 16

Lo

op

Tim

e (

s)

Nodes

CPU (12ppn)

GPU (2ppn)

GPU LB (12 ppn)

GPU-N (2ppn)

GPU-N LB (12ppn)


[email protected]

S3D

Full Application running using new OpenACC directives. Target performance 4x JaguarPF


[email protected]

Refractor all MPI to Hybrid MPI/OpenMP


20

Covert OpenMP Regions to OpenACC


21

All times in Seconds OpenMP Parallel DO16 Threads Interlagos

OpenACC ParallelConstruct

Getrates 1.18 .235

Diffusive Flux 1.99 1.21

Point wise Compute .234 .174

Total Run /cycleEntire application

4.37 2.80

Copyright 2011 Cray Inc. Supercomputing 2011

Transfer from host to accelerator Communication on hostComputation on the accelerator Computation on hostTransfer for accelerator to host

Timestep Loop RK loop !$acc data in integrate(Major Arrays on AcceleratorTimestep Loop RK loop !$acc intialization in rhsf(1,2)Timestep Loop RK loop !$acc update host(U,YSPECIES,TEMP)Timestep Loop RK loop !$acc parallel loop in rhsf (3)Timestep Loop RK loop MPI Halo Update for U,YSPECIES, TEMP)Timestep Loop RK loop !$acc update device(grad_U,grad_Ys,grad_T)Timestep Loop RK loop !$acc parallel loopin rhsf (4-5)Timestep Loop RK loop !$acc update host(mixMW)Timestep Loop RK loop MPI Halo Update for mixMWTimestep Loop RK loop !$acc update device(grad_mixMW)Timestep Loop RK loop !$acc parallel loop in rhsf(6,7,8,9)Timestep Loop RK loop MPI Halo Update for TMMPTimestep Loop RK loop Fill RHS array on hostTimestep Loop RK loop !$acc update device(diffFlux)Timestep Loop RK loop !$acc parallel loop in rhsf(10)Timestep Loop RK loop !$acc update host(diffFlux)Timestep Loop RK loop MPI Halo Update for diffFluxTimestep Loop RK loop !$acc update device(diffFlux,rhs)Timestep Loop RK loop !$acc parallel loop in rhsf(11,12)Timestep Loop RK loop !$acc update host(rhs)


Host | Host | Acc | Acc Copy | Acc Copy | Calls |Function

Time% | Time | Time | In | Out | | PE=HIDE

| | | (MBytes) | (MBytes) | |

100.0% | 283.637 | 220.669 | 276847.501 | 111725.395 | 10420 |Total

|-------------------------------------------------------------------------------------------------

| 21.2% | 60.213 | -- | -- | -- | 120 |[email protected]

| 6.9% | 19.587 | -- | -- | -- | 120 |[email protected]

| 6.8% | 19.209 | -- | -- | -- | 120 |[email protected]

| 5.0% | 14.306 | 14.306 | 32602.500 | -- | 120 |[email protected]

| 5.0% | 14.157 | 14.157 | 32805.000 | -- | 120 |[email protected]

| 4.8% | 13.533 | 13.533 | 30881.250 | -- | 120 |[email protected]

| 4.8% | 13.506 | 13.506 | 30881.250 | -- | 120 |[email protected]

| 4.8% | 13.478 | 13.478 | 30881.250 | -- | 120 |[email protected]

| 3.4% | 9.758 | 9.758 | 22376.250 | -- | 120 |[email protected]

| 3.4% | 9.738 | -- | -- | -- | 120 |[email protected]

| 3.0% | 8.509 | 8.509 | -- | 32602.500 | 120 |[email protected]

| 2.6% | 7.388 | 7.388 | 17010.000 | -- | 120 |[email protected]

| 2.6% | 7.372 | 7.372 | 17010.000 | -- | 120 |[email protected]

| 2.5% | 7.078 | 7.078 | 16402.500 | -- | 120 |[email protected]

| 2.4% | 6.862 | 6.862 | 15795.000 | -- | 120 |[email protected]

| 2.4% | 6.856 | 6.856 | 15795.000 | -- | 120 |[email protected]

| 2.1% | 5.834 | -- | -- | -- | 120 |[email protected]

| 1.9% | 5.499 | 5.499 | -- | 22376.250 | 120 |[email protected]

| 1.8% | 5.157 | 5.157 | -- | 21060.000 | 120 |[email protected]

| 1.5% | 4.167 | 4.167 | -- | 16605.000 | 120 |[email protected]

| 1.3% | 3.792 | -- | -- | -- | 120 |[email protected]

| 1.3% | 3.656 | -- | -- | -- | 120 |[email protected]

| 1.1% | 2.996 | -- | -- | -- | 120 |[email protected]

| 1.0% | 2.707 | -- | -- | -- | 20 |[email protected]

| 0.9% | 2.531 | 2.531 | -- | -- | 120 |[email protected]

| 0.9% | 2.474 | 2.474 | 5670.000 | -- | 120 |[email protected]

| 0.8% | 2.242 | 2.242 | 4906.645 | -- | 20 |[email protected]

| 0.6% | 1.626 | 1.626 | 3729.375 | -- | 20 |[email protected]

| 0.5% | 1.452 | 1.452 | -- | 5670.000 | 120 |[email protected]

| 0.5% | 1.444 | 1.444 | -- | 5670.000 | 120 |[email protected]

| 0.5% | 1.388 | 1.388 | -- | 4906.645 | 20 |[email protected]

| 0.4% | 1.039 | -- | -- | -- | 120 |[email protected]

| 0.2% | 0.705 | -- | -- | -- | 20 |[email protected]

| 0.2% | 0.691 | 0.691 | -- | 2835.000 | 20 |[email protected]

| 0.2% | 0.659 | 0.659 | -- | -- | 120 |[email protected]

| 0.2% | 0.498 | 0.498 | -- | -- | 120 |[email protected]

| 0.2% | 0.497 | 0.497 | -- | -- | 120 |[email protected]

| 0.1% | 0.390 | -- | -- | -- | 120 |[email protected]

| 0.1% | 0.289 | 0.289 | 0.135 | -- | 120 |[email protected]

| 0.0% | 0.072 | 0.072 | 101.250 | -- | 120 |[email protected]

| 0.0% | 0.048 | 13.827 | -- | -- | 120 |[email protected]


982. #ifdef GPU

983. G---------<> !$acc parallel loop private(i,ml,mu) present( temp, pressure, yspecies,rb,rf,cgetrates)

984. #else

985. !$omp parallel private(i, ml, mu)

986. !$omp do

987. #endif

988. g----------< do i = 1, nx*ny*nz, ms

989. g ml = i

990. g mu = min(i+ms-1, nx*ny*nz)

991. g gr4 I----> call reaction_rate_vec_1( temp, pressure, yspecies, ml, mu, rb,rf,cgetrates )

992. g----------> end do

993. #ifdef GPU

994. !$acc end aparallel loop

995. #else

996. !$omp end parallel

997. #endif


526. #ifdef GPU

527. !$acc update device(grad_u,mixmw)

528. G----<> !$acc parallel private(i,ml,mu)

529. !$acc loop

530. #else

531. !$omp parallel private(i, ml, mu)

532. !$omp do

533. #endif

534. g----------< do i = 1, nx*ny*nz, ms

535. g ml = i

536. g mu = min(i+ms-1, nx*ny*nz)

537. g if(jstage.eq.1)then

538. g gr4 I----> call computeCoefficients_r( pressure, Temp, yspecies, q(:,:,:,4),ds_mxvg,vscsty,mixmw, ml, mu )

539. g endif

540. g gw I-----> call computeStressTensor_r( grad_u, vscsty,ml, mu)

541. g----------> enddo

542. #ifdef GPU

543. !$acc end loop

544. !$acc end parallel

545. #else

546. !$omp end parallel

547. #endif

John Levesque Nov 16, 2001 levesque@cray - Nvidia · John Levesque Nov 16, 2001 [email protected] 2 We see that the GPU is the best device available for us today to be able to get

Documents