Porting GE’s Turbomachinery CFD solver to the Cray XK6on-demand.gputechconf.com/supercomputing/2014/...GE’s Turbomachinery RANS Code Robust & scalable algorithm • 2nd accurate,

Imagination at work.

Dr. Brian E. Mitchell, General Electric Global Research Aaron Vose, John Levesque, Cray, Inc. Jeff Larkin, NVIDIA 19 Nov 2014

Porting GE’s Turbomachinery CFD Solver to the Cray XK7

© 2014 General Electric Company - All rights reserved

2

GE Today

Oil & Gas

Power & Water Energy Management

Transportation Healthcare

Aviation

Home & Biz Solutions

GE Capital

Founded 1892, 300,000 employees, $150B revenue GE turbomachinery products powered by ~ $150B/yr. of fuel


3

TACOMA … GE’s Turbomachinery RANS Code

Robust & scalable algorithm

• 2nd accurate, “explicit”, FV, multi-grid

• Block structured grids

• RANS: k-omega, SST, etc.

Deployed across GE

• Av, Power & Water, Oil & Gas, Research

• All NPI programs

20 years of active development

• 100K’s lines of Fortran 95

• Numerous active developers

Deployed on GE & gov’t HPC


4

Active Interest in Performance & Platforms

2013 EOY Status

MPI capability Newly added OpenMP Scaling to 87K cores on ORNL jaguar Research explorations of CUDA, OpenACC, GPUs, Intel Phi, etc.

Mira & Scalability

1H14 Port to mira Focus on memory & temporal scaling Scaling > 100K cores

Titan & GPUs

Gained access to titan time & experts from Cray & NVIDIA Collaborative port of TACOMA


5

Porting Process Basic readiness

• MPI scaling

• OpenMP capability

• Jugulars to show possible benefit

• Identify kernels for GPU

Revisit RACE conditions

Systematic building of data regions, ACC loop directives (correctness)

Code optimizations (data motion, loops) (speed)

do iter = 1, max_iterations

q0(:) = q(:)

do istep=1,3

call flux_calc ! dq=f(q)

q(:)= q0(:) + dt * dq(:)

call boundary_conditions

end do

end do

subroutine flux_calc

• 60-70% of CPU time • Minimal MPI calls • ~30 subroutines • ~50 triple loops • Stable code

Focus on Correctness & Speed & Maintainability


6

Address RACE Conditions

Pre-OpenMP code

do i = 0,n+1 ! cells

dq(i) = 0.0

end do

do i = 1,n+1 ! faces

flux = ...

dq(i-1) -= flux

dq(i ) += flux

end do

Minimizes expensive computation of “flux”

OpenMP Approach

c$OMP parallel do

do i = 0,n+1 ! cells

dq(i) = 0.0

end do

c$OMP end parallel do

c$OMP parallel

k = omp_thread_num

do i = lb(k),ub(k)

flux = ...

if(color(i-1) == k)

dq(i-1) -= flux

if(color(i ) == k)

dq(i ) += flux

end do

c$OMP end parallel

Color each cell 1-nThreads Effective for nThreads < 12

Current CPU approach

GPU Approach

allocate(f(1:n+1))

c$ACC parallel loop

do i = 1,n+1 ! Faces

flux = …

f(i) = flux

end do

c$ACC end parallel loop

c$ACC parallel loop

do i = 1,n

dq(i) = f(i)-f(i+1)

end do


deallocate(f)

Two pass structure Current GPU approach


7

Systematic Porting Process – Starting point


call section1

call section2

call section3

etc.

subroutine section1

call foo1

call foo2

etc.

subroutine foo1

do i = 1,n

end do

8 sections

~30 foo’s

~50 triple loops


8

Systematic Porting Process –Step 1

Bottoms-up process

Loop level directives & data regions

• Priority is correctness, not speed

• Basic optimizations as guided by

listing file (CCE option “-rm”)

• Capture all motion of global variables

• Leverage OpenMP directives


call section1

call section2

call section3

etc.

subroutine section1

call foo1

call foo2

etc.

subroutine foo1

c$ACC data copy(…)

c$ACC parallel loop private(…)

do i = 1,n

end do


c$ACC end data


9

Systematic Porting Process –Step 2

Move data region up the call chain

Collapse data regions from foo1, foo2, etc.

Primary goal is still correctness, though speed will be improving


call section1

call section2

call section3

etc.

subroutine section1

c$ACC data copy(…)

call foo1

call foo2

etc.

c$ACC end data

subroutine foo1

c$ACC data present(…)


do i = 1,n

end do


c$ACC end data


10

Systematic Porting Process –Step 3 & 4 Collapse all data transfer

Optimize data motion

• Copy copyin, copyout, copyinout

• Selective updates … based on algorithm

• Use async clauses

• Initiate data motion from CPU regions

Optimize loops performance

• Timing studies whack-a-mole


call data_to_device

call section1

call section2

call section3

etc.

call data_to_host

subroutine section1


call foo1

call foo2

etc.

c$ACC end data

subroutine foo1



do i = 1,n

end do


c$ACC end data

subroutine data_to_device

if (first) then

c$ACC enter data create(q,dq,p)

end if

c$ACC update device(q,p) async

c$ACC wait

end subroutine data_to_device


11

Optimizations … A sampling Use of “loop seq” for utility loops

Convert array syntax to explicit loops

Avoid small private arrays … made a 2x speed difference in a key routine

Where possible, firstprivate copyin in “c$ACC parallel loop”

Essential to have access to experienced application optimization experts!

c$ACC parallel loop

do i=1,N

…

c$ACC loop seq

do iv=1,5

g(iv) =

end do

…

end do


c$ACC parallel loop

do i=1,N

do iv=1,5

g(iv) =

end do

c$ACC parallel loop

do i=1,N

g(1:5) = …


12

Results … 1 CPU core vs 1 GPU

flux_calc … 21 speed up of 1 GPU vs 1 CPU core

A key routine (VFLX) is not well optimized on GPU

Single blade row steady “BICON”

test case

• 12.5 MM grid cells • 128 structured grid blocks

• Perfect MPI load balance

1 CPU core vs 1 GPU

• 4 nodes • 1 rank / node, no OMP threads

Subroutines called by flux_calc

Results obtained on internal

machine at Cray (similar to titan)

Overall speedup


13

Results … Fully Utilized Nodes

Best case speed-up on titan is 1.3x

Run CPU using optimal MPI /

OpenMP ratio

Run GPU

• CRAY_CUDA_PROXY=1

• Optimal MPI : OMP ratio

Results obtained on titan

200K cells / CPU core

12K cells / CPU core

16 x 1 8 x 2

16 x 1 8 x 2

16 x 1 8 x 2

8 x 2 8 x 2 8 x 2

4 x 4

Work / kernel down 2x

# of kernels / GPU

MPI ranks / node x

OMP threads / rank


14

Final Thoughts

Porting process

• OpenMP was a solid base

• Be systematic … focus on correctness and then speed

• Build infrastructure to regularly check correctness

Performance

• Ensure correct loops get partitioned and vectorized

• Reduce memory transfer as much as possible

Next steps

• Can we improve the VFLX routine?

• Opportunity to improve CPU code by 1.1x by revisiting RACE condition solution


15

Acknowledgements

GE gratefully acknowledges the close collaboration & support of Cray, NVIDIA, and ORNL.

Cray, Inc.: Aaron Vose, John Levesque

Use of internal Cray development clusters

NVIDIA: Jeff Larkin

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of

Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Porting GE’s Turbomachinery CFD solver to the Cray XK6on-demand.gputechconf.com/supercomputing/2014/...GE’s Turbomachinery RANS Code Robust & scalable algorithm • 2nd accurate,

Documents