Imagination at work. Dr. Brian E. Mitchell, General Electric Global Research Aaron Vose, John Levesque, Cray, Inc. Jeff Larkin, NVIDIA 19 Nov 2014 Porting GE’s Turbomachinery CFD Solver to the Cray XK7
Imagination at work.
Dr. Brian E. Mitchell, General Electric Global Research Aaron Vose, John Levesque, Cray, Inc. Jeff Larkin, NVIDIA 19 Nov 2014
Porting GE’s Turbomachinery CFD Solver to the Cray XK7
© 2014 General Electric Company - All rights reserved
2
GE Today
Oil & Gas
Power & Water Energy Management
Transportation Healthcare
Aviation
Home & Biz Solutions
GE Capital
Founded 1892, 300,000 employees, $150B revenue GE turbomachinery products powered by ~ $150B/yr. of fuel
© 2014 General Electric Company - All rights reserved
3
TACOMA … GE’s Turbomachinery RANS Code
Robust & scalable algorithm
• 2nd accurate, “explicit”, FV, multi-grid
• Block structured grids
• RANS: k-omega, SST, etc.
Deployed across GE
• Av, Power & Water, Oil & Gas, Research
• All NPI programs
20 years of active development
• 100K’s lines of Fortran 95
• Numerous active developers
Deployed on GE & gov’t HPC
© 2014 General Electric Company - All rights reserved
4
Active Interest in Performance & Platforms
2013 EOY Status
MPI capability Newly added OpenMP Scaling to 87K cores on ORNL jaguar Research explorations of CUDA, OpenACC, GPUs, Intel Phi, etc.
Mira & Scalability
1H14 Port to mira Focus on memory & temporal scaling Scaling > 100K cores
Titan & GPUs
Gained access to titan time & experts from Cray & NVIDIA Collaborative port of TACOMA
© 2014 General Electric Company - All rights reserved
5
Porting Process Basic readiness
• MPI scaling
• OpenMP capability
• Jugulars to show possible benefit
• Identify kernels for GPU
Revisit RACE conditions
Systematic building of data regions, ACC loop directives (correctness)
Code optimizations (data motion, loops) (speed)
do iter = 1, max_iterations
q0(:) = q(:)
do istep=1,3
call flux_calc ! dq=f(q)
q(:)= q0(:) + dt * dq(:)
call boundary_conditions
end do
end do
subroutine flux_calc
• 60-70% of CPU time • Minimal MPI calls • ~30 subroutines • ~50 triple loops • Stable code
Focus on Correctness & Speed & Maintainability
© 2014 General Electric Company - All rights reserved
6
Address RACE Conditions
Pre-OpenMP code
do i = 0,n+1 ! cells
dq(i) = 0.0
end do
do i = 1,n+1 ! faces
flux = ...
dq(i-1) -= flux
dq(i ) += flux
end do
Minimizes expensive computation of “flux”
OpenMP Approach
c$OMP parallel do
do i = 0,n+1 ! cells
dq(i) = 0.0
end do
c$OMP end parallel do
c$OMP parallel
k = omp_thread_num
do i = lb(k),ub(k)
flux = ...
if(color(i-1) == k)
dq(i-1) -= flux
if(color(i ) == k)
dq(i ) += flux
end do
c$OMP end parallel
Color each cell 1-nThreads Effective for nThreads < 12
Current CPU approach
GPU Approach
allocate(f(1:n+1))
c$ACC parallel loop
do i = 1,n+1 ! Faces
flux = …
f(i) = flux
end do
c$ACC end parallel loop
c$ACC parallel loop
do i = 1,n
dq(i) = f(i)-f(i+1)
end do
c$ACC end parallel loop
deallocate(f)
Two pass structure Current GPU approach
© 2014 General Electric Company - All rights reserved
7
Systematic Porting Process – Starting point
subroutine flux_calc
call section1
call section2
call section3
etc.
subroutine section1
call foo1
call foo2
etc.
subroutine foo1
do i = 1,n
end do
8 sections
~30 foo’s
~50 triple loops
© 2014 General Electric Company - All rights reserved
8
Systematic Porting Process –Step 1
Bottoms-up process
Loop level directives & data regions
• Priority is correctness, not speed
• Basic optimizations as guided by
listing file (CCE option “-rm”)
• Capture all motion of global variables
• Leverage OpenMP directives
subroutine flux_calc
call section1
call section2
call section3
etc.
subroutine section1
call foo1
call foo2
etc.
subroutine foo1
c$ACC data copy(…)
c$ACC parallel loop private(…)
do i = 1,n
end do
c$ACC end parallel loop
c$ACC end data
© 2014 General Electric Company - All rights reserved
9
Systematic Porting Process –Step 2
Move data region up the call chain
Collapse data regions from foo1, foo2, etc.
Primary goal is still correctness, though speed will be improving
subroutine flux_calc
call section1
call section2
call section3
etc.
subroutine section1
c$ACC data copy(…)
call foo1
call foo2
etc.
c$ACC end data
subroutine foo1
c$ACC data present(…)
c$ACC parallel loop private(…)
do i = 1,n
end do
c$ACC end parallel loop
c$ACC end data
© 2014 General Electric Company - All rights reserved
10
Systematic Porting Process –Step 3 & 4 Collapse all data transfer
Optimize data motion
• Copy copyin, copyout, copyinout
• Selective updates … based on algorithm
• Use async clauses
• Initiate data motion from CPU regions
Optimize loops performance
• Timing studies whack-a-mole
subroutine flux_calc
call data_to_device
call section1
call section2
call section3
etc.
call data_to_host
subroutine section1
c$ACC data present(…)
call foo1
call foo2
etc.
c$ACC end data
subroutine foo1
c$ACC data present(…)
c$ACC parallel loop private(…)
do i = 1,n
end do
c$ACC end parallel loop
c$ACC end data
subroutine data_to_device
if (first) then
c$ACC enter data create(q,dq,p)
end if
c$ACC update device(q,p) async
c$ACC wait
end subroutine data_to_device
© 2014 General Electric Company - All rights reserved
11
Optimizations … A sampling Use of “loop seq” for utility loops
Convert array syntax to explicit loops
Avoid small private arrays … made a 2x speed difference in a key routine
Where possible, firstprivate copyin in “c$ACC parallel loop”
Essential to have access to experienced application optimization experts!
c$ACC parallel loop
do i=1,N
…
c$ACC loop seq
do iv=1,5
g(iv) =
end do
…
end do
c$ACC end parallel loop
c$ACC parallel loop
do i=1,N
do iv=1,5
g(iv) =
end do
c$ACC parallel loop
do i=1,N
g(1:5) = …
© 2014 General Electric Company - All rights reserved
12
Results … 1 CPU core vs 1 GPU
flux_calc … 21 speed up of 1 GPU vs 1 CPU core
A key routine (VFLX) is not well optimized on GPU
Single blade row steady “BICON”
test case
• 12.5 MM grid cells • 128 structured grid blocks
• Perfect MPI load balance
1 CPU core vs 1 GPU
• 4 nodes • 1 rank / node, no OMP threads
Subroutines called by flux_calc
Results obtained on internal
machine at Cray (similar to titan)
Overall speedup
© 2014 General Electric Company - All rights reserved
13
Results … Fully Utilized Nodes
Best case speed-up on titan is 1.3x
Run CPU using optimal MPI /
OpenMP ratio
Run GPU
• CRAY_CUDA_PROXY=1
• Optimal MPI : OMP ratio
Results obtained on titan
200K cells / CPU core
12K cells / CPU core
16 x 1 8 x 2
16 x 1 8 x 2
16 x 1 8 x 2
8 x 2 8 x 2 8 x 2
4 x 4
Work / kernel down 2x
# of kernels / GPU
MPI ranks / node x
OMP threads / rank
© 2014 General Electric Company - All rights reserved
14
Final Thoughts
Porting process
• OpenMP was a solid base
• Be systematic … focus on correctness and then speed
• Build infrastructure to regularly check correctness
Performance
• Ensure correct loops get partitioned and vectorized
• Reduce memory transfer as much as possible
Next steps
• Can we improve the VFLX routine?
• Opportunity to improve CPU code by 1.1x by revisiting RACE condition solution
© 2014 General Electric Company - All rights reserved
15
Acknowledgements
GE gratefully acknowledges the close collaboration & support of Cray, NVIDIA, and ORNL.
Cray, Inc.: Aaron Vose, John Levesque
Use of internal Cray development clusters
NVIDIA: Jeff Larkin
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of
Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.