Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512
Post on 25-Aug-2020
0 Views
Preview:
Transcript
Performance Results from Multi-core Platforms
Nicholas J. Wright Advanced Technology Group, NERSC/LBNL
njwright@lbl.gov
Programming weather, climate, and earth-‐system models on heterogeneous mulI-‐core plaJorms
19-‐20 September 2013, NCAR -‐ 1 -‐
Rich @ SC12 – “Show some Data!”
-‐ 2 -‐
NERSC users I want
• Robust code changes – I don’t want to add things in only to take them out again two years later
• Performance portability – Changes made today for one pla;orm should help on all
• Given hardware trends what should I do? – Understand what is limi=ng my applica=ons performance
• Roofline Model – Iden=fy and exploit parallelism
• OpenMP • Vectors • Tasks
-‐ 3 -‐
Understanding Performance on Today’s Machines – Per socket comparison
• Edison Cray XC30 – Intel Ivybridge – 2.4 GHz – 12 cores, 212 gflops, 50 GB/s* per socket
• Hopper Cray XE6 – AMD Magny Cours – 2.1 GHz – 12 cores, 95 gflops, 35 GB/s* per socket
• Mira BG/Q – IBM PowerPC – 1.6 GHz – 16 cores, 172 gflops, 29 GB/s* per socket
• Intel Xeon Phi (KNC) – 1.238 GHz – 61 cores, 1.06TF, 174 GB/s* per socket
• NVIDIA Kepler G20X – XXX cores, 1220TF, 171 GB/s per socket
-‐ 4 -‐ *DGEMM & STREAM triad
Performance Per Node
-‐ 5 -‐
102
50
29
174 175
0
20
40
60
80
100
120
140
160
180
200
XC30 XE6 BG/Q Xeon Phi NVIDIA K20X
STREAM (GB/s)
423
189 172
1,060
1,220
0
200
400
600
800
1000
1200
1400
XC30 XE6 BG/Q Xeon Phi NVIDIA K20X
DGEMM (gflops)
Roofline for Test Systems
-‐ 6 -‐
1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8 16
1/16 1/8 1/4 1/2 1 2 4 8 16
Gflo
ps/s/core
OperaIonal Intensity (Flops/Byte)
Hopper
Edison
Mira
GTC
MILC
miniDFT
miniGhost
miniFE
SNAP
AMG
Flash Code on Edison, Hopper, Mira
• ~4x more parallelism needed for equivalent performance on BG/Q compared to Cray XC30
• Energy – XC30 280W/node – BG/W 80W/node – Factor of 3.5x
-‐ 7 -‐
10^1
10^2
10^3
512 1024 2048 4096 8192 16384 32768
Runtim
e (
Seco
nds)
Nodes
0.00
0.20
0.40
0.60
0.80
1.00
1.20
512 1024 2048 4096 8192 16384 32768
Para
llel E
ffic
iency
Nodes
BG/Q (16xMPI, 4xOpenMP)BG/Q (1xMPI, 64xOpenMP)
Hopper (24xMPI)Hopper (4xMPI, 6xOpenMP)
Edison (24xMPI)Edison (2xMPI, 12xOpenMP) Same performance on BG/Q and
XC30 achievable Need to work harder on BG/Q
Performance Tuning of NWChem Texas Integrals
• Two-‐ electron repulsion integrals and construc=on of Fock matrix are key NWChem components (PMBS 13 submifed)
• Node-‐level performance normalized to Hopper reference • SMT and vectoriza=on are key for MIC & BG/Q • Code does not lend itself well to vectoriza=on, likely a new algorithmic approach is
required
• Op=miza=ons include: -‐ Dynamic load balancing, -‐ Improved data locality -‐ Loop transforma=ons -‐ Mul=-‐threading -‐ Compiler-‐directed
vectoriza=on
-‐ Overall performance gain up to 2.5x
Optimization of Geometric Multigrid
-‐ 9 -‐
See: S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker. "Op=miza=on of Geometric Mul=grid for Emerging Mul=-‐ and Manycore Processors", Supercompu=ng (SC), November 2012,
GTC on Homogenous and Heterogenous Platforms"
See: Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker. "Kine=c Turbulence Simula=ons at Extreme Scale on Leadership-‐Class Systems", Supercompu=ng (SC), November 2013,
Vectorization (Hopper vs Edison)
-‐120%
-‐100%
-‐80%
-‐60%
-‐40%
-‐20%
0%
20%
40%
60%
Hopper
Edison
-‐ 11 -‐
Vectoriza=on doesn’t help most NERSC benchmark codes as wrifen today.
Run=me change
NWCHEM Vectorization on Intel MIC and BG/Q
• Top ten subroutines accounted for 73% of total running time • Erintsp and ssssm benefit from vectorized function (inverse square root) • Obassi, wt2wt2, trac12, amshf benefit from vectorized data access • Assem, xwpq, pre4n suffer from indirect data access • Destbul can not be automatically vectorized by compiler due to
serilization • Both platforms show similar effect
Intel MIC BG/Q
The BSP execution model wastes resources packing buffers
Shiner rou=ne ~30% faster ! GTC overall ~5% faster
0
0.2
0.4
0.6
0.8
1
1.2
old new
RelaIv
e Im
e
serial
openmp
mpi
Before Aner
Cost of repacking data significant frac=on of the execu=on =me Waste of resources as well as detrimental to programmer produc=vity Example: By using OpenMP tasking we can use spare resources to repack buffers while messages are being sent
Summary • DisrupIve technology changes are coming
– Understand how they will effect you ! • Modify your code with a mind to the future
– Make sure you understand what limi=ng factors are – OpenMP – Vectoriza=on – Tasking
• Early results seem to indicate that this approach will be beneficial on today’s machines and tomorrows !
Acknowledgements
• US Department of Energy Contract No. DE-‐AC02-‐05CH11231
• Malhew Cordery, Chris Daley, Brian AusIn – NERSC ATG Group
• Lenny Oliker, Sam Williams, Khaled Ibrahim-‐ LBNL FTG Group
15
top related