Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Performance Results from Multi-core Platforms

Nicholas J. Wright Advanced Technology Group, NERSC/LBNL

njwright@lbl.gov

Programming weather, climate, and earth-‐system models on heterogeneous mulI-‐core plaJorms

19-‐20 September 2013, NCAR -‐ 1 -‐

Rich @ SC12 – “Show some Data!”

-‐ 2 -‐

NERSC users I want

•  Robust code changes –  I don’t want to add things in only to take them out again two years later

•  Performance portability –  Changes made today for one pla;orm should help on all

•  Given hardware trends what should I do? –  Understand what is limi=ng my applica=ons performance

•  Roofline Model –  Iden=fy and exploit parallelism

•  OpenMP •  Vectors •  Tasks

-‐ 3 -‐

Understanding Performance on Today’s Machines – Per socket comparison

•  Edison Cray XC30 – Intel Ivybridge – 2.4 GHz –  12 cores, 212 gflops, 50 GB/s* per socket

•  Hopper Cray XE6 – AMD Magny Cours – 2.1 GHz –  12 cores, 95 gflops, 35 GB/s* per socket

•  Mira BG/Q – IBM PowerPC – 1.6 GHz –  16 cores, 172 gflops, 29 GB/s* per socket

•  Intel Xeon Phi (KNC) – 1.238 GHz –  61 cores, 1.06TF, 174 GB/s* per socket

•  NVIDIA Kepler G20X –  XXX cores, 1220TF, 171 GB/s per socket

-‐ 4 -‐ *DGEMM & STREAM triad

Performance Per Node

-‐ 5 -‐

174 175

XC30 XE6 BG/Q Xeon Phi NVIDIA K20X

STREAM (GB/s)

189 172

XC30 XE6 BG/Q Xeon Phi NVIDIA K20X

DGEMM (gflops)

Roofline for Test Systems

-‐ 6 -‐

1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8 16

1/16 1/8 1/4 1/2 1 2 4 8 16

ps/s/core

OperaIonal Intensity (Flops/Byte)

Hopper

Edison

miniDFT

miniGhost

miniFE

Flash Code on Edison, Hopper, Mira

•  ~4x more parallelism needed for equivalent performance on BG/Q compared to Cray XC30

•  Energy –  XC30 280W/node –  BG/W 80W/node –  Factor of 3.5x

-‐ 7 -‐

512 1024 2048 4096 8192 16384 32768

Runtim

512 1024 2048 4096 8192 16384 32768

llel E

BG/Q (16xMPI, 4xOpenMP)BG/Q (1xMPI, 64xOpenMP)

Hopper (24xMPI)Hopper (4xMPI, 6xOpenMP)

Edison (24xMPI)Edison (2xMPI, 12xOpenMP) Same performance on BG/Q and

XC30 achievable Need to work harder on BG/Q

Performance Tuning of NWChem Texas Integrals

•  Two-‐ electron repulsion integrals and construc=on of Fock matrix are key NWChem components (PMBS 13 submifed)

•  Node-‐level performance normalized to Hopper reference •  SMT and vectoriza=on are key for MIC & BG/Q •  Code does not lend itself well to vectoriza=on, likely a new algorithmic approach is

required

•  Op=miza=ons include: -‐  Dynamic load balancing, -‐  Improved data locality -‐  Loop transforma=ons -‐  Mul=-‐threading -‐  Compiler-‐directed

vectoriza=on

-‐  Overall performance gain up to 2.5x

Optimization of Geometric Multigrid

-‐ 9 -‐

See: S. Williams, D. Kalamkar, A. Singh, A. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, L. Oliker. "Op=miza=on of Geometric Mul=grid for Emerging Mul=-‐ and Manycore Processors", Supercompu=ng (SC), November 2012,

GTC on Homogenous and Heterogenous Platforms"

See: Bei Wang, Stephane Ethier, William Tang, Timothy Williams, Khaled Z. Ibrahim, Kamesh Madduri, Samuel Williams, Leonid Oliker. "Kine=c Turbulence Simula=ons at Extreme Scale on Leadership-‐Class Systems", Supercompu=ng (SC), November 2013,

Vectorization (Hopper vs Edison)

-‐120%

-‐100%

-‐80%

-‐60%

-‐40%

-‐20%

Hopper

Edison

-‐ 11 -‐

Vectoriza=on doesn’t help most NERSC benchmark codes as wrifen today.

Run=me change

NWCHEM Vectorization on Intel MIC and BG/Q

•  Top ten subroutines accounted for 73% of total running time •  Erintsp and ssssm benefit from vectorized function (inverse square root) •  Obassi, wt2wt2, trac12, amshf benefit from vectorized data access •  Assem, xwpq, pre4n suffer from indirect data access •  Destbul can not be automatically vectorized by compiler due to

serilization •  Both platforms show similar effect

Intel MIC BG/Q

The BSP execution model wastes resources packing buffers

Shiner rou=ne ~30% faster ! GTC overall ~5% faster

old new

RelaIv

serial

openmp

Before Aner

Cost of repacking data significant frac=on of the execu=on =me Waste of resources as well as detrimental to programmer produc=vity Example: By using OpenMP tasking we can use spare resources to repack buffers while messages are being sent

Summary •  DisrupIve technology changes are coming

–  Understand how they will effect you ! •  Modify your code with a mind to the future

– Make sure you understand what limi=ng factors are –  OpenMP –  Vectoriza=on –  Tasking

•  Early results seem to indicate that this approach will be beneficial on today’s machines and tomorrows !

Acknowledgements

•  US Department of Energy Contract No. DE-‐AC02-‐05CH11231

•  Malhew Cordery, Chris Daley, Brian AusIn – NERSC ATG Group

•  Lenny Oliker, Sam Williams, Khaled Ibrahim-‐ LBNL FTG Group

Performance Results from Multi-core Platforms · performance)on)BG/Q comparedtoCray) XC30 • Energy – XC30"280W/node" – BG/W80W/node – Factor"of"3.5x" 7 10^1 10^2 10^3 512

Documents

I/O Performance on Cray XC30 - CUG · I/O Performance on...

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node.....

BigTwin: The IT Industry’s Highest Performing Twin Multi.....

Rollin Thomas - IXPUG...2018/05/10 · Speed/Cores per Node...

everRun User's...

A leap forward with UTK’s Cray XC30

pgconf.ru · RTN RTN RTN RTN RTN RTN CLOUD / CLUSTER...

グラウトポンプボーリングポンプ d bg - 2 bg -...

CRAY XC30 System

DBK / DBO 35 BG, 50 BG, 75 BG, 100 BG, 120 BG · 2015. 9......

Bg bg lrw022015

ArcView Print Job - michigan.gov · 9802 9801 9805 9804...

DATABASES FOR AN ENGAGED WORLD: REQUIREMENTS … SERVER...

BG 45, BG 45 C, BG 46, BG 46 C, BG 55 ... - Easy Motoculture

201408 bg bg

Performance Optimization I: Single Core/Node Vectorization,....