Trends in systems and how to get efficient performance Martin Hilgeman HPC Consultant [email protected]
Trends in systems and how to get efficient performance
Martin Hilgeman
HPC Consultant
2 of 48
The landscape is changing
“We are no longer in the general purpose era… the argument of tuning software for hardware is
moot. Now, to get the best bang for the buck, you have to tune both.”
https://www.nextplatform.com/2017/03/08/arm-amd-x86-server-chips-get-mainstream-lift-microsoft/amp/
- Kushagra Vaid, general manager of server
engineering, Microsoft Cloud Solutions
3 of 48 Accelerating Understanding Pisa, September 2017
System trend over the years (1)
C
C C
C
≈≈≈
~1970 - 2000
≈≈≈
2005
Multi-core: TOCK
4 of 48 Accelerating Understanding Pisa, September 2017
System trend over the years (2)
2007
IntegratedMemorycontroller:
TOCK
IntegratedPCIecontroller:
TOCK
2012
≈≈≈≈≈≈
5 of 48 Accelerating Understanding Pisa, September 2017
Future
IntegratedNetworkFabric Adapter:
TOCK
SoC designs:
TOCK
≈≈≈ ≈≈≈ ≈≈≈ ≈≈≈
6 of 48 Accelerating Understanding Pisa, September 2017
Moore’s Law vs. Amdahl’s Law
• The clock speed plateau
• The power ceiling
• IPC limit
• Industry is applying Moore’s Law by
adding more cores
• Meanwhile Amdahl’s Law says that you
cannot use them all efficiently
Chuck Moore, "DATA PROCESSING IN EXASCALE-CLASS COMPUTER SYSTEMS", The
Salishan Conference on High Speed Computing, 2011
0
10
20
30
40
50
60
70
80
90
100
1 2 4 12 16 24 32
Wall c
lock t
ime
Number of cores
80% parallel application
parallel
serial
1.00
1.67
2.50
3.75 4.004.29 4.44
7 of 48 Accelerating Understanding Pisa, September 2017
Moore’s Law vs Amdahl's Law - “too Many Cooks in the Kitchen”
Meanwhile Amdahl’s Law says that
you cannot use them all efficientlyIndustry is applying Moore’s Law by
adding more cores
8 of 48
Improving performance - what levels do we have?
• Challenge: Sustain performance trajectory without massive increases in cost, power, real
estate, and unreliability
• Solutions: No single answer, must intelligently turn “Architectural Knobs”
𝐹𝑟𝑒𝑞 ×𝑐𝑜𝑟𝑒𝑠
𝑠𝑜𝑐𝑘𝑒𝑡× #𝑠𝑜𝑐𝑘𝑒𝑡𝑠 ×
𝑖𝑛𝑠𝑡 𝑜𝑟 𝑜𝑝𝑠
𝑐𝑜𝑟𝑒 × 𝑐𝑙𝑜𝑐𝑘× 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦
Hardware performance What you really get
1 2 3 4 5
Software performance
9 of 48 Accelerating Understanding Pisa, September 2017
Turning the knobs 1 - 4
Frequency is unlikely to change much Thermal/Power/Leakage
challenges
Moore’s Law still holds: 130 -> 14 nm. LOTS of transistors
Number of sockets per system is the easiest knob.
Challenging for power/density/cooling/networking
IPC still grows
FMA, AVX, accelerator implementations for algorithms
Challenging for the user/developer
1
2
3
4
10 of 48
Turning knob #5
Hardware tuning knobs are limited, but there’s far more possible in the
software layer
Hardware
Operating
system
Middleware
Application
BIOS
P-states
Memory profile
I/O cache tuning
Process affinity
Memory allocation
MPI (parallel) tuning
Use of performance
libs (Math, I/O, IPP)
Compiler hints
Source changes
Adding parallelism
easy
hard
11 of 48 Accelerating Understanding Pisa, September 2017
New capabilities according to Intel
SSSE3 SSE4 AVX AVX AVX2 AVX2
2007 2009 2012 2013 2014 2015 2017
AVX-512
12 of 48 Accelerating Understanding Pisa, September 2017
The state of ISV software
Segment Applications Vectorization support
CFD Fluent, LS-DYNA, STAR
CCM+
Limited SSE2 support
CSM CFX, RADIOSS, Abaqus Limited SSE2 support
Weather WRF, UM, NEMO, CAM Yes
Oil and Gas Seismic processing Not applicable
Reservoir Simulation Yes
Chemistry Gaussian, GAMESS, Molpro Not applicable
Molecular dynamics NAMD, GROMACS,
Amber,…
PME kernels support SSE2
Biology BLAST, Smith-Waterman Not applicable
Molecular mechanics CPMD, VASP, CP2k,
CASTEP
Yes
Bottom line: ISV support for new instructions is poor. Less of an issue
for in-house developed codes, but programming is hard
13 of 48 Accelerating Understanding Pisa, September 2017
Add to this the Memory Bandwidth and System Balance
Obtained from: http://sc16.supercomputing.org/2016/10/07/sc16-invited-talk-spotlight-dr-john-d-mccalpin-presents-memory-bandwidth-system-balance-hpc-systems/
14 of 48 Accelerating Understanding Pisa, September 2017
What does Intel do about these trends?
Problem Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake
QPI
bandwidth
No problem Even better Two snoop
modes
Three snoop
modes
Four (!) snoop
modes
• UPI
• COD snoop
modes
Memory
bandwidth
No problem Extra memory
channel
Larger cache Extra load/store
units
Larger cache • Extra
load/store
units
• +50%
memory
channels
Core
frequency
No problem • More cores
• AVX
• Better
Turbo
• Even more
cores
• Above TDP
Turbo
• Still more
cores
• AVX2
• Per-core
Turbo
• Again even
more cores
• optimized
FMA
• Per-core
Turbo
based on
instruction
type
• More cores
• Larger
OOO
engine
• AVX-512
• 3 different
core
frequency
modes
The roofline model
16 of 48
Predicting performance – the roofline model
Predict system performance as function of peak performance, maximum bandwidth and arithmetic
intensity
Obtained from: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
17 of 48 Accelerating Understanding Pisa, September 2017
Roofline variables and terms
“Work” “Memory Traffic”
18 of 48
The last variable: Arithmetic intensity
19 of 48 Accelerating Understanding Pisa, September 2017
Roofline model of a modern system
GF
LO
P/s
Arithmetic intensity (FLOP/Byte)
Memory
Bound
Compute
Bound
FP peak
Williams et al.: “Roofline: An Insightful Visual Performance Model“, CACM 2009
Better use of
caches
Micro
optimi
zation
20 of 48 Accelerating Understanding Pisa, September 2017
The Broadwell baseline
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
GF
LO
P/s
FLOP/byte
E5-2699 v4 22C 2.1 GHz
FP peak (~ 1.4 TF)
21 of 48 Accelerating Understanding Pisa, September 2017
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
GF
LO
P/s
FLOP/byte
E5-2699 v4 22C 2.1 GHz
Xeon 8180 28C 2.5 GHz
Intel Skylake-SPFP peak (~ 3.3 TF)
22 of 48 Accelerating Understanding Pisa, September 2017
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
GF
LO
P/s
FLOP/byte
E5-2699 v4 22C 2.1 GHz
Xeon 8180 28C 2.5 GHz
AMD Epyc 7601 32C 2.2 GHz
What about AMD Epyc?
FP peak (~ 1 TF)
23 of 48 Accelerating Understanding Pisa, September 2017
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
GF
LO
P/s
FLOP/byte
What about accelerators?
Williams et al.: “Roofline: An Insightful Visual Performance Model“, CACM 2009
FP peak (~ 3.3 TF)
0.5
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
0.01 0.1 1 10 100
Gbyt
es/s
FLOP/byte
Xeon Phi 7250 MCDRAM
Xeon Phi 7250 DDR
NVIDIA P100
FP peak (~ 3.3 TF)
FP peak (~ 4.7 TF)
24 of 48
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
Gbyt
es/s
FLOP/byte
AVX-512
AVX-512
Ideal roofline on a sunny day
25 of 48
Might need to repair some rooftiles…
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
Gbyt
es/s
FLOP/byte
AVX-512
AVX2
Half the register width
26 of 48
It’s worse than we thought…
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
Gbyt
es/s
FLOP/byte
AVX-512
AVX2
No FMA
No FMA
27 of 48
Everything is failing on us!
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
Gbyt
es/s
FLOP/byte
AVX-512
AVX2
No FMA
No SIMD
No SIMD
28 of 48
The reality after hurricane Irma
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
0.01 0.1 1 10 100
Gbyt
es/s
FLOP/byte
AVX-512
AVX2
No FMA
No SIMD
No ILP
No instruction level parallelism
29 of 48 Accelerating Understanding Pisa, September 2017
1
2
4
8
16
32
64
128
256
512
1024
0.0625 0.125 0.25 0.5 1 2 4 8 16 32
GF
LO
P/s
Arithmetic intensity (FLOP/Byte)
No SIMD
No FMA
A few Broadwell resultsFP peak (~ 1TF)
STREAM
Triad
HPL
GROMACS
OpenFOAM
CP2kHPCG
Sparse
Solvers
(CFD)
A case study on power
31 of 48
Key Aspects of Acceleration
We have lots of transistors… Moore’s law is holding; this isn’t
necessarily the problem
We don’t really need lower power per transistor, we need lower power
per operation
How to do this?
32 of 48
Performance and Efficiency with Intel® AVX-512
669
11782034
3259
760 768 791 767
3.12.8
2.5
2.1
0
0.5
1
1.5
2
2.5
3
3.5
0
500
1000
1500
2000
2500
3000
3500
SSE4.2 AVX AVX2 AVX512
Co
re F
req
ue
ncy
GF
LO
Ps
, S
yste
m P
ow
er
LINPACK Performance
GFLOPs Power (W) Frequency (GHz)
1.00
1.74
2.92
4.83
1.00
2.00
4.00
8.00
SSE4.2 AVX AVX2 AVX512No
rmali
zed
to
SS
E4.2
G
FL
OP
s/W
att
GFLOPs / Watt
1.001.95
3.77
7.19
0.00
2.00
4.00
6.00
8.00
SSE4.2 AVX AVX2 AVX512No
rmali
zed
to
SS
E4.2
G
FL
OP
s/G
Hz
GFLOPs / GHz
Intel® AVX-512 delivers significant performance and efficiency gains
3
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Slide taken from Intel
33 of 48
Powerful instructions to save power
For LINPACK, powerful instructions can bring significant performance gains.
What about real applications?
NAS parallel benchmarks, which are “mini applications” containing kernels for
major HPC workload types
34 of 48
NPB kernels
KERNEL Description Workload
CG Conjugate gradient Memory latency bound
MG Multigrid Memory intensive
FT Fourier transform Compute and
transpose
BT Block tridiagonal solver
SP Scalar pentadiagonal solver
LU Lower-upper Gauss-Seidel solver
35 of 48
How to benchmark?
There are three ways that the kernels can be run
• Serial – N copies on N cores
• OpenMP – shared memory parallelism
• MPI – parallelism across multiple systems (nodes)
There are five classes of problem sizes. From A (tiny) to E (very large, to
1000s of cores)
36 of 48 Accelerating Understanding Pisa, September 2017
Conjugate Gradient
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHZ AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt AVX Watt
AVX2 Watt AVX-512 Watt
37 of 48 Accelerating Understanding Pisa, September 2017
Multigrid
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt AVX Watt
AVX2 Watt AVX-512 Watt
38 of 48 Accelerating Understanding Pisa, September 2017
Block tridiagnonal solver
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt
AVX Watt AVX2 Watt
AVX-512 Watt
39 of 48 Accelerating Understanding Pisa, September 2017
Scalar pentadiagnonal solver
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt AVX Watt
AVX2 Watt AVX-512 Watt
40 of 48 Accelerating Understanding Pisa, September 2017
Lower-upper Gauss-Seidel solver
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt SSE4.2 Watt AVX Watt
AVX2 Watt AVX-512 Watt
41 of 48 Accelerating Understanding Pisa, September 2017
Fourier Transformation
0
500
1000
1500
2000
2500
3000
3500
4000
no-vec MHz SSE4.2 MHz AVX MHz
AVX2 MHz AVX-512 MHz
0
50
100
150
200
250
no-vec Watt
SSE4.2 Watt
AVX Watt
AVX2 Watt
AVX-512 Watt