ei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 1 CCS-3 PAL A CASE STUDY
Jan 03, 2016
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy1
CCS-3
PAL
A CASE STUDY
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy2
CCS-3
PALSection 3
Overview In this section we will show the negative
consequences of the lack of coordination in a large scale machine
We analyze the behavior of a complex scientific application representative of the ASCI workload on a large scale supercomputer
A case study that emphasizes the importance of the coordination in the network and in the system software
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy3
CCS-3
PALASCI Q
2,048 ES45 Alphaservers, with 4 processors/node16 GB of memory per node8,192 processors in total2 independent network rails, Quadrics Elan3 > 8192 cables 20 Tflops peak, #2 in the top 500 listsA complex human artifact
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy4
CCS-3
PALDealing with the complexity of a real system
In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q.
This methodology is based on an arsenal of analytical models custom microbenchmarks full applications discrete event simulators
Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy5
CCS-3
PALOverview
Our performance expectations for ASCI Q and the reality Identification of performance factors
Application performance and breakdown into components Detailed examination of system effects
A methodology to identify operating systems effects Effect of scaling – up to 2000 nodes/ 8000 processors Quantification of the impact
Towards the elimination of overheads demonstrated over 2x performance improvement
Generalization of our results: application resonance Bottom line: the importance of the integration of the
various system across nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy6
CCS-3
PAL
SAGE Performance (QA & QB)
0
0.2
0.4
0.6
0.8
1
1.2
0 512 1024 1536 2048 2560 3072 3584 4096
# PEs
Cyc
le-t
ime
(s)
Model
Sep-21-02
Nov-25-02
Performance of SAGE on 1024 nodes
Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) Measured time 2x greater than model (4096 PEs)
There is a difference
why ?
Lower is better!
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy7
CCS-3
PALUsing fewer PEs per Node
Test performance using 1,2,3 and 4 PEs per node
Sage on QB (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000
#PEs
Cycle
Tim
e (
s)
1PEsPerNode
2PEsPerNode
3PEsPerNode
4PEsPerNode
Lower is better!
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy8
CCS-3
PALUsing fewer PEs per node (2)
Measurements match model almost exactly for 1,2 and 3 PEs per node!
Sage on QB (timing.input)
0
0.1
0.2
0.3
0.4
0.5
0.6
1 10 100 1000 10000
#PEs
Err
or
(s)
- M
easu
red
- M
od
el
1PEsPerNode
2PEsPerNode
3PEsPerNode
4PEsPerNode
Performance issue only occurs when using 4 PEs per node
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy9
CCS-3
PALMystery #1
SAGE performs significantly worse on ASCI Q than was predicted by our model
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy10
CCS-3
PALSAGE performance components
Look at SAGE in terms of main components: Put/Get (point-to-point boundary exchange) Collectives (allreduce, broadcast, reduction)
SAGE on QB - Breakdown (timing.input)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 10 100 1000 10000
#PEs
Tim
e/C
ycle
(s)
token_allreduce
token_bcast
token_get
token_put
token_reduction
cyc_time
Performance issue seems to occur only on collective operations
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy11
CCS-3
PALPerformance of the collectives
Measure collective performance separately
•0
•0.5
•1
•1.5
•2
•2.5
•3
•0 •100 •200 •300 •400 •500 •600 •700 •800 •900 •1000
• Lat
ency
ms
•Nodes
•Allreduce Latency
•1 process per node•2 processes per node•3 processes per node•4 processes per node
Collectives (e.g., allreduce and barrier) mirror the performance of the application
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy12
CCS-3
PALIdentifying the problem within Sage
Sage
Allreduce
Simplify
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy13
CCS-3
PALExposing the problems with simple benchmarks
Allreduce
Benchmarks
Add complexity
Challenge: identify the simplest benchmark that exposes the problem
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy14
CCS-3
PALInterconnection network and communication libraries
The initial (obvious) suspects were the interconnection network and the MPI implementation
We tested in depth the network, the low level transmission protocols and several allreduce algorithms
We also implemented allreduce in the Network Interface Card
By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7
But we only got small improvements in Sage (5%)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy15
CCS-3
PALMystery #2
Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce
7 times faster leads to a small performance improvement
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy16
CCS-3
PALComputational noise
After having ruled out the network and MPI we focused our attention on the compute nodes
Our hypothesis is that the computational noise is generated inside the processing nodes
This noise “freezes” a running process for a certain amount of time and generates a “computational” hole
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy17
CCS-3
PALComputational noise: intuition
Running 4 processes on all 4 processors of an Alphaserver ES45
P2P0 P1 P3
The computation of one process is interrupted by an external event (e.g., system daemon or kernel)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy18
CCS-3
PAL
IDLE
Computational noise: 3 processes on 3 processors
Running 3 processes on 3 processors of an Alphaserver ES45
P2P0 P1
The “noise” can run on the 4th processor without interrupting the other 3 processes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy19
CCS-3
PALCoarse grained measurement
We execute a computational loop for 1,000 seconds on all 4,096 processors of QB
•P•1
•P•2
•P•3
•P•4
•TIME
•START •END
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy20
CCS-3
PALCoarse grained computational overhead per process
The slowdown per process is small, between 1% and 2.5%
lower is better
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy21
CCS-3
PALMystery #3
Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the
microbenchmarks of per-processor noise indicate that at most 2.5% of performance is
lost to noise
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy22
CCS-3
PALFine grained measurement
We run the same benchmark for 1000 seconds, but we measure the run time every millisecond
Fine granularity representative of many ASCI codes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy23
CCS-3
PALFine grained computational overhead per node
We now compute the slowdown per-node, rather than per-process
The noise has a clear, per cluster, structure
Optimum is 0(lower is better)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy24
CCS-3
PALFinding #1
Analyzing noise on a per-nodebasis reveals a regular structure
across nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy25
CCS-3
PAL The Q machine is organized in 32 node clusters (TruCluster) In each cluster there is a cluster manager (node 0), a quorum node
(node 1) and the RMS data collection (node 31)
Noise in a 32 Node Cluster
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy26
CCS-3
PALPer node noise distribution
Plot distribution of one million, 1 ms computational chunks
In an ideal, noiseless, machine the distribution graph is a single bar at 1 ms of 1 million points per process (4
million per node)
Every outlier identifies a computation that was delayed by external interference
We show the distributions for the standard cluster node, and also nodes 0, 1 and 31
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy27
CCS-3
PALCluster Node (2-30)
10% of the times the execution of the 1 ms chunk of computation is delayed
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy28
CCS-3
PALNode 0, Cluster Manager
We can identify 4 main sources of noise
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy29
CCS-3
PALNode 1, Quorum Node
One source of heavyweight noise (335 ms!)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy30
CCS-3
PALNode 31
Many fine grained interruptions, between 6 and 8 milliseconds
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy31
CCS-3
PALThe effect of the noise
An application is usually a sequence of a computation followed by a synchronization (collective):
... ... ... ...
... ... ... ... ... But if an event happens on a single node then it can affect
all the other nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy32
CCS-3
PALEffect of System Size
The probability of a random event occurring increases with the node count.
... ... ... ...
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy33
CCS-3
PALTolerating Noise: Buffered Coscheduling (BCS)
... ... ... ...... ... ... ...We can tolerate the noise by coscheduling the activities of the system software on each node
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy34
CCS-3
PALDiscrete Event Simulator:used to model noise
DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise
Noise model closely approximates experimental data The primary bottleneck is the fine-grained noise generated by the compute
nodes (Tru64)
Lower is better
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy35
CCS-3
PALFinding #2
On fine-grained applications, moreperformance is lost to short but
frequent noise on all nodes than to long but less frequent noise
on just a few nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy36
CCS-3
PALIncremental noise reduction
1. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff)
2. decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s)
3. moved several daemons from nodes 1 and 2 to node 0 on each cluster.
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy37
CCS-3
PALImprovements in the Barrier Synchronization Latency
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy38
CCS-3
PALResulting SAGE Performance
Nodes 0 and 31 also configured out in the optimization
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 512 1024 1536 2048 2560 3072 3584 4096# PEs
Cy
cle
-tim
e (
s)
ModelSep-21-02Nov-25-02Jan-27-03 (Min)Jan-27-03
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 1024 2048 3072 4096 5120 6144 7168 8192# PEs
Cy
cle
-tim
e (
s)
ModelSep-21-02Nov-25-02Jan-27-03May-01-03May-01-03 (min)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy39
CCS-3
PALFinding #3
We were able to double SAGE’s performance by selectively removing
noise caused by several types of system activities
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy40
CCS-3
PALGeneralizing our results:application resonance
The computational granularity of a balanced bulk-synchronous application correlates to the type of noise.
Intuition: any noise source has a negative impact, a few noise sources
tend to have a major impact on a given application. Rule of thumb:
the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude
The performance can be enhanced by selectively removing sources of noise
We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application.
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy41
CCS-3
PALCumulative Noise Distribution, Sequence of Barriers with No Computation
Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy42
CCS-3
PALConclusions
Combination of Measurement, Simulation and Modeling to
Identify and resolve performance issues on Q Used modeling to determine that a problem exists
Developed computation kernels to quantify O/S events:
Effect increases with the number of nodes
Impact is determined by the computation granularity in an application
Application performance has significantly improved
Method also being applied to other large-systems