Top Banner
ei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 1 CCS-3 PAL A CASE STUDY
42

A CASE STUDY

Jan 03, 2016

Download

Documents

Blake Ramirez

A CASE STUDY. Section 3. Overview In this section we will show the negative consequences of the lack of coordination in a large scale machine We analyze the behavior of a complex scientific application representative of the ASCI workload on a large scale supercomputer - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy1

CCS-3

PAL

A CASE STUDY

Page 2: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy2

CCS-3

PALSection 3

Overview In this section we will show the negative

consequences of the lack of coordination in a large scale machine

We analyze the behavior of a complex scientific application representative of the ASCI workload on a large scale supercomputer

A case study that emphasizes the importance of the coordination in the network and in the system software

Page 3: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy3

CCS-3

PALASCI Q

2,048 ES45 Alphaservers, with 4 processors/node16 GB of memory per node8,192 processors in total2 independent network rails, Quadrics Elan3 > 8192 cables 20 Tflops peak, #2 in the top 500 listsA complex human artifact

Page 4: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy4

CCS-3

PALDealing with the complexity of a real system

In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q.

This methodology is based on an arsenal of analytical models custom microbenchmarks full applications discrete event simulators

Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code

Page 5: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy5

CCS-3

PALOverview

Our performance expectations for ASCI Q and the reality Identification of performance factors

Application performance and breakdown into components Detailed examination of system effects

A methodology to identify operating systems effects Effect of scaling – up to 2000 nodes/ 8000 processors Quantification of the impact

Towards the elimination of overheads demonstrated over 2x performance improvement

Generalization of our results: application resonance Bottom line: the importance of the integration of the

various system across nodes

Page 6: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy6

CCS-3

PAL

SAGE Performance (QA & QB)

0

0.2

0.4

0.6

0.8

1

1.2

0 512 1024 1536 2048 2560 3072 3584 4096

# PEs

Cyc

le-t

ime

(s)

Model

Sep-21-02

Nov-25-02

Performance of SAGE on 1024 nodes

Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) Measured time 2x greater than model (4096 PEs)

There is a difference

why ?

Lower is better!

Page 7: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy7

CCS-3

PALUsing fewer PEs per Node

Test performance using 1,2,3 and 4 PEs per node

Sage on QB (timing.input)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000

#PEs

Cycle

Tim

e (

s)

1PEsPerNode

2PEsPerNode

3PEsPerNode

4PEsPerNode

Lower is better!

Page 8: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy8

CCS-3

PALUsing fewer PEs per node (2)

Measurements match model almost exactly for 1,2 and 3 PEs per node!

Sage on QB (timing.input)

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000 10000

#PEs

Err

or

(s)

- M

easu

red

- M

od

el

1PEsPerNode

2PEsPerNode

3PEsPerNode

4PEsPerNode

Performance issue only occurs when using 4 PEs per node

Page 9: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy9

CCS-3

PALMystery #1

SAGE performs significantly worse on ASCI Q than was predicted by our model

Page 10: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy10

CCS-3

PALSAGE performance components

Look at SAGE in terms of main components: Put/Get (point-to-point boundary exchange) Collectives (allreduce, broadcast, reduction)

SAGE on QB - Breakdown (timing.input)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000

#PEs

Tim

e/C

ycle

(s)

token_allreduce

token_bcast

token_get

token_put

token_reduction

cyc_time

Performance issue seems to occur only on collective operations

Page 11: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy11

CCS-3

PALPerformance of the collectives

Measure collective performance separately

•0

•0.5

•1

•1.5

•2

•2.5

•3

•0 •100 •200 •300 •400 •500 •600 •700 •800 •900 •1000

• Lat

ency

ms

•Nodes

•Allreduce Latency

•1 process per node•2 processes per node•3 processes per node•4 processes per node

Collectives (e.g., allreduce and barrier) mirror the performance of the application

Page 12: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy12

CCS-3

PALIdentifying the problem within Sage

Sage

Allreduce

Simplify

Page 13: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy13

CCS-3

PALExposing the problems with simple benchmarks

Allreduce

Benchmarks

Add complexity

Challenge: identify the simplest benchmark that exposes the problem

Page 14: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy14

CCS-3

PALInterconnection network and communication libraries

The initial (obvious) suspects were the interconnection network and the MPI implementation

We tested in depth the network, the low level transmission protocols and several allreduce algorithms

We also implemented allreduce in the Network Interface Card

By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7

But we only got small improvements in Sage (5%)

Page 15: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy15

CCS-3

PALMystery #2

Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce

7 times faster leads to a small performance improvement

Page 16: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy16

CCS-3

PALComputational noise

After having ruled out the network and MPI we focused our attention on the compute nodes

Our hypothesis is that the computational noise is generated inside the processing nodes

This noise “freezes” a running process for a certain amount of time and generates a “computational” hole

Page 17: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy17

CCS-3

PALComputational noise: intuition

Running 4 processes on all 4 processors of an Alphaserver ES45

P2P0 P1 P3

The computation of one process is interrupted by an external event (e.g., system daemon or kernel)

Page 18: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy18

CCS-3

PAL

IDLE

Computational noise: 3 processes on 3 processors

Running 3 processes on 3 processors of an Alphaserver ES45

P2P0 P1

The “noise” can run on the 4th processor without interrupting the other 3 processes

Page 19: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy19

CCS-3

PALCoarse grained measurement

We execute a computational loop for 1,000 seconds on all 4,096 processors of QB

•P•1

•P•2

•P•3

•P•4

•TIME

•START •END

Page 20: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy20

CCS-3

PALCoarse grained computational overhead per process

The slowdown per process is small, between 1% and 2.5%

lower is better

Page 21: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy21

CCS-3

PALMystery #3

Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the

microbenchmarks of per-processor noise indicate that at most 2.5% of performance is

lost to noise

Page 22: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy22

CCS-3

PALFine grained measurement

We run the same benchmark for 1000 seconds, but we measure the run time every millisecond

Fine granularity representative of many ASCI codes

Page 23: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy23

CCS-3

PALFine grained computational overhead per node

We now compute the slowdown per-node, rather than per-process

The noise has a clear, per cluster, structure

Optimum is 0(lower is better)

Page 24: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy24

CCS-3

PALFinding #1

Analyzing noise on a per-nodebasis reveals a regular structure

across nodes

Page 25: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy25

CCS-3

PAL The Q machine is organized in 32 node clusters (TruCluster) In each cluster there is a cluster manager (node 0), a quorum node

(node 1) and the RMS data collection (node 31)

Noise in a 32 Node Cluster

Page 26: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy26

CCS-3

PALPer node noise distribution

Plot distribution of one million, 1 ms computational chunks

In an ideal, noiseless, machine the distribution graph is a single bar at 1 ms of 1 million points per process (4

million per node)

Every outlier identifies a computation that was delayed by external interference

We show the distributions for the standard cluster node, and also nodes 0, 1 and 31

Page 27: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy27

CCS-3

PALCluster Node (2-30)

10% of the times the execution of the 1 ms chunk of computation is delayed

Page 28: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy28

CCS-3

PALNode 0, Cluster Manager

We can identify 4 main sources of noise

Page 29: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy29

CCS-3

PALNode 1, Quorum Node

One source of heavyweight noise (335 ms!)

Page 30: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy30

CCS-3

PALNode 31

Many fine grained interruptions, between 6 and 8 milliseconds

Page 31: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy31

CCS-3

PALThe effect of the noise

An application is usually a sequence of a computation followed by a synchronization (collective):

... ... ... ...

... ... ... ... ... But if an event happens on a single node then it can affect

all the other nodes

Page 32: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy32

CCS-3

PALEffect of System Size

The probability of a random event occurring increases with the node count.

... ... ... ...

Page 33: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy33

CCS-3

PALTolerating Noise: Buffered Coscheduling (BCS)

... ... ... ...... ... ... ...We can tolerate the noise by coscheduling the activities of the system software on each node

Page 34: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy34

CCS-3

PALDiscrete Event Simulator:used to model noise

DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise

Noise model closely approximates experimental data The primary bottleneck is the fine-grained noise generated by the compute

nodes (Tru64)

Lower is better

Page 35: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy35

CCS-3

PALFinding #2

On fine-grained applications, moreperformance is lost to short but

frequent noise on all nodes than to long but less frequent noise

on just a few nodes

Page 36: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy36

CCS-3

PALIncremental noise reduction

1. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff)

2. decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s)

3. moved several daemons from nodes 1 and 2 to node 0 on each cluster.

Page 37: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy37

CCS-3

PALImprovements in the Barrier Synchronization Latency

Page 38: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy38

CCS-3

PALResulting SAGE Performance

Nodes 0 and 31 also configured out in the optimization

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 512 1024 1536 2048 2560 3072 3584 4096# PEs

Cy

cle

-tim

e (

s)

ModelSep-21-02Nov-25-02Jan-27-03 (Min)Jan-27-03

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1024 2048 3072 4096 5120 6144 7168 8192# PEs

Cy

cle

-tim

e (

s)

ModelSep-21-02Nov-25-02Jan-27-03May-01-03May-01-03 (min)

Page 39: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy39

CCS-3

PALFinding #3

We were able to double SAGE’s performance by selectively removing

noise caused by several types of system activities

Page 40: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy40

CCS-3

PALGeneralizing our results:application resonance

The computational granularity of a balanced bulk-synchronous application correlates to the type of noise.

Intuition: any noise source has a negative impact, a few noise sources

tend to have a major impact on a given application. Rule of thumb:

the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude

The performance can be enhanced by selectively removing sources of noise

We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application.

Page 41: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy41

CCS-3

PALCumulative Noise Distribution, Sequence of Barriers with No Computation

Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes

Page 42: A CASE STUDY

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy42

CCS-3

PALConclusions

Combination of Measurement, Simulation and Modeling to

Identify and resolve performance issues on Q Used modeling to determine that a problem exists

Developed computation kernels to quantify O/S events:

Effect increases with the number of nodes

Impact is determined by the computation granularity in an application

Application performance has significantly improved

Method also being applied to other large-systems