Performance Analysis and Prediction of Large-Scale ... · Performance Analysis and Prediction of Large-Scale Scientific Applications ... – Statistical (Wisconsin, ORNL, IBM)* •

A. Hoisie & H. WassermanICS 2002 New York NY

Performance Analysis and Predictionof Large-Scale Scientific Applications

Adolfy Hoisie and Harvey Wasserman

Computer & Computational Sciences Division (CCS)

University of CaliforniaLos Alamos National LaboratoryLos Alamos, New Mexico 87545

ICS2002 June 22, 2002

{hoisie, hjw}@lanl.gov


Introduction and Motivation


What is This Tutorial About?

• Overview of performance modeling.– Analytical techniques that encapsulate as parameters the

performance characteristics of applications and machines

– Techniques developed at LANL

– Emphasis on full applications

– No dependence on specific tools• Although data collection is vital

• Applications of performance models: performanceprediction.– Tuning roadmap for current bottlenecks

– Architecture exploration for future systems

– Software / algorithm changes

– System installation diagnostics: “Rational System Integration”


What is This Tutorial Really About?

• Insight into performance issues.

– Performance modeling is the only practical way to obtainquantitative information on how to map real applications toparallel architectures rapidly and with high accuracy

• With this insight you become a more educatedbuyer/seller/user of computer systems.– Help you become a “performance skeptic.”

– Show how to integrate information from various levels of thebenchmark hierarchy.

– Show why “naïve” approaches sometimes don’t work.


Vendor

Why Evaluate Performance?

Sell Machine

User

Buy Machine

Adjust for Technology Shifts

New technologies inspire new applications

Overall Goal: Advance the state of the art of computer architecture.


Why Performance Modeling?

• Other performance analysis methods fallshort in either accuracy or practicality:– Experimental

• Simulation (UCLA, Darthmouth, Los Alamos)*– Greatest architectural flexibility but takes too long for real

applications

• Trace-driven experiments (UIUC, Barcelona)*– Results often lack generality

• Benchmarking (~ everybody)– Limited to current implementation of the code

– Limited to currently-available architectures

– Difficult to distinguish between real performance andmachine idiosyncracies

* Partial lists Continued…



• Other performance analysis methods:– Queuing Theory: takes too long for real apps

– Statistical (Wisconsin, ORNL, IBM)*• Mean value analysis

• System workload performance

– PRAM algorithmic analysis: neglects mostimportant “real-world” characteristics



• Performance is a Multidimensional space:– Problem size, # of processors, network

size/topology/protocols/etc., communicationspeed, computation speed, application input,target optimization (time vs. size), etc.

– These issues interact and trade off with each other

• Performance Characterization– Typically done as a cross-section of performance

space

– Workload-based evaluation is critical


Performance Modeling Process

• Distill the design space by careful inspection ofthe code

• Parameterize the key application characteristics

• Parameterize the machine performancecharacteristics

• Measure using microbenchmarks

• Combine empirical data with analytical model

• Iterate

• Report results


Reporting Results

• The huge design space requires careful choiceof metrics.– Reporting results itself requires a methodology.

• We separate uniprocessor and multiprocessorissues– although there is much overlap.

– Benchmarking and single-CPU performance is stillimportant.

• With all this in mind, here is the tutorial outline:


Tutorial Outline*

Times are approximate*


Single-Processor Performance Metrics


• What single-processor efficiency can you expect?

– Assumption: Memory performance is the key• We are not using first principles to model computation time.

• A coarse approximation is based on a characteristic computingrate and # of FLOPS

• Systems of concern to date are microprocessors that run at 5-15% of peak performance.

– Single-processor performance is often thebottleneck in large-scale parallel applications, somodeling/characterization is crucial.

Why Study Single-Processor Performance?


Performance Metrics

• Always Best:– Execution Time: However, it is not always the best metric for

comparison purposes.

• Interesting:– Efficiency (Observed rate / Peak rate)

• Sometimes Unreliable:– MFLOPS

– SPEC / PERFECT / SLALOM, etc.

• Always Unreliable:– Clock Rate (e.g., 500 MHz) or Clock Period (2 ns)

• Rate = 1 / CP

• Fastest rate at which things can happen (but don’t)

– Peak Speed (MAX_FLOPS*Clock Rate): grossly overstates realperformance

– MIPS


µProc60%/yr.

DRAM7%/yr.

1

10

100

100019

8019

81

1983

1984

1985

1986

1987

1988 19

8919

9019

9119

9219

9319

9419

9519

9619

9719

9819

9920

00

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Perfo

rman

ce “Moore’s Law”

From D. Patterson, CS252, Spring 1998 ©UCB

Moral: Overall performance dominated bymemory performance.

Why is CPU Clock Speed a Poor Metric?


Why is Peak Speed a Poor Metric?

• Calculate peak asPeak = Clock Rate * MAX_FLOPS_PER_CP

• MAX_FLOPS_PER_CP typically 2, 4

• Two questions:– Does the code contain enough floating-point operations?

– Does the compiler make the floating-point parallelismavailable to the architecture?

• Two answers:– Typically only 1 instruction in 3 is a FP instruction

– See section 3 of this tutorial.


If peak performancewerea reliable metric allthe bluebars would equal allthe red bars inheight.

Peak Speed is a Poor Metric


CPU Performance

CPU Time = Ninst * CPI * Clock rate

Application

Compiler

CPU Time =Instructions--------------- Program

Cycles-------------Instruction

Seconds------------- Cycle

XX

Instruction Set

Architecture

Technology


CPI as a Metric and CPI Profiling• “Average cycles per instruction”

• Dangerous for cross-platform comparison but useful forcomparison with optimal values

• CPI Profiling:

• Look primarily at CPIcompute (CPIo), CPIstall, CPImem

• Two excellent papers:– Bhandarkar, D. and Cvetanovic, Z., "Performance Characterization of the Alpha 21164

Microprocessor Using TP and SPEC Workloads," Proc. Second. Int. Sypm. on High-Perf. Comp. Arch., IEEE Computer Society Press, Los Alamitos Ca., 1996.

– Bhandarkar, D. and Ding, J., "Performance Characterization of the Pentium ProProcessor, " Proc. Third. Int. Sypm. on High-Perf. Comp. Arch., IEEE ComputerSociety Press, Los Alamitos Ca., pp 288-297, 1997

(sum over all instruction types)


CPI Profiling

• Use HW counters to obtain:– CP, tot instructions, cache misses, total mem ops, others.

– http://icl.cs.utk.edu/projects/papi

• CPI = CPIo + CPImem, but problem is overlap– Old way: use counters, infer CPImem (e.g., sgi Origin)

• CPI = CPIo + # memops * (miss_rate * miss penalty)– (NOTE: miss_rate is only an indirect measure of performance. Just

as we can’t measure CPU performance by Ninst only, we can’tmeasure memory performance by hit rate.)

– Need to know how hits/misses contribute to CPU time

– Average memory access time for hits/misses

– New way: measure CPImem, CPIstall directly (e.g., Itanium)


Optimal CPI = 0.25 for MIPS R10000 (issue rate is four instruction per CP).

** H. Wasserman, O. Lubeck, Y.Luo, and F. Bassetti, Proc. SC97, 1997.

InfiniteCacheCPI**

Single-CPUMFLOPSSGI O2K

Single-CPUEffic

SGI O2K

HEAT 0.74 35 9%

SWEEP 0.88 45 11%

HYDRO 0.89 31 8%

HYDRO-T 0.90 63 16%

NEUT 0.77 49 12%

CPI Profiling: sgi Origin2000


CPI Profiling with Memory Model

O. Lubeck, Y. Luo, H. Wasserman and F. Bassetti “Performance “Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications,” Proc. SC97

CP

I Pro

file

HEA

T-50

HEA

T-75

HEA

T-10

0HY

DRO

-100

HYDR

O-1

50HY

DRO

-300

HYDR

O-T

-100

HYDR

O-T

-150

HYDR

O-T

-300

SWEE

P-25

SWEE

P-50

SWEE

P-75

SWEE

P-10

0SW

EEP-

125

NEUT

-20

NEUT

-40

0.0

0.2

0.4

0.6

0.8

1.0

% of CPI spent idle waiting on memory

% of CPI spent idle waiting on L2 cache

% of CPI doing computation (not idle)CPIo T2 Tm

HEAT .74 0 60

HYDRO .89 2 50

HYDRO-T .9 0 11

SWEEP .88 11 43

NEUT .77

(NOMINAL) .25 11 80


CPI Profiling with PAPIP

erce

nt

of

CP

I

• HYDRO is stride-n (n=linear grid size); HYDRO-T is stride-1• 100x100 fits in cache, 300 x 300 does not


MFLOPS as a Metric

• Useful means of characterizing performance,especially to demonstrate efficiency

• Problems:– Can be artifically inflated (by algorithm, code, by compiler)

– Doesn’t work for codes with small numbers of FLOPS

– No convention for counting FLOPS & FLOP instruction setsdiffer:

– A = B * C + D? A = B * C? A = B?A=A/B?

• Use with care


Benchmark Hierarchy

HW demo

kernels

basic routines

stripped-down app

full app Unde

rsta

ndin

g In

crea

ses

Inte

grat

ion

(real

ity) I

ncre

ases

full workload


Kernels or “MicroBenchmarks”• Small, specially-written programs to isolate and measure one specific

performance attritube at a time. Examples:– cache / memory throughput

– Floating-point processing rate

– Communication operations

– I/O

– Application-specific microkernels

• Most important use is in explanation of observed performance of applicationbenchmarks

• Problems with implementation

CALL STARTCLOCK DO 20 II=1,LOOPS DO 21 I=1,LEN R(I)=V1(I)*S1 21 CONTINUE20 CONTINUE CALL ENDCLOCK

Many compilers optimize this entiremicrobenchmark away.


CacheBench MicroBenchmarkPhilip Mucci: http://icl.cs.utk.edu/projects/llcbench


Standardized Benchmarks

– collection of “standard,” “relevent” codes

– Very widely used, available

– Problem: represents no workload

– results in a single-numbercharacterization

– Not always a measure of time

– Are any of these a good predictor forperformance of your code?

• Vendors love thesef77 - SPEC ....f77 - linpack

• Are these optimizations useful for real codes?

• LINPACK, Dhrystones, Whetstones, PERFECT,SPEC:http://www.spec.org


LINPACK

• Not really a benchmark; a library of linear algebra routines.

• Main advantages: easy, => lots of data

• Measures rate for solving dense systems with GaussianElimination

• Enormous database of timing data.

• Question: Can you use LINPACK to estimate performance ofyour application?– What portion of your application consists of LINPACK?

– Knowing this portion, and knowing pure LINPACK performance,how do you relate one to the other?

• Answer: Amdahl’s Law (.ca 1968)


Amdahl’s LawGiven a machine with 2 modes of computing, V and S that differ in relative speeds:

V mode offCannot use V Can use V

TTvTs

V mode onUsed VDidn’t use V

Ts T’vT’

Run a code with V off; measure T, Tv , and Ts;Run the code with V on: measure T’;

T = Tv + TsT ’ = T’v+ T’s = T’v + TsDefine: r = ratio of the two speeds = Tv / T’v > 1Define: fv = frac. of code that can take advantage of V = Tv / T = 1 - fsThe resulting speedup, S, is

S = T / T ’ =1

fv / r + (1 - fv) Amdahl’s Law


Amdahl’s Law Applications

• Faster Mode– Vector

– Parallel

– Computation-only

– Cache-hit

– LINPACK

– non-conditional comp.– SUBROUTINE DO_FFT

– etc.

• Slower Mode– Scalar, non-vector

– Serial, sequential

– I / O

– Cache miss

– non-LINPACK

– conditional comp.– SUBROUTINE SETUP

– non-etc.

Choose one from Column “A” and one from Column “B”


Amdahl’s Law Applications

• Amdahl’s Law bounds the speedup due to any improvement.Example: What will the speedup be if 20% of the exec. time is ininter-processor communications which we can improve by 10X?

S=T/T’= 1/ [.2/10 + .8] = 1.25

=> Invest resources where time is spent. The slowest portionwill dominate.

• Amdahl’s Law forces the law of diminishing returns on performance.– HPC and Murphy: “If any system component can damage

performance, it will.”


Amdahl’s Law as a Special Case:Bottleneckology*

* J. Worlton, “Toward a Science of Parallel Computation”See also, Bucher & Simmons, “

Ravg = 1

fi

Rii =1

N

Âwhere Ravg = average exec. time

fi = frac. of time spent in mode iRi = rate of mode i

This is a weighted Harmonic Mean.Use it for averagingrates!

• Emphasizes reciprocals of small numbers.• Result: Computer systems require balance.


• Peak rate and clock rate say extremely little about actualperformance.

• Benchmarking is the process by which we determine computerperformance on a specific workload of interest.

=> be careful generalizing results from a workload.

• You cannot represent the performance of a high-performancecomputer with only a single number.

Summary: Single-CPU Performance Metrics


Parallel Performance


• Ideal Speedup = SI (n ) = n - Theoretical limit; obtainable rarely- Ignores all of real life

• These definitions apply to a fixed-problem experiment.

• Speedup = S(n ) = T(1) / T(n )where T1 is the time for the best serial implementation.

• Absolute: Elapsed (wall-clock) Time = T(n )

=> Performance improvement due to parallelsm

• Parallel Efficiency = E(n ) = T(1) / n T(n )

Parallel Performance Metrics


A Pictorial of Parallel Metrics

Fraction of serial code

Communication HW

Idealspeedup, SI

Communication SW

Amdahl’s upper limit

Speedup, S

Sequential Work S <= ----------------------------------------------- Max(Work + Sync Time + Comm Cost)


Parallel Performance Metrics: Speedup

Speedup is only one characteristic of a program - it isnot synonymous with performance. In this comparison of twomachines the code achieves comparable speedups but one ofthe machines is faster.

4840322416800

250

500

750

1000

1250

1500

1750

2000T3E

O2KT3E Ideal

O2K Ideal

ProcessorsM

FL

OP

S

Absolute performance:

Processors

60504030201000

10

20

30

40

50

60T3EO2K

Ideal

Sp

eed

up

Relative performance:


Fixed-Problem Size Scaling

• a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS),variable subgrid 1

• Amdahl Limit: SA(n) = T(1) / T(n) = ---------------- f / n + ( 1 - f )

• This bounds the speedup based only on the fraction of the codethat cannot use parallelism ( 1- f ); it ignores all other factors

• SA --> 1 / ( 1- f ) as n --> •


Fixed-Problem Size Scaling (Cont’d)

• Efficiency (n) = T(1) / [ T(n) * n]

• Memory requirements decrease with n

• Surface-to-volume ratio increases with n

• Superlinear speedup possible from cache effects

• Motivation: what is the largest # of procs I can use effectively andwhat is the fastest time that I can solve a given problem?

• Problems:- Sequential runs often not possible (large problems)- Speedup (and efficiency) is misleading if processors are slow


S. Goedecker andAdolfy Hoisie,Achieving HighPerformance inNumericalComputations onRISC Workstationsand ParallelSystems,InternationalConference onComputationalPhysics:PC'97 Santa Cruz,August 25-28 1997.

Fixed-Problem Size Scaling: Examples


Fixed-Problem Size Scaling: Examples

• This shows why choice of problem size is important for performancestudies.– But problem sizes should be chosen to reflect workload, not to prortray the

machine in its best light.


Scaled Speedup Experiments

• a.k.a. Fixed Subgrid-Size, Weak Scaling, Gustafson scaling.

• Motivation: Want to use a larger machine to solve a larger globalproblem in the same amount of time.

• Memory and surface-to-volume effects remain constant.


Scaled Speedup Experiments

• Be wary of benchmarks that scale problems to unreasonably-large sizes

- scale the problem to fill the machine when a smaller size will do;

- simplify the science in order to add computation-> “World’s largest MD simulation - 10 gazillion particles!”

- run grid sizes for only a few cycles because the full runwon’t finish during this lifetime or because the resolution makes no sense compared with resolution of input data

• Suggested alternate approach (Gustafson): Constant time benchmarks- run code for a fixed time and measure work done


ProcessorsNChains Time Natoms Time per Time EfficiencyAtom per PE per Atom

1 32 38.4 2368 1.62E-02 1.62E-02 1.0002 64 38.4 4736 8.11E-03 1.62E-02 1.0004 128 38.5 9472 4.06E-03 1.63E-02 0.9978 256 38.6 18944 2.04E-03 1.63E-02 0.995

16 512 38.7 37888 1.02E-03 1.63E-02 0.99232 940 35.7 69560 5.13E-04 1.64E-02 0.98764 1700 32.7 125800 2.60E-04 1.66E-02 0.975

128 2800 27.4 207200 1.32E-04 1.69E-02 0.958256 4100 20.75 303400 6.84E-05 1.75E-02 0.926512 5300 14.49 392200 3.69E-05 1.89E-02 0.857

TBON on ASCI Red

0.4400.5400.6400.7400.8400.9401.040

0 200 400 600

Efficiency

Scaled-Problem Size Scaling: Example


Summary: Parallel Performance Metrics

1Amdahl Limit: SA(n) = T(1) / T(n) = ---------------- f / n + (1 - f)where f = fraction parallel.Q: What about other factors?

• Complexity of parallelism surfaces in the largenumber of metrics utilized for analyzing it.

• Speedup can be misleading, intrinsically, and due toits various definitions.

• Amdahl’s Law applies, leading to a case of potentiallydiminishing returns due to a variety of factors.


Modeling Communication

Tcomm = Nmsg * Tmsg

Nmsg is the frequency or “non-overlapped” # of messages.

Tmsg is time for one point-to-point communication;- measured by “ping-pong” experiment;

Tmsg = ts + twLwhere Tmsg is the time to send a message of length L, ts is the “start-up” time (size-independent), and tw is the (asymptotic) time per word (1/BW)

This model helps pinpoint comm bottleneck: latency termdominates for “short” messages; BW term for “long” messages.

Problem: Often tw depends on L because of buffering


Modeling Communication (Cont’d)

100080060040020000

100

200

300

400

Avg Rate on 2 Procs

ASCI Red MPI

Message Size (KB)

Rate

(M

B/s)

r•

rate = ------------------ ( 1 + n / n1/2 )

r• = 1 / tw

n1/2 = ts / tw

• Note the difficulties with this model:- Meaning of “bandwidth” vs. effective bandwidth.- Difficulty in resolving effects due to different protocols.- The two parameters are not truly independent.


} ts = startup cost = time to send a message of length 0


L =message length

T=Time Slope = tw = cost per word = 1/BW

Machine Ts Bandwidth(MB/s)

IBM SP 13 68SGI O2Kintra 12 120SGI O2Kinter 125 70ASCI Red 28 340CRAY T3E 16 250Quadrics 5 200



Quadrics QsNet,Compaq ES-40

Piece-wise, linear model:

0 £ n £ 32 T ~ 5 ms64 £ n £ 512 T ~ 7 ms + 18 ns / Wordn > 512 T ~ 9ms + 5 ns / Word


David Culler, Richard Martin, et al., UC Berkeley

• Latency in sending a message between modules

• overhead felt by the processor on sending or receiving msg - proc busy!

• gap between successive sends or receives (1/rate)

• Processors

• Round Trip time: 2 x ( 2o + L)• ts = o + L + o = time that the node is busy and cannot perform other ops• LogGp Model: G = 1/ BW

MPMP

Interconnection Network

P M° ° °

P ( processors )

o (overhead)L (latency)

og (gap)

Modeling Communication: LogP Model


Comparison of Communication Models• Linear model doesn’t answer questions related to overlap of

messages and how soon a second msg can be started after thefirst.– The application model can include these effects.

– Ex.: app tries to send messages faster than the network interfacecan handle; shows up as (artifactual) increased communicate timein the linear model.

• Linear model abstracts some details of architecture– However, all mere mortals deal with latency / BW

• Need to know how much time is spent from user space to userspace - can use either model.


Twelve* Ways to Fool the Masses WhenGiving Results on Parallel Computers†

1. Quote only 32-bit performance results, not 64-bit, and compare your 32-bit results with others’ 64-bit results.

2. Present inner kernel performance figures as the performance of the entireapplication.

3. Quietly employ assembly code, and compare your assembly-coded resultswith others’ Fortran or C implementations.

4. Scale up the problem size with the number of processors, but don’t clearlydisclose this fact.

5. Quote performance results linearly projected to a full system.6. Compare your results against scalar, unoptimized, single-processor code

on Crays.7. Compare with an old code on an obsolete system.


Twelve* Ways (Cont’d)

8. Base MFLOPS operation counts on the parallel implementationinstead of on the best sequential implementation.

9. Quote performance in terms of processor utilization, parallelspeedup, or MFLOPS (Peak)/dollar.

10. Mutilate the algorithm used in the parallel implementation tomatch the architecture.

11. Measure parallel run times on a dedicated system but measure"conventional" run times on a busy system.

12. If all else fails, show pretty pictures and animated videos anddon’t talk about performance.

13. (*hjw) If all else fails, rely on peak speed.

†David H. Bailey (LBNL) Supercomputing Review, Aug, 1991; ScientificProgramming, 1 (2) 1993, NASA Ames NAS Report RNR-91-020(http://www.nas.nasa.gov/)


There are at least 3 of the 12 “Ways to Fool the Masses” used inthis preprint. Can you find them?

Twelve* Ways (Cont’d)


Performance Modeling Case Studies


Modeling Parallel Performance

l Single-Node Effects:

– The CPU and its memory subsystem.

– Parameterize by single-node speed, subgrid size, # of FLOPS per cell,and possibly memory parameters.

l Multi-Node Effects:

– Algorithmic scalability: Features of the algorithm in the absence of implementation issues (comm µ N1/3 / P1/2 ; comp µ N2/3 / P )

– Parallel scalability: The real, measurable, behavior of the code on areal system, with parallel overhead, communications, load

imbalance includedl Modeling these effects allows identification of current bottlenecks for tuning

and prediction of performance on future systems- an “experiment-ahead” approach

l Use care in choosing problem sizes!


General Strategy (Review)

• Use microkernels and models to understandlow-level behavior

• Use application model to account foroccurrence of low-level primitives

• Integrate the two in an overall model

• Use care in choosing problem sizes!


Trun=Tcomputation + Tcommunication - Toverlap

• Tcomputation is easiest to model. A coarse approximation is based onthe number of grid points, characteristic Mflop rate and a sensitivity

analysis for cache behavior..

• Tcommunication is trickier. It depends on the type of communication

kernels (blocking, non-blocking), point-to-point or globalcommunications, communication parameters, network topology,contention. The linear model (latency-bandwidth) or LogGP can beutilized.

• Toverlap is the hardest. It depends on algorithmic overlap,

communication /computation overlap in hardware, load balancing,contention, runtime variability, overall machine load.

“Fundamental Equation” of Modeling


• Solve the particle transport equation, where the density distribution of particles N(x, E, W, t) is the unknown

• Use discrete directions W (e.g. S6 has 6 per octant)

• SWEEP3D code: 1-group, Cartesian-grid kernel

(http://www.c3.lanl.gov/par_arch/Software.html); sweep dominates

Case Study I. SN Transport


Particle Transport “Sweeps”

1D

2-D diagonal sweep

Solve each spatial cell in a specified orderfor a single ordinate (direction) subject to the constraintthat a cell cannot be solved for a particulardirection until its “upstream” neighbors havebeen solved.


3D Wavefront in 2D Partition


Pipelined Wavefront Abstraction withMessage Passing

for each octant for each angle-block for each z-block receive east receive north compute subgrid send west send south end for end forend for 2D Domain decomposition with “blocking”

• # of active cells (processors) varies from one diagonal to the next.• Blocking in “z” leads to tradeoff :

Parallel Efficiency vs. Communication Intensity

Zx

y


• Nsweep wavefronts “scan” the processor grid.

• Each scan requiresNs steps.

• There’s a delay of d between scans.

• The total number of steps, S, for allwavefronts is

• The challenge is to find Ns and d.

• For Sn: Nsweep = zblocks * angleblocks * octants

Basic Pipeline Model

)1( -+= sweeps NdNS


Communication Pipeline

4

)1(2)1(2

=

-+-=comm

xycomms

d

PPN

B

NtT msg

msg += 0

Tcomm = [2(Px + Py - 2) + 4(Nsweep - 1)] * Tmsg

7

1 2 30 1 2

4 5 6

8 9 10

12 13 14

3

7

11

15

2

4

6

4 6 8

6 8 10

8 10 12

3 5

5 7 9

7 9 11

Px

Py

Processor NodesMessage Step


1

1

=

-+=comp

yxcomps

d

PPN

Tcpu = ( Nx

Px

*Ny

Py

* Nz

Kb

* Na

Ab

)N flops

Rflops

PX

PYY

N+1N

Tcomp = [(Px + Py - 1) + (Nsweep - 1)] * Tcpu

Computation Pipeline


(Py-1)*Px procs have South neighbors: all send (Py-1)*Px procs have North neighbors: all receive (Px-1)*Py procs have East neighbors: all send (Px-1)*Py procs have West neighbors: all receive========================================== Nmsg = [(Py-1)*Px + (Px-1)*Py ] pairs of send/receives

A) T = Nmsg * Tmsg + (Px * Py)* Tcpu

B) T = Px * Py *2* Tmsg + (Px * Py)* Tcpu

Do you see any problem with any of these 2 alternative approaches?

Alternative Modeling Approaches?

A) is a (wrong) upper bound. B) is a (wrong) lower bound.Both fail to accurately describe the overlap in communicationand computation. Both fail to account for the delays due tothe different repetition rates of the two types of wavefronts.Both are wrong…but don’t feel bad if you almost agreed to one of them…we struggled with this for quite some time.

Px

Py


VAMPIR Analysis with Two Wavefronts


Tcomp = [(Px + Py - 1) + (Nsweep - 1)] * Tcpu

Tcomm = [2(Px + Py - 2) + 4(Nsweep - 1)] * Tmsg

• Nsweep = 1: Validates the number of pipeline stages in Tcomp and Tcomm, as

function of (Px +Py), in the available range of processor configurations.

• Nsweep ~ (Px +Py ): Validates case where the contributions of the (Px +Py )and

Nsweep terms are comparable.

• Nsweep >> (Px +Py ): Validates the repetition rate of the pipeline.

In each regime cases can be identified where:

Tcomp >> Tcomm

Tcomp = 0

Tcomp ~ Tcomm

Validation Regimes


30201001

2

3

4

5

MeasuredModel

Px + Py

Tim

e (s

econ

ds)

Tcompdominant. Nsweep = 10. SGI Origin

40302010000.0e+0

5.0e-3

1.0e-2

1.5e-2

2.0e-2

MeasuredModel

Px + Py

Tim

e (

seconds)

Tcomp=0. Nsweep = 10. CRAY T3E.

16141210864222.0e-3

3.0e-3

4.0e-3

5.0e-3

6.0e-3

7.0e-3

8.0e-3

MeasuredModelTcomp from Model

Px + Py

Tim

e (

seconds)

Tcomp dominant.Nsweep=10. SGI Origin

Validation Weak Scalability

302010000

20

40

60

80

MeasuredModelTcomp from Model

Px + Py

Tim

e (

se

co

nd

s)


30201000.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

MeasuredModel

Px + Py

Tim

e (s

eco

nd

s)

CRAY T3E

Validation Strong Scalability


• Model so far represents sweepsgenerated by angle/k-blockloops.

• Application consists of multipleoctants, multiple iterations.

• Iteration dependence added asmultiplicative term.

• Multiple octants extends thepipeline length, includesdependences between octants.

Pipelined wavefront abstraction:

for each octant for each angle-block for each z-block receive east receive north compute sub-grid send west send south end for end forend for

Model for Multiple Octants


-i -j +k

-i -j -k

-i

-j

2 31 4

Multiple Octant Processing

5

-i +j -k

-i +j +k6

+j

iq = 47 8 9 10 11 12 13

+i -j -k

+i -j +k

14 15 16

+ i -j

17 18

+i +j -k

+i +j +k

19 20

+ i

+j

21 22 23 24 25 26


Table 1. Octant Ordering and Consequent Wavefront Delay in SWEEP3D

Octant From Which Sweep Originates Delay from Previous Sweep

-i –j –k

-i –j+k 1

-i +j –k 2 + (Py – 1)

-i +j +k 1

+i –j –k 2 + (Py –1) + (Px – 1)

+i –j +k 1

+i +j –k 2 + (Py – 1)

+i +j +k 1

Total Steps for 8 Octants 10 + (2Px + 4Py – 6)

Multiple Octant Processing

• Result: Pipeline length is 3 times longer than that of 1 octant(but much less than 8 times longer).

• Result: The pipeline length is asymmetric with respect to the processor grid.


SWEEP3D QSC: mk=10 mmi=3, All CPUs Per Node

8.0

13.0

18.0

23.0

28.0

33.0

38.0

43.0

48.0

0 20 40 60 80 100

Px + Py

To

tal Ela

pse

d R

un

tim

e

Tmodel

Tmeasured

Multiple Octant & Iteration Processing

Compaq ES40 Cluster (4 processors per SMP). Model parameters included aCPU processing rate = 330 MFLOPS (16.5% of peak), MPI Latency = 11 ms, MPIBW = 290 MB/s (message size = 1500 words)


Blocking Strategies

• Larger block sizes lead to increasedcomputation / communication ratio.

• For wavefront algorithms smaller blocks yield higher parallel efficiency.

30201000

10

20

30

40

50

Time 1000

Time 500

Time 100

Time 10

Model 1000

Model 500

Model 100

Model 10

SWEEP on CRAYT3E for Several KBlock Sizes

Px + Py

Tim

e (s

eco

nd

s)


MPPs vs. Clusters of SMPs

• SWEEP3D results so far assumed that a logical processor mesh couldbe imbedded into the machine topology such that– each mesh node maps to a unique processor and

– each mesh edge maps to a unique router link.

– Required to maintain comm. concurrency within wavefronts

• Q: What happens to d and Nsteps on a cluster of SMPs with reducedconnectivity ?– Obvious latency & BW effects, but what else?

– Obvious relevance to ASCI Blue systemsas well as others

1,5 2,6 3,726

48

610 8

3,7 5,9 7

5 7 9

7 9 11

4 6 8 10

6 8 10 12

PX

PY


SMP

Sx

Sy

m

nLy

Lx

Notation

Clusters of SMPs

• This problem was solved byinduction/emulation.

• Wavefronts are delayed at inter-SMPboundaries if a message from aprevious wavefront is already using aninter-SMP link.

• Communication step for a givenmessage will be “bumped” aswavefronts “collide” with one-another.Wavefronts eventually scan the PE gridat a slower pace than in the “MPP” case.

• One result: Don’t need full connectivity -only need 1/2 of S


SWEEP Model: Key Practices• Creation of application code microbenchmarks

– Isolate communication pipeline and computation pipeline

– Model “from the inside-out.”• Understand one sweep, then many

• Wide validation range: controlled by blocking input parameters– Computation-dominant

– Communication-dominant

– Pipeline-dominant

– Sweep-dominant


Case Study Conclusions (I)

• The SWEEP3D models account for the overlap in thecommunication and computation components of thealgorithm.

• The models are parametric: basic machine performancenumbers (latency, MFLOPS rate, bandwidth) andapplication characteristics (problem size, etc) serve asinput.

• The MPP model was validated in all “regimes”, and onthree parallel architectures of wide practical interest (IBMSP2, Cray T3E and SGI Origin 2000).


Case Study Conclusions (II)

• The SWEEP3D cluster model required an inductionprocess and emulation after examination of wavefrontbehavior.

• It is the first model demonstrating the effect of reduced-connectivity networks on application performance of whichwe are aware.

• A relatively simple change in the system - the clustertopology - resulted in a much more complex communicationmodel.


• SAGE – SAIC’s Adaptive Grid Eulerianhydrocode

• Hydrodynamics code with AMR

• Applied to: water shock, energy coupling, hydroinstability problems, etc.

• Comes from Los Alamos CRESTONE project

• Represents a large class of production ASCIapplications at Los Alamos

• Routinely run on 1,000’s of processors.

Case Study II: Hydrodynamics


SAGE Uses

http://www.lanl.gov/orgs/pa/newsbulletin/2002/06/06/text01.shtml

One-kilometer iron asteroid struck with an impact equal to about 1.5 trillion tons of TNT, and produced a jet of water more than 12 miles high

Wave velocities for the largest asteroid will be roughly 380 miles an hour.Initial tsunami waves are more than half a mile high, abating to about two-thirdsof that height 40 miles in all directions from the point of impact.


• Understand the key characteristics of the code– Main data structures and their decomposition

– Processing stages

• Slab Parallelization Strategy– How communications scales

– Communication patterns

– Effect of network topology

• Processing Stages– Gather data, Computation, Scatter Data

Performance Model


Cell and Block Decomposition

PE1

PE2

PE3

PE4

X

Y

Z

Y

Z

...

...

...

1 2 3

M+1


One Bit of Algebra: Scaling Analysis• The total volume is: V = E.P = L3

• The volume of each sub-grid is: E=l.L2

where P is the number of PEs, l is the short side of the slab (in the Zdimension) and L is the side of the slab in X and Y directions (assuming asquare grid in the X-Y plane).

• The surface of the slab, L2, in the X-Y plane is: L2 = V2/3 = (E.P)2/3

communication growths with the number of processors!

• Consider again the volume of the entire grid: V = E.P = (l.L2).P

• This is partitioned across PEs such that there will be L/(2P) foils of width 2on each PE:

(E.P)1/3/2P = (E/8P2)1/3

• When this has a value less than one, a processor will contain less than asingle foil, i.e. when P > SQRT(E/8) the number of processors involved inboundary exchange increases!

• There is a maximum distance between the processors that hold a foil,termed the “PE Distance” (PED)


• Volume = E . PCommunication surface in Z = (E.P)2/3

1 2

2PEs

1 2 3 4 5 6 7 8

8PEs

...1

23

4

52

3

64PEs 1

4

52

3

4

6

7

7

8

9...

256PEs

• First E cells ->PE1…

(E=numcells_PE)

PE1234

Slab Decomposition


(for E = 13,500 P > 41)

• Surface split across PEs when:P>÷(E/8)

Comparison of Boundary Surfaces

1.E+03

1.E+04

1.E+05

1.E+06

1 10 100 1000 10000 100000# PEs

To

tal S

urf

ace

(cel

ls)

Grid Surface

PE Surface

Slab Scaling: (1) Surface Size


• PE distance = È (8P2/E)1/3 ˘

Comparison of inter-PE communication distances

1

10

100

1 10 100 1000 10000 100000# PEs

PE

Dis

tan

ce

Neighbor Distance

Slab Scaling: (2) PE Distance


• PE distance impacts performance on ASCI BlueMountain:

4

8 SMPs

12 ...

8 SMPs

12 ... 4128node SMP

n HiPPi links

2 ......

PE

• PE distance results in many PEscommunicating across a small number oflinks:

Effect of Network Topology: ASCI BlueMountain


• PE distance hidden on Compaq (Quadrics) network:

4 node SMP

• PE distance has maximum effectwhen all PEs communicating outof SMP node:

Effect of Network Topology: Compaq (Quadrics)


• Sage consists of many stages per cycle:– Gather (1+) - obtain data from remote PEs– Compute– Scatter (1+) - update data on remote PEs

• Tokens act as data templates for data transfers

n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4

n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4

n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4

Gather

Compute

Scatter

Gather/Scatter Comms

X, LO 4

Y, LO 152

Z, LO 6860

X, HI 4

Y, HI 152

Z, HI 6860

Direction Size (cells)

Processing Stages in SAGE


• Encapsulates code characteristics

• Parameterized in terms of:– Code parameters (e.g. cells per PE)

– System parameters (CPU speed,communication latency & bandwidth, memorycontention)

– Network Architecture (contention)

Tcycle(P,E) = Tcomp(E) + TGScomm(P,E) + Tallreduce(P) + Tmem(P,E)

Computation

Gather & ScatterCommunications

AllreduceCommunications

In-Box SMPMemory Contention

Performance Model for SAGE


Model Parameters

Memory contention per cellon P PEs.

Tmem(P)_

Cells per PEEApplication

Size of boundaries in X, Y &Z

Surfaces in X,Y, Z

Mapping

Time to process E cellsTcomp(E)_

Latency and BandwidthLc_, Bc

_

Communication Links perSMP

CL†

#PEs, & #PEs per SMP boxP†, PSMP†System

† System specification_ Measured / Benchmarked


• Validated on large-scale platforms:

– ASCI Blue Mountain (SGI Origin 2000)– CRAY T3E– ASCI Red (intel)– ASCI White (IBM SP3)– Compaq Alphaserver SMP clusters

• Single parameterized model– system specific parameters

• Model is highly accurate (< 10% error)

Model Validation


SAGE Performance (ASCI Blue Mountain)

0

1

2

3

4

5

6

100 1000 10000# PEs

Cyc

le T

ime

(s)

Prediction

Measurement

i) ASCI Blue Mountain


SAGE Performance (ASCI White)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 10 100 1000 10000

# PEs

Cyc

le T

ime

(s)

Prediction

Measurement

ii) ASCI White


At time of testing, only 8 nodes available

SAGE Performance (Compaq ES45 AlphaServer)

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100# PEs

Cyc

le T

ime

(s)

Prediction

Measurement

iii) Compaq ES45 AlphaServer Cluster


Applications of Performance Models


Predictive Value of Models

• Fast exploration of design space:– New architectures with increased comm. BW, decreased

MPI latency, upgrade of CPU speed• Example: SN transport on hypothetical 100-TF system

– New algorithms / coding strategies• Example: different parallel decomposition method

• Can estimate improvement prior to coding effort

• Example: SAGE


• SN transport has “well-defined” performance goals -– (1000)3 cells, 10,000 time steps, 40 TB memory, 5000 unknowns

per cell, 30-hour execution time goal.

• Best performance at time of model: 0.1 µsec grind time perphase-space cell– Total execution time for 104 time steps: 54 years

• Design space includes: problem size, # of processors, geometryof cluster (size and topology), communication parameters,computation parameters, optimal (problem) blocking sizes,target optimization (e.g., runtime, problem size)

Predicting SN Transport Performance


SGI Origin2000 100-TeraOPS (.ca 2005)

Processors 2,000 20,000

Bandwidth 110 MB/sec 4 GB/sec

Latency 30 m sec 1 m sec

Rcpu 40 MFLOPS 500 MFLOPS

Problem Size (50)3 (1000)3

Subgrid 6 x 6 x 1000

Total Memory 40 TBytes

SWEEP Scalability Study Parameters


SWEEP MPP Results: 1-Billion Cells

0100

300

500

700

1 50 100 150

Latency [us]

Tim

e [h

rs]

k-block size=10

k-block size=100

k-block size=500

0

1000

2000

3000

1 500 1000 1500

Sustained processor speed (MFLOPS)

Tim

e [h

rs]

k-block size=10

k-block size=100

k-block size=500

Sensitivity analysis onlatency of the network

Bandwith=4Gbytes/s,6X6X1000 subgrid, optimalangle - blocking, 500Mflops/processor

Sensitivity analysis onsustained processor speed

Bandwidth=4Gbytes/s,6X6X1000 subgrid, optimalangle - blocking,latency=1ms,


Estimates of SWEEP3D Performance on a HypotheticalFuture-Generation (100-TFLOPS) System as a Functionof MPI Latency and Sustained Per-Processor ComputingRate.

Sustained Computing Rate10% of Peak 50% of Peak

MPILatency Runtime (hours) Runtime (hours)

0.1 ms 180 56

1.0 ms 198 74

10 ms 291 102

SWEEP MPP Results: 1-Billion Cells


0

50

100

150

200

250

300

0 500 1000 1500

Sustained processor speed [Mflops]

Ru

nti

me

(ho

urs

)

k-block size=10k-block size=100k-block size=500

020406080

100120140160

0 50 100 150

Latency [us]

Ru

nti

me

(ho

urs

)

k-block size=10

k-block size=100k-block size=500

Sensitivity analysis onlatency of the network

(Bandwith=400MB/s,2X2X250 subgrid, optimalangle - blocking, 500Mflops/processor)

Sensitivity analysis onsustained processor speed

Bandwidth=400 MB/s,2X2X250 subgrid, optimalangle-blocking,latency=15ms

SWEEP MPP Results: 20-Million Cells


Bill ion-Cell Latency Sensitivity SMP

0

500

1000

1500

2000

2500

0 50 100 150

Latency (usec)

Tim

e (h

ours

)


Bill ion-Cell Rcpu Sensitivity SMP

0

500

1000

1500

2000

2500

3000

3500

0 500 1000 1500

Rcpu (MFLOPS)

Tim

e (h

ours

)


Estimates of SWEEP3D Performance on a HypotheticalFuture-Generation (100-TFLOPS) System as a Function ofMPI Latency and Sustained Per-Processor Computing Rate.

Sustained Computing Rate10% of Peak 50% of Peak

MPILatency Runtime (hours) Runtime (hours)

0.1 ms 185 58

1.0 ms 205 78

10 ms 297 104

SWEEP SMP-Cluster Results: 1-Billion Cells


Architecture Exploration (I)

• Predictions for the 30T

SAGE Performance Model (ES45)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 10 100 1000 10000 100000

# PEs

Tim

e fo

r 1

cycl

e (s

)

Prediction

Measurement


50 MFLOPS/cpu, L=100 us, Bw=100MB/s,4 x 4 x 100 subgrid, optimal blocking, 10e7 cells total,1 Link ea. Dir. Between Hosts.

Origin Cluster

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0 5000 10000 15000 20000 25000

# Procs

Tcomm-SMPTcomm-MPPTcomp

Architecture Exploration (II)


Sensitivity on the number of links

0

100

200

300

400

500

600

700

0 5000 10000 15000 20000 25000

Px *Py

Mo

dele

d e

xecu

tio

n t

ime [

hr]

Model-MPP

L=1

L=2

L=3

L=4

50 MFLOPS/cpu, L=100 us, BW=100MB/s,4 x 4 x 100 subgrid, optimal blocking, 10e7 cells total,NG=30, 12 iters, 10e4 timesteps

Architecture Exploration (III)


80006000400020000100

200

300

163264128256

# Processors

Tim

e (h

ours

) # CPUs per SMP

L=min(sx,sy)/4

Sensitivity Analysis on SMP SizeArchitecture Exploration (IV)


30T/12K procs

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

100 200 300 400 500

16K cells/proc

Architecture Exploration (V)


30T/12K procs

1

1.1

1.2

1.3

1.4

1.5

1.6

200 300 400 500

Bandwidth

5K cells/proc

Architecture Exploration (VI)


Application Optimization

• 3-D cell Grid

• Partition in 3 dimensions,– each PE can have:

i cells in X, j cells in Y, k cells in Z

Surface Volume

= 1 1 1 i j k

+ +

Volume = i.j.k (computation)

Surface = i.j + j.k + k.i (communication - gather/scatter)

(min when i=j=k)


• Minimum Surface-to-Volume ratio– minimizes communication time (Gather &

Scatter)

Case 1: 2x2x1 Case 2: 2x1x1 Case 3: 1x1x1



Cube vs Slab (Compaq ES45)

• Cube Surface: 4 times smaller than Slab

Comparison of Slab and Cube PE Surface Sizes

0.0E+00

2.0E+03

4.0E+03

6.0E+03

8.0E+03

1.0E+04

1.2E+04

1.4E+04

1.6E+04

1.8E+04

1 10 100 1000 10000 100000

# PEs

Tota

l Sur

face

(cel

ls)

Slab

Cube



Cube vs Slab (Compaq ES45)

• Cube PE distance > Slab PE distance

Comparison of Slab and Cube PE Distances

1

10

100

1000

1 10 100 1000 10000 100000

# PEs

PE

Dis

tanc

e

Slab (Z)

Cube (Y)

Cube (Z)



SAGE Performance Model - Comparison of Slab and Cube (ES45)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 10 100 1000 10000 100000

# PEs

Tim

e f

or

1 C

yc

le (

s)

Slab

Cube

• Expect performance improvement using cube

Cube vs Slab (Compaq ES45)Application Optimization


SAGE - Performance Data (Compaq ES45)

0

0.4

0.8

1.2

1.6

1 10 100 1000# PEs

Cyc

le T

ime

(s)

Model

• Model gives an expectation of performance

• Model used to validate measurements!

Rational System Integration


0

0.4

0.8

1.2

1.6

1 10 100 1000# PEs

Cyc

le T

ime

(s)

Model

Measured (Sept 9th 01)


0

0.4

0.8

1.2

1.6

1 10 100 1000# PEs

Cyc

le T

ime

(s)

Model


Measured (Oct 2nd 01)


0

0.4

0.8

1.2

1.6

1 10 100 1000# PEs

Cyc

le T

ime

(s)

Model


Measured (Oct 2nd 01)

Measured (Oct 24th 01)


Final Thoughts


Performance Engineering

Performance-engineered system: The components(application and system) are parameterized andmodeled, and a constitutive model is proposed andvalidated.

Predictions are made based on the model. The model ismeant to be updated, refined, and further validated asnew factors come into play.


Final Thoughts (1 of 4)

• Application / architecture mapping is the key - not lists of raw basicmachine characteristics.

• Point design studies need to address a specific workload.

• Performance and scalability modeling is an effective “tool” forworkload characterization, system design, application optimization,and algorithm-architecture mapping.

• Back-of-the-envelope performance predictions are risky (outrightwrong ?), given the complexity of analysis in a multidimensionalperformance space.

• Applications and systems at this scale need to be performance-engineered -- modeling is the means to analysis.


• We offered a practical methodology for performance analysis of large-scale scientific applications.– Adds insight into current performance

– Allows prediction of performance on future systems

– Combines application- and system-dependent information.

• Naïve metrics for parallel scaling don’t work.

• The methodology is not tied to any particular tool(s)– The model is the tool!




• Performance evaluation of supercomputer systems:If done “properly” you can get any answer you want.:)– “One way to get good performance is to redefine ‘good.’ ” (K.

Kennedy, 6/7/99)

• Performance analysis requires information fromseveral levels of the benchmarking hierarchy:– Kernels-alone do not characterize the performance of a

supercomputer.

– Use kernels to help understand performance of parts of yourapplication.



• Amdahl’s Law requires balance in all components thatcontribute to performance.

• Modeling is crucial.

• Single-processor performance is the bottleneck much ofthe time.

• Responsible, careful evaluation of high-performancemachines is a necessary condition for continuedprogress and future success of parallel computing.

• Be an educated buyer!


Acknowledgments and Disclaimers• Other members, LANL Parallel Architectures and Algorithms Team

– Eitan Frachtenberg (Hebrew Univ, Israel)

– Vladimir Getov (Univ. of Westminster, UK)

– Darren Kerbyson

– Michael Lang

– Scott Pakin

– Juan Fernandez Peinador (Univ. of Valencia, Spain)

– Fabrizio Petrini

• Thanks to:– US Department of Energy through Los Alamos National Laboratory

contract W-7405-ENG-36

– Los Alamos Computer Science Institute

• Note: Any benchmark results presented herein reflect our workload.Results from other workloads may vary.


Important Resources (1 of 4)• Hennessy, J. L. and Patterson, D. A. Computer Architecture: A

Quantitative Approach. 1990. Morgan Kaufman Publishers, Inc. SanMateo, CA. Third edition, 2002.

• Patterson, D.A. and Hennessy, J.L., Computer Organization andDesign: The Hardware/Software Interface. 1993. San Mateo, CA:Morgan Kaufmann Publishers. Second Edition, 1998.

• D. Kuck, “High Performance Computing,” Oxford U. Press (New York)1996.

• D. Culler, A. Goopta, and J. P. Singh, “Parallel Computer Architecture,”Morgan Kaufmann (San Francisco) 1998.

• K. Hwang, “Advanced Computer Architecture,” McGraw-Hill (New York)1993.

• http://www.c3.lanl.gov/~hjw/CODES/SWEEP3D/sweep3d.html


• D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos, K. Schauser, R. Subramonian, and T.von Eiken, "LogP: A Practical Model of Parallel Computation," Communications of the ACM,39(11):79:85, Nov., 1996.

• R. Martin, A. Vahdat, D. Culler, T. Anderson, “Effects of CommunicationLatency, Overhead, and Bandwidth in a Cluster Architecture,” Iternational Symposium onComputer Architecture , Denver, CO. June 1997.http://www.cs.berkeley.edu/~rmartin/papers/logp.ps

• C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. L. Hennessy, "Effects of Latency,Occupancy, and Bandwidth in DSM Multiprocessors," Stanford Univ. Comp. Sci. ReportCSL-TR-95-660, 1/95.

• D. A. Patterson: http://www.cs.berkeley.edu/~pattrsn/252S98/index.html

• Ian Foster, “Designing and Building Parallel Programs,” Addison Wesley (), 1995, andhttp://www.mcs.anl.gov/dbpp/

• G. Fox, R. Williams, and P. Messina, “Parallel Computing Works!” Morgan Kaufman, 1994.

• P. H. Worley, “The Effects of time Constraints on Scaled Speedup,” SIAM J. Sci StatComp., 11(5):838-858, 1990.

• Zagha, M, Larson, B, Tuerner, S, and Itzkowitz, M, “Performance Analysis UsingMIPS R10000 Performance Counters, “Proc. SC96, IEEE Computer Society.

• Torellas, J., Solihin, Y., and Lam, V., “Scal-Tool: Pinpointing and QuantifyingScalability Bottlenecks in DSM Multiprocessors,” SC99

Important Resources (2 of 4)



Darren J. Kerbyson, Shawn Pautz, and Adolfy Hoisie, Predictive Modeling of Parallel Sn Sweepson Unstructured Grids, Los Alamos National Laboratory Unclassified Report LA-UR-02-2662.Salvador Coll, Fabrizio Petrini, Eitan Frachtenberg and Adolfy Hoisie. Performance Evaluation ofI/O Traffic and Placement of I/O Nodes on a High Performance Network. In Workshop onCommunication Architecture for Clusters 2002 (CAC '02), International Parallel and DistributedProcessing Symposium 2002 (IPDPS '02), Fort Lauderdale, FL, April 2002.Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg, Adolfy Hoisie, Leonid Gurvits"UsingMultirail Networks in High-Performance Clusters. In IEEE Cluster 2001, Newport Beach, CA, October2001Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg and Adolfy Hoisie Hardware- and Software-Based Collective Communication on the Quadrics Network. In IEEE International Symposium onNetwork Computing and Applications 2001 (NCA 2001), Boston, MA, October 2001.Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg, Adolfy Hoisie "Hardware and Software BasedCollective Communication on the Quadrics Network. In IEEE International Symposium on NetworkComputing and Applications 2001 (NCA 2001), Boston, MA, October 2001Eitan Frachtenberg, Fabrizio Petrini, Salvador Coll, Wu-chun Feng"Gang Scheduling withLighweight User-Level Communication. In 2001 International Conference on Parallel Processing(ICPP2001), Workshop on Scheduling and Resource Management for Cluster Computing, ValenciaSpain, September 2001

http://www.c3.lanl.gov/par_arch/Publications.html


Eitan Frachtenberg andFabrizio Petrini. Overlapping Communication and Computation in theQuadrics Network. LAUR 01-4695, August 2001.Eitan Frachtenberg and Fabrizio Petrini. Scheduler Testbed System Design. LAUR 01-4694, 08/ 01.Fabrizio Petrini ,Wu-chun Feng Adolfy Hoisie, Salvador Coll, Eitan Frachtenberg. The QuadricsNetwork (QsNet): High-Performance Clustering Technology. In Hot Interconnects 9, StanfordUniversity, Palo Alto, CA, August 2001.Darren J. Kerbyson, Hank J. Alme, Adolfy Hoisie, Fabrizio Petrini, Harvey J. Wasserman,Michael Gittings, "Predictive Performance and Scalability Modeling of a Large-ScaleApplication,"Proceedings of SC2001 (LAUR-01-4337, July, 2001).Fabrizio Petrini, Adolfy Hoisie, Wu-chun Feng, Richard Graham, A Performance Evaluation of theQuadrics Interconnection Network, LAUR-00308, Workshop on Communication Architecture forClusters (CAC '01), Int'l Parallel and Distributed Proicessing Symposium (IPDPS '01), April 23-27,2001. San Francisco.Adolfy Hoisie, Olaf Lubeck, Harvey Wasserman, Fabrizio Petrini, Hank Alme, A GeneralPredictive Performance Model for Wavefront Algorithms on Clusters of SMPs, LAUR-00308, In theproceeding of ICPP 2000, August 20-25, 2000. Toronto, Canada.Adolfy Hoisie, Olaf Lubeck, Harvey Wasserman, "Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications", The InternationalJournal of High Performance Computing Applications, Sage Science Press, Vol. 14: 4, Winter 2000.

http://www.c3.lanl.gov/par_arch/Publications.html



About the Authors

Adolfy Hoisie is a Staff Scientist and the Leader of the Parallel Architectures andPerformance Team in the Computer and Computational Sciences Division at LANL.From 1987 until he joined LANL in 1997, he has been a researcher at CornellUniversity. His area of research is performance evaluation of high-performancearchitectures. He published extensively, lectured at numerous conferences andworkshops, often as an invited speaker, and taught tutorials in this field at importantevents worldwide. He is the winner of the Gordon Bell Award in 1996.

Harvey Wasserman has been a Staff Scientist in the Computing, Information, andCommunications Division at Los Alamos National Laboratory since 1985. Hisresearch interests involve supercomputer architecture and performance evaluation,and he has participated in benchmarks of almost all significant high-performancecomputing architectures, including single-processor workstations, parallel vectorsupercomputers, and massively-parallel systems. In a prior life he was a chemist andhe holds a Ph.D. in Inorganic Chemistry from the State University of New York andwas a Postdoctoral Research Associate at Los Alamos in 1982-1984. He is a co-author of over 50 articles and has presented numerous invited and contributedlectures and tutorials. In 1999 during a one-year sabbatical at LANL he developedand taught a curriculum on ASCI system usage.


Twelve Ways Test Answers

• The only performance comparison given is for thesqrt function. No comparison is given for the wholecode.– Instead of comparing performance with other systems, a

pretty picture is shown.

– Instead of comparing performance with other systems,utilization is quoted.

• The comparison uses microprocessors that are 2-3generations older than the system of interest.

Performance Analysis and Prediction of Large-Scale ... · Performance Analysis and Prediction of Large-Scale Scientific Applications ... – Statistical (Wisconsin, ORNL, IBM)* •

Documents