A. Hoisie & H. Wasserman ICS 2002 New York NY Performance Analysis and Prediction of Large-Scale Scientific Applications Adolfy Hoisie and Harvey Wasserman Computer & Computational Sciences Division (CCS) University of California Los Alamos National Laboratory Los Alamos, New Mexico 87545 ICS2002 June 22, 2002 {hoisie, hjw}@lanl.gov
131
Embed
Performance Analysis and Prediction of Large-Scale ... · Performance Analysis and Prediction of Large-Scale Scientific Applications ... – Statistical (Wisconsin, ORNL, IBM)* •
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A. Hoisie & H. WassermanICS 2002 New York NY
Performance Analysis and Predictionof Large-Scale Scientific Applications
Adolfy Hoisie and Harvey Wasserman
Computer & Computational Sciences Division (CCS)
University of CaliforniaLos Alamos National LaboratoryLos Alamos, New Mexico 87545
ICS2002 June 22, 2002
{hoisie, hjw}@lanl.gov
A. Hoisie & H. WassermanICS 2002 New York NY
Introduction and Motivation
A. Hoisie & H. WassermanICS 2002 New York NY
What is This Tutorial About?
• Overview of performance modeling.– Analytical techniques that encapsulate as parameters the
performance characteristics of applications and machines
– Techniques developed at LANL
– Emphasis on full applications
– No dependence on specific tools• Although data collection is vital
• Applications of performance models: performanceprediction.– Tuning roadmap for current bottlenecks
– Architecture exploration for future systems
– Software / algorithm changes
– System installation diagnostics: “Rational System Integration”
A. Hoisie & H. WassermanICS 2002 New York NY
What is This Tutorial Really About?
• Insight into performance issues.
– Performance modeling is the only practical way to obtainquantitative information on how to map real applications toparallel architectures rapidly and with high accuracy
• With this insight you become a more educatedbuyer/seller/user of computer systems.– Help you become a “performance skeptic.”
– Show how to integrate information from various levels of thebenchmark hierarchy.
– Show why “naïve” approaches sometimes don’t work.
A. Hoisie & H. WassermanICS 2002 New York NY
Vendor
Why Evaluate Performance?
Sell Machine
User
Buy Machine
Adjust for Technology Shifts
New technologies inspire new applications
Overall Goal: Advance the state of the art of computer architecture.
A. Hoisie & H. WassermanICS 2002 New York NY
Why Performance Modeling?
• Other performance analysis methods fallshort in either accuracy or practicality:– Experimental
• Simulation (UCLA, Darthmouth, Los Alamos)*– Greatest architectural flexibility but takes too long for real
applications
• Trace-driven experiments (UIUC, Barcelona)*– Results often lack generality
• Benchmarking (~ everybody)– Limited to current implementation of the code
– Limited to currently-available architectures
– Difficult to distinguish between real performance andmachine idiosyncracies
* Partial lists Continued…
A. Hoisie & H. WassermanICS 2002 New York NY
Why Performance Modeling?
• Other performance analysis methods:– Queuing Theory: takes too long for real apps
– Statistical (Wisconsin, ORNL, IBM)*• Mean value analysis
• Two questions:– Does the code contain enough floating-point operations?
– Does the compiler make the floating-point parallelismavailable to the architecture?
• Two answers:– Typically only 1 instruction in 3 is a FP instruction
– See section 3 of this tutorial.
A. Hoisie & H. WassermanICS 2002 New York NY
If peak performancewerea reliable metric allthe bluebars would equal allthe red bars inheight.
Peak Speed is a Poor Metric
A. Hoisie & H. WassermanICS 2002 New York NY
CPU Performance
CPU Time = Ninst * CPI * Clock rate
Application
Compiler
CPU Time =Instructions--------------- Program
Cycles-------------Instruction
Seconds------------- Cycle
XX
Instruction Set
Architecture
Technology
A. Hoisie & H. WassermanICS 2002 New York NY
CPI as a Metric and CPI Profiling• “Average cycles per instruction”
• Dangerous for cross-platform comparison but useful forcomparison with optimal values
• CPI Profiling:
• Look primarily at CPIcompute (CPIo), CPIstall, CPImem
• Two excellent papers:– Bhandarkar, D. and Cvetanovic, Z., "Performance Characterization of the Alpha 21164
Microprocessor Using TP and SPEC Workloads," Proc. Second. Int. Sypm. on High-Perf. Comp. Arch., IEEE Computer Society Press, Los Alamitos Ca., 1996.
– Bhandarkar, D. and Ding, J., "Performance Characterization of the Pentium ProProcessor, " Proc. Third. Int. Sypm. on High-Perf. Comp. Arch., IEEE ComputerSociety Press, Los Alamitos Ca., pp 288-297, 1997
(sum over all instruction types)
A. Hoisie & H. WassermanICS 2002 New York NY
CPI Profiling
• Use HW counters to obtain:– CP, tot instructions, cache misses, total mem ops, others.
– http://icl.cs.utk.edu/projects/papi
• CPI = CPIo + CPImem, but problem is overlap– Old way: use counters, infer CPImem (e.g., sgi Origin)
• CPI = CPIo + # memops * (miss_rate * miss penalty)– (NOTE: miss_rate is only an indirect measure of performance. Just
as we can’t measure CPU performance by Ninst only, we can’tmeasure memory performance by hit rate.)
– Need to know how hits/misses contribute to CPU time
– Average memory access time for hits/misses
– New way: measure CPImem, CPIstall directly (e.g., Itanium)
A. Hoisie & H. WassermanICS 2002 New York NY
Optimal CPI = 0.25 for MIPS R10000 (issue rate is four instruction per CP).
** H. Wasserman, O. Lubeck, Y.Luo, and F. Bassetti, Proc. SC97, 1997.
InfiniteCacheCPI**
Single-CPUMFLOPSSGI O2K
Single-CPUEffic
SGI O2K
HEAT 0.74 35 9%
SWEEP 0.88 45 11%
HYDRO 0.89 31 8%
HYDRO-T 0.90 63 16%
NEUT 0.77 49 12%
CPI Profiling: sgi Origin2000
A. Hoisie & H. WassermanICS 2002 New York NY
CPI Profiling with Memory Model
O. Lubeck, Y. Luo, H. Wasserman and F. Bassetti “Performance “Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications,” Proc. SC97
CP
I Pro
file
HEA
T-50
HEA
T-75
HEA
T-10
0HY
DRO
-100
HYDR
O-1
50HY
DRO
-300
HYDR
O-T
-100
HYDR
O-T
-150
HYDR
O-T
-300
SWEE
P-25
SWEE
P-50
SWEE
P-75
SWEE
P-10
0SW
EEP-
125
NEUT
-20
NEUT
-40
0.0
0.2
0.4
0.6
0.8
1.0
% of CPI spent idle waiting on memory
% of CPI spent idle waiting on L2 cache
% of CPI doing computation (not idle)CPIo T2 Tm
HEAT .74 0 60
HYDRO .89 2 50
HYDRO-T .9 0 11
SWEEP .88 11 43
NEUT .77
(NOMINAL) .25 11 80
A. Hoisie & H. WassermanICS 2002 New York NY
CPI Profiling with PAPIP
erce
nt
of
CP
I
• HYDRO is stride-n (n=linear grid size); HYDRO-T is stride-1• 100x100 fits in cache, 300 x 300 does not
A. Hoisie & H. WassermanICS 2002 New York NY
MFLOPS as a Metric
• Useful means of characterizing performance,especially to demonstrate efficiency
• Problems:– Can be artifically inflated (by algorithm, code, by compiler)
– Doesn’t work for codes with small numbers of FLOPS
– No convention for counting FLOPS & FLOP instruction setsdiffer:
– A = B * C + D? A = B * C? A = B?A=A/B?
• Use with care
A. Hoisie & H. WassermanICS 2002 New York NY
Benchmark Hierarchy
HW demo
kernels
basic routines
stripped-down app
full app Unde
rsta
ndin
g In
crea
ses
Inte
grat
ion
(real
ity) I
ncre
ases
full workload
A. Hoisie & H. WassermanICS 2002 New York NY
Kernels or “MicroBenchmarks”• Small, specially-written programs to isolate and measure one specific
performance attritube at a time. Examples:– cache / memory throughput
– Floating-point processing rate
– Communication operations
– I/O
– Application-specific microkernels
• Most important use is in explanation of observed performance of applicationbenchmarks
• Problems with implementation
CALL STARTCLOCK DO 20 II=1,LOOPS DO 21 I=1,LEN R(I)=V1(I)*S1 21 CONTINUE20 CONTINUE CALL ENDCLOCK
Many compilers optimize this entiremicrobenchmark away.
• Not really a benchmark; a library of linear algebra routines.
• Main advantages: easy, => lots of data
• Measures rate for solving dense systems with GaussianElimination
• Enormous database of timing data.
• Question: Can you use LINPACK to estimate performance ofyour application?– What portion of your application consists of LINPACK?
– Knowing this portion, and knowing pure LINPACK performance,how do you relate one to the other?
• Answer: Amdahl’s Law (.ca 1968)
A. Hoisie & H. WassermanICS 2002 New York NY
Amdahl’s LawGiven a machine with 2 modes of computing, V and S that differ in relative speeds:
V mode offCannot use V Can use V
TTvTs
V mode onUsed VDidn’t use V
Ts T’vT’
Run a code with V off; measure T, Tv , and Ts;Run the code with V on: measure T’;
T = Tv + TsT ’ = T’v+ T’s = T’v + TsDefine: r = ratio of the two speeds = Tv / T’v > 1Define: fv = frac. of code that can take advantage of V = Tv / T = 1 - fsThe resulting speedup, S, is
S = T / T ’ =1
fv / r + (1 - fv) Amdahl’s Law
A. Hoisie & H. WassermanICS 2002 New York NY
Amdahl’s Law Applications
• Faster Mode– Vector
– Parallel
– Computation-only
– Cache-hit
– LINPACK
– non-conditional comp.– SUBROUTINE DO_FFT
– etc.
• Slower Mode– Scalar, non-vector
– Serial, sequential
– I / O
– Cache miss
– non-LINPACK
– conditional comp.– SUBROUTINE SETUP
– non-etc.
Choose one from Column “A” and one from Column “B”
A. Hoisie & H. WassermanICS 2002 New York NY
Amdahl’s Law Applications
• Amdahl’s Law bounds the speedup due to any improvement.Example: What will the speedup be if 20% of the exec. time is ininter-processor communications which we can improve by 10X?
S=T/T’= 1/ [.2/10 + .8] = 1.25
=> Invest resources where time is spent. The slowest portionwill dominate.
• Amdahl’s Law forces the law of diminishing returns on performance.– HPC and Murphy: “If any system component can damage
performance, it will.”
A. Hoisie & H. WassermanICS 2002 New York NY
Amdahl’s Law as a Special Case:Bottleneckology*
* J. Worlton, “Toward a Science of Parallel Computation”See also, Bucher & Simmons, “
Ravg = 1
fi
Rii =1
N
Âwhere Ravg = average exec. time
fi = frac. of time spent in mode iRi = rate of mode i
This is a weighted Harmonic Mean.Use it for averagingrates!
• Emphasizes reciprocals of small numbers.• Result: Computer systems require balance.
A. Hoisie & H. WassermanICS 2002 New York NY
• Peak rate and clock rate say extremely little about actualperformance.
• Benchmarking is the process by which we determine computerperformance on a specific workload of interest.
=> be careful generalizing results from a workload.
• You cannot represent the performance of a high-performancecomputer with only a single number.
Summary: Single-CPU Performance Metrics
A. Hoisie & H. WassermanICS 2002 New York NY
Parallel Performance
A. Hoisie & H. WassermanICS 2002 New York NY
• Ideal Speedup = SI (n ) = n - Theoretical limit; obtainable rarely- Ignores all of real life
• These definitions apply to a fixed-problem experiment.
• Speedup = S(n ) = T(1) / T(n )where T1 is the time for the best serial implementation.
• Absolute: Elapsed (wall-clock) Time = T(n )
=> Performance improvement due to parallelsm
• Parallel Efficiency = E(n ) = T(1) / n T(n )
Parallel Performance Metrics
A. Hoisie & H. WassermanICS 2002 New York NY
A Pictorial of Parallel Metrics
Fraction of serial code
Communication HW
Idealspeedup, SI
Communication SW
Amdahl’s upper limit
Speedup, S
Sequential Work S <= ----------------------------------------------- Max(Work + Sync Time + Comm Cost)
A. Hoisie & H. WassermanICS 2002 New York NY
Parallel Performance Metrics: Speedup
Speedup is only one characteristic of a program - it isnot synonymous with performance. In this comparison of twomachines the code achieves comparable speedups but one ofthe machines is faster.
• Amdahl Limit: SA(n) = T(1) / T(n) = ---------------- f / n + ( 1 - f )
• This bounds the speedup based only on the fraction of the codethat cannot use parallelism ( 1- f ); it ignores all other factors
• SA --> 1 / ( 1- f ) as n --> •
A. Hoisie & H. WassermanICS 2002 New York NY
Fixed-Problem Size Scaling (Cont’d)
• Efficiency (n) = T(1) / [ T(n) * n]
• Memory requirements decrease with n
• Surface-to-volume ratio increases with n
• Superlinear speedup possible from cache effects
• Motivation: what is the largest # of procs I can use effectively andwhat is the fastest time that I can solve a given problem?
• Problems:- Sequential runs often not possible (large problems)- Speedup (and efficiency) is misleading if processors are slow
A. Hoisie & H. WassermanICS 2002 New York NY
S. Goedecker andAdolfy Hoisie,Achieving HighPerformance inNumericalComputations onRISC Workstationsand ParallelSystems,InternationalConference onComputationalPhysics:PC'97 Santa Cruz,August 25-28 1997.
Fixed-Problem Size Scaling: Examples
A. Hoisie & H. WassermanICS 2002 New York NY
Fixed-Problem Size Scaling: Examples
• This shows why choice of problem size is important for performancestudies.– But problem sizes should be chosen to reflect workload, not to prortray the
• Motivation: Want to use a larger machine to solve a larger globalproblem in the same amount of time.
• Memory and surface-to-volume effects remain constant.
A. Hoisie & H. WassermanICS 2002 New York NY
Scaled Speedup Experiments
• Be wary of benchmarks that scale problems to unreasonably-large sizes
- scale the problem to fill the machine when a smaller size will do;
- simplify the science in order to add computation-> “World’s largest MD simulation - 10 gazillion particles!”
- run grid sizes for only a few cycles because the full runwon’t finish during this lifetime or because the resolution makes no sense compared with resolution of input data
• Suggested alternate approach (Gustafson): Constant time benchmarks- run code for a fixed time and measure work done
A. Hoisie & H. WassermanICS 2002 New York NY
ProcessorsNChains Time Natoms Time per Time EfficiencyAtom per PE per Atom
1Amdahl Limit: SA(n) = T(1) / T(n) = ---------------- f / n + (1 - f)where f = fraction parallel.Q: What about other factors?
• Complexity of parallelism surfaces in the largenumber of metrics utilized for analyzing it.
• Speedup can be misleading, intrinsically, and due toits various definitions.
• Amdahl’s Law applies, leading to a case of potentiallydiminishing returns due to a variety of factors.
A. Hoisie & H. WassermanICS 2002 New York NY
Modeling Communication
Tcomm = Nmsg * Tmsg
Nmsg is the frequency or “non-overlapped” # of messages.
Tmsg is time for one point-to-point communication;- measured by “ping-pong” experiment;
Tmsg = ts + twLwhere Tmsg is the time to send a message of length L, ts is the “start-up” time (size-independent), and tw is the (asymptotic) time per word (1/BW)
This model helps pinpoint comm bottleneck: latency termdominates for “short” messages; BW term for “long” messages.
Problem: Often tw depends on L because of buffering
A. Hoisie & H. WassermanICS 2002 New York NY
Modeling Communication (Cont’d)
100080060040020000
100
200
300
400
Avg Rate on 2 Procs
ASCI Red MPI
Message Size (KB)
Rate
(M
B/s)
r•
rate = ------------------ ( 1 + n / n1/2 )
r• = 1 / tw
n1/2 = ts / tw
• Note the difficulties with this model:- Meaning of “bandwidth” vs. effective bandwidth.- Difficulty in resolving effects due to different protocols.- The two parameters are not truly independent.
A. Hoisie & H. WassermanICS 2002 New York NY
} ts = startup cost = time to send a message of length 0
Modeling Communication (Cont’d)
L =message length
T=Time Slope = tw = cost per word = 1/BW
Machine Ts Bandwidth(MB/s)
IBM SP 13 68SGI O2Kintra 12 120SGI O2Kinter 125 70ASCI Red 28 340CRAY T3E 16 250Quadrics 5 200
A. Hoisie & H. WassermanICS 2002 New York NY
Modeling Communication (Cont’d)
Quadrics QsNet,Compaq ES-40
Piece-wise, linear model:
0 £ n £ 32 T ~ 5 ms64 £ n £ 512 T ~ 7 ms + 18 ns / Wordn > 512 T ~ 9ms + 5 ns / Word
A. Hoisie & H. WassermanICS 2002 New York NY
David Culler, Richard Martin, et al., UC Berkeley
• Latency in sending a message between modules
• overhead felt by the processor on sending or receiving msg - proc busy!
• gap between successive sends or receives (1/rate)
• Processors
• Round Trip time: 2 x ( 2o + L)• ts = o + L + o = time that the node is busy and cannot perform other ops• LogGp Model: G = 1/ BW
MPMP
Interconnection Network
P M° ° °
P ( processors )
o (overhead)L (latency)
og (gap)
Modeling Communication: LogP Model
A. Hoisie & H. WassermanICS 2002 New York NY
Comparison of Communication Models• Linear model doesn’t answer questions related to overlap of
messages and how soon a second msg can be started after thefirst.– The application model can include these effects.
– Ex.: app tries to send messages faster than the network interfacecan handle; shows up as (artifactual) increased communicate timein the linear model.
• Linear model abstracts some details of architecture– However, all mere mortals deal with latency / BW
• Need to know how much time is spent from user space to userspace - can use either model.
A. Hoisie & H. WassermanICS 2002 New York NY
Twelve* Ways to Fool the Masses WhenGiving Results on Parallel Computers†
1. Quote only 32-bit performance results, not 64-bit, and compare your 32-bit results with others’ 64-bit results.
2. Present inner kernel performance figures as the performance of the entireapplication.
3. Quietly employ assembly code, and compare your assembly-coded resultswith others’ Fortran or C implementations.
4. Scale up the problem size with the number of processors, but don’t clearlydisclose this fact.
5. Quote performance results linearly projected to a full system.6. Compare your results against scalar, unoptimized, single-processor code
on Crays.7. Compare with an old code on an obsolete system.
A. Hoisie & H. WassermanICS 2002 New York NY
Twelve* Ways (Cont’d)
8. Base MFLOPS operation counts on the parallel implementationinstead of on the best sequential implementation.
9. Quote performance in terms of processor utilization, parallelspeedup, or MFLOPS (Peak)/dollar.
10. Mutilate the algorithm used in the parallel implementation tomatch the architecture.
11. Measure parallel run times on a dedicated system but measure"conventional" run times on a busy system.
12. If all else fails, show pretty pictures and animated videos anddon’t talk about performance.
13. (*hjw) If all else fails, rely on peak speed.
†David H. Bailey (LBNL) Supercomputing Review, Aug, 1991; ScientificProgramming, 1 (2) 1993, NASA Ames NAS Report RNR-91-020(http://www.nas.nasa.gov/)
A. Hoisie & H. WassermanICS 2002 New York NY
There are at least 3 of the 12 “Ways to Fool the Masses” used inthis preprint. Can you find them?
Twelve* Ways (Cont’d)
A. Hoisie & H. WassermanICS 2002 New York NY
Performance Modeling Case Studies
A. Hoisie & H. WassermanICS 2002 New York NY
Modeling Parallel Performance
l Single-Node Effects:
– The CPU and its memory subsystem.
– Parameterize by single-node speed, subgrid size, # of FLOPS per cell,and possibly memory parameters.
l Multi-Node Effects:
– Algorithmic scalability: Features of the algorithm in the absence of implementation issues (comm µ N1/3 / P1/2 ; comp µ N2/3 / P )
– Parallel scalability: The real, measurable, behavior of the code on areal system, with parallel overhead, communications, load
imbalance includedl Modeling these effects allows identification of current bottlenecks for tuning
and prediction of performance on future systems- an “experiment-ahead” approach
l Use care in choosing problem sizes!
A. Hoisie & H. WassermanICS 2002 New York NY
General Strategy (Review)
• Use microkernels and models to understandlow-level behavior
• Use application model to account foroccurrence of low-level primitives
• Integrate the two in an overall model
• Use care in choosing problem sizes!
A. Hoisie & H. WassermanICS 2002 New York NY
Trun=Tcomputation + Tcommunication - Toverlap
• Tcomputation is easiest to model. A coarse approximation is based onthe number of grid points, characteristic Mflop rate and a sensitivity
analysis for cache behavior..
• Tcommunication is trickier. It depends on the type of communication
kernels (blocking, non-blocking), point-to-point or globalcommunications, communication parameters, network topology,contention. The linear model (latency-bandwidth) or LogGP can beutilized.
• Toverlap is the hardest. It depends on algorithmic overlap,
communication /computation overlap in hardware, load balancing,contention, runtime variability, overall machine load.
“Fundamental Equation” of Modeling
A. Hoisie & H. WassermanICS 2002 New York NY
• Solve the particle transport equation, where the density distribution of particles N(x, E, W, t) is the unknown
• Use discrete directions W (e.g. S6 has 6 per octant)
Solve each spatial cell in a specified orderfor a single ordinate (direction) subject to the constraintthat a cell cannot be solved for a particulardirection until its “upstream” neighbors havebeen solved.
for each octant for each angle-block for each z-block receive east receive north compute subgrid send west send south end for end forend for 2D Domain decomposition with “blocking”
• # of active cells (processors) varies from one diagonal to the next.• Blocking in “z” leads to tradeoff :
Parallel Efficiency vs. Communication Intensity
Zx
y
A. Hoisie & H. WassermanICS 2002 New York NY
• Nsweep wavefronts “scan” the processor grid.
• Each scan requiresNs steps.
• There’s a delay of d between scans.
• The total number of steps, S, for allwavefronts is
• The challenge is to find Ns and d.
• For Sn: Nsweep = zblocks * angleblocks * octants
Basic Pipeline Model
)1( -+= sweeps NdNS
A. Hoisie & H. WassermanICS 2002 New York NY
Communication Pipeline
4
)1(2)1(2
=
-+-=comm
xycomms
d
PPN
B
NtT msg
msg += 0
Tcomm = [2(Px + Py - 2) + 4(Nsweep - 1)] * Tmsg
7
1 2 30 1 2
4 5 6
8 9 10
12 13 14
3
7
11
15
2
4
6
4 6 8
6 8 10
8 10 12
3 5
5 7 9
7 9 11
Px
Py
Processor NodesMessage Step
A. Hoisie & H. WassermanICS 2002 New York NY
1
1
=
-+=comp
yxcomps
d
PPN
Tcpu = ( Nx
Px
*Ny
Py
* Nz
Kb
* Na
Ab
)N flops
Rflops
PX
PYY
N+1N
Tcomp = [(Px + Py - 1) + (Nsweep - 1)] * Tcpu
Computation Pipeline
A. Hoisie & H. WassermanICS 2002 New York NY
(Py-1)*Px procs have South neighbors: all send (Py-1)*Px procs have North neighbors: all receive (Px-1)*Py procs have East neighbors: all send (Px-1)*Py procs have West neighbors: all receive========================================== Nmsg = [(Py-1)*Px + (Px-1)*Py ] pairs of send/receives
A) T = Nmsg * Tmsg + (Px * Py)* Tcpu
B) T = Px * Py *2* Tmsg + (Px * Py)* Tcpu
Do you see any problem with any of these 2 alternative approaches?
Alternative Modeling Approaches?
A) is a (wrong) upper bound. B) is a (wrong) lower bound.Both fail to accurately describe the overlap in communicationand computation. Both fail to account for the delays due tothe different repetition rates of the two types of wavefronts.Both are wrong…but don’t feel bad if you almost agreed to one of them…we struggled with this for quite some time.
Px
Py
A. Hoisie & H. WassermanICS 2002 New York NY
VAMPIR Analysis with Two Wavefronts
A. Hoisie & H. WassermanICS 2002 New York NY
Tcomp = [(Px + Py - 1) + (Nsweep - 1)] * Tcpu
Tcomm = [2(Px + Py - 2) + 4(Nsweep - 1)] * Tmsg
• Nsweep = 1: Validates the number of pipeline stages in Tcomp and Tcomm, as
function of (Px +Py), in the available range of processor configurations.
• Nsweep ~ (Px +Py ): Validates case where the contributions of the (Px +Py )and
Nsweep terms are comparable.
• Nsweep >> (Px +Py ): Validates the repetition rate of the pipeline.
In each regime cases can be identified where:
Tcomp >> Tcomm
Tcomp = 0
Tcomp ~ Tcomm
Validation Regimes
A. Hoisie & H. WassermanICS 2002 New York NY
30201001
2
3
4
5
MeasuredModel
Px + Py
Tim
e (s
econ
ds)
Tcompdominant. Nsweep = 10. SGI Origin
40302010000.0e+0
5.0e-3
1.0e-2
1.5e-2
2.0e-2
MeasuredModel
Px + Py
Tim
e (
seconds)
Tcomp=0. Nsweep = 10. CRAY T3E.
16141210864222.0e-3
3.0e-3
4.0e-3
5.0e-3
6.0e-3
7.0e-3
8.0e-3
MeasuredModelTcomp from Model
Px + Py
Tim
e (
seconds)
Tcomp dominant.Nsweep=10. SGI Origin
Validation Weak Scalability
302010000
20
40
60
80
MeasuredModelTcomp from Model
Px + Py
Tim
e (
se
co
nd
s)
A. Hoisie & H. WassermanICS 2002 New York NY
30201000.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
MeasuredModel
Px + Py
Tim
e (s
eco
nd
s)
CRAY T3E
Validation Strong Scalability
A. Hoisie & H. WassermanICS 2002 New York NY
• Model so far represents sweepsgenerated by angle/k-blockloops.
• Application consists of multipleoctants, multiple iterations.
• Multiple octants extends thepipeline length, includesdependences between octants.
Pipelined wavefront abstraction:
for each octant for each angle-block for each z-block receive east receive north compute sub-grid send west send south end for end forend for
Model for Multiple Octants
A. Hoisie & H. WassermanICS 2002 New York NY
-i -j +k
-i -j -k
-i
-j
2 31 4
Multiple Octant Processing
5
-i +j -k
-i +j +k6
+j
iq = 47 8 9 10 11 12 13
+i -j -k
+i -j +k
14 15 16
+ i -j
17 18
+i +j -k
+i +j +k
19 20
+ i
+j
21 22 23 24 25 26
A. Hoisie & H. WassermanICS 2002 New York NY
Table 1. Octant Ordering and Consequent Wavefront Delay in SWEEP3D
Octant From Which Sweep Originates Delay from Previous Sweep
-i –j –k
-i –j+k 1
-i +j –k 2 + (Py – 1)
-i +j +k 1
+i –j –k 2 + (Py –1) + (Px – 1)
+i –j +k 1
+i +j –k 2 + (Py – 1)
+i +j +k 1
Total Steps for 8 Octants 10 + (2Px + 4Py – 6)
Multiple Octant Processing
• Result: Pipeline length is 3 times longer than that of 1 octant(but much less than 8 times longer).
• Result: The pipeline length is asymmetric with respect to the processor grid.
A. Hoisie & H. WassermanICS 2002 New York NY
SWEEP3D QSC: mk=10 mmi=3, All CPUs Per Node
8.0
13.0
18.0
23.0
28.0
33.0
38.0
43.0
48.0
0 20 40 60 80 100
Px + Py
To
tal Ela
pse
d R
un
tim
e
Tmodel
Tmeasured
Multiple Octant & Iteration Processing
Compaq ES40 Cluster (4 processors per SMP). Model parameters included aCPU processing rate = 330 MFLOPS (16.5% of peak), MPI Latency = 11 ms, MPIBW = 290 MB/s (message size = 1500 words)
A. Hoisie & H. WassermanICS 2002 New York NY
Blocking Strategies
• Larger block sizes lead to increasedcomputation / communication ratio.
• For wavefront algorithms smaller blocks yield higher parallel efficiency.
30201000
10
20
30
40
50
Time 1000
Time 500
Time 100
Time 10
Model 1000
Model 500
Model 100
Model 10
SWEEP on CRAYT3E for Several KBlock Sizes
Px + Py
Tim
e (s
eco
nd
s)
A. Hoisie & H. WassermanICS 2002 New York NY
MPPs vs. Clusters of SMPs
• SWEEP3D results so far assumed that a logical processor mesh couldbe imbedded into the machine topology such that– each mesh node maps to a unique processor and
– each mesh edge maps to a unique router link.
– Required to maintain comm. concurrency within wavefronts
• Q: What happens to d and Nsteps on a cluster of SMPs with reducedconnectivity ?– Obvious latency & BW effects, but what else?
– Obvious relevance to ASCI Blue systemsas well as others
1,5 2,6 3,726
48
610 8
3,7 5,9 7
5 7 9
7 9 11
4 6 8 10
6 8 10 12
PX
PY
A. Hoisie & H. WassermanICS 2002 New York NY
SMP
Sx
Sy
m
nLy
Lx
Notation
Clusters of SMPs
• This problem was solved byinduction/emulation.
• Wavefronts are delayed at inter-SMPboundaries if a message from aprevious wavefront is already using aninter-SMP link.
• Communication step for a givenmessage will be “bumped” aswavefronts “collide” with one-another.Wavefronts eventually scan the PE gridat a slower pace than in the “MPP” case.
• One result: Don’t need full connectivity -only need 1/2 of S
A. Hoisie & H. WassermanICS 2002 New York NY
SWEEP Model: Key Practices• Creation of application code microbenchmarks
– Isolate communication pipeline and computation pipeline
– Model “from the inside-out.”• Understand one sweep, then many
• Wide validation range: controlled by blocking input parameters– Computation-dominant
– Communication-dominant
– Pipeline-dominant
– Sweep-dominant
A. Hoisie & H. WassermanICS 2002 New York NY
Case Study Conclusions (I)
• The SWEEP3D models account for the overlap in thecommunication and computation components of thealgorithm.
• The models are parametric: basic machine performancenumbers (latency, MFLOPS rate, bandwidth) andapplication characteristics (problem size, etc) serve asinput.
• The MPP model was validated in all “regimes”, and onthree parallel architectures of wide practical interest (IBMSP2, Cray T3E and SGI Origin 2000).
A. Hoisie & H. WassermanICS 2002 New York NY
Case Study Conclusions (II)
• The SWEEP3D cluster model required an inductionprocess and emulation after examination of wavefrontbehavior.
• It is the first model demonstrating the effect of reduced-connectivity networks on application performance of whichwe are aware.
• A relatively simple change in the system - the clustertopology - resulted in a much more complex communicationmodel.
A. Hoisie & H. WassermanICS 2002 New York NY
• SAGE – SAIC’s Adaptive Grid Eulerianhydrocode
• Hydrodynamics code with AMR
• Applied to: water shock, energy coupling, hydroinstability problems, etc.
• Comes from Los Alamos CRESTONE project
• Represents a large class of production ASCIapplications at Los Alamos
One-kilometer iron asteroid struck with an impact equal to about 1.5 trillion tons of TNT, and produced a jet of water more than 12 miles high
Wave velocities for the largest asteroid will be roughly 380 miles an hour.Initial tsunami waves are more than half a mile high, abating to about two-thirdsof that height 40 miles in all directions from the point of impact.
A. Hoisie & H. WassermanICS 2002 New York NY
• Understand the key characteristics of the code– Main data structures and their decomposition
– Processing stages
• Slab Parallelization Strategy– How communications scales
– Communication patterns
– Effect of network topology
• Processing Stages– Gather data, Computation, Scatter Data
Performance Model
A. Hoisie & H. WassermanICS 2002 New York NY
Cell and Block Decomposition
PE1
PE2
PE3
PE4
X
Y
Z
Y
Z
...
...
...
1 2 3
M+1
A. Hoisie & H. WassermanICS 2002 New York NY
One Bit of Algebra: Scaling Analysis• The total volume is: V = E.P = L3
• The volume of each sub-grid is: E=l.L2
where P is the number of PEs, l is the short side of the slab (in the Zdimension) and L is the side of the slab in X and Y directions (assuming asquare grid in the X-Y plane).
• The surface of the slab, L2, in the X-Y plane is: L2 = V2/3 = (E.P)2/3
communication growths with the number of processors!
• Consider again the volume of the entire grid: V = E.P = (l.L2).P
• This is partitioned across PEs such that there will be L/(2P) foils of width 2on each PE:
(E.P)1/3/2P = (E/8P2)1/3
• When this has a value less than one, a processor will contain less than asingle foil, i.e. when P > SQRT(E/8) the number of processors involved inboundary exchange increases!
• There is a maximum distance between the processors that hold a foil,termed the “PE Distance” (PED)
A. Hoisie & H. WassermanICS 2002 New York NY
• Volume = E . PCommunication surface in Z = (E.P)2/3
1 2
2PEs
1 2 3 4 5 6 7 8
8PEs
...1
23
4
52
3
64PEs 1
4
52
3
4
6
7
7
8
9...
256PEs
• First E cells ->PE1…
(E=numcells_PE)
PE1234
Slab Decomposition
A. Hoisie & H. WassermanICS 2002 New York NY
(for E = 13,500 P > 41)
• Surface split across PEs when:P>÷(E/8)
Comparison of Boundary Surfaces
1.E+03
1.E+04
1.E+05
1.E+06
1 10 100 1000 10000 100000# PEs
To
tal S
urf
ace
(cel
ls)
Grid Surface
PE Surface
Slab Scaling: (1) Surface Size
A. Hoisie & H. WassermanICS 2002 New York NY
• PE distance = È (8P2/E)1/3 ˘
Comparison of inter-PE communication distances
1
10
100
1 10 100 1000 10000 100000# PEs
PE
Dis
tan
ce
Neighbor Distance
Slab Scaling: (2) PE Distance
A. Hoisie & H. WassermanICS 2002 New York NY
• PE distance impacts performance on ASCI BlueMountain:
4
8 SMPs
12 ...
8 SMPs
12 ... 4128node SMP
n HiPPi links
2 ......
PE
• PE distance results in many PEscommunicating across a small number oflinks:
Effect of Network Topology: ASCI BlueMountain
A. Hoisie & H. WassermanICS 2002 New York NY
• PE distance hidden on Compaq (Quadrics) network:
4 node SMP
• PE distance has maximum effectwhen all PEs communicating outof SMP node:
Effect of Network Topology: Compaq (Quadrics)
A. Hoisie & H. WassermanICS 2002 New York NY
• Sage consists of many stages per cycle:– Gather (1+) - obtain data from remote PEs– Compute– Scatter (1+) - update data on remote PEs
• Tokens act as data templates for data transfers
n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4
n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4
n-4 n-3 n-2 n-1 n n+1 n+2 n+3 n+4
Gather
Compute
Scatter
Gather/Scatter Comms
X, LO 4
Y, LO 152
Z, LO 6860
X, HI 4
Y, HI 152
Z, HI 6860
Direction Size (cells)
Processing Stages in SAGE
A. Hoisie & H. WassermanICS 2002 New York NY
• Encapsulates code characteristics
• Parameterized in terms of:– Code parameters (e.g. cells per PE)
– System parameters (CPU speed,communication latency & bandwidth, memorycontention)
– ASCI Blue Mountain (SGI Origin 2000)– CRAY T3E– ASCI Red (intel)– ASCI White (IBM SP3)– Compaq Alphaserver SMP clusters
• Single parameterized model– system specific parameters
• Model is highly accurate (< 10% error)
Model Validation
A. Hoisie & H. WassermanICS 2002 New York NY
SAGE Performance (ASCI Blue Mountain)
0
1
2
3
4
5
6
100 1000 10000# PEs
Cyc
le T
ime
(s)
Prediction
Measurement
i) ASCI Blue Mountain
A. Hoisie & H. WassermanICS 2002 New York NY
SAGE Performance (ASCI White)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 10 100 1000 10000
# PEs
Cyc
le T
ime
(s)
Prediction
Measurement
ii) ASCI White
A. Hoisie & H. WassermanICS 2002 New York NY
At time of testing, only 8 nodes available
SAGE Performance (Compaq ES45 AlphaServer)
0
0.1
0.2
0.3
0.4
0.5
0.6
1 10 100# PEs
Cyc
le T
ime
(s)
Prediction
Measurement
iii) Compaq ES45 AlphaServer Cluster
A. Hoisie & H. WassermanICS 2002 New York NY
Applications of Performance Models
A. Hoisie & H. WassermanICS 2002 New York NY
Predictive Value of Models
• Fast exploration of design space:– New architectures with increased comm. BW, decreased
MPI latency, upgrade of CPU speed• Example: SN transport on hypothetical 100-TF system
– New algorithms / coding strategies• Example: different parallel decomposition method
• Can estimate improvement prior to coding effort
• Example: SAGE
A. Hoisie & H. WassermanICS 2002 New York NY
• SN transport has “well-defined” performance goals -– (1000)3 cells, 10,000 time steps, 40 TB memory, 5000 unknowns
per cell, 30-hour execution time goal.
• Best performance at time of model: 0.1 µsec grind time perphase-space cell– Total execution time for 104 time steps: 54 years
• Design space includes: problem size, # of processors, geometryof cluster (size and topology), communication parameters,computation parameters, optimal (problem) blocking sizes,target optimization (e.g., runtime, problem size)
Estimates of SWEEP3D Performance on a HypotheticalFuture-Generation (100-TFLOPS) System as a Functionof MPI Latency and Sustained Per-Processor ComputingRate.
Estimates of SWEEP3D Performance on a HypotheticalFuture-Generation (100-TFLOPS) System as a Function ofMPI Latency and Sustained Per-Processor Computing Rate.
Sustained Computing Rate10% of Peak 50% of Peak
MPILatency Runtime (hours) Runtime (hours)
0.1 ms 185 58
1.0 ms 205 78
10 ms 297 104
SWEEP SMP-Cluster Results: 1-Billion Cells
A. Hoisie & H. WassermanICS 2002 New York NY
Architecture Exploration (I)
• Predictions for the 30T
SAGE Performance Model (ES45)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 10 100 1000 10000 100000
# PEs
Tim
e fo
r 1
cycl
e (s
)
Prediction
Measurement
A. Hoisie & H. WassermanICS 2002 New York NY
50 MFLOPS/cpu, L=100 us, Bw=100MB/s,4 x 4 x 100 subgrid, optimal blocking, 10e7 cells total,1 Link ea. Dir. Between Hosts.
Origin Cluster
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0 5000 10000 15000 20000 25000
# Procs
Tcomm-SMPTcomm-MPPTcomp
Architecture Exploration (II)
A. Hoisie & H. WassermanICS 2002 New York NY
Sensitivity on the number of links
0
100
200
300
400
500
600
700
0 5000 10000 15000 20000 25000
Px *Py
Mo
dele
d e
xecu
tio
n t
ime [
hr]
Model-MPP
L=1
L=2
L=3
L=4
50 MFLOPS/cpu, L=100 us, BW=100MB/s,4 x 4 x 100 subgrid, optimal blocking, 10e7 cells total,NG=30, 12 iters, 10e4 timesteps
Architecture Exploration (III)
A. Hoisie & H. WassermanICS 2002 New York NY
80006000400020000100
200
300
163264128256
# Processors
Tim
e (h
ours
) # CPUs per SMP
L=min(sx,sy)/4
Sensitivity Analysis on SMP SizeArchitecture Exploration (IV)
• Minimum Surface-to-Volume ratio– minimizes communication time (Gather &
Scatter)
Case 1: 2x2x1 Case 2: 2x1x1 Case 3: 1x1x1
Application Optimization
A. Hoisie & H. WassermanICS 2002 New York NY
Cube vs Slab (Compaq ES45)
• Cube Surface: 4 times smaller than Slab
Comparison of Slab and Cube PE Surface Sizes
0.0E+00
2.0E+03
4.0E+03
6.0E+03
8.0E+03
1.0E+04
1.2E+04
1.4E+04
1.6E+04
1.8E+04
1 10 100 1000 10000 100000
# PEs
Tota
l Sur
face
(cel
ls)
Slab
Cube
Application Optimization
A. Hoisie & H. WassermanICS 2002 New York NY
Cube vs Slab (Compaq ES45)
• Cube PE distance > Slab PE distance
Comparison of Slab and Cube PE Distances
1
10
100
1000
1 10 100 1000 10000 100000
# PEs
PE
Dis
tanc
e
Slab (Z)
Cube (Y)
Cube (Z)
Application Optimization
A. Hoisie & H. WassermanICS 2002 New York NY
SAGE Performance Model - Comparison of Slab and Cube (ES45)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 10 100 1000 10000 100000
# PEs
Tim
e f
or
1 C
yc
le (
s)
Slab
Cube
• Expect performance improvement using cube
Cube vs Slab (Compaq ES45)Application Optimization
A. Hoisie & H. WassermanICS 2002 New York NY
SAGE - Performance Data (Compaq ES45)
0
0.4
0.8
1.2
1.6
1 10 100 1000# PEs
Cyc
le T
ime
(s)
Model
• Model gives an expectation of performance
• Model used to validate measurements!
Rational System Integration
SAGE - Performance Data (Compaq ES45)
0
0.4
0.8
1.2
1.6
1 10 100 1000# PEs
Cyc
le T
ime
(s)
Model
Measured (Sept 9th 01)
SAGE - Performance Data (Compaq ES45)
0
0.4
0.8
1.2
1.6
1 10 100 1000# PEs
Cyc
le T
ime
(s)
Model
Measured (Sept 9th 01)
Measured (Oct 2nd 01)
SAGE - Performance Data (Compaq ES45)
0
0.4
0.8
1.2
1.6
1 10 100 1000# PEs
Cyc
le T
ime
(s)
Model
Measured (Sept 9th 01)
Measured (Oct 2nd 01)
Measured (Oct 24th 01)
A. Hoisie & H. WassermanICS 2002 New York NY
Final Thoughts
A. Hoisie & H. WassermanICS 2002 New York NY
Performance Engineering
Performance-engineered system: The components(application and system) are parameterized andmodeled, and a constitutive model is proposed andvalidated.
Predictions are made based on the model. The model ismeant to be updated, refined, and further validated asnew factors come into play.
A. Hoisie & H. WassermanICS 2002 New York NY
Final Thoughts (1 of 4)
• Application / architecture mapping is the key - not lists of raw basicmachine characteristics.
• Point design studies need to address a specific workload.
• Performance and scalability modeling is an effective “tool” forworkload characterization, system design, application optimization,and algorithm-architecture mapping.
• Back-of-the-envelope performance predictions are risky (outrightwrong ?), given the complexity of analysis in a multidimensionalperformance space.
• Applications and systems at this scale need to be performance-engineered -- modeling is the means to analysis.
A. Hoisie & H. WassermanICS 2002 New York NY
• We offered a practical methodology for performance analysis of large-scale scientific applications.– Adds insight into current performance
– Allows prediction of performance on future systems
– Combines application- and system-dependent information.
• Naïve metrics for parallel scaling don’t work.
• The methodology is not tied to any particular tool(s)– The model is the tool!
Final Thoughts (2 of 4)
A. Hoisie & H. WassermanICS 2002 New York NY
Final Thoughts (3 of 4)
• Performance evaluation of supercomputer systems:If done “properly” you can get any answer you want.:)– “One way to get good performance is to redefine ‘good.’ ” (K.
Kennedy, 6/7/99)
• Performance analysis requires information fromseveral levels of the benchmarking hierarchy:– Kernels-alone do not characterize the performance of a
supercomputer.
– Use kernels to help understand performance of parts of yourapplication.
A. Hoisie & H. WassermanICS 2002 New York NY
Final Thoughts (4 of 4)
• Amdahl’s Law requires balance in all components thatcontribute to performance.
• Modeling is crucial.
• Single-processor performance is the bottleneck much ofthe time.
• Responsible, careful evaluation of high-performancemachines is a necessary condition for continuedprogress and future success of parallel computing.
• Be an educated buyer!
A. Hoisie & H. WassermanICS 2002 New York NY
Acknowledgments and Disclaimers• Other members, LANL Parallel Architectures and Algorithms Team
– Eitan Frachtenberg (Hebrew Univ, Israel)
– Vladimir Getov (Univ. of Westminster, UK)
– Darren Kerbyson
– Michael Lang
– Scott Pakin
– Juan Fernandez Peinador (Univ. of Valencia, Spain)
– Fabrizio Petrini
• Thanks to:– US Department of Energy through Los Alamos National Laboratory
contract W-7405-ENG-36
– Los Alamos Computer Science Institute
• Note: Any benchmark results presented herein reflect our workload.Results from other workloads may vary.
A. Hoisie & H. WassermanICS 2002 New York NY
Important Resources (1 of 4)• Hennessy, J. L. and Patterson, D. A. Computer Architecture: A
Quantitative Approach. 1990. Morgan Kaufman Publishers, Inc. SanMateo, CA. Third edition, 2002.
• Patterson, D.A. and Hennessy, J.L., Computer Organization andDesign: The Hardware/Software Interface. 1993. San Mateo, CA:Morgan Kaufmann Publishers. Second Edition, 1998.
• D. Kuck, “High Performance Computing,” Oxford U. Press (New York)1996.
• D. Culler, A. Goopta, and J. P. Singh, “Parallel Computer Architecture,”Morgan Kaufmann (San Francisco) 1998.
• K. Hwang, “Advanced Computer Architecture,” McGraw-Hill (New York)1993.
• D. Culler, R. Karp, D. Patterson, A. Sahay, E. Santos, K. Schauser, R. Subramonian, and T.von Eiken, "LogP: A Practical Model of Parallel Computation," Communications of the ACM,39(11):79:85, Nov., 1996.
• R. Martin, A. Vahdat, D. Culler, T. Anderson, “Effects of CommunicationLatency, Overhead, and Bandwidth in a Cluster Architecture,” Iternational Symposium onComputer Architecture , Denver, CO. June 1997.http://www.cs.berkeley.edu/~rmartin/papers/logp.ps
• C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. L. Hennessy, "Effects of Latency,Occupancy, and Bandwidth in DSM Multiprocessors," Stanford Univ. Comp. Sci. ReportCSL-TR-95-660, 1/95.
• D. A. Patterson: http://www.cs.berkeley.edu/~pattrsn/252S98/index.html
• Ian Foster, “Designing and Building Parallel Programs,” Addison Wesley (), 1995, andhttp://www.mcs.anl.gov/dbpp/
• G. Fox, R. Williams, and P. Messina, “Parallel Computing Works!” Morgan Kaufman, 1994.
• P. H. Worley, “The Effects of time Constraints on Scaled Speedup,” SIAM J. Sci StatComp., 11(5):838-858, 1990.
• Zagha, M, Larson, B, Tuerner, S, and Itzkowitz, M, “Performance Analysis UsingMIPS R10000 Performance Counters, “Proc. SC96, IEEE Computer Society.
• Torellas, J., Solihin, Y., and Lam, V., “Scal-Tool: Pinpointing and QuantifyingScalability Bottlenecks in DSM Multiprocessors,” SC99
Important Resources (2 of 4)
A. Hoisie & H. WassermanICS 2002 New York NY
Important Resources (3 of 4)
Darren J. Kerbyson, Shawn Pautz, and Adolfy Hoisie, Predictive Modeling of Parallel Sn Sweepson Unstructured Grids, Los Alamos National Laboratory Unclassified Report LA-UR-02-2662.Salvador Coll, Fabrizio Petrini, Eitan Frachtenberg and Adolfy Hoisie. Performance Evaluation ofI/O Traffic and Placement of I/O Nodes on a High Performance Network. In Workshop onCommunication Architecture for Clusters 2002 (CAC '02), International Parallel and DistributedProcessing Symposium 2002 (IPDPS '02), Fort Lauderdale, FL, April 2002.Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg, Adolfy Hoisie, Leonid Gurvits"UsingMultirail Networks in High-Performance Clusters. In IEEE Cluster 2001, Newport Beach, CA, October2001Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg and Adolfy Hoisie Hardware- and Software-Based Collective Communication on the Quadrics Network. In IEEE International Symposium onNetwork Computing and Applications 2001 (NCA 2001), Boston, MA, October 2001.Fabrizio Petrini, Salvador Coll, Eitan Frachtenberg, Adolfy Hoisie "Hardware and Software BasedCollective Communication on the Quadrics Network. In IEEE International Symposium on NetworkComputing and Applications 2001 (NCA 2001), Boston, MA, October 2001Eitan Frachtenberg, Fabrizio Petrini, Salvador Coll, Wu-chun Feng"Gang Scheduling withLighweight User-Level Communication. In 2001 International Conference on Parallel Processing(ICPP2001), Workshop on Scheduling and Resource Management for Cluster Computing, ValenciaSpain, September 2001
http://www.c3.lanl.gov/par_arch/Publications.html
A. Hoisie & H. WassermanICS 2002 New York NY
Eitan Frachtenberg andFabrizio Petrini. Overlapping Communication and Computation in theQuadrics Network. LAUR 01-4695, August 2001.Eitan Frachtenberg and Fabrizio Petrini. Scheduler Testbed System Design. LAUR 01-4694, 08/ 01.Fabrizio Petrini ,Wu-chun Feng Adolfy Hoisie, Salvador Coll, Eitan Frachtenberg. The QuadricsNetwork (QsNet): High-Performance Clustering Technology. In Hot Interconnects 9, StanfordUniversity, Palo Alto, CA, August 2001.Darren J. Kerbyson, Hank J. Alme, Adolfy Hoisie, Fabrizio Petrini, Harvey J. Wasserman,Michael Gittings, "Predictive Performance and Scalability Modeling of a Large-ScaleApplication,"Proceedings of SC2001 (LAUR-01-4337, July, 2001).Fabrizio Petrini, Adolfy Hoisie, Wu-chun Feng, Richard Graham, A Performance Evaluation of theQuadrics Interconnection Network, LAUR-00308, Workshop on Communication Architecture forClusters (CAC '01), Int'l Parallel and Distributed Proicessing Symposium (IPDPS '01), April 23-27,2001. San Francisco.Adolfy Hoisie, Olaf Lubeck, Harvey Wasserman, Fabrizio Petrini, Hank Alme, A GeneralPredictive Performance Model for Wavefront Algorithms on Clusters of SMPs, LAUR-00308, In theproceeding of ICPP 2000, August 20-25, 2000. Toronto, Canada.Adolfy Hoisie, Olaf Lubeck, Harvey Wasserman, "Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications", The InternationalJournal of High Performance Computing Applications, Sage Science Press, Vol. 14: 4, Winter 2000.
http://www.c3.lanl.gov/par_arch/Publications.html
Important Resources (4 of 4)
A. Hoisie & H. WassermanICS 2002 New York NY
About the Authors
Adolfy Hoisie is a Staff Scientist and the Leader of the Parallel Architectures andPerformance Team in the Computer and Computational Sciences Division at LANL.From 1987 until he joined LANL in 1997, he has been a researcher at CornellUniversity. His area of research is performance evaluation of high-performancearchitectures. He published extensively, lectured at numerous conferences andworkshops, often as an invited speaker, and taught tutorials in this field at importantevents worldwide. He is the winner of the Gordon Bell Award in 1996.
Harvey Wasserman has been a Staff Scientist in the Computing, Information, andCommunications Division at Los Alamos National Laboratory since 1985. Hisresearch interests involve supercomputer architecture and performance evaluation,and he has participated in benchmarks of almost all significant high-performancecomputing architectures, including single-processor workstations, parallel vectorsupercomputers, and massively-parallel systems. In a prior life he was a chemist andhe holds a Ph.D. in Inorganic Chemistry from the State University of New York andwas a Postdoctoral Research Associate at Los Alamos in 1982-1984. He is a co-author of over 50 articles and has presented numerous invited and contributedlectures and tutorials. In 1999 during a one-year sabbatical at LANL he developedand taught a curriculum on ASCI system usage.
A. Hoisie & H. WassermanICS 2002 New York NY
Twelve Ways Test Answers
• The only performance comparison given is for thesqrt function. No comparison is given for the wholecode.– Instead of comparing performance with other systems, a
pretty picture is shown.
– Instead of comparing performance with other systems,utilization is quoted.
• The comparison uses microprocessors that are 2-3generations older than the system of interest.