Top Banner
X. Sun (IIT) CS546 Lecture 5 Page 1 Performance Evaluation of Parallel Processing Xian-He Sun Illinois Institute of T echnology [email protected]
78

cs546perf

Apr 06, 2018

Download

Documents

Mudassir Khan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 1/78

X. Sun (IIT) CS546 Lecture 5 Page 1

Performance Evaluation of

Parallel Processing

Xian-He Sun

Illinois Institute of Technology

[email protected]

Page 2: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 2/78

X. Sun (IIT) CS546 Lecture 5 Page 2

Outline

• Performance metrics – Speedup

 – Efficiency

 – Scalability• Examples

• Reading: Kumar – ch 5

Page 3: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 3/78

X. Sun (IIT) CS546 Lecture 5 Page 3

Performance Evaluation(Improving performance is the goal) 

• Performance Measurement

 – Metric, Parameter

• Performance Prediction – Model, Application-Resource

• Performance Diagnose/Optimization

 – Post-execution, Algorithm improvement,

Architecture improvement, State-of-the-art,Scheduling, Resource management/Scheduling

Page 4: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 4/78

X. Sun (IIT) CS546 Lecture 5 Page 4

Parallel Performance Metrics(Run-time is the dominant metric) 

• Run-Time (Execution Time)

• Speed: mflops, mips, cpi

• Efficiency: throughput

• Speedup

• Parallel Efficiency

• Scalability: The ability to maintain performance gain when

system and problem size increase• Others: portability, programming ability,etc

TimeExecutionParallelTimeExecutionorUniprocess pS

Page 5: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 5/78

X. Sun (IIT) CS546 Lecture 5 Page 5

Models of Speedup • Speedup

• Scaled Speedup

 – Parallel processing gain over sequentialprocessing, where problem size scales up with

computing power (having sufficientworkload/parallelism)

TimeExecutionParallel

TimeExecutionorUniprocess pS

Performance Evaluation of Parallel Processing

Page 6: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 6/78

X. Sun (IIT) CS546 Lecture 5 Page 6

Speedup

• Ts =time for the best serial algorithm

• Tp=time for parallel algorithm using pprocessors

 p

s p

T S

Page 7: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 7/78

X. Sun (IIT) CS546 Lecture 5 Page 7

Example

Processor 1

time

100

time

1 2 3 4

25 25 25 25 time

1 2 3 4

35 35 35 35

(a) (b) (c)

ationparallelizperfect

,0.4

25

100 pS

10iscostsynchbut

balancingloadperfect

,85.235

100 pS

Page 8: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 8/78

X. Sun (IIT) CS546 Lecture 5 Page 8

Example (cont.)

time

1 2 3 4

30 20 40 10time

1 2 3 4

50 50 50 50

(d) (e)

imbalanceloadbutsynchno

,5.240

100 pS

costsynchandimbalanceload

,0.250

100 pS

Page 9: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 9/78

X. Sun (IIT) CS546 Lecture 5 Page 9

What Is “Good” Speedup? 

• Linear speedup:

• Superlinear speedup

• Sub-linear speedup:

 pS  p

 pS  p

 pS  p

Page 10: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 10/78

X. Sun (IIT) CS546 Lecture 5 Page 10

Speedup

p

speedup

Page 11: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 11/78

X. Sun (IIT) CS546 Lecture 5 Page 11

Sources of Parallel Overheads 

• Interprocessor communication

• Load imbalance

• Synchronization

• Extra computation

Page 12: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 12/78

X. Sun (IIT) CS546 Lecture 5 Page 12

Degradations of Parallel Processing

Unbalanced Workload

Communication Delay 

Overhead Increases with the Ensemble Size 

Page 13: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 13/78

X. Sun (IIT) CS546 Lecture 5 Page 13

Degradations of Distributed Computing

Unbalanced Computing Power and Workload

Shared Computing and Communication Resource 

Uncertainty, Heterogeneity, and Overhead Increases

with the Ensemble Size 

Page 14: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 14/78

X. Sun (IIT) CS546 Lecture 5 Page 14

Causes of Superlinear Speedup

• Cache size increased

• Overhead reduced

• Latency hidden

• Randomized algorithms• Mathematical inefficiency of the serial algorithm

• Higher memory access cost in sequential

processing

• X.H. Sun, and J. Zhu, "Performance Considerations of Shared Virtual Memory Machines," 

 IEEE Trans. on Parallel and Distributed Systems, Nov. 1995

Page 15: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 15/78

X. Sun (IIT) CS546 Lecture 5 Page 15

• Fixed-Size Speedup (Amdahl’s law) 

 –  Emphasis on turnaround time

 –  Problem size, W , is fixed

TimeExecutionParallelTimeExecutionorUniprocess pS

W S p

Solvingof TimeParallel

Solvingof TimeorUniprocess

Page 16: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 16/78

X. Sun (IIT) CS546 Lecture 5 Page 16

 Amdahl’s Law 

• The performance improvement that can be gained bya parallel implementation is limited by the fraction oftime parallelism can actually be used in anapplication

• Let  = fraction of program (algorithm) that is serialand cannot be parallelized. For instance: – Loop initialization – Reading/writing to a single disk – Procedure call overhead

• Parallel run time is given by

s p T ) p

α

(αT  1

Page 17: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 17/78

X. Sun (IIT) CS546 Lecture 5 Page 17

 Amdahl’s Law 

•Amdahl’s law

gives a limit on speedup in terms of   

 p p

T T 

T S

 p

T T T 

ss

s p

ss p

 

 

 

 

 

 

1

1

)1(

)1(

Page 18: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 18/78

X. Sun (IIT) CS546 Lecture 5 Page 18

Enhanced Amdahl’s Law 

pas

T T 

 p

T T 

T Speedup

overhead overhead 

FS

1

11

1 1

)1( 

  

• To include overhead

• The overhead includes parallelism and interaction

overheads

Amdahl’s law: argument against massively parallel systems 

Page 19: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 19/78

X. Sun (IIT) CS546 Lecture 5 Page 19

•  Fixed-Size Speedup (Amdahl Law, 67)

Wp 

W1 

Wp Wp Wp Wp 

W1 W1 W1 W1 

1 2 3 4 5

Number of Processors (p)

Amount

of 

Work 

Tp 

T1 

Tp  Tp  Tp 

T1 T1 

Tp 

T1  T1 

1 2 3 4 5

Number of Processors (p)

Elapsed

Time

Page 20: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 20/78

X. Sun (IIT) CS546 Lecture 5 Page 20

 Amdahl’s Law 

• The speedup that is achievable on p processors is:

• If we assume that the serial fraction is fixed, then the

speedup for infinite processors is limited by 1/  

• For example, if =10%, then the maximum speedup is

10, even if we use an infinite number of processors

 p

T S

 p

s p  

 

1

1

 

1lim p p S

Page 21: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 21/78

X. Sun (IIT) CS546 Lecture 5 Page 21

Comments on Amdahl’s Law 

• The Amdahl’s fraction in practice depends on the problem size

n and the number of processors p • An effective parallel algorithm has:

• For such a case, even if one fixes p , we can get linear speedups

by choosing a suitable large problem size

• Scalable speedup

• Practically, the problem size that we can run for a particularproblem is limited by the time and memory of the parallelcomputer

n pn as0),( 

n p pn p

 p

T S

 p

s p as

),()1(1  

Page 22: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 22/78

X. Sun (IIT) CS546 Lecture 5 Page 22

•  Fixed-Time Speedup (Gustafson, 88)

° Emphasis on work finished in a fixed time

° Problem size is scaled from W to W '

°  W' : Work finished within the fixed time with parallel

processing

Solvingof TimeorUniprocess

'Solvingof TimeorUniprocess

'Solvingof TimeParallel

'Solvingof TimeorUniprocess'

W S p

W '

Page 23: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 23/78

X. Sun (IIT) CS546 Lecture 5 Page 23

Gustafson’s Law (Without Overhead)

a 1-a time

  p (1-a)p

 ps

s

t t 

 

 pW 

 pW W Work 

 pWork SpeedupFT  )1(1()1()(     

Page 24: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 24/78

X. Sun (IIT) CS546 Lecture 5 Page 24

•  Fixed-Time Speedup (Gustafson)

Wp 

W1 Wp 

Wp 

Wp Wp 

W1 

W1 

W1 

W1 

1 2 3 4 5

Number of Processors (p)

Amount

of 

Work 

Tp 

T1 

Tp Tp Tp 

T1 T1 

Tp 

T1 T1 

1 2 3 4 5

Number of Processors (p)

Elapsed

Time

Page 25: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 25/78

X. Sun (IIT) CS546 Lecture 5 Page 25

Converting ’s between Amdahl’s

and Gustafon’s laws 

Based on this observation,

Amdahl’s and Gustafon’s laws

are identical.

 p p

 pGG

 A)1(1)1(    

G

G

 A

 p

 

  

).1(

1

1

 

Page 26: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 26/78

X. Sun (IIT) CS546 Lecture 5 Page 27

Memory Constrained Scaling:

Sun and Ni’s Law • Scale the largest possible solution limited by

the memory space. Or, fix memory usage perprocessor – (ex) N-body problem

• Problem size is scaled from W to W*• W* is the work executed under memorylimitation of a parallel computer

• For simple profile, and G(n) is the increase of

parallel workload as the memory capacityincreases p times.)(* * M  pGW 

Page 27: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 27/78

X. Sun (IIT) CS546 Lecture 5 Page 28

Sun & Ni’s Law 

timein Increase

work in Increase

TimeWork 

 pTime pWork Speedup MB

)1( / )1(

)( / )(

 p pG

 pG

TimeWork 

 pTime pWork Speedup MB

 / )()1(

)()1(

)1( / )1(

)( / )(

  

  

a 1-a

 p  (1-a)G(p)

time

Page 28: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 28/78

X. Sun (IIT) CS546 Lecture 5 Page 29

•  Memory-Bounded Speedup (Sun & Ni, 90)

° Emphasis on work finished under current physical

limitation

° Problem size is scaled from W to W * 

°

 W *

: Work executed under memory limitation withparallel processing

*

**

Solvingof TimeParallel

Solvingof TimeorUniprocess

W S p

• X.H. Sun, and L. Ni , "Scalable Problems and Memory-Bounded Speedup," 

 Journal of Parallel and Distributed Computing, Vol. 19, pp.27-37, Sept. 1993 (SC90).

Page 29: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 29/78

X. Sun (IIT) CS546 Lecture 5 Page 30

•  Memory-Boundary Speedup (Sun & Ni)

Wp 

W1 

Wp 

Wp 

Wp Wp 

W1 

W1 

W1 

W1 

1 2 3 4 5

Number of Processors (p)

Amount

of 

Work 

Tp 

T1 

Tp  Tp Tp 

T1 T1 

Tp

 

T1 T1 

1 2 3 4 5

Number of Processors (p)

Elapsed

Time

 – Work executed under memory limitation

 – Hierarchical memory

Page 30: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 30/78

X. Sun (IIT) CS546 Lecture 5 Page 31

Characteristics

• Connection to other scaling models – G(p) = 1, problem constrained scaling

 – G(p) = p, time constrained scaling

• With overhead

• G(p) > p, can lead to large increase inexecution time

 – (ex) 10K x 10K matrix factorization: 800MB, 1 hr in

uniprocessorwith 1024 processors, 320K x 320K matrix, 32 hrs

Page 31: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 31/78

X. Sun (IIT) CS546 Lecture 5 Page 32

 – ScalableMore accurate solution

Sufficient parallelism

Maintain efficiency 

 – Efficient in parallelcomputingLoad balance

Communication

 –  Mathematically

effectiveAdaptive

Accuracy

Why Scalable Computing

Page 32: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 32/78

X. Sun (IIT) CS546 Lecture 5 Page 33

•  Memory-Bounded Speedup

° Natural for domain decomposition based computing

° Show the potential of parallel processing (In gerneal,

computing requirement increases faster with problem

size than that of communication)

° Impacts extend to architecture design: trade-off of 

memory size and computing speed

Page 33: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 33/78

X. Sun (IIT) CS546 Lecture 5 Page 34

Why Scalable Computing (2)

• Appropriate for small machine

 – Parallelism overheads begin to dominate benefitsfor larger machines

• Load imbalance

• Communication to computation ratio

 – May even achieve slowdowns

 – Does not reflect real usage, and inappropriate for

large machine• Can exaggerate benefits of improvements

Small Work

Page 34: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 34/78

X. Sun (IIT) CS546 Lecture 5 Page 35

Why Scalable Computing (3)

• Appropriate for big machine

 – Difficult to measure improvement

 – May not fit for small machine

• Can’t run 

• Thrashing to disk

• Working set doesn’t fit in cache 

 – Fits at some p , leading to superlinear speedup

Large Work

Page 35: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 35/78

X. Sun (IIT) CS546 Lecture 5 Page 36

Demonstrating Scaling Problems

parallelism

overhead

superlinear

User want to scale problems as machines grow!

Small Ocean problem

On SGI Origin2000

Big equation solver problem

On SGI Origin2000

Page 36: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 36/78

X. Sun (IIT) CS546 Lecture 5 Page 37

How to Scale

• Scaling a machine – Make a machine more powerful

 – Machine size• <processor, memory, communication, I/O>

 – Scaling a machine in parallel processing• Add more identical nodes

• Problem size – Input configuration

 – data set size : the amount of storage required to

run it on a single processor – memory usage : the amount of memory used by

the program

Page 37: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 37/78

X. Sun (IIT) CS546 Lecture 5 Page 38

Two Key Issues in Problem Scaling

• Under what constraints should the problembe scaled?

 – Some properties must be fixed as the machinescales

• How should the problem be scaled? – Which parameters?

 – How?

Page 38: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 38/78

X. Sun (IIT) CS546 Lecture 5 Page 39

Constraints To Scale

• Two types of constraints – Problem-oriented

• Ex) Time

 – Resource-oriented

• Ex) Memory

• Work to scale

 – Metric-oriented

• Floating point operation, instructions

 – User-oriented

• Easy to change but may difficult to compare

• Ex) particles, rows, transactions

• Difficult cross comparison

Page 39: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 39/78

X. Sun (IIT) CS546 Lecture 5 Page 40

• Speedup

Time ExecutionParallel

Time Executionor UniprocessS  p

SpeedSequential

SpeedParallel pS

Rethinking of Speedup

• Why it is called speedup but compare time

• Could we compare speed directly?

• Generalized speedup

• X.H. Sun, and J. Gustafson, "Toward A Better Parallel Performance Metric,"

Parallel Computing, Vol. 17, pp.1093-1109, Dec. 1991.

Page 40: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 40/78

X. Sun (IIT) CS546 Lecture 5 Page 41

Page 41: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 41/78

X. Sun (IIT) CS546 Lecture 5 Page 42

Compute : Problem

• Consider parallel algorithm for computing the value of

=3.1415…through the following numericalintegration

dx

 x

π 

1

0 2

1

421

4

 x

Page 42: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 42/78

X. Sun (IIT) CS546 Lecture 5 Page 43

Compute : Sequential Algorithm

computepi(){ 

h=1.0/n; 

sum =0.0; 

for (i=0;i<n;i++) { 

x=h*(i+0.5); sum=sum+4.0/(1+x*x); 

 }

pi=h*sum; 

 }

Page 43: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 43/78

X. Sun (IIT) CS546 Lecture 5 Page 44

Compute : Parallel Algorithm

• Each processor computes on a set of about n/p

points which are allocated to each processor in acyclic manner

• Finally, we assume that the local values of areaccumulated among the p processors under

synchronization

  01 2 3

  01 2 3

  01 2 3

  01 2 3   0

1 2 3

Page 44: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 44/78

X. Sun (IIT) CS546 Lecture 5 Page 45

Compute : Parallel Algorithm

computepi(){ 

id=my_proc_id(); 

nprocs=number_of_procs(): 

h=1.0/n; 

sum=0.0; 

for(i=id;i<n;i=i+nprocs) { x=h*(i+0.5); 

sum=sum+4.0/(1+x*x); 

 }

localpi=sum*h; 

use_tree_based_combining_for_critical_section(); 

pi=pi+localpi; end_critical_section(); 

 }

Page 45: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 45/78

X. Sun (IIT) CS546 Lecture 5 Page 46

Compute : Analysis

• Assume that the computation of is performed over n points

• The sequential algorithm performs 6 operations (twomultiplications, one division, three additions) per points on the x-axis. Hence, for n points, the number of operations executed in thesequential algorithm is:

nT s 6

   for (i=0;i<n;i++) {

 x=h*(i+0.5);

sum=sum+4.0/(1+x*x);

 }

3 additions

2 multiplications

1 division

Page 46: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 46/78

X. Sun (IIT) CS546 Lecture 5 Page 47

Compute : Analysis

• The parallel algorithm uses p processors with staticinterleaved scheduling. Each processor computes ona set of m points which are allocated to each processin a cyclic manner

• The expression for m is given by if p

does not exactly divide n. The runtime for the parallelalgorithm for the parallel computation of the localvalues of is:

1  p

nm

00 )66(*6 t  p

nt mT  p

Page 47: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 47/78

X. Sun (IIT) CS546 Lecture 5 Page 48

Compute : Analysis

• The accumulation of the local values of

using atree-based combining can be optimally performed inlog2(p) steps

• The total runtime for the parallel algorithm for thecomputation of including the parallel computationand the combining is:

• The speedup of the parallel algorithm is:

))(log()66(*6 000 c p t t  pt  p

nt mT 

) / 1)(log(66

6

0t t  p p

nn

T T S

c p

s p

Page 48: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 48/78

X. Sun (IIT) CS546 Lecture 5 Page 49

Compute : Analysis

• The Amdahl’s fraction for this parallel algorithm canbe determined by rewriting the previous equation as:

• Hence, the Amdahl’s fraction (n,p) is:

• The parallel algorithm is effective because:

),()1(1

6

)log(1

pn p

 pS

n

 p pc

n

 p

 pS  p p

 

)1(6

)log(

)1(),(

 pn

 p pc

n p

 p pn 

 pn pn fixedforas0),(  

Page 49: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 49/78

X. Sun (IIT) CS546 Lecture 5 Page 50

Finite Differences: Problem

• Consider a finite difference iterative method appliedto a 2D grid where:

 ji

 ji

 ji

 ji

 ji

 ji X  X  X  X  X  X  ,,1,11,1,

1

, )1()(   

Page 50: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 50/78

X. Sun (IIT) CS546 Lecture 5 Page 51

Finite Differences: Serial Algorithm

finitediff(){ 

for (t=0;t<T;t++) { 

for (i=0;i<n;i++) { 

for (j=0;j<n;j++) { 

x[i,j]=w_1*(x[i,j-1]+x[i,j+1]+x[i-1,j]+x[i+1,j]+w_2*x[i,j];  }

 }

 }

 }

Page 51: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 51/78

X. Sun (IIT) CS546 Lecture 5 Page 52

Finite Differences: Parallel Algorithm

• Each processor computes on a sub-grid ofpoints

• Synch between processors after every iterationensures correct values being used for subsequent

iterations

 p

n

 p

n

 p

n

Page 52: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 52/78

X. Sun (IIT) CS546 Lecture 5 Page 53

Finite Differences: Parallel Algorithm

finitediff(){ 

row_id=my_processor_row_id(); 

col_id=my_processor_col_id(); 

p=numbre_of_processors(); 

sp=sqrt(p); 

rows=cols=ceil(n/sp); row_start=row_id*rows; 

col_start=col_id*cols; 

for (t=0;t<T;t++) { 

for (i=row_start;i<min(row_start+rows,n);i++) { 

for (j=col_start;j<min(col_start+cols,n);j++) { 

x[i,j]=w_1*(x[i,j-1]+x[i,j+1]+x[i-1,j]+x[i+1,j]+w_2*x[i,j];  }

barrier(); 

 }

 }

 }

Page 53: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 53/78

X. Sun (IIT) CS546 Lecture 5 Page 54

Finite Differences:Analysis

• The sequential algorithm performs 6 operations(2multiplications, 4 additions) every iteration per point on the grid.Hence, for an n*n grid and T iterations, the number of operationsexecuted in the sequential algorithm is:

0

2

6 t nT s

 x[i,j]=w_1*(x[i,j-1]+x[i,j+1]+x[i-1,j]+x[i+1,j]+w_2*x[i,j];

2 multiplications

4 additions

Page 54: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 54/78

X. Sun (IIT) CS546 Lecture 5 Page 55

Finite Differences:Analysis

• The parallel algorithm uses p processors with staticblockwise scheduling. Each processor computes onan m*m sub-grid allocated to each processor in ablockwise manner

• The expression for m is given by Theruntime for the parallel algorithm is:

 p

nm

0

2

0

2 )(66 t  p

nt mT  p

Page 55: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 55/78

X. Sun (IIT) CS546 Lecture 5 Page 56

Finite Differences:Analysis

• The barrier synch needed for each iteration can be optimally

performed in log(p) steps

• The total runtime for the parallel algorithm for the computationis:

• The speedup of the parallel algorithm is:

))(log(6))(log()(66 00

2

00

2

0

2

cc p t t  pt  p

n

t t  pt  p

n

t mT 

) / 1)(log(6

6

0

2

2

t t  p p

nn

T T S

c p

s p

Page 56: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 56/78

X. Sun (IIT) CS546 Lecture 5 Page 57

Finite Differences:Analysis

• The Amdahl’s fraction for this parallel algorithm can be

determined by rewriting the previous equation as:

• Hence, the Amdahl’s fraction (n.p) is:

• We finally note that

• Hence, the parallel algorithm is effective

),()1(1

6

)log(1

2

 pn p

 pS

n

 p pc

 pS  p p

 

26)1(

)log(),(

n p

 p pc pn

 

pfixedforas0),( n pn 

E i S l

Page 57: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 57/78

X. Sun (IIT) CS546 Lecture 5 Page 58

Equation Solver

A[i,j] = 0.2 * (A[i, j] + A[i, j-1] + A[i-1, j] + a[i, j+1] + a[i+1, j])

n

nprocedure solve (A)

… 

while(!done) do

diff = 0;

for i = 1 to n do

for j = 1 to n do

temp = A[i, j];

A[i, j] = … diff += abs(A[i,j] – temp);

end for

end for

if (diff/(n*n) < TOL) then done =1 ;

end whileend procedure

W kl d

Page 58: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 58/78

X. Sun (IIT) CS546 Lecture 5 Page 59

Workloads

• Basic properties – Memory requirement : O(n2)

 – Computational complexity : O(n3), assuming the number ofiterations to converge to be O(n)

• Assume speedups equal to # of p

• Grid size – Fixed-size : fixed

 – Fixed-time :

 – Memory-bound :

n pk k  pn 333

n pk k  pn 22

Page 59: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 59/78

X. Sun (IIT) CS546 Lecture 5 Page 60

Memory Requirement of Equation Solver

3

2232 )(

 p

n

 p

 pn

 p

Fixed-time: 33

k  pn

Fixed-size :

Memory-bound :  pn 2

,

 p

n2

Page 60: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 60/78

X. Sun (IIT) CS546 Lecture 5 Page 61

Time Complexity of Equation Solver

Fixed-time:

Fixed-size:

Memory-bound:

22k  pn 33 )( pnk 

Sequential time complexity

,

 p

n3

3n

 pn p

 pn 33

)(

Page 61: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 61/78

X. Sun (IIT) CS546 Lecture 5 Page 62

Concurrency

Fixed-time:

Fixed-size :

Memory-bound: 22 k  pn

2

n

Concurrency is proportional to the number of grid points

33k  pn 3 22232 )( pn pnk 

,

Page 62: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 62/78

X. Sun (IIT) CS546 Lecture 5 Page 63

Communication to Computation Ratio

n

 p

 p

n

 p

n

 p

n

CCR 22

2

1

Fixed-time :

Fixed-size : Memory-bound :

n p

 p

 pn

 p

 p

k  p

CCR6

2322

2

)(11

n

 p

 pn

 p

 p

 p

CCR1

)(

11

222

2

Page 63: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 63/78

X. Sun (IIT) CS546 Lecture 5 Page 64

Scalability 

• The Need for New Metrics

•  Comparison of performances with different workload

• Availability of massively parallel processing

•  Scalability

Ability to maintain parallel processing gain when both

problem size and system size increase

Page 64: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 64/78

X. Sun (IIT) CS546 Lecture 5 Page 65

Parallel Efficiency

• The achieved fraction of total potential

parallel processing gain – Assuming linear speedup p is ideal case

• The ability to maintain efficiency whenproblem size increase

 pS E  p

 p

Page 65: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 65/78

X. Sun (IIT) CS546 Lecture 5 Page 66

Maintain Efficiency

• Efficiency of adding n numbers in parallel

 – For an efficiency of 0.80 on 4 procs, n=64

 – For an efficiency of 0.80 on 8 procs, n=192

 – For an efficiency of 0.80 on 16 procs, n=512

Efficiency for Various Data Sizes

0

0.2

0.4

0.6

0.8

1

1 4 8 16 32

number of processors

      E      f      f      i     c      i     e     n     c     y

n=64

n=192

n=320

n=512

 E=1/(1+2plogp/n)

Page 66: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 66/78

X. Sun (IIT) CS546 Lecture 5 Page 67

• Ideally Scalable

T (m p, m W ) = T ( p, W )

 –  T: execution time

 –  W: work executed

 –  P: number of processors used

 –  m: scale up m times –  work: flop count based on the best practical

serial algorithm

• Fact:

T (m

p, m

W ) = T ( p, W )if and only if 

The Average Unit Speed Is Fixed

Page 67: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 67/78

X. Sun (IIT) CS546 Lecture 5 Page 68

 –  Definition:

The average unit speed is the achieved speed divided by

the number of processors

 –  Definition (Isospeed Scalability):An algorithm-machine combination is scalable if the

achieved average unit speed can remain constant with

increasing numbers of processors, provided the problem

size is increased proportionally

Page 68: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 68/78

X. Sun (IIT) CS546 Lecture 5 Page 69

• Isospeed Scalability (Sun & Rover, 91)

 – W: work executed when p processors are employed

 – W': work executed when p' > p processors are employed

to maintain the average speed

 –  Ideal case

 – Scalability in terms of time

'

')',(

W  p

W  p p p yScalabilit 

 

,'

' p

W  pW 

processors'on'work withtime

processorsonwork withtime

'',

' pW 

 pW 

W T 

W T  p p

 p

 p  

1)',(  p p 

Page 69: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 69/78

X. Sun (IIT) CS546 Lecture 5 Page 70

• Isospeed Scalability (Sun & Rover)

 – W: work executed when p processors are employed – W': work executed when p' > p processors are employed

to maintain the average speed

 –  Ideal case'

'

)',( W  p

W  p

 p p yScalabilit 

 

,'

' p

W  pW 

1)',(  p p 

• X. H. Sun, and D. Rover, "Scalability of Parallel Algorithm-Machine Combinations," 

 IEEE Trans. on Parallel and Distributed Systems, May, 1994 (Ames TR91)

Page 70: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 70/78

X. Sun (IIT) CS546 Lecture 5 Page 71

The Relation of Scalability and Time 

• More scalable leads to smaller time – Better initial run-time and higher scalability lead to

superior run-time

 – Same initial run-time and same scalability lead to

same scaled performance – Superior initial performance may not last long if

scalability is low

• Range Comparison

• X.H. Sun, "Scalability Versus Execution Time in Scalable Systems," 

 Journal of Parallel and Distributed Computing, Vol. 62, No. 2, pp. 173-192, Feb 2002.

Range Comparison Via Performance Crossing Point

Page 71: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 71/78

X. Sun (IIT) CS546 Lecture 5 Page 72

Range Comparison Via Performance Crossing Point 

 Assume Program I is oz times slower than program 2 at the initial state

Begin (Range Comparison)

 p' = p;

Repeat 

 p' = p' + 1;

Compute the scalability of program 1  (p,p');

Compute the scalability of program 2   (p,p') ;

Until ( (p,p') >   (p,p') or p' = the limit of ensemble size)

If  (p,p') >   (p,p') Then 

 p is the smallest scaled crossing point;

program 2 is superior at any ensemble size p†, p  p† < p'

Else program 2 is superior at any ensemble size p†, p  p†   p’ 

End {if}

End {Range Comparison}

Page 72: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 72/78

X. Sun (IIT) CS546 Lecture 5 Page 73

• Range Comparison

Influence of Communication Speed Influence of Computing Speed

• X.H. Sun, M. Pantano, and Thomas Fahringer, "Integrated Range Comparison for Data-Parallel

Compilation Systems," IEEE Trans. on Parallel and Distributed Processing, May 1999.

Page 73: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 73/78

X. Sun (IIT) CS546 Lecture 5 Page 74

The SCALA (SCALability Analyzer) System 

• Design Goals

 – Predict performance

 – Support program optimization

 – Estimate the influence of hardware variations•  Uniqueness

 – Designed to be integrated into advanced compilersystems

 – Based on scalability analysis

Page 74: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 74/78

X. Sun (IIT) CS546 Lecture 5 Page 75

• Vienna Fortran Compilation System

 – A data-parallel restructuring compilation system

 – Consists of a parallelizing compiler for VF/HPFand tools for program analysis and restructuring

 – Under a major upgrade for HPF2

• Performance prediction is crucial forappropriate program restructuring

Page 75: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 75/78

X. Sun (IIT) CS546 Lecture 5 Page 76

The Structure of SCALA 

Page 76: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 76/78

X. Sun (IIT) CS546 Lecture 5 Page 77

Prototype Implementation • Automatic range comparison for different data distributions

• The P 3 T static performance estimator

• Test cases: Jacobi and Redblack

No Crossing Point Have Crossing Point

Summary

Page 77: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 77/78

X. Sun (IIT) CS546 Lecture 5 Page 78

Summary

• Relation between Iso-speed scalability and iso-efficiency scalability – Both measure the ability to maintain parallel efficiency

defined as

 – Where iso-efficiency’s speedup is the traditional speedupdefined as

 – Iso-speed’s speedup is the generalized speedup defined as 

 – If the the sequential execution speed is independent ofproblem size, iso-speed and iso-efficiency is equivalent

 – Due to memory hierarchy, sequential execution performancevaries largely with problem size

 p

S E 

p

 p

SpeedSequential

SpeedParallel pS

TimeExecutionParallelTimeExecutionorUniprocess pS

Page 78: cs546perf

8/2/2019 cs546perf

http://slidepdf.com/reader/full/cs546perf 78/78

Summary

• Predict the sequential execution performancebecomes a major task of SCALA due to advancedmemory hierarchy

 – Memory-LogP model is introduced for data access cost

• New challenge in distributed computing• Generalized iso-speed scalability

• Generalized performance tool: GHS

• K. Cameron and X.-H. Sun, "Quantifying Locality Effect in Data Access Delay: Memory logP,"

Proc. of 2003 IEEE IPDPS 2003, Nice, France, April, 2003.

• X.-H. Sun and M. Wu, "Grid Harvest Service: A System for Long-Term, Application-Level Task 

Scheduling," Proc. of 2003 IEEE IPDPS 2003, Nice, France, April, 2003.