I Why Parallel Computers ? II What are the problems of Parallel Computing? III Architecture of Parallel Computers IV Performance Evaluation V References.

I Why Parallel Computers ?

II What are the problems of Parallel Computing?

III Architecture of Parallel Computers

IV Performance Evaluation

V References

Introduction and MotivationM. Garbey

COSC

I.1 Why Parallel Computers

A Parallel Computer is for

Solving a LARGER problem

Solving FASTER the problem ?

Naive Example of the construction of a Wall with Many workers:

time

Dimensionor size of the pb

Workers ……..

I.2 Why Parallel Computers ?

The price of a single CPU grows faster than linearly with speed.

One hope that the price will increase linearly with the speed!

Hardware of Parallel Computers are easy to build with off the shelf componants and processors reducing development

time and costs.

A Parallel Computer is « ten times » cheaper than a vectorial computerfor the same performance

Network of PCs is « ten times » cheaper than a parallel computer.

What about Parallel Software????????


Computer Clocks are approaching fundamental speed limits (with the current know technology)

Speed of light= 30 cm per nanosecond.

In copper wire the limit speed is 9 cm per nanosecond.

Miniaturization of CPU chips adds heat dissipation problems

Clock speeds on workstation have increased in the past few years from 1 MHz, to 100 MHz, and now up to 1 GHz and more….

According to Moore ’s Law, processor speed AND memory increase by a factor two every 18 months.

Note: this exponential growth gives a factor 1000 in 15 years, and 1000 000 in 30 years….However 2010 seems to be the limit for this type of growth …..


Computer Year Cycle time(ns) Memory

IBM 7090 60 ’s 2000 32KCDC 6600 1964 100 128KCDC7600 1969 27.5 128K+512KCRAY 1 1976 12.5 2 MCDC Cyber 205 1979 20 4 MFujitsu VP200 1983 7.5 64 MCRAY 2 1984 4.1 256 MCRAY X-MP/4 1985 8.5 16 MNEC SX-2 1984 6 128 MCRAY Y-MP/8 1988 6 128 MFujitsu VP 2600 1990 3.2 256 MNEC SX-3 1990 2.9 256 MCRAY Y-MP 16 1992 4 512 M

Density Temperature

I.4 Why Parallel Computers ? There is a need for large scale computing:

Examples from the industry: Weather forcast

Direct Numerical Simulation of an Air PlaneCar CrashSafety of Nuclear ReactorDrug designGenome

Example from the academic world:Modeling of Molecules: Structure, Chemical reactions ….. Biochemistry…(Linear Algebra and Eigenvalue Problem ….)Simulation of Supraconductors: Vortex dynamic, Phase Transition ….(and many other examples in Material Sciences…).Climat Modeling and Air PollutionGenome

More application:Nuclear Explosion simulations.Nuclear Waste disposal.

II What are the problems of Parallel Computing ?

To ask the good questions may help to solve the right problems ……

• Computer Science Point of View• Numerical Analysis Point of View•Scientific Computing Point of View•Industrial Point of View

We are going to review briefly the 4 types of problems.

II.1 What are the problems of Parallel Computing ?Computer Science Point of View Partitionning:

datas must be distributed into memory units in order to be used inparallel by CPU Units

Access to data and Bandwith:the CPUs must have parallel pathes to access the datas that they do need simultaneously.

Latency:latency must not be a bottle neck, i.e. simultaneous access to data may cause pause, but it should not decrease the processing speed.

Network

CPUs

MemoryUnits

Control Unit?

II.2 What are the problems of Parallel Computing ?Numerical Analysis Point of View Familly of « Universal Algorithms », such as the Gauss elimination algorithm usedto solve linear system.

Is it possible to design a parallel version of the Algorithm?

Main difficulty: three level of loops with variable size…..

Look for parallel libraries ….

Many Algorithms involve scalar product, computation of norm, more generally:

data size mapping data size

N*N distributed one number in the memory or N datas

Bottleneck ?

II.2 What are the problems of Parallel Computing ?Numerical Analysis Point of View « Best Numerical Methods » have:

adaptivity irregular data structure and or irregular mesh

implicit scheme global communications

high order accuracy size of the problem might be limited by conditionning number i.e. available arithmetic

« Best Algorithm for Parallel Computers with many CPUs »

Regular Data Structure

Local communication

Largest problem that fits in memory!

II.2 What are the problems of Parallel Computing ?Numerical Analysis Point of View New Technology of Computers have always leads to

New Methods and Algorithms

New Theorems

New Implementation Style !

What about the evaluation of the performance of a parallel algorithm ?

II.3 What are the problems of Parallel Computing ?Scientific Computing Point of View Example from Supraconductivity numerical simulation:

large scale computationnot a short term issuenumerical simulation is a fundamental toolpluridisciplinary team: material science + numerics + parallel computing

Set of ProblemSOne has a Set of ModelS (assuming there are good models !!!!)

Set of Numerical MethodSQuestion:

for which problem a Parallel Architecture is Usefull?What is the best Model versus the Parallel Architecture ?What is the best Method versus the Parallel Architecture?

Further Question: Validation and Verification of Code.

II.4 What are the problems of Parallel Computing ?Industrial Point of View Hypothesis: one has a production code that is

more than 10 000 lines and usualy much more for post and preprocessing.More than 10 years * manpower of developmentThe code is running efficiently on a vectorial machine …. Eventually aRISC workstation.

Goal: to get a Production code running on the Parallel Computer at a much LOWER COST including development, cost of the computer, cost of maintenance,with/without reduced elapse time, with/without increased accuracy,and STABILITY and ROBUSTNESS.

Further Question: Validation and Verification of the Code in parallel computing environnement.

Problem:Which fraction of the code can be parallelized?Which fraction of the code can be solved with a new (parallel) algorithm?Can we know apriori what will be the cost? The computing time?

III Architecture of Parallel Computers :

Organisation of the memory: Distributed vs Shared

Memory Adressing: Local vs Global

Memory Access : Uniform MA vs Non Uniform MA

Granularity: Flops performance of individual processor

Topology of the network of communication: tores, hypercube, grid,.....

Butterfly topology

Multiprocessorwith shared memory

Multiprocessorwith distributed memory and local address

Multiprocessorwith distributedmemory and globaladdress

III Architecture of Parallel Computers :

Partitionning of the Datas

Shared memory: memory is divided into modules, datas are distributed amongthe modules.

Distributed memory: datas must be stored into or moved to the private memoryof the right processor.

Access to datas:

Shared memory: datas are available to all processors.

Distributed memory: datas can moved through a high speed network.

Latency : ( as a consequenc of simultaneous access to datas)

Shared memory: one has (partial) copy of datas in cache memory. Coherenceof datas must be performed and this might slow down the process!

Distributed memory: datas should be sent to target processors as soon asthey are produced by « source processors ».

Mecanism of control : Single Instruction Multiple Datas

vs

Multiple Instruction Multiple Datas

The set of CPUs can process

each instruction for all the datas (reft F90)

example: process an array at once: A=B+C where A, B and C are arrays

or

excecute differents instructions or piece of code on different datas

example: coupling models/coupling codes

if processor is one

advance chemistry in time

else

advance fluid in time

end

exchange (flow field heat)

III Influence of Architecture

on Programming and Performance

Memory Organization

Shared Distributed

Programming easy hard

Parallel Efficiency hard easy

III Memory Hierarchy:

Niveau 1 2 3 4 5Nom Registres Cache Mémoire

principaleRéseau Sauvegarde

disqueTaille < 1 KB < 4 MB < 4 GB - > 1 GBTechnologie CMOS

ou BiCMOS

On-chip ou Off-chipCMOS SRAM

CMOS DRAM

ATM, FDDI,memory channel,Crossbar,

Disquemagnétiqueou optique

Tempsd'accès (ns)

2-5 3-10 80-400 1000 - 100000 5 000 000

Bandepassante(MO/s)

4000-32 000 800-5000 400-2000 10-800 4-32

Géré par Compileur Hardware Systèmed'exploitation

Systèmed'exploitation

Systèmed'exploitation

sauvegardépar

Cache Mémoireprincipale

Disque - Bande

I/O busmemory Bus

cache

Disquesregister

CPUMemory

netwokk

The architecture of High Performance Computer, and high end processors,are caracterized by a hierarchy of several layers of memory :

Memory speed Memory Cost Memory Size

III Memory Hierarchy:

Consequently, locality and reuse of datas are most critical to obtain good performance. Recall that one get usually a small fraction of peak performance of processors

I/O busmemory Bus

cache

Disquesregister

CPUMemory

netwokk

Basic Linear Algebra Subroutines (reft J. Dongarra & Al)

Goal: increase portability and performance of a code,kernel often provided by the vendorsexample 1: a x + y BLAS 1 operation on vectorexample 2: A x + y BLAS 2 matrix/vectorexample 3: A B + C BLAS 3 matrix/matrix

BLAS loads and stores flops ratio

level 1 3 n 2 n 3:2

level 2 n + 3 n 2 n 1:2

level 3 4 n 2 n 2:n

2 2

2 3

IV Performance Evaluation:

Amdahl LawHypothesis: fraction of sequential process = s s, , 0 1

fraction of parallel process = 1 s,

Then

T s T s T p

S p sp s

E sp s

p

p

p

,

1 11

1

1 1

/ ,

/

/ .

Corollaire: E s E pp p 1 0/ et quand .

T1 = elpased time for sequential process with 1 processor

Tp = elapsed time for parallel process with p processors

S T Tp p 1 /

E T p Tp p 1 /

S p Ep p

E S pp p /therefore

speed-up:

efficiency:

S p Ep p et 1Pseudo theorem But in practice cache memory effectMay give super linear speedup for large problem size.

Definition

Definitions of speed-up and efficiency:

However , Amdahl low is valid iff the percentage of the work that has to be sequentialis independant of the size o the problem, which is not true in general.Most likely the percentage of sequential work decreases when the size of the problemincreases; then

Gustafson LawThe maximum performance of the parallel code corresponds to the largest problem

that fit into the main memory (i.e. no swap between memory and disc…..). Hypothesis: T = 1;

Fraction of sequential time sFraction of parallel work: (1-s)

T = s + (1-s) p

S = T / T = s + (1-s) p

Example: s=0.1, p=10,if fraction of sequential time is fixed (Gustafson law) S = 9.1,if fraction of sequential time growths linearly with p, Amdhal law gives S =5.3.In this last case, speed up will not increase further than 10 even if p goes to infinity!.

p

1

1 p

IV Performance Evaluation

Scalability:

The scalability caracterizes the ability of an algorithm to make usefull useof additional processors.

An algorithm is scalable when as the number of processors increases,the efficiency can stay constant by increasing the size of the problem.(Hopefully, this constant should be close to one!!!!)

Example: lets consider an algorithm of complexity n: the algorithm is perfectly scalable if the elapse time stay constant when: n 2 n and p 2 p

Remark: an algorithm might be scalable but useless if the elapsed timeincreases too much for its real use.

100

101

102

103

100

101

102

103

Processeurs

spee

d-up

- ideal speed-up

. algorithme 1

-. algorithme 2

-- algorithme 3

100

101

102

103

104

100

101

102

103

104

Processeurs

spee

d-up

- ideal speed-up

. algorithme 1

. algorithme 2

-- algorithme 3

IV. Modeling of parallel efficiency of an algorithm:One consider 3 algorithms to solve a problem of size n with aritmeticcomplexity n+ n :

1 . T n n pP 2 /

2 . T n n pP 2 1 0 0/

3 . T n n p pP 2 20 6/ .

speed-up is 10.8 for p=12 processors and n=100, however behaviors of algorithmchanges for increasing n

2

IV Modeling of parallel performance of an algorithm: Analysis and prediction of performance by mean of asymptotic analysis andextrapolation is not quiet safe!!!

Most often mathematical models are ideal and neglect many architecture andsoftware complexities.

Asymptotic estimate may work for n and p too large to be realistic.

Example: if arithmetic complexity is 10n+n log(n),

10n is larger than n log(n) for n < 1024

Constant factor in estimates can make the difference in practice:

Example: a complexity of 10 n2 is better than a complexity

of 1000n log n for n < 996

IV.2 Modelling of Elapse time for a Parallel Algorithm

The excecution time is the elapse time between the time when the first processor start its work and the time when the last processor ends its work

During this excecution time a processor can either compute, communicate, wait, (be on strike?) .

The total excecution for processor j is therefore:

T T Tcompj

commj

idlej

This is a very naive point of view: we will see more based on applications!

Excecution time:

V. References• D. P. Bertsekas et J.N. Tsitsiklis,

Parallel and Distributed Computation, Numerical methods, Prentice Hall 89.

• K. Dowd and C.R. Severance,

High Performance Computing, 2d Edit. O ’Reilly, 98.

• J.J. Dongarra, I.S. Duff, D. C. Sorrensen et H.A. Van der Vorst, Solving linear system on Vector and shared memory computers, Edt SIAM 91.

• W. Gropp, E. Lusk et A. Skjellum,

Using MPI, Scientific and Engineering Computation 94, MIT Press.• I. Foster, Designing and Building Parallel Programs 94, Addison - Wesley Publishing.

• Y. Robert, The impact of vector and parallel architectures on the Gaussian elimination

algorithm, Manchester University Press and Wiley, 1990.

• E. F. Van de Velde, Concurrent Scientific Computing, TAM(16), Springer Verlag 92.

• Parallel Processing for Scientific Computing (SIAM conference).

• Parallel CFD Conference, http://www.parcfd.org.

• Domain Decomposition Conference, http://www.ddm.org.

• journal Parallel Computing.

I Why Parallel Computers ? II What are the problems of Parallel Computing? III Architecture of Parallel Computers IV Performance Evaluation V References.

Documents