I Why Parallel Computers ? II What are the problems of Parallel Computing? III Architecture of Parallel Computers IV Performance Evaluation V References Introduction and Motivation M. Garbey COSC
I Why Parallel Computers ?
II What are the problems of Parallel Computing?
III Architecture of Parallel Computers
IV Performance Evaluation
V References
Introduction and MotivationM. Garbey
COSC
I.1 Why Parallel Computers
A Parallel Computer is for
Solving a LARGER problem
Solving FASTER the problem ?
Naive Example of the construction of a Wall with Many workers:
time
Dimensionor size of the pb
Workers ……..
I.2 Why Parallel Computers ?
The price of a single CPU grows faster than linearly with speed.
One hope that the price will increase linearly with the speed!
Hardware of Parallel Computers are easy to build with off the shelf componants and processors reducing development
time and costs.
A Parallel Computer is « ten times » cheaper than a vectorial computerfor the same performance
Network of PCs is « ten times » cheaper than a parallel computer.
What about Parallel Software????????
I.3 Why Parallel Computers ?
Computer Clocks are approaching fundamental speed limits (with the current know technology)
Speed of light= 30 cm per nanosecond.
In copper wire the limit speed is 9 cm per nanosecond.
Miniaturization of CPU chips adds heat dissipation problems
Clock speeds on workstation have increased in the past few years from 1 MHz, to 100 MHz, and now up to 1 GHz and more….
According to Moore ’s Law, processor speed AND memory increase by a factor two every 18 months.
Note: this exponential growth gives a factor 1000 in 15 years, and 1000 000 in 30 years….However 2010 seems to be the limit for this type of growth …..
I.3 Why Parallel Computers ?
Computer Year Cycle time(ns) Memory
IBM 7090 60 ’s 2000 32KCDC 6600 1964 100 128KCDC7600 1969 27.5 128K+512KCRAY 1 1976 12.5 2 MCDC Cyber 205 1979 20 4 MFujitsu VP200 1983 7.5 64 MCRAY 2 1984 4.1 256 MCRAY X-MP/4 1985 8.5 16 MNEC SX-2 1984 6 128 MCRAY Y-MP/8 1988 6 128 MFujitsu VP 2600 1990 3.2 256 MNEC SX-3 1990 2.9 256 MCRAY Y-MP 16 1992 4 512 M
Density Temperature
I.4 Why Parallel Computers ? There is a need for large scale computing:
Examples from the industry: Weather forcast
Direct Numerical Simulation of an Air PlaneCar CrashSafety of Nuclear ReactorDrug designGenome
Example from the academic world:Modeling of Molecules: Structure, Chemical reactions ….. Biochemistry…(Linear Algebra and Eigenvalue Problem ….)Simulation of Supraconductors: Vortex dynamic, Phase Transition ….(and many other examples in Material Sciences…).Climat Modeling and Air PollutionGenome
More application:Nuclear Explosion simulations.Nuclear Waste disposal.
II What are the problems of Parallel Computing ?
To ask the good questions may help to solve the right problems ……
• Computer Science Point of View• Numerical Analysis Point of View•Scientific Computing Point of View•Industrial Point of View
We are going to review briefly the 4 types of problems.
II.1 What are the problems of Parallel Computing ?Computer Science Point of View Partitionning:
datas must be distributed into memory units in order to be used inparallel by CPU Units
Access to data and Bandwith:the CPUs must have parallel pathes to access the datas that they do need simultaneously.
Latency:latency must not be a bottle neck, i.e. simultaneous access to data may cause pause, but it should not decrease the processing speed.
Network
CPUs
MemoryUnits
Control Unit?
II.2 What are the problems of Parallel Computing ?Numerical Analysis Point of View Familly of « Universal Algorithms », such as the Gauss elimination algorithm usedto solve linear system.
Is it possible to design a parallel version of the Algorithm?
Main difficulty: three level of loops with variable size…..
Look for parallel libraries ….
Many Algorithms involve scalar product, computation of norm, more generally:
data size mapping data size
N*N distributed one number in the memory or N datas
Bottleneck ?
II.2 What are the problems of Parallel Computing ?Numerical Analysis Point of View « Best Numerical Methods » have:
adaptivity irregular data structure and or irregular mesh
implicit scheme global communications
high order accuracy size of the problem might be limited by conditionning number i.e. available arithmetic
« Best Algorithm for Parallel Computers with many CPUs »
Regular Data Structure
Local communication
Largest problem that fits in memory!
II.2 What are the problems of Parallel Computing ?Numerical Analysis Point of View New Technology of Computers have always leads to
New Methods and Algorithms
New Theorems
New Implementation Style !
What about the evaluation of the performance of a parallel algorithm ?
II.3 What are the problems of Parallel Computing ?Scientific Computing Point of View Example from Supraconductivity numerical simulation:
large scale computationnot a short term issuenumerical simulation is a fundamental toolpluridisciplinary team: material science + numerics + parallel computing
Set of ProblemSOne has a Set of ModelS (assuming there are good models !!!!)
Set of Numerical MethodSQuestion:
for which problem a Parallel Architecture is Usefull?What is the best Model versus the Parallel Architecture ?What is the best Method versus the Parallel Architecture?
Further Question: Validation and Verification of Code.
II.4 What are the problems of Parallel Computing ?Industrial Point of View Hypothesis: one has a production code that is
more than 10 000 lines and usualy much more for post and preprocessing.More than 10 years * manpower of developmentThe code is running efficiently on a vectorial machine …. Eventually aRISC workstation.
Goal: to get a Production code running on the Parallel Computer at a much LOWER COST including development, cost of the computer, cost of maintenance,with/without reduced elapse time, with/without increased accuracy,and STABILITY and ROBUSTNESS.
Further Question: Validation and Verification of the Code in parallel computing environnement.
Problem:Which fraction of the code can be parallelized?Which fraction of the code can be solved with a new (parallel) algorithm?Can we know apriori what will be the cost? The computing time?
III Architecture of Parallel Computers :
Organisation of the memory: Distributed vs Shared
Memory Adressing: Local vs Global
Memory Access : Uniform MA vs Non Uniform MA
Granularity: Flops performance of individual processor
Topology of the network of communication: tores, hypercube, grid,.....
Butterfly topology
Multiprocessorwith shared memory
Multiprocessorwith distributed memory and local address
Multiprocessorwith distributedmemory and globaladdress
III Architecture of Parallel Computers :
Partitionning of the Datas
Shared memory: memory is divided into modules, datas are distributed amongthe modules.
Distributed memory: datas must be stored into or moved to the private memoryof the right processor.
Access to datas:
Shared memory: datas are available to all processors.
Distributed memory: datas can moved through a high speed network.
Latency : ( as a consequenc of simultaneous access to datas)
Shared memory: one has (partial) copy of datas in cache memory. Coherenceof datas must be performed and this might slow down the process!
Distributed memory: datas should be sent to target processors as soon asthey are produced by « source processors ».
Mecanism of control : Single Instruction Multiple Datas
vs
Multiple Instruction Multiple Datas
The set of CPUs can process
each instruction for all the datas (reft F90)
example: process an array at once: A=B+C where A, B and C are arrays
or
excecute differents instructions or piece of code on different datas
example: coupling models/coupling codes
if processor is one
advance chemistry in time
else
advance fluid in time
end
exchange (flow field heat)
III Influence of Architecture
on Programming and Performance
Memory Organization
Shared Distributed
Programming easy hard
Parallel Efficiency hard easy
III Memory Hierarchy:
Niveau 1 2 3 4 5Nom Registres Cache Mémoire
principaleRéseau Sauvegarde
disqueTaille < 1 KB < 4 MB < 4 GB - > 1 GBTechnologie CMOS
ou BiCMOS
On-chip ou Off-chipCMOS SRAM
CMOS DRAM
ATM, FDDI,memory channel,Crossbar,
Disquemagnétiqueou optique
Tempsd'accès (ns)
2-5 3-10 80-400 1000 - 100000 5 000 000
Bandepassante(MO/s)
4000-32 000 800-5000 400-2000 10-800 4-32
Géré par Compileur Hardware Systèmed'exploitation
Systèmed'exploitation
Systèmed'exploitation
sauvegardépar
Cache Mémoireprincipale
Disque - Bande
I/O busmemory Bus
cache
Disquesregister
CPUMemory
netwokk
The architecture of High Performance Computer, and high end processors,are caracterized by a hierarchy of several layers of memory :
Memory speed Memory Cost Memory Size
III Memory Hierarchy:
Consequently, locality and reuse of datas are most critical to obtain good performance. Recall that one get usually a small fraction of peak performance of processors
I/O busmemory Bus
cache
Disquesregister
CPUMemory
netwokk
Basic Linear Algebra Subroutines (reft J. Dongarra & Al)
Goal: increase portability and performance of a code,kernel often provided by the vendorsexample 1: a x + y BLAS 1 operation on vectorexample 2: A x + y BLAS 2 matrix/vectorexample 3: A B + C BLAS 3 matrix/matrix
BLAS loads and stores flops ratio
level 1 3 n 2 n 3:2
level 2 n + 3 n 2 n 1:2
level 3 4 n 2 n 2:n
2 2
2 3
IV Performance Evaluation:
Amdahl LawHypothesis: fraction of sequential process = s s, , 0 1
fraction of parallel process = 1 s,
Then
T s T s T p
S p sp s
E sp s
p
p
p
,
1 11
1
1 1
/ ,
/
/ .
Corollaire: E s E pp p 1 0/ et quand .
T1 = elpased time for sequential process with 1 processor
Tp = elapsed time for parallel process with p processors
S T Tp p 1 /
E T p Tp p 1 /
S p Ep p
E S pp p /therefore
speed-up:
efficiency:
S p Ep p et 1Pseudo theorem But in practice cache memory effectMay give super linear speedup for large problem size.
Definition
Definitions of speed-up and efficiency:
However , Amdahl low is valid iff the percentage of the work that has to be sequentialis independant of the size o the problem, which is not true in general.Most likely the percentage of sequential work decreases when the size of the problemincreases; then
Gustafson LawThe maximum performance of the parallel code corresponds to the largest problem
that fit into the main memory (i.e. no swap between memory and disc…..). Hypothesis: T = 1;
Fraction of sequential time sFraction of parallel work: (1-s)
T = s + (1-s) p
S = T / T = s + (1-s) p
Example: s=0.1, p=10,if fraction of sequential time is fixed (Gustafson law) S = 9.1,if fraction of sequential time growths linearly with p, Amdhal law gives S =5.3.In this last case, speed up will not increase further than 10 even if p goes to infinity!.
p
1
1 p
IV Performance Evaluation
Scalability:
The scalability caracterizes the ability of an algorithm to make usefull useof additional processors.
An algorithm is scalable when as the number of processors increases,the efficiency can stay constant by increasing the size of the problem.(Hopefully, this constant should be close to one!!!!)
Example: lets consider an algorithm of complexity n: the algorithm is perfectly scalable if the elapse time stay constant when: n 2 n and p 2 p
Remark: an algorithm might be scalable but useless if the elapsed timeincreases too much for its real use.
100
101
102
103
100
101
102
103
Processeurs
spee
d-up
- ideal speed-up
. algorithme 1
-. algorithme 2
-- algorithme 3
100
101
102
103
104
100
101
102
103
104
Processeurs
spee
d-up
- ideal speed-up
. algorithme 1
. algorithme 2
-- algorithme 3
IV. Modeling of parallel efficiency of an algorithm:One consider 3 algorithms to solve a problem of size n with aritmeticcomplexity n+ n :
1 . T n n pP 2 /
2 . T n n pP 2 1 0 0/
3 . T n n p pP 2 20 6/ .
speed-up is 10.8 for p=12 processors and n=100, however behaviors of algorithmchanges for increasing n
2
IV Modeling of parallel performance of an algorithm: Analysis and prediction of performance by mean of asymptotic analysis andextrapolation is not quiet safe!!!
Most often mathematical models are ideal and neglect many architecture andsoftware complexities.
Asymptotic estimate may work for n and p too large to be realistic.
Example: if arithmetic complexity is 10n+n log(n),
10n is larger than n log(n) for n < 1024
Constant factor in estimates can make the difference in practice:
Example: a complexity of 10 n2 is better than a complexity
of 1000n log n for n < 996
IV.2 Modelling of Elapse time for a Parallel Algorithm
The excecution time is the elapse time between the time when the first processor start its work and the time when the last processor ends its work
During this excecution time a processor can either compute, communicate, wait, (be on strike?) .
The total excecution for processor j is therefore:
T T Tcompj
commj
idlej
This is a very naive point of view: we will see more based on applications!
Excecution time:
V. References• D. P. Bertsekas et J.N. Tsitsiklis,
Parallel and Distributed Computation, Numerical methods, Prentice Hall 89.
• K. Dowd and C.R. Severance,
High Performance Computing, 2d Edit. O ’Reilly, 98.
• J.J. Dongarra, I.S. Duff, D. C. Sorrensen et H.A. Van der Vorst, Solving linear system on Vector and shared memory computers, Edt SIAM 91.
• W. Gropp, E. Lusk et A. Skjellum,
Using MPI, Scientific and Engineering Computation 94, MIT Press.• I. Foster, Designing and Building Parallel Programs 94, Addison - Wesley Publishing.
• Y. Robert, The impact of vector and parallel architectures on the Gaussian elimination
algorithm, Manchester University Press and Wiley, 1990.
• E. F. Van de Velde, Concurrent Scientific Computing, TAM(16), Springer Verlag 92.
• Parallel Processing for Scientific Computing (SIAM conference).
• Parallel CFD Conference, http://www.parcfd.org.
• Domain Decomposition Conference, http://www.ddm.org.
• journal Parallel Computing.