PARALLEL COMPUTING: Models and Algorithms

PARALLEL COMPUTING:

Models and Algorithms

Course for Undergraduate Students in the 4th year

(Major in Computer Science-Software)

Instructors: Instructors: Mihai L. Mocanu, Ph.D., Professor Cristian M. Mihăescu, Ph.D., LecturerCosmin M. Poteraș, Ph.D. Student, Assistant

E-mail: [email protected]: Room 303 Office hours: Thursday12:00-14:00Course page: http://software.ucv.ro/~mocanu_mihai

(ask for the passw and use the appropriate entry)

Course objectives

Understanding of basic concepts of parallel computing

• understand various approaches to parallel hardware architectures and their strong/weak points

• become familiar with typical software/programming approaches

• learn basic parallel algorithms and algorithmic techniques

• learn the jargon … so you understand what people are talking about

• be able to apply this knowledge

Course objectives (cont.)

Familiarity with Parallel Concepts and Techniques

• drastically flattening the learning curve in a parallel environment

Broad Understanding of Parallel Architectures and Programming TechniquesTechniques

• be able to quickly adapt to any parallel programming environment

Flexibility

Textbooks and Working

Textbooks:

1. Vipin Kumar, Ananth Grama, Anshul Gupta, George Kyrypis - Introduction to

Parallel Computing Benjamin/Cummings 2003, (2nd Edition - ISBN 0-201-

64865-2) or Benjamin/Cummings 1994, (1st Edition ISBN 0-8053-3170-0)

2. Behrooz Parhami - Introduction to Parallel Processing: Algorithms and

Architectures, Kluwer Academic Publ, 2002

3. Dan Grigoras – Parallel Computing. From Systems to Applications, 3. Dan Grigoras – Parallel Computing. From Systems to Applications,

Computer Libris Agora, 2000, ISBN 973-97534-6-9

4. Mihai Mocanu – Algorithms and Languages for Parallel Processing, Publ.

University of Craiova, 1995

Laboratory and Projects:

1. Mihai Mocanu, Alexandru Patriciu – Parallel Computing in C for Unix and

Windows NT Networks, Publ. University of Craiova, 1998

2. Christofer H.Nevison et al. - Laboratories for Parallel Computing, Jones and

Bartlett, 1994

Other resources are on the web page

Topics Covered (overview)

• Fundamental Models (C 1..5)

• Introduction

• Parallel Programming Platforms

• Principles of Parallel Algorithm Design

• Basic Communication Operations

• Analytical Modeling of Parallel Programs

• Parallel Programming (C 6 & a part of C 7)

• Programming using Message Passing Paradigm

• Parallel Algorithms (C 8, 9, 10 & a part of C 11)

• Dense Matrix Algorithms

• Sorting

• Graph Algorithms

Topics in Detail I

1. Parallel Programming Platforms & Parallel Models

• logical and physical organization

• interconnection networks for parallel machines

• communication costs in parallel machines

• process - processors mappings, graph embeddings • process - processors mappings, graph embeddings

Why?

It is better to be aware of the physical and economical constraints and tradeoffs

of the parallel system you are designing for, then to be sorry later.

Topics in Detail II

2. Quick Introduction to PVM (Parallel Virtual Machine) and MPI

(Message Passing Interface)

• semantics and syntax of basic communication operations

• setting up your PVM/MPI environment, compiling and running PVM or MPI

programs

Why?

You can start to program simple parallel programs early on.

Topics in Detail III

3. Principles of Parallel Algorithm Design

• decomposition techniques

• load balancing

• techniques for reducing communication overhead

• parallel algorithm models • parallel algorithm models

Why?

These are fundamental issues that appear/apply to every parallel program. You

really should learn this stuff by hearth.

Topics in Detail IV

4. Implementation and Cost of Basic Communication Operations

• broadcast, reduction, scatter, gather, parallel prefix, …

Why?

These are fundamental primitives you would often use and you should known These are fundamental primitives you would often use and you should known

them well: Not only what they do, but also how much do they cost and when

and how to use them.

Going through details of implementation allows us to see how are the principles

from the previous topic applied to relatively simple problems.

Topics in Detail V

5. Analytical Modeling of Parallel Programs

• sources of overhead

• execution time, speedup, efficiency, cost, Amdahl's law

Why?Why?

Parallel programming is done to increase performance.

Debugging and profiling is extremely difficult in parallel setting, so it is better to

understand from the beginning what performance to expect from a given parallel

program and, more generally, how to design parallel programs with low execution

time. It is also important to know the limits of what can be done and what not.

Topics in Detail VI

6. Parallel Dense Matrix Algorithms

• matrix vector multiplication

• matrix matrix multiplication

• solving systems of linear equations

7. Parallel Sorting

• odd-even transpositions sort

• sorting networks, bitonic sort

• parallel quicksort

• bucket and sample sort

Why?

Classical problems with lots of applications, many interesting and useful

techniques exposed.

Topics in Detail VII

8. Parallel Graph Algorithms

• minimum spanning tree

• single-source shortest paths

• all-pairs shortest paths

• connected components

• algorithms for sparse graphs

9. Search Algorithms for Discrete Optimization Problems

• search overhead factor, speedup anomalies

• parallel depth-first search

Why?

As before, plus shows many examples of hard-to-parallelize problems.

Grading (tentative)

• 20% continuous test quizes (T)• 20% continuous practical laboratory assignments (L)• 20% cont. practical evaluation through projects (P)• 40% final written exam (E)

You have to get at least 50% on any continuous evaluation You have to get at least 50% on any continuous evaluation form (T, L and P) in order to be allowed to sustain the final exam during session.

You have to get at least 50% on the final exam (E) to pass and obtain a mark greater than 5. All the grades obtained go with the specified weight into the computation of the final mark.

Assignments and evaluations

• assignments from your project for a total of 20 points

• mostly programming in C, C++ with threads or multiple processes, PVM or MPI, etc. implementing (relatively) simple algorithms and load balancing techniques, so make sure and check the lab info as soon as possible

• continuous evaluation based on some theoretical questions • continuous evaluation based on some theoretical questions thrown in to prepare you better for the final exam

If you have problems with setting up your working environment and/or running your programs, ask the TA for help/advice. He is there to help you with that.

Use him, but do not abuse him with normal programming bugs.

Project (tentative)

Project:

• may be individual or done in groups of 2-3

• intermediary reports or presentations weight 30% of the final grade

• required: programs + written documentation + final and intermediary presentations (2 by the end of semester) intermediary presentations (2 by the end of semester)

• three main types

• report on interesting non-covered algorithms

• report on interesting parallel applications

• not-so-trivial programming project

• final written report and presentation – due date: end of Jan.

Introduction

• Background

• Speedup. Amdahl’s Law

• The Context and Difficulties of Actual Parallel ComputingComputing

• Demand for computational speed. Grand challenge problems

• Global weather forecasting

• N-body problem: modeling motion of astronomical bodies

Background

• Parallel Computing: using more than one computer, or a

computer with more than one processor, to solve a task

• Parallel computers (computers with more than one

processor), and their way of programming - parallel processor), and their way of programming - parallel

programming – have been around for more than 40

years! Motives:

– Usually faster computation - very simple idea - that n

computers operating simultaneously can achieve the result n

times faster - it will not be n times faster for various reasons.

– Other motives include: fault tolerance, larger amount of

memory available, ...

“... There is therefore nothing new in the idea of parallel

programming, but its application to computers. The author

cannot believe that there will be any insuperable difficulty in

extending it to computers. It is not to be expected that the

necessary programming techniques will be worked out

overnight. Much experimenting remains to be done. After all,

the techniques that are commonly used in programming today

were only won at the cost of considerable toil several yearswere only won at the cost of considerable toil several years

ago. In fact the advent of parallel programming may do

something to revive the pioneering spirit in programming

which seems at the present to be degenerating into a rather

dull and routine occupation ...”

Gill, S. (1958), “Parallel Programming,” The Computer Journal, vol. 1, April 1958, pp. 2-10.

Speedup Factor

Speedup factor can also be cast in terms of computational steps:

S(p) = Execution time using one processor (best sequential algorithm)

Execution time using a multiprocessor with p processors=

ts

tp

S(p) = Number of computational steps using one processor

Number of par allel computational steps with p processors

• S(p) gives increase in speed by using “a multiprocessor”

Hints:

• Use best sequential algorithm with single processor system

• Underlying algorithm for parallel implementation might be (and it

is usually) different

Number of par allel computational steps with p processors

Maximum Speedup

•Is usually p with p processors (linear speedup)

Speedup factor is given by:

S(p) = ts p

=fts + (1 − f )ts /p 1 + (p − 1)f

This equation is known as Amdahl’s law

Remark: Possible but unusual to get superlinear speedup

(greater than p) but due to a specific reason such as:

– Extra memory in multiprocessor system

– Nondeterministic algorithm

fts + (1 − f )ts /p 1 + (p − 1)f

Maximum Speedup Amdahl’s law

Serial section Parallelizable sections

(a) One processor

fts (1 - f)ts

ts

(b) Multipleprocessors

(1 - f)ts/ptp

p processors

Speedup against number of processors

•Even with infinite number of processors, maximum speedup limited to 1/f •Ex: With only 5% of computation being serial, maximum speedup is 20

Superlinear Speedup - Searching

(a) Searching each sub-space sequentially

ts

t /p

Start Time

ts/p

∆t

Solution foundxts/p

Sub-spacesearch

x indeterminate

(b) Searching each sub-space in parallel

Speedup is given by:

Worst case for sequential search when

solution found in last sub-space search. Then

parallel version offers greatest benefit, i.e.

t

tp

tx

pS

s

∆

∆+×

=)(

p 1–

p------------ t

st∆+×

Solution found

∆t Least advantage for parallel version when

solution found in first sub-space search of

the sequential search, i.e.

Sp()p

------------ ts

t∆+×

t∆

---------------------------------------- ∞→=

as ∆t tends to zero

The Context of Parallel Processing

Facts:

• The explosive growth of digital computer architectures

• The need for:

• a better understanding of various forms/ degrees of

concurrencyconcurrency

• user-friendliness, compactness and simplicity of code

• high performance but low cost, low power consumption a.o.

High-performance uniprocessors are increasingly complex and

expensive, and they have high power-consumption

They may also be under-utilized - mainly due to the lack of

appropriate software.

Possible trade-offs to achieve efficiency

What’s better?

• The use of one or a small number of such complex processors,

at one extreme, OR

• A moderate to very large number of simpler processors, at the

other

• The answer may seem simple, but there is a clue, forcing us to • The answer may seem simple, but there is a clue, forcing us to

answer first to another question: how “good” is communication

between processors?

So:

• When combined with a high-bandwidth, but logically simple,

inter-processor communication facility, the latter approach may

lead to significant increase in efficiency, not only at the execution

but also in earlier stages (i.e. in the design process)

The Difficulties of Parallel Processing

Two are the major problems that have prevented over the years

the immediate and widespread adoption of such (moderately to)

massively parallel architectures:

• the inter-processor communication bottleneck

• the difficulty, and thus high cost, of algorithmic/software

developmentdevelopment

How were these problems overcomed?

• At very high clock rates, the link between the processor and

memory becomes very critical

→ integrated processor/memory design optimization

→ emergence of multiple-processor microchips

• The emergence of standard programming and communication

models has removed some of the concerns with compatibility

and software design issues in parallel processing

The Difficulties of Parallel Processing

Two are the major problems that have prevented over the years

the immediate and widespread adoption of such (moderately to)

massively parallel architectures:

• the inter-processor communication bottleneck

• the difficulty, and thus high cost, of algorithmic/software

developmentdevelopment

How were these problems overcomed?

• At very high clock rates, the link between the processor and

memory becomes very critical

→ integrated processor/memory design optimization

→ emergence of multiple-processor microchips

• The emergence of standard programming and communication

models has removed some of the concerns with compatibility

and software design issues in parallel processing

Demand for Computational Speed

• Continuous demand for greater computational speed

from a computer system than is usually possible

• Areas requiring great computational speed include • Areas requiring great computational speed include

numerical modeling and simulation, scientific and

engineering problems etc.

• Remember: Computations must not only be completed,

but completed within a “reasonable” time period

Grand Challenge Problems

One that cannot be solved in a reasonable amount of

time with today’s computers. Obviously, an

execution time of 2 months is always unreasonable

Examples

• Modeling large DNA structures

• Global weather forecasting

• Modeling motion of astronomical bodies.

Global Weather Forecasting

• Atmosphere modeled by dividing it into 3-dim. cells

• Computations in each cell repeat many times to model

time passing

� Suppose whole global atmosphere divided into cells of size 1

mile × 1 mile × 1 mile to a height of 10 miles (10 cells high) -mile × 1 mile × 1 mile to a height of 10 miles (10 cells high) -

about 5 × 108 cells

� Suppose each calculation requires 200 float. point operations.

In one time step, 1011 floating point operations necessary.

� To forecast weather over 7 days using 1-minute intervals, a

computer operating at 1Gflops (109 flops) takes 106 s/ >10 days

� To perform calculation in 5 minutes requires computer

operating at 3.4 Tflops (3.4 × 1012 flops).

Modeling Motion of Astronomical Bodies

• Bodies are attracted to each others by gravitational forces

• Movement of each body is predicted by calculating total

force on each body

• With N bodies, N - 1 forces to calculate for each body, or

approx. N2 calculations (N log2 N for an efficient approx. approx. N2 calculations (N log2 N for an efficient approx.

algorithm); after determining new positions of bodies,

calculations repeated

• If a galaxy might have, say, 1011 stars, even if each

calculation done in 1 ms (extremely optimistic figure), it

takes 109 years for one iteration using N2 algorithm and

almost a year for one iteration using an efficient N log2 N

approximate algorithm.

Astrophysical N-body simulation – screen snapshot

PARALLEL COMPUTING: Models

and Algorithms

A course for the 4th year students

(Major in Computer Science - Software)

Parallel Programming Platforms

Contents

• Parallel Computing : definitions and terminology

• Historical evolution

• A taxonomy of parallel solutions

Pipelining• Pipelining

• Functional parallelism

• Vector parallelism

• Multi-processing

• Multi-computing

Von Neumann constraints

External

Memory

Internal

Memory

*

Output

Unit

Input

Unit

Control

Unit

A.L. Unit

*

*

CPU

Parallel Computing – What is it? (here, from the

platform point of view)

• Try a simple definition, fit for our purposes• Try a simple definition, fit for our purposes

• A historical overview: How did parallel

platforms evolved?

WHAT IS PARALLEL

COMPUTING?

From the platform point of view, it is:

•Use of several processors/ execution units in parallel to collectively solve a problem•Ability to employ different processors/ computers/ •Ability to employ different processors/ computers/ machines to execute concurrently different parts of a single program•Questions:

•How big are the parts? (grain of parallelism) Can be instruction, statement, procedure, or other size. •Parallelism in this way is loosely defined, with plenty of overlap with distributed computing

PARALLEL COMPUTING AND

PROGRAMMING PLATFORMS

Definition for our purposes:

•We will mainly focus on relative coarse grain

•Main goal: shorter running time!

•The processors are contributing to the solution of the •The processors are contributing to the solution of the same problem

•In distributed systems the problem is often that of coordination (e.g. leader election, commit, termination detection…)

•In parallel computing a problem involves lots of data and computation (e.g. matrix multiplication, sorting), communication is to be kept to an optimum

Terminology

Distributed System: A collection of multiple autonomous

computers, communicating through a computer network, that

interact with each other in order to achieve a common goal.

Sistem Paralel: An optimized collection of processors, dedicated Sistem Paralel: An optimized collection of processors, dedicated

to the execution of complex tasks; each processor executes in a

semi-independent manner a subtask and co-ordination may be

needed from time to time. The primary goal of parallel processing

is a significant increase in performance.

Remark. Parallel processing in distributed environments is not

only possible but a cost-effective attractive alternative.

Do we need powerful computer platforms?

Yes, to solve much bigger problems much faster!Coarse-grain parallelism is mainly applicable to long-running, scientific programs

Performance

- there are problems which can use any amount of computing (i.e. simulation)(i.e. simulation)

Capability

- to solve previously unsolvable problems (such as prime number factorization): too big data sizes, real time constraints

Capacity

-to handle a lot of processing much faster, perform more precise computer simulations (e.g. weather prediction)

Measures of Performance

• To computer scientists: speedup, execution time.• To applications people: size of problem, accuracy of solution, etc.

Speedup of algorithm

= sequential execution time/execution time on p processors (with the same data set).processors (with the same data set).

Speedup on problem

= sequential execution time of best known sequential algorithm / execution time on p processors.

•A more honest measure of performance.•Avoids picking an easily parallelizable algorithm with poor sequential execution time.

How did parallel platforms evolved?

Execution Speed

•With a 102 times increase of (floating point) execution

speed every 10 years

Communication Technology

•A factor which is critical to the performance of •A factor which is critical to the performance of

parallel computing platforms

• 1985 – 1990 : in spite of an average 20x increase in

processor performance, the communication speed kept

constant

Parallel Computing – How platforms evolved

Time(s) per fp instruction Motto: “I think there is a world Motto: “I think there is a world

market for maybe five computers” market for maybe five computers”

((Thomas Watson, IBM Chairman, 1943))

Towards Parallel Computing – The 5 ERAs

Why are powerful computers parallel?

From Transistors to FLOPS

• by Moore’s law the no of transistors per area doubles every 18

months

• how to make use of these transistors?

• more execution units, graphical pipelines, etc.• more execution units, graphical pipelines, etc.

• more processors

So, technology is not the only key, computer structure (architecture)

and organization are also important!

Inhibitors of parallelism:

•Dependencies

Why are powerful computers parallel? (cont.)

The Data Communication Argument

• for huge data it is cheaper and more feasible to move

computation towards data

The Memory/Disk Speed Argument

• parallel platforms typically yield better memory system • parallel platforms typically yield better memory system

performance, because they have

• larger aggregate caches

• higher aggregate bandwidth to memory system

Explicit Parallel Programming Platforms

• physical organization – hardware view

• communication network

• logical organization - programmer’s view of the • logical organization - programmer’s view of the

platform

• process-processors mappings

A bit of historical perspective

Parallel computing has been here since the early days of computing.

Traditionally: custom HW, custom SW, high prices

The “doom” of the Moore law:

- custom HW has hard time catching up with the commodity processors

Current trend: use commodity HW components, standardize SWCurrent trend: use commodity HW components, standardize SW

⇒⇒⇒⇒ Parallelism sneaking into commodity computers:

• Instruction Level Parallelism - wide issue, pipelining, OOO

• Data Level Parallelism – 3DNow, Altivec

• Thread Level Parallelism – Hyper-threading in Pentium IV

⇒⇒⇒⇒ Transistor budgets allow for multiple processor cores on a chip.

A bit of historical perspective (cont.)

Most applications would benefit from being parallelized and

executed on a parallel computer.

• even PC applications, especially the most demanding ones – games,

multimedia

Chicken & Egg Problem:

1. Why build parallel computers when the applications are sequential?1. Why build parallel computers when the applications are sequential?

2. Why parallelize applications when there are no parallel commodity

computers?

Answers:

1. What else to do with all those transistors?

2. Applications already are a bit parallel (wide issue, multimedia

instructions, hyper-threading), and this bit is growing.

Parallel Solutions: A Taxonomy

Pipelining

- instructions are decomposed into elementary operations; different operations

belonging to several instructions may be at a given moment in execution

Functional parallelism

- independent units are provided to execute specialized functions

Vector parallelism

- identical units are provided to execute under unique control the same operation on

different data items

Multi-processing

- several “tightly coupled” processors execute independent instructions,

communicating through a common shared memory

Multi-computing

- several “tightly coupled” processors execute independent instructions, and usually

communicate with each other by sending messages

Pipelining (often completed by functional/vector parallelism)

Ex. IBM 360/195, CDC 6600/7600, Cray 1

Vector Processors

• Early parallel computers use vector processors; their design was MISD, their programming was SIMD (see Flynn’s taconomy next)

• Most significant representatives of this class:

•CDC Cyber 205, CDC 6600

•Cray-1, Cray-2, Cray XMP, Cray YMP etc.•Cray-1, Cray-2, Cray XMP, Cray YMP etc.

•IBM 3090 Vector

• Inovative aspects:

• Superior organization

• Use of performant technologies (not CMOS), i.e. cooling

• Use of “peripheral processors” (minicomputers)

• Generally, do not rely on usual techniques for paging/ segmentation, that slow down computations

Cray XMP/4Cray 2

Flynn’s Taxonomy

Data

Instr. Flow

Flow

Simple Multiple

Simple SISD SIMD

Multiple MISD MIMD

SIMD (Single Instruction stream, Multiple Data stream)

Global control unit

Interconnection network

PE PE PE PE PE…

Ex: early parallel machines

• Illiac IV, MPP, CM-2, MasPar MP-1

Modern settings

• multimedia extensions - MMX, SSE

• DSP chips

SIMD (cont.)

Positives:

• less hardware needed (compared to MIMD computers, they

have only one global control unit)

• less memory needed (must store only a copy of the program)

• less startup time to communicate with neighboring processors

• easy to understand and reason about

Negatives:

• proprietary hardware needed – fast obsolescence, high

development costs/time

• rigid structure suitable only for highly structured problems

• inherent inefficiency due to selective turn-off

SIMD and Data-Parallelism

SIMD computers are naturally suited for data-parallel

programs

• programs in which the same set of instructions are executed on a large data set

Example:Example:

for (i=0; i<1000; i++) pardo

c[i] = a[i]+b[i];

Processor k executes c[k] = a[k]+b[k]

SIMD – inefficiency example (1)

Example:

for (i=0; i<10; i++)

~ if (a[i]<b[i])

~ c[i] = a[i]+b[i];

~ else

~ c[i] = 0;

Different processors cannot execute distinct instructions in the same clock cycle

~ c[i] = 0;

4 1 7 2 9 3 3 0 6 7

5 3 4 1 4 5 3 1 4 8

a[]

b[]

c[]

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9


Example:

for (i=0; i<10; i++) pardo

~ if (a[i]<b[i])

~ c[i] = a[i]+b[i];

~ else

~ c[i] = 0;

4 1 7 2 9 3 3 0 6 7

5 3 4 1 4 5 3 1 4 8

a[]

b[]

c[]

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9


Example:


~ if (a[i]<b[i])

~ c[i] = a[i]+b[i];

~ else

~ c[i] = 0;

4 1 7 2 9 3 3 0 6 7

5 3 4 1 4 5 3 1 4 8

a[]

b[]

9 4 8 1 15c[]

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9


Example:


~ if (a[i]<b[i])

~ c[i] = a[i]+b[i];

~ else

~ c[i] = 0;

4 1 7 2 9 3 3 0 6 7

5 3 4 1 4 5 3 1 4 8

a[]

b[]

9 4 0 0 0 8 0 1 0 15c[]

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9

MIMD (Multiple Instruction stream, Multiple Data stream)


PE +control unit

…PE +control unit

PE +control unit


Single Program, Multiple Data

• a popular way to program MIMD computers

• simplifies code maintenance/program distribution

• equivalent to MIMD (big switch at the beginning)

MIMD (cont)

Positives:

• can be easily/fast/cheaply built from existing microprocessors

• very flexible (suitable for irregular problems)

• can have extra hardware to provide fast synchronization,

which enables them to operate in SIMD mode (ex. CM5)which enables them to operate in SIMD mode (ex. CM5)

Negatives:

• more complex (each processor has its own control unit)

• requires more resources (duplicated program, OS, …)

• more difficult to reason about/design correct programs

Address-Space Organization

Aka Bell’s Taxonomy (only for MIMD computers)

•Multiprocessors

(single address space, communication uses common memory)

• Scalable (distributed memory)

• Not scalable (centralized memory)• Not scalable (centralized memory)

•Multicomputers

(multiple address space, communication uses transfer of messages)

• Distributed

• Centralized

Vector Parallelism

• Is based on “primary” high-level, efficient operations, able to process in one step whole linear arrays (vectors)

• It may be extended to matrix processing etc.

Multiprocessors

Ex. Compaq SystemPro, Sequent Symmetry 2000

Multicomputers

Ex. nCube, Intel iPSC/860

Multiprocessor Architectures

• Typical examples are the Connection Machine-s

CM2 CM5

Organization

Host Computer MicrocontrollerCM Processors

AndMemories

• Host sends commands/ data to a microcontroller

• The microcontroller broadcasts control signals and

data back to the processor network

• It also collects data from the network

CM* Processors and Memory

• Bit dimension (this means the memory is

addressable at bit level)

• Operations are bit serialized

• Data organization in fields is arbitrary (may

include any number of bits, starts anywhere)

• A set of contextual bits (flags) in all processors

determines their activation

Programming

• PARIS - PArallel Instruction Set, similar to an assembly language

• *LISP –Common Lisp extension that includes explicit parallel operations

• *LISP –Common Lisp extension that includes explicit parallel operations

• C* - C extension with explicit parallel data and implicit parallel operations

• CM-Fortran – the implemented dialect of Fortran 90

CM2 Architecture

Connection MachineProcessors


Nexus FrontEnd

Sequencer0

Sequencer3


Sequencer1


Sequencer2

Interconnection Network

of CM2 Processors

• Any node in the network is a cluster (“chip”), with:

– 16 data processors on a chip

– Memory– Memory

– Routing node

• Nodes are connected in a 12D hypercube

– There are 4096 nodes, each has direct links to other 11 nodes

– Maximal dimension of a CM is thus 12 x 4096, or 64K

processors

CM5

• Starting with CM-5, the Thinking Machines Co. went

(in 1991) from a hypercube architecture of simple

processors to a complete new one, MIMD, based on a processors to a complete new one, MIMD, based on a

“fat tree” of RISC processors (SPARC)

• A few years later CM-5E replaced SPARC processors

with more fast SuperSPARCs

Levels of parallelism

Implicit Parallelism in Modern Microprocessors

• pipelining, superscalar execution, VLIW

Hardware parallelism

- as given by machine architecture and hardware multiplicity (Hwang)

- reflects a model of resource utilization by operations with a potential of - reflects a model of resource utilization by operations with a potential of

simultaneous execution, or refers the resources’ peak performance

Software parallelism

- acts at job, program, instruction or even bit (arithmetic) level

Limitations of Memory System Performance

• Problem: high latency of memory vs. speed of computing

• Solutions: caches, latency hiding using multithreading and

prefetching

Granularity

- is a measure for the amount of computations within a process- is a measure for the amount of computations within a process

- usually described as coarse, medium and fine

Latency

- opposed to granularity, measures the overhead due to communication

between fragments of code

PARALLEL COMPUTING:




Communication in Parallel Systems

Contents

• Role of communication in parallel systems

• Types of interconnection networks

• General topologies: clique, star, linear array, ring, • General topologies: clique, star, linear array, ring,

tree & fat tree, 2D & 3D mesh/torus, hypercube,

butterfly

• Evaluating interconnection networks: diameter,

connectivity, bandwidth, cost

Communication

• plays a major role, for both:

• Shared Address Space Platforms (multiprocessors)

• Uniform Memory Access multiprocessors

• Non-Uniform Memory Access multiprocessors

• cache coherence issues• cache coherence issues

• Message Passing Platforms

• network characteristics are important

• mapping between parallel processes and processors is

critical

Sequential Programming Paradigm

Message-Passing Programming Paradigm

Shared Address Space Platforms


P P P P P…

M M MM…

shared

memory,

UMA


…P CM

P CM

P CM

distributed memory,

NUMA

Interconnection Networks for Parallel Computers

Static networks

• point-to-point communication links among processing nodes

• also called direct networks

Dynamic networks

• communication links are connected dynamically by switches to

create paths between processing nodes and memory banks/other

processing nodes

• also called indirect networks

Quasi-static/ Pseudo-dynamic networks

• to be introduced later

Interconnection Networks

p p

Static/direct network

p p

Dynamic/indirect network

p p

processing node

network interface/switch

p p

switching element

Static Interconnection Networks

Just the most usual topologies:

• Complete network (clique)

• Star network

• Linear array

• Ring• Ring

• Tree

• 2D & 3D mesh/torus

• Hypercube

• Butterfly

• Fat tree

Clique, Star, Linear Array, Ring, Tree

p0 pn-1…p1 p2

p0 pn-1…p1 p2

Clique, Star, Linear Array, Ring, Tree

- important logical topologies, as many common communication patters

correspond to these topologies:

- clique: all-to-all broadcast

- star: master – slave, broadcast

- line, ring: pipelined execution

- tree: hierarchical decomposition

- none of them is very practical

- clique: cost

- star, line, ring, tree: low bisection width

- line, ring: high diameter

- actual execution is performed on the embedding into the physical network

2D & 3D Array & Torus

- good match for discrete simulation and matrix operations

- easy to manufacture and extend

Examples: Cray 3D (3d torus), Intel Paragon (2D mesh)

Hypercube

- good graph-theoretic properties (low diameter, high bisection width)

- nice recursive structure

- good for simulating other topologies (they can be efficiently embedded into

hypercube)

- degree log (n), diameter log (n), bisection width n/2

- costly/difficult to manufacture for high n, not so popular nowadays- costly/difficult to manufacture for high n, not so popular nowadays

000 001

010 011

100 101

110 111

Butterfly

- Hypercube – derived network of log(n) diameter and constant degree

- perfect match for some complex algorithms (like Fast Fourier Transform)

- there are other Hypercube-related networks (Cube Connected Cycles, Shuffle-

Exchange, De-Bruin and Beneš networks)

Bn Bn

Bn+1

Fat Tree

Main idea: exponentially increase the multiplicity of links as the distance

from the bottom increases

- keeps nice properties of the binary tree (low diameter)

- solves the low bisection and bottleneck at the top levels

Example: CM5

Dynamic Interconnection Networks

BUS – Based Interconnection Networks

• processors and the memory modules are connected to a shared bus

Advantages:

• simple, low cost

Disadvantages:

• only one processor can access memory at a given time• only one processor can access memory at a given time

• bandwidth does not scale with the number of processors/memory

modules

Example:

• quad Pentium Xeon

Crossbar

Advantages:

• non blocking network

Disadvantages:

• cost O(pm)

Evaluating Interconnection Networks

diameter

• the longest distance (number of hops) between any two nodes

• gives lower bound on time for algorithms communicating only with direct neighbours

connectivity

• multiplicity of paths between any two nodes• multiplicity of paths between any two nodes

• high connectivity lowers contention for communication resources

bisection width (bisection bandwidth)

• the minimal number of links (resp. their aggregate bandwidth) that must be removed to partition the network into two equal halves

• provides lower bound on time when the data must be shuffled from one half of the network to another half

• VLSI area/volume: in 2D, in 3D)( 2wO )( 2/3

wO

Evaluating Interconnection Networks

p-1112log((p+1)/2)complete

p-1112star

p(p-1)/2p-1p2/41clique

Cost

(# of links)

Arc

Connectivity

Bisection

Width

DiameterNetwork

(p log p)/2log pp/2log phypercube

2p42√p2|√p/2|2D torus

2(p-√p)2√p2(√p-1)2D mesh

p-111p-2linear array

p-1112log((p+1)/2)complete

binary tree

So, the Logical View of PP Platform :

Control Structure - how to express parallel tasks

• Single Instruction stream, Multiple Data stream

• Multiple Instruction stream, Multiple Data stream

• Single Program Multiple Data

Communication Model - how to specify interactions between tasks

• Shared Address Space Platforms (multiprocessors)

• Uniform Memory Access multiprocessors

• Non-Uniform Memory Access multiprocessors

• Cache-Only Memory Access multiprocessors (+ cache coherence issues)

• Message Passing Platforms (multicomputers)

PARALLEL COMPUTING:




Parallel Programming Models

Contents

� The “ideal parallel computer”: PRAM

� Categories of PRAMs

� PRAM algorithm examples

� Algorithmic Models

� Data-Parallel Model

� Task Graph Model � Task Graph Model

� Work Pool Model

� Master-Slave Model

� Pipeline (Producer-Consumer) Model

� Parallel Algorithm Design

� Performance Models

� Decomposition Techniques

Explicit Parallel Programming

• Platforms & physical organization

• Communication network

• Logical organization - programmer’s view

hardware view

• Logical organization - programmer’s view

of the platform

• Process-processors mappings

The Ideal Parallel Computer

PRAM - Parallel Random Access Machine

�consists of:

• p processors, working in lock-step, synchronous

manner on the same program instructions

• each with its local memory • each with its local memory

• each connected to an unbounded shared memory� the access time to shared memory costs one step

�PRAM abstracts away communication, allows to

focus on the parallel tasks

Why PRAM is an Ideal Parallel Computer?

� PRAM is a natural extension of the sequential model of

computation (RAM), it provides a means of interaction

between processors at no cost

� it is not feasible to manufacture PRAMs:

• the real cost of connecting p processors to m memory

cells such that their accesses do not interfere is o(pm),cells such that their accesses do not interfere is o(pm),

which is huge for any practical values of m

� an algorithm for PRAM might lead to a good algorithm for a

real machine

� if something cannot be efficiently solved on PRAM, it cannot

be efficiently done on any practical machine (based on

current technology)

� Restrictions may be imposed for simultaneous read/write operations in the common memory

� There are 4 main classes, depending on how simultaneous accesses are handled

• Exclusive read, exclusive write - EREW PRAM

Categories of PRAMs

• Exclusive read, exclusive write - EREW PRAM

• Concurrent read, exclusive write - CREW PRAM

• Exclusive read, concurrent write - ERCW PRAM (for completeness)

• Concurrent read, concurrent write - CRCW PRAM

•Allowing concurrent read access does not create semantic discrepancies in the program

•Concurrent write access to the same memory location requires arbitration

•Ways of resolving concurrent writes

Resolving concurrent writes

•Ways of resolving concurrent writes

• Common – all writes must write the same value

• Arbitrary – arbitrary write succeeds

• Priority – the write with highest priority succeeds

• Sum – the sum of the written values is stored

PRAM Algorithm Example 1Problem (parallel prefix): use EREW PRAM to sum numbers stored at

m0, m1, …, mn-1, where n=2k for some k. The result should be stored at m0.

Algorithm for processor pi:

for (j=0; j<k; j++)

~ if (i % 2^(j+1) == 0) {

~ a = read(mi);

~ b = read(m );

1 8 3 2 7 3 1 4

p0 p2 p4 p6

9 8 5 2 10 3 5 4~ b = read(mi+2^j);

~ write(a+b, mi);

~ }

9 8 5 2 10 3 5 4

p0 p4

14 8 5 2 15 3 5 4

29 8 5 2 15 3 5 4

p0

Example for k=3

PRAM Example Notes

• the program is written in SIMD (and SPMD) format

• the inefficiency caused by idling processors is clearly visible

• can be easily extended for n not power of 2

• takes log2(n) rounds to execute• takes log2(n) rounds to execute

Important!

→ using a similar approach to parallel prefix (+ some other ideas) it can be shown that:

Any CRCW PRAM can be simulated by an EREW PRAM with a slowdown factor of O(log n)

PRAM Algorithm Example 2

Problem: use Sum - CRCW PRAM with n2 processors to sort

n numbers stored at x0, x1, …, xn-1.

CRCW condition: processors can write concurrently 0s and

1s in a location, the sum of values will actually be written

Question: How many steps would it take?Question: How many steps would it take?

1. O(n log n)

2. O(n)

3. O(log n)

4. O(1)

5. less then (n log n)/ n2

PRAM Example 2

Note: We will mark processors pi,j for 0<=i,j<n

Algorithm for processor pi,j:a = read(xi); ~

b = read(xj); ~

if ((a>b)|| ((a==b)&&(i>j)))

~ write(1, m );

1 7 3 9 3 0

0 1 1 1 1 0

x[]

m0 m1 m2 m3 m4 m5

~ write(1, mi);

if (j==0) {

~ b = read(mi); ~ ~

write(a, xb);

}

0 1 1 1 1 00 0 0 1 0 00 1 0 1 0 00 0 0 0 0 00 1 0 1 1 01 1 1 1 1 0

1 4 2 5 3 0

0 1 3 3 7 9

O(1) sorting algorithm!

(Chaudhuri, p.90-91)

m[]

x[]

Find the small error in the matrix!

The beauty and challenge of parallel algorithms

Problems that are trivial in sequential setting can be quite

interesting and challenging to parallelize.

Homework: Compute sum of n numbers

How would you do it in parallel?

• using n processors

• using p processors

• when communication is cheap

• when communication is expensive

Algorithmic Models

� try to offer a common base to the development, expressing and comparisons of parallel algorithms

� generally, they use the architectural model of shared memory parallel machine (multi-processor)

� shared memory is a useful abstraction from the � shared memory is a useful abstraction from the programmer point of view, especially for the early phases of algorithm design

� communication is kept as simple as possible

� usual causes of inefficiency are eliminated

Parallel Algorithmic Models

• Data-Parallel Model

• Task Graph Model

• Work Pool Model

• Master-Slave Model

• Pipeline (Producer-Consumer) Model

Data Parallel Model

� Working principle

• divide data up amongst processors

• process different data segments in parallel

• communicate boundary information, if necessary

� FeaturesFeatures

• includes loop parallelism

• well suited for SIMD machines

• communication is often implicit

Task Graph Model

� decompose algorithm into different sections

� assign sections to different processors

� often uses fork()/join()/spawn()

� usually does not yield itself to high level of parallelism

Work Pool Model• dynamic mapping of tasks to processes

• typically small amount of data per task

• the pool of tasks (priority queue, hash table, tree) can be centralized or distributed

get task

P0

P1P2

P3

t0

t8

t3t2

t7

get task

process task

possibly add task

work pool

Master-Slave Model

�master generates and allocates tasks

� can be also hierarchical/multilayer

� master potentially a bottleneck

� overlapping communication and computation at the master often usefulmaster often useful

Pipelining

� a sequence of tasks whose execution can overlap

� sequential processor must execute them sequentially, without overlap

� parallel computer can overlap the tasks, increasing throughput (but not decreasing latency)

Parallel Algorithms Performance

Granularity• fine grained: large number of small tasks

• coarse grained: small number of large tasks

Degree of Concurrency• the maximal number of tasks that can be executed simultaneously

Critical PathCritical Path• the costliest directed path between any pair of start and finish nodes in the task dependency graph

• the cost of the path is the sum of the weights of the nodes

Task Interaction Graph• tasks correspond to nodes and an edge connects two tasks if they communicate/interact with each other

� directed acyclic graph capturing causal dependency between tasks

� a task corresponding to a node can be executed only after all tasks on the other sides of the incoming edges have already been executed

Task Dependency Graphs

sequential summation, traversal, …

binary summation, merge sort, …

5-step Guide to Parallelization

Identify computational hotspots

• find what is worth to parallelize

Partition the problem into smaller semi-independent tasks • find/create parallelism

Identify Communication requirements between these tasks • realize the constraints communication puts on parallelism• realize the constraints communication puts on parallelism

Agglomerate smaller tasks into larger tasks• group the basic tasks together so that the communication is minimized, while still allowing good load balancing properties

Translate (map) tasks/data to actual processors• balance the load of processors, while trying to minimize communication

Parallel Algorithm Design

Involves all of the following:

1. identifying the portions of the work that can be performed concurrently

2.mapping the concurrent pieces of work onto multiple processes running in parallel

3.distributing the input, output and intermediate data 3.distributing the input, output and intermediate data associated with the program

4.managing access to data shared by multiple processes

5.synchronizing the processes in various stages of parallel program execution

Optimal choices depend on the parallel architecture

Platform dependency exampleProblem:

• process each element of an array, with interaction between neighbouring elements

1st Setting: message passing computer

Solution: distribute the array into blocks of size n/p

2nd Setting: shared memory computer with shared cache

Solution: stripped partitioning

p0 p1 p2 p3

Decomposition techniques

• Recursive Decomposition

• Data Decomposition

• Task Decomposition

• Exploratory Decomposition• Exploratory Decomposition

• Speculative Decomposition

Recursive Decomposition

Divide and conquer leads to natural concurrency.

• quick sort: 6 2 5 8 9 5 1 7 3 4 3 0

2 5 5 1 3 4 3 0 6 8 9 72 5 5 1 3 4 3 0 6 8 9 7

1 0 2 5 5 3 4 3 6 7 8 9

…

• finding minimum recursively:rMin(A[0..n-1]) = min(rMin(A[0..n/2-1], rMin(A[n/2..n-1]));

Data Decomposition

• Begin by focusing on the largest data structures, or the ones that

are accessed most frequently

• Divide the data into small pieces, if possible of similar size

• Strive for more aggressive partitioning then your target computer

will allow

• Use data partitioning as a guideline for partitioning the computation • Use data partitioning as a guideline for partitioning the computation

into separate tasks; associate some computation with each data

element

• Take communication requirements into account when partitioning

data

Data Decomposition (cont.)

Partitioning according to

• input data (e.g. find minimum, sorting)

• output data (e.g. matrix multiplication)

• intermediate data (bucket sort)

Associate tasks with the data

• do as much as you can with the data before further

communication

• owner computes rule

Partition in a way that minimizes communication

costs

Task Decomposition

• Partition the computation into many small tasks of

approximately uniform computational requirements

• Associate data with each task

• Common in problems where data structures are highly

unstructured, or no obvious data structures to partition exist

Exploratory Decomposition

• commonly used in search space exploration

• unlike the data decomposition, the search space is not known

beforehand

• computation can terminate as soon as a solution is found

• the work amount can be more or less than in sequential case1 2 3 41 2 3 4

5 6 7 8

9 10 11

13 14 15 12

1 2 3 4

5 6 7 8

9 10 15 11

13 14 12

1 2 3 4

5 6 8

9 10 7 11

13 14 15 12

1 2 3 4

5 6 7 8

9 10 11

13 14 15 12

1 2 3 4

5 6 7 8

9 10 11

13 14 15 12

1 2 3 4

5 6 7

9 10 11 8

13 14 15 12

Speculative DecompositionExample:

Discrete event simulation – state-space vertical partitioning

• execute branches concurrently, assuming certain restrictions are met (i.e. lcc), then keep the executions that are correct, re-taking the others in new conditions

• total amount of work is always more then in the sequential case, but execution time can be less

Task Characteristics

• Task generation: static vs dynamic

• Task sizes: uniform, non-uniform, known, unknown

• Size of Data Associated with Tasks: influences mapping decisions, input/output sizesinfluences mapping decisions, input/output sizes

InterTask Communication Characteristics

• static vs dynamic

• regular vs irregular

• read-only vs read-write

• one way vs two way• one way vs two way

Load Balancing

Efficiency adversely affected by uneven workload:

execution time

P0

P1 computation

P2 idle (wasted)

P3

P4

Load Balancing (cont.)

Load balancing: shifting work from heavily loaded processors to lightly loaded ones.

P0

computation

execution time saved

P1 computation

P2 idle (wasted)

P3 moved

P4

Static load balancing Dynamic load balancing

- before execution - during execution

Static Load Balancing

Map data and tasks into processors prior to execution

• the tasks must be known beforehand (static task generation)

• usually task sizes need to be known in order to work well

• even if the sizes are known (but non-uniform), the problem of optimal mapping is NP hard (but there are reasonable approximation schemes)approximation schemes)

1D Array Partitioning

p1 p2 p3 p4p0

p p p pp

block partitioning

cyclic(striped) partitioning

p1 p2 p3 p4p0

p1 p2 p3 p4p0

block-cyclic partitioning

2D Array Partitioning

p1p2p3p4

p0

p5

p2 p3 p4p0 p1 p5 p2 p4p0

p1 p3 p5

p1 p2p0 p1 p2p0p2p4

p0 p1p3p5

Example: Geometric Operations

Image filtering, geometric transformations, …

Trivial observation:

• Workload is directly proportional to the number of objects.

• If dealing with pixels, the workload is proportional to area.• If dealing with pixels, the workload is proportional to area.

Load balancing achieved by assigning to processors blocks of the same area.

Dynamic Load BalancingCentralized Schemes

• master-slave:

• master generates tasks and distributes workload

• easy to program, prone to master becoming bottleneck

• self scheduling

• take a task from the work pool when you are ready• take a task from the work pool when you are ready

• chunk scheduling

• self scheduling with taking single task can be costly

• take a chunk of tasks at once

• when there are few tasks left, the chunk size decreases

Dynamic Load Balancing

Distributed Schemes

Distributively share workload with other processors.

Issues:

• how to pair sending and receiving processors

• transfer of workload initiated by sender or receiver?

• how much work to transfer?

• when to decide to transfer?

Example: Computing Mandelbrot Set

Colour of each pixel c is defined solely from its coordinates:

int getColour(Complex c) {

int colour = 0;

Complex z = (0,0);

1.5

while ((|z|<2) && (colour<max)) {

z = z2+c;

colour++;

}

return colour;

}

-2 real +1-1.5

Mandelbrot Set Example (cont.) Possible partitioning strategies:

• Partition by individual output pixels; most aggressive partitioning.

• Partition by rows.

• Partition by columns.

• Partition by 2-D blocks.

Assignment: Evaluate Mandelbrot set partitioning strategies



PARALLEL COMPUTING:


Parallel Performance: System and

Software Measures

Remember:

The Development of Parallel Programs

Involves ALL of the following:

1. identifying the portions of the work that can be performed concurrently

2. mapping the concurrent pieces of work onto multiple processes running in parallelprocesses running in parallel

3. distributing the input, output and intermediate data associated with the program

4. managing access to data shared by multiple processes

5. synchronizing the processes in various stages of parallel program execution

The goal is to attain good performance in all stages.

Predicting and Measuring

Parallel Performance• Building parallel versions of software can enable appl.:

– to run a given data set in significantly less time

– run multiple data sets in a fixed amount of time

– or run large-scale data sets that are prohibitive with

sequential software

• OK, these are visible cases, but how do we measure the

performance of a parallel system in the other cases?

– Traditional measures like MIPS and MFLOPS really

don’t get a parallel computation performance

– I.e., clusters, that can have very high FLOPS, may

be poorer in accessing all data in the cluster

Metrics for Parallel Systems and

Algorithms Performance• Systematic ways to measure parallel performance are

needed:

– Execution Time

– Speedup

– Efficiency

– System Throughput

– Cost-effectiveness

– Utilization

– Data access speed etc.

Execution time and overhead

• The response time – measures interval between

submission of a request until the first response

is produced

• Execution Time• Execution Time

– Parallel runtime, Tp

– Sequential runtime, Ts

• Total Parallel Overhead:

• The overhead is any combination of excess or

indirect computation time, memory, bandwidth,

etc.

spo TpTT −=

Minimizing Overhead in Parallel

Computing

• Sources of overhead:

– Invoking a function incurs the overhead of

branching and modifying the stack pointer

regardless of what that function does

– Recursion

– When we can choose among several algorithms,

each of which have known characteristics, their

overhead is different

• Overhead can influence the decision whether or

not to parallelize a piece of code!

Speedup

• Speedup is the most likely used measure of parallel performance

• If Ts is the best possible serial time and Tn is the time taken by a parallel algorithm on n processors

• Linear speedup occurring when Tp = Ts/p, is considered ideal

• Superlinear speedup can happen in some cases

p

s

T

Ts =

Speedup Definition Variability (1)

• Exactly what is meant by Ts (i.e. the time taken to run the fastest serial algorithm on one processor)

– One processor of the parallel computer?

– The fastest serial machine available?

– A parallel algorithm run on a single processor?

– Is the serial algorithm the best one?

• To keep things fair, Ts should be the best possible time in the serial world

Speedup Definition Variability (2)

A slightly different definition of speedup:

• The time taken by the parallel algorithm on one

processor divided by the time taken by the parallel

algorithm on N processorsalgorithm on N processors

• However this is misleading since many parallel

algorithms contain extra operations to accommodate

the parallelism (e.g the communication)

• Result: Ts is increased thus exaggerating the speedup

Factors That Limit Speedup

• Computational (Software) Overhead

– Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation

• Poor Load Balancing

– Speedup is generally limited by the speed of the slowest – Speedup is generally limited by the speed of the slowest node. So an important consideration is to ensure that each node performs the same amount of work

• Communication Overhead

– Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processors directly degrades the speedup

Linear Speedup

• Which ever definition is used the ideal is to

produce linear speedup (N, using N cores)

• However in practice the speedup is reduced

from its ideal value of Nfrom its ideal value of N

• For applications that scale well, the speedup

should increase at or close to the same rate of

increase in the number of processors (threads)

• Superlinear speedup results when

– unfair values are used for Ts

– differences in the nature of the hardware used

Speedup Curves

Linear Speedup

Superlinear Speedup

Spee

dup

Typical Speedup

Number of Processors

Spee

dup

Efficiency

• Speed up does not measure how efficiently the

processors are being used

– Is it worth using 100 processors to get a speedup of 2?

• Efficiency is defined as the ratio of speedup and • Efficiency is defined as the ratio of speedup and

number of processors required to achieve it

– The efficiency is bounded from above by 1, measuring the

fraction of time when a processor is usefully employed

• In an ideal case, as s=p, it comes that e=1

p

se =

Amdahl’s Law

• Used to compute an upper bound of speedup

• A parallel algorithm has 2 types of operations:

– Those which must be executed in serial

– Those which can be executed in parallel– Those which can be executed in parallel

• The speedup of a parallel algorithm is limited

by the percentage of operations which must be

performed sequentially

• Amdahl's Law assumes a fixed data set size,

and same % of overall serial execution time

• Let the time taken to do the serial calculations

be some fraction σ of the total time ( 0 < σ ≤ 1)

– The parallelizable portion is 1- σ of the total

• Assuming linear speedup

Amdahl’s Law

• Assuming linear speedup

– Tserial = σT1

– Tparallel = (1- σ)T1/N

• By substitution:

N

Speedup) - 1 (

σ

1σ

+

=

Consequences of Amdahl’s Law

• Say we have a program containing 100

operations each of which take 1 time unit.

• Suppose σ=0.2, using 80 processors

– Speedup = 100 / (20 + 80/100) = 100 / 20.8 < 5– Speedup = 100 / (20 + 80/100) = 100 / 20.8 < 5

– A speedup of only 5 is possible no matter how

many processors are available

• So why bother with parallel computing?...

Just wait for a faster processor ☺

Limitations of Amdahl’s Law

• To avoid the limitations of Amdahl’s law:

– Concentrate on parallel algorithms with small serial

components

• Amdahl's Law has been criticized for ignoring real-

world overheads such as communication, world overheads such as communication,

synchronization, thread management, as well as the

assumption of infinite-core processors

• It is not complete in that it does not take into account

problem size

• As the no. of processors increases, the amount of data

handled is likely to increase as well

Gustafson's Law

• If a parallel application using 32 processors is able to

compute a data set 32 times the size of the original,

does the execution time of the serial portion increase?

– It does not grow in the same proportion as the data set

– Real-world data suggests that the serial execution time will – Real-world data suggests that the serial execution time will

remain almost constant

• Gustafson's Law, aka scaled speedup, considers an

increase in the data size in proportion to the increase in

the number of processors, computing the (upper bound)

speedup of the application, as if the larger data set could

be executed in serial

Gustafson's Law Formula

Speedup ≤ p + (1-p)·s

where:

– p is the number of processors

– s is the percentage of serial execution time in the – s is the percentage of serial execution time in the

parallel application for a given data set size

• Since the % of serial time within the parallel execution

must be known, a typical usage for this formula is to

compute the speedup of the scaled parallel execution

(larger data sets as the number of processors increases)

to the serial execution of the same sized problem

Comparative Results

• I.e., if 1% of execution time on 32 cores will be spent

in serial execution, the speedup of this application

over the same data set being run on a single core with

a single thread (assuming that to be possible) is:

Speedup ≤ 32 + (1-32)·0.01 = 32 - 0.31 = 31.69Speedup ≤ 32 + (1-32)·0.01 = 32 - 0.31 = 31.69

• Assuming the serial execution percentage to be 1%,

the equation for Amdahl's Law yields:

Speedup ≤ 1/(0.01 + (0.99/32)) = 24.43

• This is a false computation, however, since the given

% of serial time is relative to the 32-core execution

Redundancy

• Hardware redundancy: more processors are employed

for a single application, at least one acting as standby

– Very costly, but often very effective, solution

• Redundancy can be planned at a finer grain

– Individual servers can be replicated

– Redundant hardware can be used for non-critical activities

when no faults are present

• Software redundancy: software must be designed so

that the state of permanent data can be recovered or

“rolled back” when a fault is detected

Granularity of Parallelism

• Given by the average size of a sequential component in a parallel computation

• Independent parallelism: independent processes. No need to synchronize.

• Coarse-grained parallelism: relatively independent • Coarse-grained parallelism: relatively independent processes with occasional synchronization.

• Medium-grained parallelism. E.g. multi-threads which synchronize frequently.

• Fine-grained parallelism: synchronization every few instructions.

Degree of Parallelism

• Is given by the number of operations which can be

scheduled for simultaneous (parallel) execution

• For pipeline parallelism, where data is vector –

shaped, the degree is coincident with vector size shaped, the degree is coincident with vector size

(length)

• It may be constant throughout the steps of an

algorithm, but most often it varies

• It’s best illustrated by that representation of parallel

computations that uses DAGs

� In parallel programming there is a large gap:

Problem Structure <--…………….--> Solution Structure

� We may try an intermediate step:

Problem ---> Directed Acyclic Graph (DAG) ---> Solution

DAGs (Directed Acyclic Graphs)

- very simple, yet powerful tools -

(Particular DAGs are the so-called Task Graphs)

Problem ---> DAG:

split problem into tasks

DAG ---> Solution:

map tasks to parallel architecture

A

A,B,C and D

are graphs but

they are not task

A graph which has:

1 root, 1 leaf, no cycles and all nodes connected

What Is A Task Graph?

A

BC

they are not task

graphsD

� The standard algorithm to create task graphs :

1. Divide problem into set of n tasks

2. Every task becomes a node in the task graph

How to go from Problem ---> Task Graph?

3. If task(xx) cannot start before task(yy) has finished

then draw a line from node(yy) to node(xx)

4. Identify (or create) starting and finishing tasks

� The process (execution) flows through the task

graph like pipelining in a single processor system

Memory Performance

• Capacity: How many bytes can be held

• Bandwidth: How many bytes can be

transferred per second

• Latency: How much time is needed to fetch • Latency: How much time is needed to fetch

a word

• Routing delays: when data must be gathered

from different parts of memory

• Contention: resolved by memory blocking



PARALLEL COMPUTING:


Principles of Parallel Algorithms

Design

Contents

� Combinational Circuits (CCs): Metrics

� Parallel Design for List Operations Using CCs

• SPLIT(list1,property) --- O(size(list1))

• MERGE(list1,list2) --- O (max(size(list1),size(list2)))

• SORT (list1) ---- O(size(list1)^2)

• SEARCH(key,directory) --- O(size(directory))

� Metrics for Communication Networks (CNs):

• clique, mesh, torus, linear array, ring

• hypercube, shuffle exchange

� Designing a Connection Network

Remember our Base (Ideal) Platform:Parallel Random Access Machine

• consists of:

• p processors, working in lock-step, synchronous

manner on the same program instructions

• each has local memory

• each is connected to unbounded shared memory• each is connected to unbounded shared memory

� access time to shared memory costing only one step

� an algorithm for PRAM might lead to a good algorithm

for a real machine

� if something cannot be efficiently solved on PRAM, it

cannot be efficiently done on any practical machine

(based on current technology)

Parallel Algorithms Modeling

• Can be done in different ways

• We have already seen some models for parallel

computations:– Directed Acyclic Graphs (DAGs)

– Task Graphs– Task Graphs

– Will examine next modeling by means of Combinational

Circuits (CCs)

• Modeling computations is not enough – We must also put into model the communication needs of the

parallel algorithm

Modeling Parallel Computations: using

Combinational Circuits (CCs)

CCs - a family of models of computation, consisting of:• A number of inputs at one end

• A number of outputs at the other end

• A number of interconnected components (internally) arranged in columns called stagescolumns called stages

• Each component can be viewed as a single (logical) processor with constant fan-in and constant fan-out.

• Components synchronise their computations (input to output) in a constant time unit (independent of input values) – like PRAMs!

• Computations are usually simple logical operations (directly implementable in hardware for speed!), but they may be used for more complex operations as well

• There must be no feedback

Metrics for Combinational Circuits

• Width

– Shows the most efficient use of parallel resources during

execution

• Depth

– Measures the complexity of the parallel algorithm implemented

using a CC

• Size

– Measures the constructive complexity of the CC, which is

equivalent with the total number of fundamental operations

– May be an indicator for the total number of operations in the

algorithm

List Processing using Combinational

Circuits

• Imagine we have direct hardware implementation of “some” list

processing functions

• Fundamental operations of these hardware computers correspond to

fundamental components in our CCs

• Processing tasks which are non-fundamental on a standard single • Processing tasks which are non-fundamental on a standard single

processor architecture can be parallelised (to reduce complexity)

• Classic processing examples – searching, sorting, permuting, ….

• Implementing them on a different parallel machine may be done

using a number of components set up in a combinational circuit.

• Question: what components are useful for implementation in a CC?

• Answering this will help us to reveal some fundamental principles in

parallel algorithm design

• Example: consider the following fundamental operations:

• (BI)PARTITION(list1) --- constant time (no need to parallelise)

• APPEND(list1,list2) --- constant time (no need to parallelise)

and the following non-fundamental operations:

• SPLIT(list1,property) --- O(size(list1))

Parallel Design for List Operations

What can we• MERGE(list1,list2) --- O (max(size(list1),size(list2)))

• SORT (list1) ---- O(size(list1)^2)

• SEARCH(key,directory) --- O(size(directory))

• Compositional Analysis - use the analysis of each component to construct the design,

with further analysis – of speedup and efficiency

• Advantage - re-use of already done analysis

• Requires - complexity analysis for each component.

What can we

do here

to attack

the complexity?

Parallel Design - the SPLIT operation

L1

.

Where: split partitions L into M

and N -

• Forall Mx, Property(Mx)

Consider:

M1

.

.

SPLIT(property).

.

Ln

N1

.

.

Nq

• Forall Mx, Property(Mx)

• Forall Ny, Not(Property(Ny))

• Append(M,N) IsA permutation

of L

.

Mp

Question: Can we use the property structure to help parallelise the

design? EXAMPLE: A ^ B, A v B, ‘any boolean expression’

Example: Splitting on structured

property A ^ B

SPLIT(B)SPLIT(A) app

bipartition append

appBIP

SPLIT(B)

app

app

SPLIT(A)

Example: Splitting on property A ^ B

(cont.)

SPLIT(B)

NEED TO DO PROBABILISTIC ANALYSIS:

Typical gain when P(A) = 0.5 and P(B) = 0.5

n/2

n/4

app

SPLIT(B)SPLIT(A)

BIP

SPLIT(B)

app

app

SPLIT(A)

app

n

n/2

n/2

n/4

Depth of circuit is 1+ (n/2) + (n/4) +1+1 = 3 + (3n/4)

Example: Splitting on an unstructured

property (integer list into evens and odds)

BIP

SPLIT app

Question: what is average speedup for the design?

BIP

SPLIT app

Example: Splitting on an unstructured

property (cont.)

BIP

SPLIT

(even/odd)appn

n/2 n/2

BIP

SPLIT

(even/odd) appn/2

n/2

Answer: Doing probabilistic analysis as before … Depth = 2+ n/2

Parallel Design - the MERGE operation

• Merging is applied on 2 sorted sequences; let them in the average,

typical case, be of equal length m = 2^n.

• Recursive implementation:

– Base case, n=0 => m = 1.

– Precondition is met: a list with 1 element is already sorted!– Precondition is met: a list with 1 element is already sorted!

– The component required is actually a comparison operator:

Merge(1) = Compare

C (or CAE)

X= [x1]

Y = [y1]

[min (x1,y1)]

[max (x1,y1)]

M1

MERGE - the first recursive composition

QUESTION:

Using only component M1 (the comparison C), how can we construct

a circuit for merging lists of length 2 (M2)?

Useful Measures: Width, Depth, Size (all = 1 for the base case)

ANALYSIS:

• How many M1s (the size) are needed in total?

• What is the complexity (based on the depth)?

• What is the most efficient use of parallel resources (based on

width) during execution?

MERGE - building M2 recursively

from a number of M1s

X = [x1,x2]

x1

x2z1

M2

M1

• We may use 2 M1s to initially merge odd and even input elements

• Then another M1 is used to compare the middle values

C

C

C

X = [x1,x2] x2

y1

y2

Y = [y1,y2]

z2

z3

z4

M1

M1

Width = 2 Depth = 2 Size = 3

MERGE - proving M2 is correct

• Validation is based on testing the CC with different input values for X and Y

• This does not prove that the CC is correct for all possible cases

• Clearly, there are equivalence classes of tests

• These must be identified and correctness must be proved only for • These must be identified and correctness must be proved only for the classes

• Here we have 3 equivalence classes (use symmetry to swap X,Y)

DISJOINT OVERLAP CONTAINMENT

x1 x2 y1 y2

x1

y1

x2

y2

x1

y1 y2

x2

MERGE - The next recursive step: M4

• A 2-layer architecture can be used for constructing M4 from M2s and M1s (Cs)

• Consequently we can say M4 is constructed just from M1s!

M4

M2

M2

C

C

C

X

Y

Questions: 1.how can you prove the validity of the construction?

2. what are the size, width and depth (in terms of M1s)?

MERGE – Measures for recursive step

on M4

M2 C

C

X

M4

M2 CY

Depth (M4) = Depth (M2) +1

Width (M4) = Max (2*Width(M2), 3)

Size (M4) = 2*Size(M2) + 3

that gives: Depth = 3, Width = 4, Size = 9

MERGE – The general recursive

construction

Now we consider the general case:

“Given any number of Mms how do we construct an M2m?”

M2mx1

Mm

Mm

C

C

C

2m-1 C’s

x1

x2m

y1

y2m

MERGE – Measures and recursive

analysis on general merge circuit Mm

Width:width (Mm) = 2 * width (Mm/2) = … = m

Depth: Let d(2m) = depth(M2m),

then d(2m) = 1 + d(m), for m>1 and d(1) = 1 then d(2m) = 1 + d(m), for m>1 and d(1) = 1

… => d(m) = 1 + log(m)

Size: Let s(2m) = size(M2m),

now s(2m) = 2s(m) + (m-1), for m>1 and s(1) = 1

… => s(m) = 1 + mlog(m)

Parallel Design - the SORT operation

• Sorting can be done in a lot of manners – choosing to sort by merging exhibits two important advantages:

– this is a method with great potential of parallelism

– we may use the parallel implementation of another non-fundamental operation (MERGE)

• Also a good example of recursively constructing CCs; the same • Also a good example of recursively constructing CCs; the same technique can be applied to all CCs synthesis and analysis

• This requires understanding of standard non-parallel (sequential) algorithm and shows that some sequential algorithms are better suited to parallel implementation than others

• Well suited to formal reasoning (preconditions, invariants, induction …)

Sorting by Merging

M1

S8

We can use the merge circuits to sort arrays - for example,

sorting an array of 8 numbers:

M1

M1

M1

M2

M2

M4

Sorting by Merging – the Analysis

•Analyse the base case for sorting a 2 integer list (S2)

•Synthesise and analyse S4

•What are the width, depth and size of Sn?

•What about cases when n is not a power of 2?

Question: is there a more efficient means of sorting using the

merge components? If so, why?

Parallel Design - the SEARCH operation

Searching is fundamentally different from all the other components:

• the structure to be used for parallelisation is found in the component

(directory), not in the input data

• we need to be able to cut up state not just communication channels

• also, we need some sort of synchronisation mechanism

SEARCH

(directory)

SEARCH

(?)

??

SEARCH

(?)key data

key

data

Homework on Parallel Design using CCs

• Try to prove the correctness for sorting by merging, first for the

given case n=8, and then in the general case

• Look for information on parallel sorting on the web, and apply CCs

in your own parallel sorting method; at least two different methods

for sorting should be implemented, if possible in different manner:

– one recursive

– the other non-recursive

• Give a recursive implementation for SEARCH parallel operation

using CCs, like we did for MERGE and SORT

• Perform a recursive analysis on all the recursive methods designed

to compute the parallel complexity measures width, depth and size

Remember: Our Methodology

(Parallelization Guide)

Identify computational hotspots

• find what is worth to parallelize

Partition the problem into small semi-independent tasks • find/create parallelism

Identify Communication requirements between tasks Identify Communication requirements between tasks • realize the constraints communication puts on parallelism

Agglomerate smaller tasks into larger tasks• group the basic tasks together so that the communication is

minimized, while still allowing good load balancing properties

Translate (map) tasks/data to actual processors• balance the load of processors, while trying to minimize

communication

Parallel Algorithm Design and

Communication Networks (CNs)

• We can omit to look towards the CNs only in the early phases of

parallel design of logical processes (LPs)

• The communication network and the parallel architecture for which

a parallel algorithm is destined also play an important role in its

selection, i.e.:selection, i.e.:

• Matrix algorithms are best suited for meshes

• Divide and conquer, recursive a.o. are appropriate for trees

• Greater flexibility in the algorithm can benefit from hypercube

topologies

• Engineering choices and compromises, but also metrics correct

estimation play a great role during the late phases of parallel design

Metrics for Communication Networks

• Degree:

– The degree of a LP (CN node) is its number of (direct) neighbours

in the CN graph

– The degree of the whole algorithm (CN graph) is the maximum of

all processor degrees in the network

– A high degree has theoretical power, a low degree is more practical

• Connectivity:

– Since a network node and/or link may fail, the network should still

continue to function with reduced capacity

– The node connectivity is the minimum number of nodes that must

be removed in order to partition (divide) the network

– The link connectivity is the minimum number of links that must be

removed in order to partition the network

Metrics for CNs (cont.)

• Diameter:

– Is the maximum distance between two nodes – that is, the maximum

number of nodes that must be traversed to send a message to any node

along a shortest path

– Lower diameter implies shorter time to send messages across network– Lower diameter implies shorter time to send messages across network

• Narrowness:

– This is a measure of (potential) congestion, defined as below

– We partition the CN into 2 groups of LPs (let’s say A and B)

– In each group the number of processors is denoted as Na and Nb (Nb<=Na)

– We count the number of interconnections between A and B (call this I)

– The maximum value of Nb/I for all possible partitions is the narrowness.

Metrics for CNs (cont.)

• Expansion increments:

– This is a measure of (potential) expansion

– A network should be expandable – that is, it should be possible to

create larger systems (of the same topology) by simply adding new

nodesnodes

– It is better to have the option of small increments (why?)

Fully Connected Networks

A common topology: each node is

connected (directly) to all other

nodes (by 2-way links)

Example: we have

10 links with 5 nodes

Question: how many

links are with n nodes?

1

25

2

1

Metrics for n = 5

Degree = 4

Diameter = 1

Node Connectivity = 4, Link connectivity = 4

Narrowness = 1/3: Narrowness(1) = 2/6 = 1/3, Narrowness(2) = 1/4

Expansion Increment = 1

34

1

1 2

A 3 4

B 2 1

General case (we still may have to differentiate after n even or odd)

Fully Connected Networks (cont.)

41 52 3 n

If n is even:• Degree = n-1

• Connectivity = n-1

• Diameter = 1

• Narrowness = 2/n

• Expansion Increment = 1

If n is odd:

… ?

Mesh and Torus

•In a mesh, the nodes are arranged in a k-dimensional lattice of width w,

giving a total of w^k nodes; we may have in particular:

•k =1 … giving a linear array, or

•k =2 … giving a 2-dimensional matrix

•Communication allowed only between neighbours (no diagonal connections)

•A mesh with wraparound is called torus•A mesh with wraparound is called torus

The Linear Array and the Ring

A simple ring

Question: what are the metrics

•for n = 6

•in the general case?

A chordal ring

Hypercube Connections (Binary n-Cubes)

0 1

1-D hypercube (2 nodes)

The networks consist of N=2^k nodes arranged in a k-dimensional hypercube.

The nodes are numbered 0,1,…,2^k-1 and two nodes are connected if their

binary labels differ by exactly 1 bit

4-D HyperCube

2-D hypercube (4nodes)

0 1

0

0

1

32

1

2 3

6

4 5

7

000

010

011

111

101100

001

110

3-D hypercube (8nodes)

Question: what are the metrics of an n-dimensional hypercube?

Shuffle Exchange

p0 p2 p5p1 p3 p4 p6 p7

A 1-way communication line links PI to PJ, where:

•J= 2I for 0<=I<=4-1,

•J = 2I+1-8 for 4<=I<=N-1

2-way links may be added to every even processor and its successor

Shuffle Exchange – from another view

p0 p2 p5p1 p3 p4 p6 p7

p0 p2 p5p1 p3 p4 p6 p7

Question: what are metrics for

•case n=8

•the general case for any n which is a power of 2?

Typically, requirements are specified as bounds on a subset of metrics:

•min_nodes < number_nodes < max_nodes

•min_links < number_links < max_links

•connectivity > c_min

Designing a Connection Network

•connectivity > c_min

•diameter < d_max

•narrowness < n_max

Normally, experience might tell if a classic CN fits. Otherwise a CN which is close to meeting the requirements must be refined; or 2 (or more) CN must be combined in a complementary fashion (if possible!)



PARALLEL COMPUTING:


Parallel Virtual Computing

Environments

Contents

• Historical Background

• HPC-VCE Architectures

• HPC–VCE Programming Model

• Parallel Execution Issues in a VCE

• PVM

• MPI

Historical Background (1)

• Complex scientific research has always been looking for “immense” computing resources

• Supercomputers have been used traditionally to provide processing capability (‘60s – ’90s) to provide processing capability (‘60s – ’90s)

• In recent years, it has been more feasible to use “commodity computers”, that is, to build supercomputers by connecting 100s of cheap workstations to get high processing capability

• Example: the Beowulf system, created from desktop PCs linked by a high-speed network

Historical Background (2)

• High-speed networks enabled the integration of resources, geographically distributed and managed at different domains.

• In late 1990s, Foster & Kesselman proposed a • In late 1990s, Foster & Kesselman proposed a

“plug in the wall” approach: Grid Computing

– It was aimed to make globally dispersed computer

power as easy to access as an electric power grid

• In the next decade, Cloud Computing was

introduced – it refers to a technology providing

virtualized distributed resources over Internet

HPC-VCE Architectures/

Programming Paradigms

There are several very large spread systems for

parallel processing, but fundamentally different

from the programming point of view:

�SMP (Symmetric Multi-Processing )�SMP (Symmetric Multi-Processing )

�MPP (Massively Parallel Processing)

�(Computer) Clusters

�Grids and Clouds

�NOW (Network of Workstations)

Symmetric Multi-Processing

• An architecture in which multiple CPUs, residing in one cabinet, are driven from a single O/S image

A Pool of Resources

• Each processor is a peer (one is not favored more than another)

– Shared bus

– Shared memory address space

– Common I/O channels and disks– Common I/O channels and disks

– Separate caches per processor, synchronized via various techniques

• But if one CPU fails, the entire SMP system is down

– Clusters of two or more SMP systems can be used to provide high availability (fault resilience)

Scalability of SMPs

• Is limited (2-32), reduced by several factors, such as:

– Inter-processor communication

– Bus contention with CPUs and serializable points

– Kernel serialization

• Most vendors have SMP models on the market:• Most vendors have SMP models on the market:

– Sequent, Pyramid, Encore pioneered SMP on Unix platforms

– IBM, HP, NCR, Unisys also provide SMP servers

– Many versions of Unix, Windows NT, NetWare and OS/2 have been designed or adapted for SMP

Speedup and Efficiency of SMPs

• SMPs help with overall throughput, not a single job, speeding up whatever processes can be overlapped

– In a desktop computer, it would speed up the running of multiple applications simultaneously

– If an application is multithreaded, it will improve – If an application is multithreaded, it will improve the performance of that single application

• The OS controls all CPUs, executing simultaneously, either processing data or in an idle loop waiting to do something

– CPUs are assigned to the next available task or thread that can run concurrently

Massively Parallel Processing

• Architecture in which each available processing node runs a separate copy of the O/S

Distributed Resources

• Each CPU is a subsystem with its own memory and copy of the OS and application

• Each subsystem communicates with the others via a high-speed interconnect via a high-speed interconnect

• There are independent cache/memory/and I/O subsystems per node

• Data is shared via function, from node-to-node generally

• Sometimes this is referred as “shared-nothing” architecture (example: IBM SP2)

Integrated MPP and SMP

Is possible: the Reliant computer from Pyramid Technology combined both MPP and SMP processing.

Speedup and Efficiency

• Nodes communicate by passing messages, using

standards such as MPI

• Nearly all supercomputers as of 2005 are massively

parallel, and may have x100,000 CPUs

• The cumulative output of the many constituent CPUs • The cumulative output of the many constituent CPUs

can result in large total peak FLOPS

– The true amount of computation accomplished depends on

the nature of the computational task and its implementation

– Some problems are more intrinsically able to be separated

into parallel computational tasks than others

• Single chip implementations of massively parallel

architectures are becoming cost effective

Further Comments

• To use MPP effectively, a problem must be breakable

into pieces that can all be solved simultaneously

• It is the case of scientific environments: simulations or

mathematical problems can be split apart and each part

processed at the same timeprocessed at the same time

• In the business world: a parallel data query (PDQ) can

divide a large database into pieces (parallel groups)

• In contrast: applications that support parallel operations

(multithreading) may immediately take advantage of

SMPs - and performance gains are available to all

applications simply because there are more processors

Computer Clusters

• Composed of multiple computing nodes working

together closely so that in many respects they form

a single computer to process computational jobs

• Clusters are increasingly built by assembling the

same or similar type of commodity machines that same or similar type of commodity machines that

have one or several CPUs and CPU cores

• Clusters are used typically when the tasks of a job

are relatively independent of each other so that they

can be farmed out to different nodes of the cluster

• In some cases, tasks of a job may still need to be

processed in a parallel manner, i.e. tasks may be

required to interact with each other during execution

Computer vs. Data Clusters

• Computer clusters should not be confused with data

clusters, that refer to allocation for files/ directories

• They are loosely coupled sets of independent

processors functioning as a single system to provide:

– Higher Availability (remember: clusters of 2+ SMP

systems are used to provide fault resilience)

– Performance and Load Balancing

– Maintainability

• Examples: RS/6000 (up to 8 nodes), DEC Open

VMS Cluster (up to 16 nodes), IBM Sysplex (up to

32 nodes), Sun SparcCluster

Clustering Issues(valid for both computing and data)

• A cluster of servers may

provide fault tolerance

and/or load balancing

– If one server fails, one or – If one server fails, one or

more additional servers

are still available

– Load balancing is used to

distribute the workload

over multiple systems

How It Works

• The allocation of jobs to individual nodes of a cluster

is handled by a Distributed Resource Manager (DRM)

• The DRM allocates a task to a node using the resource

allocation policies that may consider node availability,

user priority, job waiting time, etc. user priority, job waiting time, etc.

• Typically, DRMs also provide submission and monitor

interface, enabling users to specify jobs to be executed

and keep track of the progress of execution

• Examples of popular resource managers are: Condor,

the Sun Grid Engine (SGE) and the Portable Batch

Queuing System (PBS)

Types of Clusters

The primary distinction within computer clusters is how

tightly-coupled the individual nodes are:

• The Beowulf Cluster Design: densely located, sharing

a dedicated network, probably has homogenous nodes a dedicated network, probably has homogenous nodes

• "Grid" Computing: when a compute task uses one or

few nodes, and needs little inter-node communication

Middleware such as MPI (Message Passing Interface) or

PVM (Parallel Virtual Machine) allows well designed

programs to be portable to a wide variety of clusters

Speedup and Efficiency

• The TOP500 list includes many clusters

• Tightly-coupled computer clusters are often designed

for "supercomputing“

– The central concept of a Beowulf cluster is the use of – The central concept of a Beowulf cluster is the use of

commercial off-the-shelf (COTS) computers to produce a

cost-effective alternative to a traditional supercomputer

• But clusters, that can have very high Flops, may be

poorer in accessing all data in the cluster

– They are excellent for parallel computation, but inferior to

traditional supercomputers at non-parallel computation

Grids

• “A computational grid is a hardware and software infrastructure providing dependable, consistent, pervasive and cheap access to high-end computational capabilities.” (Foster, Kesselman, “The Grid: Blueprint for a New Computing Infrastructure”, 1998)for a New Computing Infrastructure”, 1998)

• The key concept in a grid is the ability to negotiate resource-sharing arrangements among a set of resources from participating parties and then to use the resulting resource pool for some purpose

• The ancestor of the Grid is Metacomputing, which tried

to interconnect supercomputer centers with the purpose

to obtain superior processing resources

A Grid Checklist

1) Coordination of resources that are not subject to centralized control …

2) Use of standard, open, general-purpose protocols and interfaces

3) Delivery of nontrivial qualities of service3) Delivery of nontrivial qualities of service(response time, throughput, availability, security)

Grid computing is concerned with “coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations” (Foster, Tuecke, “The Anatomy of the Grid,” 2000), and/or co-allocation of multiple resource types to meet complex user demands, so that the utility of the combined system is significantly greater than that of the sum of its parts

How Grids Work

• A typical grid computing architecture includes a meta-

scheduler connecting a number of geographically

distributed clusters that are managed by local DRMs

• The meta-scheduler (e.g. GridWay, GridSAM) aims

to optimize computational workloads by combining an to optimize computational workloads by combining an

organization ’s multiple DRMs into an aggregated

single view, allowing jobs to be directed to the best

location (cluster) for execution

• It integrates computational resources into a global

infrastructure, so that users no longer need to be aware

of which resources are used for their jobs

Grid Middleware

• Grid computing tried to introduce common interfaces

and standards that eliminate the heterogeneity from

the resource access in different domains

• Therefore, several grid middleware systems have been

developed to resolve the differences that exist between developed to resolve the differences that exist between

submission, monitoring and query interfaces of DRMs

– The Globus Toolkit provides: a platform-independent job

submission interface, GRAM (Globus Resource Allocation

Manager), which cooperates underlying DRMs to integrate

the job submission method; a security framework, GSI (the

Grid Security Infrastructure), and a resource information

mechanism, MDS (the Monitoring and Discovery Service)

The Globus Toolkit

• Interactions with its components are mapped to local

management system specific calls; support is provided

for many DRMs, including Condor, SGE and PBS

– GridSAM provides a common job submission/ monitoring interface to multiple underlying DRMs interface to multiple underlying DRMs

– As a Web Service based submission service, it implements the Job Submission Description Language (JSDL) and a collection of DRM plug-ins that map JSDL requests and monitoring calls to system-specific calls

• In addition, a set of new open standards and protocols

like OGSA (Open Grid Services Architecture), WSRF

(Web Services Resource Framework) are introduced

to facilitate mapping between independent systems

Grid Challenges

Computing in grid environments may be difficult due to:

• Resource heterogeneity: results in differing capability

of processing jobs, making the execution performance

difficult to assess

• Resource dynamic behavior: it exists in both the • Resource dynamic behavior: it exists in both the

networks and computational resources

• Resource co-allocation: the required resources must be

offered at the same time, or the computation cannot go

• Resource access security: important things need to be

managed, i.e. access policy (what is shared? to whom?

when?), authentication (how users/resources identify?),

authorization (are operations consistent with the rules?)

Clouds

• Cloud computing uses the Web server facilities of a 3rd

party provider on the Internet to store, deploy and run applications

• It takes two main forms:

1. “Infrastructure as a Service” (IaaS): only hardware/ 1. “Infrastructure as a Service” (IaaS): only hardware/ software infrastructure (OS, databases) are offered

– Includes “Utility Computing”, “DeskTop Virtualization”

2. “Software as a Service" (SaaS), which includes the business applications as well

• Regardless whether the cloud is infrastructure only or includes applications, major features are self service, scalability and speed

Speedup and Performance

• Customers log into the cloud and run their applications as desired; although a representative of the provider may be involved in setting up the service, customers make all configuration changes from their browsers

• In most cases, everything is handled online from start • In most cases, everything is handled online from start to finish by the customer

• The cloud provides virtually unlimited computing capacity and supports extra workloads on demand

• Cloud providers may be connected to multiple Tier 1 Internet backbones for fast response times/ availability

Infrastructure Only (IaaS/PaaS)

• Using the cloud for computing power only can be

cheap to support new projects or seasonal increases

• When constructing a new datacenter, there are very big

security, environmental and management issues, not to

mention hardware/software maintenance forever aftermention hardware/software maintenance forever after

• In addition, commercial cloud facilities may be able to

withstand natural disasters

• Infrastructure-only cloud computing is also named

infrastructure as a service (IaaS), platform as a service

(PaaS), cloud hosting, utility computing, grid hosting

Infrastructure & Applications (SaaS)

• More often, cloud computing refers to application

service providers (ASPs) that offer everything: the

infrastructure as outlined below and the applications,

relieving the organization of virtually all maintenance

• Google Apps and Salesforce.com's CRM products are • Google Apps and Salesforce.com's CRM products are

examples of this "software-as-a-service" model (SaaS)

• This is a paradigm shift because company data are

stored externally; even if data are duplicated in-house,

copies "in the cloud" create security and privacy issues

• Companies may create private clouds within their own

datacenters, or use hybrid clouds (both private/public)

Networks of Workstations (NOW)

• Uses network based architecture (even the Internet), even when working in Massively Parallel model

• More appropriate to distributed computing, therefore they are seen sometimes as distributed computers

– NOW formed the hardware/ software foundation used by the Inktomi search engine (Inktomi was acquired by Yahoo! in 2002)

– This led to a multi-tier architecture for Internet services based on distributed systems, in use today

NOW Working in Parallel

• Application partitions task into manageable subtasks

• Application asks participating nodes to post available resources and computational burdens

– Network bandwidth

– Available RAM

– Processing power available

• Nodes respond and application parses out subtasks to nodes with less computational burden and most available resources

• Application must parse out subtasks and synchronize answer

HPC–VCE Programming Model

• High Performance Computing environments (HPCe) have to deliver a tremendous amount of power over a short period of time

• A Virtual Computing Environment (VCE) :

– Uses existing software to build a programming model on top – Uses existing software to build a programming model on top of which rapid parallel/ distributed applications development is made possible

– Provides tools to create, debug, and execute applications on heterogeneous hardware

– Let the software map high level descriptions of the problems to available hardware

– Low-level issues are no longer a concern of the programmer

Parallel Computing/ Programming in

Distributed Environments

• The Bad News

– Too many architectures

– Existing architectures are too specific

– Programs too closely tied to architecture

– Software was developed using an obsolete mentality– Software was developed using an obsolete mentality

• The Good News

– Centralized systems are a thing of the past

• Computing was evolving towards cycle servers

– Each user has his/her own computer

– Workstations are networked

• Typical LAN speeds are ≥ 100 mbs

Workstation Users in VCEs

• All VCE configuration include workstations

• Workstations are chronically underutilized

• Workstation users can be classified as:

– Casual Users– Casual Users

– Sporadic Users

– Frustrated Users

• The VCE must help frustrated users without

hurting casual and sporadic users

Other Considerations

• The VCE must be cost effective

– Use existing tools like NFS, PVM, MPI whenever possible

– Must not require tremendous amounts of processor powerprocessor power

• The VCE must coexist with other software

– Non-VCE applications should not be impacted by the VCE

• The VCE must avoid kernel modes

A VCE Minimal Configuration

Problem Specification

Design Stage

Coding Level

SDM

•The SDM (software

development module)

provides tools to build

application task graphCoding Level

Compilation Manager

Runtime Manager

SEM

application task graph

•The SEM (software

execution module)

compiles applications

and dispatches tasks

Parallel Execution Issues in a VCE

• Compilation Issues

– Executables must be prepared to maximize scheduling

flexibility

– Compilations must be scheduled to maximize

application performance and hardware utilizationapplication performance and hardware utilization

• Runtime Issues

– Task Placement: the criteria for automatically selecting

machines to host tasks must consider both hardware

utilization and application throughput

– Programmers may improve task placement decisions

Processor Utilization and Task

Migration• Free parallelism: parallel applications with low

efficiency benefit when run on idle machines

• Load balancing: a central issue in the execution module

• Various migration strategies are possible• Various migration strategies are possible

– Redundant execution

– Check-pointing

– Dump and migrate

– Recompilation

– Byte coded tasks

Parallel VCE systems in a decade

• P4

• Chameleon

• Parmacs

• PVM• PVM

• MPI

• CHIMP

• NX (Intel i860, Paragon)

• …

PVM : What is it?

Heterogeneous Virtual Machine support for:

• Resource Management

– add/delete hosts from a virtual machine (VM)

• Process Control

– spawn/kill tasks dynamically – spawn/kill tasks dynamically

• Communication using Message Passing

– blocking send, blocking and non-blocking receive, mcast

• Dynamic Task Groups

– task can join or leave a group at any time

• Fault Tolerance

– VM automatically detects faults and adjusts

Popular PVM Uses

• Poorman’s Supercomputer

– PC clusters, Linux, Solaris, NT

– Cobble together whatever resources you can get

• Metacomputer linking multiple (super)computers• Metacomputer linking multiple (super)computers

– ultimate performance: eg. have combined x1000s processors and up to 50 supercomputers

• Education Tool

– teaching parallel programming

– academic and thesis research

PVM In a Nutshell

• PVM is set on top of different architectures running

different operating systems (Hosts)

• Each host runs a PVM daemon (PVMD)

• A collection of PVMDs define the VM• A collection of PVMDs define the VM

• Once configured, tasks can be started (spawned),

killed, signaled from a console

• Communicate using basic message passing

• Performance is good

• API Semantics limit optimizations

Inside View of PVM

• Every process has a unique, virtual-machine-wide,

identifier called a task ID (TID)

• A single master PVMD disseminates current

virtual machine configuration and holds the so-virtual machine configuration and holds the so-

called PVM mailbox.

• The VM can grow and shrink around the master

(if the master dies, the machine falls apart)

• Dynamic configuration is used whenever practical

host (one per IP address)pvmd - one PVM daemon per host

pvmd

PVM Design

libpvm - task linked to PVM library task task task

Unix Domain Sockets

inner host messagestcp

direct connect

pvmd

pvmd

pvmds fully connected using UDP

OS network interface

task task task

Shared Memory

shared memory multiprocessor

P0 P1 P2

task task task

distributed memory MPP

task task task

task task task

internal interconnect

Multiple Transports of PVM Tasks

• PVM uses sockets mostly

– Unix-domain on host

– TCP between tasks on different hosts

– UDP between Daemons (custom reliability)– UDP between Daemons (custom reliability)

• SysV Shared Memory Transport for SMPs

– Tasks still use pvm_send(), pvm_recv()

• Native MPP

– PVM can ride atop a native MPI implementation

• PVM uses tid to identify pvmd, tasks, groups

• Fits into 32-bit integer

Task ID (tid)

18 bits12 bits

S G host ID local part

• S bit addresses pvmd, G bit forms mcast address

• Local part defined by each pvmd – e.g.

12 bits

S G host ID process node ID

11 bits7 bits

4096 hosts 2048 nodeseach with

PVM Addressing

Strengths/Weaknesses

• Addresses contain routing information by

virtue of the host part

– Transport selection at runtime is simplified:

Bit-mask + table lookupBit-mask + table lookup

• Moving a PVM task is very difficult

• Group/multicast bit makes it straightforward

to implement multicast within point-to-point

infrastructure

MPI : Design Goals

• Make it go faster than PVM (as fast as possible)

• Operate in a serverless (daemonless environment)

• Specify portability but not interoperability

• Standardize best practices of parallel VCEs• Standardize best practices of parallel VCEs

• Encourage competing implementations

• Enable the building of safe libraries

• Make it the “assembly language” of Message

Passing

MPI Implementations

• MPICH (Mississippi-Argonne) open source

– A top-quality reference implementation

– http://www-unix.mcs.anl.gov/mpi/mpich/

• High Performance Cluster MPIs• High Performance Cluster MPIs

– AM-MPI, FM-MPI, PM-MPI, GM-MPI, BIP-MPI

• 10us latency, 100MB/sec on Myrinet

• Vendor supported MPI

– SGI, Cray, IBM, Fujitsu, Sun, Hitachi, …

PVM vs. MPI

• easy to use

• interoperable

• fault tolerant

• is a standard

• widely supported

• MPP performance

PVM MPI

Each API has its unique strengths

• heterogeneity support

• resource control

• dynamic model

• good for experiment

• MPP performance

• many comm. methods

• topology support

• static model (SPMD)

• good to build scalable products

Best

Distributed Computing

Best

Large Multiprocessor

Evaluate the needs of your application then choose

PARALLEL COMPUTING: Models and Algorithms

Documents

parallel architectures

parallel concepts

parallel processing

parallel environmentbroad

parallel system

basic parallel algorithms

pvm parallel virtual

mihai mocanu algorithms