Parallel Computers 1 PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003 TOPICS: Parallel computing requires an understanding of parallel algorithms,

Parallel Computers

1

PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW

Fall 2003

TOPICS: Parallel computing requires an understanding of parallel algorithms, parallel languages, and parallel architecture, all of which are covered in this class for the following topics.

• Fundamental concepts in parallel computation.• Synchronous Computation

– SIMD, Vector, Pipeline Computing– Associative Computing– Fortran 90– ASC Language– MultiC Language

• Asynchronous (MIMD or Multiprocessors) Shared Memory Computation– OpenMP language

• Distributed Memory MIMD/Multiprocessor Computation– Sometimes called Multicomputers– Programming using Message Passing– MPI language

• Interconnection Networks (SIMD and MIMD)• Comparison of MIMD and SIMD Computation in Real

Time Computation Applications

GRADING FOR PDC • Five or six homework assignments• Midterm and Final Examination• Grading: Homework 50%, Midterm 20%, Final 30%

Parallel Computers

2Introduction to Parallel Computing

(Chapter One)• References: [1] - [4] given below.

1. Chapter 1, “Parallel Programming ” by Wilkinson, el.2. Chapter 1, “Parallel Computation” by Akl3. Chapter 1-2, “Parallel Computing” by Quinn, 19944. Chapter 2, “Parallel Processing & Parallel Algorithms” by Roosta

• Need for Parallelism– Numerical modeling and simulation of

scientific and engineering problems.– Solution for problems with deadlines

• Command & Control problems like ATC.– Grand Challenge Problems

• Sequential solutions may take months or years.

• Weather Prediction - Grand Challenge Problem– Atmosphere is divided into 3D cells.– Data such as temperature, pressure, humidity,

wind speed and direction, etc. are recorded at regular time-intervals in each cell.

– There are about 5108 cells of (1 mile) 3 .– It would take a modern computer over 100

days to perform necessary calculations for 10 day forecast.

• Parallel Programming - a viable way to increase computational speed.– Overall problem can be split into parts, each of

which are solved by a single processor.

Parallel Computers

3

– Ideally, n processors would have n times the computational power of one processor, with each doing 1/nth of the computation.

– Such gains in computational power is rare, due to reasons such as

• Inability to partition the problem perfectly into n parts of the same computational size.

• Necessary data transfer between processors

• Necessary synchronizing of processors

• Two major styles of partitioning problems

– (Job) Control parallel programming

• Problem is divided into the different, non-identical tasks that have to be performed.

• The tasks are divided among the processors so that their work load is roughly balanced.

• This is considered to be coarse grained parallelism.

– Data parallel programming

• Each processor performs the same computation on different data sets.

• Computations do not necessarily have to be synchronous.

• This is considered to be fine grained parallelism.

Parallel Computers

4Shared Memory Multiprocessors (SMPs)

• All processors have access to all memory locations .

• The processors access memory through some type of interconnection network.

• This type of memory access is called uniform memory access (UMA) .

• A data parallel programming language, based on a language like FORTRAN or C/C++ may be available.

• Alternately, programming using threads is sometimes used.

• More programming details will be discussed later.• Difficulty for the SMP architecture to provide fast

access to all memory locations result in most SMPs having hierarchical or distributed memory systems.– This type of memory access is called

nonuniform memory access (NUMA). • Normally, fast cache is used with NUMA systems

to reduce the problem of different memory access time for PEs. – This creates the problem of ensuring that all

copies of the same date in different memory locations are identical.

– Numerous complex algorithms have been designed for this problem.

Parallel Computers

5

Message-Passing Multiprocessors (Multicomputers)

• Processors are connected by an interconnection network (which will be discussed later in chapter).

• Each processor has a local memory and can only access its own local memory.

• Data is passed between processors using messages, as dictated by the program.

• Note: If the processors run in SIMD mode (i.e., synchronously), then the movement of the data movements over the network can be synchronous:

– Movement of the data can be controlled by program steps.

– Much of the message-passing overhead (e.g., routing, hot-spots, headers, etc.) can be avoided.

• A common approach to programming multiprocessors is to use message-passing library routines in addition to conventional sequential programs (e.g., MPI, PVM)

• The problem is divided into processes that can be executed concurrently on individual processors. A processor is normally assigned multiple processes.

• Multicomputers can be scaled to larger sizes much easier than shared memory multiprocessors.

Parallel Computers

6

Multicomputers (cont.)

• Programming disadvantages of message-passing

– Programmers must make explicit message-passing calls in the code

– This is low-level programming and is error prone.

– Data is not shared but copied, which increases the total data size.

– Data Integrity: difficulty in maintaining correctness of multiple copies of data item.

• Programming advantages of message-passing

– No problem with simultaneous access to data.

– Allows different PCs to operate on the same data independently.

– Allows PCs on a network to be easily upgraded when faster processors become available.

• Mixed “distributed shared memory” systems

– Lots of current interest in a cluster of SMPs.

• See Dr. David Bader’s or Dr. Joseph JaJa’s website

– Other mixed systems have been developed.

Parallel Computers

7

Flynn’s Classification Scheme

• SISD - single instruction stream, single data stream– Primarily sequential processors

• MIMD - multiple instruction stream, multiple data stream.– Includes SMPs and multicomputers.– processors are asynchronous, since they can

independently execute different programs on different data sets.

– Considered by most researchers to contain the most powerful, least restricted computers.

– Have very serious message passing (or shared memory) problems that are often ignored when • compared to SIMDs• when computing algorithmic complexity

– May be programmed using a multiple programs, multiple data (MPMD) technique.

– A common way to program MIMDs is to use a single program, multiple data (SPMD) method• Normal technique when the number of

processors are large.• Data Parallel programming style for

MIMDs• SIMD: single instruction and multiple data streams.

– One instruction stream is broadcast to all processors.

Parallel Computers

8

Flynn’s Taxonomy (cont.)

• SIMD (cont.)

– Each processor (also called a processing element or PE) is very simplistic and is essentially an ALU;

• PEs do not store a copy of the program nor have a program control unit.

– Individual processors can be inhibited from participating in an instruction (based on a data test).

– All active processor executes the same instruction synchronously, but on different data

– On a memory access, all active processors must access the same location in their local memory.

– The data items form an array and an instruction can act on the complete array in one cycle.

• MISD - Multiple Instruction streams, single data stream.

– This category is not used very often.

– Some include pipelined architectures in this category.

Parallel Computers

9

Interconnection Network Overview

References: Texts [1-4] discuss these network examples, but reference 3 (Quinn) is particularly good.

• Only an overview of interconnection networks is included here. It will be covered in greater depth later.

• The PEs (processing elements) are called nodes.• A link is the connection between two nodes.

– bidirectional or use two directional links .– Either one wire to carry one bit or parallel wires (one

wire for each bit in word) can be used.– The above choices do not have a major impact on the

concepts presented in this course.• The diameter is the minimal number of links between

the two farthest nodes in the network.– The diameter of a network gives the maximal

distance a single message may have to travel.• Completely Connected Network

– Each of n nodes has a link to every other node.– Requires n(n-1)/2 links – Impractical, unless very few processors

• Line/Ring Network– A line consists of a row of n nodes, with connection

to adjacent nodes.– Called a ring when a link is added to connect the two

end nodes of a line.– The line/ring networks have many applications.– Diameter of a line is n-1 and of a ring is n/2.

Parallel Computers

10

• The Mesh Interconnection Network

– The nodes are in rows and columns in a rectangle.

– The nodes are connected by links that form a 2D mesh. (Give diagram on board.)

– Each interior node in a 2D mesh is connected to its four nearest neighbors.

– A square mesh with n nodes has n rows and n columns

• The diameter of a n n mesh is 2(n - 1)

– If the horizonal and vertical ends of a mesh to the opposite sides, the network is called a torus.

– Meshes have been used more on actual computers than any other network.

– A 3D mesh is a generalization of a 2D mesh and has been used in several computers.

– The fact that 2D and 3D meshes model physical space make them useful for many scientific and engineering problems.

• Binary Tree Network

– A binary tree network is normally assumed to be a complete binary tree.

– It has a root node, and each interior node has two links connecting it to nodes in the level below it.

– The height of the tree is lg n and its diameter is 2 lg n .

Parallel Computers

11Metrics for Evaluating Parallelism

References: All references cover most topics in this section and have useful information not contained in others. Ref. [2, Akl] includes new research and is the main reference used, although others (esp. [3, Quinn] and [1,Wilkinson]) are also used.

Granularity: Amount of computation done between communication or synchronization steps and is ranked as fine, intermediate, and coarse.

• SIMDs are built for efficient communications and handle fine-grained solutions well.

• SMPs or message passing MIMDS handle communications less efficiently than SIMDs but more efficiently than clusters and can handle intermediate-grained solutions well.

• Cluster of workstations or distributed systems have slower communications among PEs and are better suited for coarse grain computations.

• For asynchronous computations, increasing the granularity

– reduces expensive communications

– reduces costs of process creation

– but reduces the nr of concurrent processes

Parallel Computers

12

Parallel Metrics (continued)

Speedup

• A measure of the increase in running time due to parallelism.

• Based on running times, S(n) = ts/tp , where

– ts is the execution time on a single processor, using the fastest known sequential algorithm

– tp is the execution time using a parallel

processor.

• In theoretical analysis, S(n) = ts/tp where

– ts is the worst case running time for of the

fastest known sequential algorithm for the problem

– tp is the worst case running time of the parallel

algorithm using n PEs.

Parallel Computers

13Parallel Metrics (continued)

• Linear Speedup is optimal for most problems

– Claim: The maximum possible speedup for parallel computers with n PEs for ‘normal problems’ is n.

– Proof of claim

• Assume a computation is partitioned perfectly into n processes of equal duration.

• Assume no overhead is incurred as a result of this partitioning of the computation.

• Then, under these ideal conditions, the parallel computation will execute n times faster than the sequential computation.

• The parallel running time is ts /n.

• Then the parallel speedup of this computation is S(n) = ts /(ts /n) = n.

– We shall later see that this “proof” is not valid for certain types of nontraditional problems.

– Unfortunately, the best speedup possible for most applications is much less than n, as

• Above assumptions are usually invalid.

• Usually some parts of programs are sequential and only one PE is active.

• Sometimes a large number of processors are idle for certain portions of the program.

– E.g., during parts of the execution, many PEs may be waiting to receive or to send data.

Parallel Computers

14

Parallel Metrics (cont)

Superlinear speedup (i.e., when S(n) > n): Most texts besides [2,3] argue that

• Linear speedup is the maximum speedup obtainable.– The preceding “proof” is used to argue that

superlinearity is impossible.• Occasionally speedup that appears to be

superlinear may occur, but can be explained by other reasons such as – the extra memory in parallel system.– a sub-optimal sequential algorithm used.– luck, in case of algorithm that has a random

aspect in its design (e.g., random selection) • Selim Akl has shown that for some less standard

problems, superlinear algorithms can be given.– Some problems cannot be solved without use

of parallel computation.– Some problems are natural to solve using

parallelism and sequential solutions are inefficient.

– The final chapter of Akl’s textbook and several journal papers have been written to establish these claims are valid, but it may still be a long time before they are fully accepted.

– Superlinearity has been a hotly debated topic for too long to be accepted quickly.

Parallel Computers

15

Amdahl’s Law

• Assumes that the speedup is not superliner; i.e.,

S(n) = ts/ tp n

– Assumption only valid for traditional problems.

• By Figure 1.29 in [1] (or slide #40), if f denotes the fraction of the computation that must be sequential,

tp f ts + (1-f) ts /n

• Substituting above inequality into the above equation for S(n) and simplifying (see slide #41 or book) yields

• Amdahl’s “law”: S(n) 1/f, where f is as above.

• See Slide #41 or Fig. 1.30 for related details.

• Note that S(n) never exceed 1/f and approaches 1/f as n increases.

• Example: If only 5% of the computation is serial, the maximum speedup is 20, no matter how many processors are used.

• Observations: Amdahl’s law limitations to parallelism:

– For a long time, Amdahl’s law was viewed as a fatal limit to the usefulness of parallelism.

ffn

nnS

1

)1(1)(

Parallel Computers

16

– Amdahl’s law is valid and some textbooks discuss how it can be used to increase the efficient of many parallel algorithms.

– Shows that efforts required to further reduce the fraction of the code that is sequential may pay off in large performance gains.

– Hardware that allows even a small decrease in the percent of things executed sequentially may be considerably more efficient.

– A key flaw in past arguments that Amdahl’s law is a fatal limit to the future of parallelism is

• Gustafon’s Law: The proportion of the computations that are sequential normally decreases as the problem size increases.

– Other limitations in applying Amdahl’s Law:

• Its proof focuses on the steps in a particular algorithm, and does not consider that other algorithms with more parallelism may exist.

• Amdahl’s law applies only to ‘standard’ problems were superlinearity doesn’t occurs.

– For more details on superlinearity, see [2] “Parallel Computation: Models and Methods”, Selim Akl, pgs 14-20 (Speedup Folklore Theorem) and Chapter 12.

Parallel Computers

17

More Metrics for Parallelism

• Efficiency is defined by

– Efficiency give the percentage of full utilization of parallel processors on computation, assuming a speedup of n is the best possible.

• Cost: The cost of a parallel algorithm or parallel execution is defined by

Cost = (running time) (Nr. of PEs)

= tp n– Observe that

– Cost allows the quality of parallel algorithms to be compared to that of sequential algorithms.• Compare cost of parallel algorithm to

running time of sequential algorithm• The advantage that parallel algorithms have

in using multiple processors is removed by multiplying their running time by the number n of processors they are using.

• If a parallel algorithm requires exactly 1/n the running time of a sequential algorithm, then the parallel cost is the same as the sequential running time.

n

nS

nt

tEp

s )(

CostE t s

Parallel Computers

18

More Metrics (cont.)

• Cost-Optimal Parallel Algorithm: A parallel algorithm for a problem is said to be cost-optimal if its cost is proportional to the running time of an optimal sequential algorithm for the same problem.

– By proportional, we means that

cost = tp n = k ts

where k is a constant. (See pg 67 of [1]).

– Equivalently, a parallel algorithm is optimal if

parallel cost = O(f(t)),

where f(t) is the running time of an optimal sequential algorithm.

– In cases where no optimal sequential algorithm is known, then the “fastest known” sequential algorithm is often used instead.

Parallel Computers

19

Sieve of Eratosthenes(A Data-Parallel vs Control-Parallel Example)

• Reference [3, Quinn, Ch. 1], pages 10-17

• A prime number is a positive integer with exactly two factors, itself and 1.

• Sieve (siv) algorithm finds the prime numbers less than or equal to some positive integer n

– Begin with a list of natural numbers2, 3, 4, …, n

– Remove composite numbers from the list by striking multiples of 2, 3, 5, and successive primes

– After each striking, the next unmarked natural number is prime

– Sieve terminates after multiples of largest prime less than or equal to have been struck from the list.

• Sequential Implementation uses 3 data structures

– Boolean array index for the numbers being sieved.

– An integer holding latest prime found so far.

– An loop index that is incremented as multiple of current prime are marked as composite nrs.

n

Parallel Computers

20

A Control-Parallel Approach

• Control parallelism involves applying a different sequence of operations to different data elements

• Useful for

– Shared-memory MIMD

– Distributed-memory MIMD

– Asynchronous PRAM

• Control-parallel sieve

– Each processor works with a different prime, and is responsible for striking multiples of that prime and identifying a new prime number

– Each processor starts marking…

– Shared memory contain

• boolean array containing numbers being sieved,

• integer corresponding largest prime found so far

– PE’s local memories contain local loop indexes keeping track of multiples of its current prime (since each is working with different prime).

Parallel Computers

21A Control-Parallel Approach

(cont.)

• Problems and inefficiencies

– Algorithm for Shared Memory MIMD

1. Processor accesses variable holding current prime

2. searches for next unmarked value

3. updates variable containing current prime– Must avoid having two processors doing this at

same time

– A processor could waste time sieving multiples of a composite number

• How much speedup can we get?

– Suppose n = 1000

– Sequential algorithm

• Time to strike out multiples of prime p is

(n+1- p2)/p• Multiples of 2: ((1000+1) –4)/2=997/2=498

• Multiples of 3: ((1000+1) –9)/3=992/3=330

• Total Sum = 1411 (number of “steps”)

– 2 PEs gives speedup 1411/706=2.00

– 3 PEs gives speedup 1411/499=2.83

– 3 PEs require 499 strikeout time units, so no more speedup is possible using additrional PEs

• Multiples of 2’s dominate with 498 strikeout steps

Parallel Computers

22

A Data-Parallel Approach

• Data parallelism refers to using multiple PEs to apply the same sequence of operations to different data elements.

• Useful for following types of parallel operations

– SIMD

– PRAM

– Shared-memory MIMD,

– Distributed-memory MIMD.

• Generally not useful for pipeline operations.

• A data-parallel sieve algorithm:

– Each processor works with a same prime, and is responsible for striking multiples of that prime from a segment of the array of natural numbers

– Assume we have k processors, wherek << , (i.e., k is much less than ).

• Each processor gets no more than n/k natural numbers

• All primes less than , as well as first prime greater than are in list controlled by first processor

n

n

n

n

Parallel Computers

23A Data-Parallel Approach

(cont.)

• Data-parallel Sieve (cont.)

– Distributed-memory MIMD Algorithm

• Processor 1 finds next prime, broadcasts it to all PEs

• Each PE goes through their part of the array, striking multiples of that prime (performing same operation)

• Continues until first processor reaches a prime greater than sqrt(n)

• How much speedup can we get?

– Suppose n = 1,000,000 and we have k PEs

– There are 168 primes less than 1,000, the largest of which is 997

– Maximum execution time to strike out primes 1,000,000/ k/2 + 1,000,000/ k/3 + 1,000,000/50/5 + … +

1,000,000/ k/997 etime

– The sequential execution time is the above sum with k = 1.

– The communication time = 168(k–1) ctime.

– We assume that each communication takes 100 times longer than the time to execute a strikeout

Parallel Computers

24A Data-Parallel Approach

(cont.)

• How much speedup can we get? (cont.)– Speedup is not directly proportional to the

number of PEs — it’s highest at 11 PEs• Computation time is inversely proportional

to the number of processors used• Communication time increases linearly• After 11 processors, the increase in the

communication time is higher than the decrease in computation time, and total execution time increases.

• Study Figures 1-11 and 1-12 on pg 15 of [3].

– How about parallel I/O time?• Practically, the primes generated must be

stored on an external device • Assume access to device is sequential.• I/O time is constant because output must be

performed sequentially• This sequential code severely limits the

parallel speedup according to Amdahl’s law – the fraction of operations that must be

performed sequentially limits the maximum speedup possible.

• Parallel I/O is an important topic in parallel computing.

Parallel Computers

25

Future Additions Needed

• To be added:

– The work metric and work optimal concept.

– The speedup and slowdown results from [2].

– Data parallel vs control parallel example

• Possibly a simpler example than sieve in Quinn

– Look at chapter 2 of [8, Jordan] for possible new information.

Parallel Computers

26

n or n

Parallel Computers 1 PARALLEL AND DISTRIBUTED COMPUTING OVERVIEW Fall 2003 TOPICS: Parallel computing requires an understanding of parallel algorithms,

Documents

parallel computation

parallel computers

parallel languages

parallel architecture

parallel computing chapter

simd computation

processors access memory

n processors