EECC756 - Shaaban #1 Exam Review Spring2001 5-10-2001 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.

EECC756 - ShaabanEECC756 - Shaaban#1 Exam Review Spring2001 5-10-2001

Parallel Computer ArchitectureParallel Computer Architecture• A parallel computer is a collection of processing elements

that cooperate to solve large problems.

• Broad issues involved:– Resource Allocation:

• Number of processing elements (PEs).• Computing power of each element.• Amount of physical memory used.

– Data access, Communication and Synchronization• How the elements cooperate and communicate.• How data is transmitted between processors.• Abstractions and primitives for cooperation.

– Performance and Scalability:• Performance enhancement of parallelism: Speedup.• Scalabilty of performance to larger systems/problems.


The Goal of Parallel ComputingThe Goal of Parallel Computing• Goal of applications in using parallel machines: Speedup

Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =

Performance (p processors)

Performance (1 processor)

Time (1 processor)

Time (p processors)


Elements of Modern ComputersElements of Modern Computers

HardwareHardwareArchitectureArchitecture

Operating SystemOperating System

Applications SoftwareApplications Software

ComputingComputing ProblemsProblems

AlgorithmsAlgorithmsand Dataand DataStructuresStructures

High-levelHigh-levelLanguagesLanguages

Performance Performance EvaluationEvaluation

MappingMapping

ProgrammingProgramming

BindingBinding(Compile, (Compile, Load)Load)


Approaches to Parallel ProgrammingApproaches to Parallel Programming

Source code written inSource code written inconcurrent dialects of C, C++concurrent dialects of C, C++ FORTRAN, LISPFORTRAN, LISP ..

ProgrammerProgrammer

ConcurrencyConcurrencypreserving compilerpreserving compiler

ConcurrentConcurrentobject codeobject code

Execution byExecution byruntime systemruntime system

Source code written inSource code written insequential languages C, C++sequential languages C, C++ FORTRAN, LISPFORTRAN, LISP ..

ProgrammerProgrammer

ParallelizingParallelizing compilercompiler

ParallelParallelobject codeobject code

Execution byExecution byruntime systemruntime system

(a) Implicit (a) Implicit ParallelismParallelism

(b) Explicit(b) Explicit ParallelismParallelism

EECC756 - ShaabanEECC756 - Shaaban

Evolution of Computer Evolution of Computer ArchitectureArchitecture

Scalar

Sequential Lookahead

I/E Overlap FunctionalParallelism

MultipleFunc. Units Pipeline

Implicit Vector

Explicit Vector

MIMDSIMD

MultiprocessorMulticomputer

Register-to -Register

Memory-to -Memory

Processor Array

Associative Processor

Massively Parallel Processors (MPPs)

I/E: Instruction Fetch and Execute

SIMD: Single Instruction stream over Multiple Data streams

MIMD: Multiple Instruction streams over Multiple Data streams


Programming ModelsProgramming Models• Programming methodology used in coding applications.

• Specifies communication and synchronization.

• Examples:

– Multiprogramming:

No communication or synchronization at program level

– Shared memory address space:

– Message passing:

Explicit point to point communication.

– Data parallel:

More regimented, global actions on data.• Implemented with shared address space or message passing.


Flynn’s 1972 Classification of Flynn’s 1972 Classification of Computer ArchitectureComputer Architecture

• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines.

• Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements.

• Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution.

• Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers:

• Shared memory multiprocessors.• Multicomputers: Unshared distributed memory,

message-passing used instead.


Current Trends In Parallel ArchitecturesCurrent Trends In Parallel Architectures

• The extension of “computer architecture” to support communication and cooperation:

– OLD: Instruction Set Architecture

– NEW: Communication Architecture

• Defines: – Critical abstractions, boundaries, and primitives

(interfaces).

– Organizational structures that implement interfaces (hardware or software).

• Compilers, libraries and OS are important bridges today.


Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors• The Uniform Memory Access (UMA) Model:

– The physical memory is shared by all processors.

– All processors have equal access to all memory addresses.

• Distributed memory or Nonuniform Memory Access (NUMA) Model:

– Shared memory is physically distributed locally among processors.

• The Cache-Only Memory Architecture (COMA) Model:

– A special case of a NUMA machine where all distributed main memory is converted to caches.

– No memory hierarchy at each processor.


Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor Processor

Interconnect

I/Odevices

M M M

Network

P

$

P

$

P

$

Network

D

P

C

D

P

C

D

P

C

Distributed memory or Nonuniform Memory Access (NUMA) Model

Uniform Memory Access (UMA) ModelInterconnect: Bus, Crossbar, Multistage networkP: ProcessorM: MemoryC: CacheD: Cache directory

Cache-Only Memory Architecture (COMA)


Message-Passing MulticomputersMessage-Passing Multicomputers• Comprised of multiple autonomous computers (nodes).

• Each node consists of a processor, local memory, attached storage and I/O peripherals.

• Programming model is more removed from basic hardware operations.

• Local memory is only accessible by local processors.

• A message-passing network provides point-to-point static connections among the nodes.

• Inter-node communication is carried out by message passing through the static connection network

• Process communication achieved using a message-passing programming environment.


Convergence: Generic Parallel Convergence: Generic Parallel ArchitectureArchitecture

• A generic modern multiprocessor

Node: processor(s), memory system, plus communication assist• Network interface and communication controller

• Scalable network

• Convergence allows lots of innovation, now within framework• Integration of assist with node, what operations, how efficiently...

Mem

Network

P

$

Communicationassist (CA)


Fundamental Design IssuesFundamental Design Issues• At any layer, interface (contract) aspect and performance

aspects– Naming: How are logically shared data and/or processes

referenced?– Operations: What operations are provided on these data– Ordering: How are accesses to data ordered and coordinated?– Replication: How are data replicated to reduce

communication?– Communication Cost: Latency, bandwidth, overhead,

occupancy• Understand at programming model first, since that sets

requirements• Other issues

– Node Granularity: How to split between processors and memory?


SynchronizationSynchronizationMutual exclusion (locks):

– Ensure certain operations on certain data can be performed by only one process at a time.

– Room that only one person can enter at a time.

– No ordering guarantees.

Event synchronization: – Ordering of events to preserve dependencies.

• e.g. Producer Consumer of data

– Three main types:• Point-to-point

• Global

• Group


Communication Cost ModelCommunication Cost ModelComm Time per message = Overhead + Assist Occupancy

+ Network Delay + Size/Bandwidth + Contention

= ov + oc + l + n/B + Tc

Overhead = Time to initiate the transfer

Occupancy = The time it takes data to pass through the slowest component on

the communication path. Limits frequecy of communication operations.

l + n/B + Tc = Network Latency, can be hidden by overlapping with other processor

operations

• Overhead and assist occupancy may be f(n) or not.• Each component along the way has occupancy and delay.

– Overall delay is sum of delays.

– Overall occupancy (1/bandwidth) is biggest of occupancies.

Comm Cost = frequency * (Comm time - overlap)


Conditions of Parallelism: Conditions of Parallelism: Data DependenceData Dependence

1 True Data or Flow Dependence: A statement S2 is data dependent on statement S1 if an execution path exists from S 1 to S2 and if at least one output variable of S1 feeds in as an input operand used by S2

denoted by S1 S2

2 Antidependence: Statement S2 is antidependent on S1 if S2 follows S1 in program order and if the output of S2 overlaps the input of S1

denoted by S1 S2

3 Output dependence: Two statements are output dependent if they produce the same output variable

denoted by S1 S2


Conditions of Parallelism: Data DependenceConditions of Parallelism: Data Dependence

4 I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same variable is involved but because the same file is referenced by both I/O statements.

5 Unknown dependence:

• Subscript of a variable is subscribed (indirect addressing).

• The subscript does not contain the loop index.

• A variable appears more than once with subscripts having different coefficients of the loop variable.

• The subscript is nonlinear in the loop index variable.


Data and I/O Dependence: ExamplesData and I/O Dependence: Examples A -

B -

S1: Load R1,AS2: Add R2, R1S3: Move R1, R3

S4: Store B, R1

S1: Read (4),A(I) /Read array A from tape unit 4/S2: Rewind (4) /Rewind tape unit 4/S3: Write (4), B(I) /Write array B into tape unit 4/S4: Rewind (4) /Rewind tape unit 4/

S1

S3

S4 S2

Dependence graph

S1 S3I/O

I/O dependence caused by accessing thesame file by the read and write statements


Conditions of ParallelismConditions of Parallelism• Control Dependence:

– Order of execution cannot be determined before runtime due to conditional statements.

• Resource Dependence:– Concerned with conflicts in using shared resources including functional

units (integer, floating point), memory areas, among parallel tasks.

• Bernstein’s Conditions:

Two processes P1 , P2 with input sets I1, I2 and output sets O1, O2 can execute in parallel (denoted by P1 || P2) if:

I1 O2 =

I2 O1 =

O1 O2 =


Bernstein’s Conditions: An ExampleBernstein’s Conditions: An Example• For the following instructions P1, P2, P3, P4, P5 in program order and

– Instructions are in program order– Each instruction requires one step to execute– Two adders are available

P1 : C = D x E

P2 : M = G + C

P3 : A = B + C

P4 : C = L + M

P5 : F = G E

Using Bernstein’s Conditions after checking statement pairs: P1 || P5 , P2 || P3 , P2 || P5 , P5 || P3 , P4 || P5

X P1

D E

+3P4

+2P3+1

P2

C

BG

L P5

G E

FAC

X P1

D E

+1P2

+3P4

P5

G

B

F

C

+2P3

AL

E GC

M

Parallel execution in three stepsassuming two adders are available per step

Sequential execution

Time

XP1

P5

+2

+3 +1

P2 P4

P3

Dependence graph:Data dependence (solid lines)Resource dependence (dashed lines)


Theoretical Models of Parallel ComputersTheoretical Models of Parallel Computers• Parallel Random-Access Machine (PRAM):

– n processor, global shared memory model.

– Models idealized parallel computers with zero synchronization or memory access overhead.

– Utilized parallel algorithm development and scalability and complexity analysis.

• PRAM variants: More realistic models than pure PRAM– EREW-PRAM: Simultaneous memory reads or writes to/from

the same memory location are not allowed.

– CREW-PRAM: Simultaneous memory writes to the same location is not allowed.

– ERCW-PRAM: Simultaneous reads from the same memory location are not allowed.

– CRCW-PRAM: Concurrent reads or writes to/from the same memory location are allowed.


Example: sum algorithm on P processor PRAMExample: sum algorithm on P processor PRAM

•Input: Array A of size n = 2k

in shared memory

•Initialized local variables: •the order n,

•number of processors p = 2q n,

• the processor number s

•Output: The sum of the elements

of A stored in shared memory

begin

1. for j = 1 to l ( = n/p) do

Set B(l(s - 1) + j): = A(l(s-1) + j)

2. for h = 1 to log n do

2.1 if (k- h - q 0) then

for j = 2k-h-q(s-1) + 1 to 2k-h-qS do

Set B(j): = B(2j -1) + B(2s)

2.2 else {if (s 2k-h) then

Set B(s): = B(2s -1 ) + B(2s)}

3. if (s = 1) then set S: = B(1)

endRunning time analysis:• Step 1: takes O(n/p) each processor executes n/p operations•The hth of step 2 takes O(n / (2hp)) since each processor has

to perform (n / (2hp)) operations• Step three takes O(1)•Total Running time:

p hh

n

T n On

p

n

pO

n

pn( ) ( log )

log

21


Example: Sum Algorithm on P Processor PRAMExample: Sum Algorithm on P Processor PRAM

Operation represented by a node is executed by the processor indicated below the node.

B(6)=A(6)

P3

B(5)=A(5)

P3

P3

B(3)

B(8)=A(8)

P4

B(7)=A(7)

P4

P4

B(4)

B(2)=A(2)

P1

B(1)=A(1)

P1

P1

B(1)

B(4)=A(4)

P2

B(3)=A(3)

P2

P2

B(2)

B(2)

P2

B(1)

P1

B(1)

P1

S= B(1)

P1

For n = 8 p = 4Processor allocation for computing the sum of 8 elements on 4 processor PRAM

5

4

3

2

1

TimeUnit


• Input: – n x n matrix A ; vector x of order n– The processor number i. The number of processors– The ith submatrix B = A( 1:n, (i-1)r +1 ; ir) of size n x r where r = n/p– The ith subvector w = x(i - 1)r + 1 : ir) of size r

• Output:

– Processor Pi computes the vector y = A1x1 + …. Aixi and passes the result to the right

– Upon completion P1 will hold the product Ax

Begin

1. Compute the matrix vector product z = Bw

2. If i = 1 then set y: = 0

else receive(y,left)

3. Set y: = y +z

4. send(y, right)

5. if i =1 then receive(y,left)

End

Tcomp = k(n2/p)Tcomm = p(l+ mn)T = Tcomp + Tcomm

= k(n2/p) + p(l+ mn)

Example: Asynchronous Matrix Vector Product on a Ring


Levels of Parallelism in Program ExecutionLevels of Parallelism in Program Execution

Jobs or programs (Multiprogramming)Level 5

Subprograms, job steps or related parts of a program

Level 4

Procedures, subroutines, or co-routines

Level 3

Non-recursive loops or unfolded iterations

Level 2

Instructions or statements

Level 1

Increasing communicationsdemand and mapping/scheduling overhead

}}}

Higherdegree ofParallelism

MediumGrain

CoarseGrain

FineGrain


Limited Concurrency: Amdahl’s LawLimited Concurrency: Amdahl’s Law–Most fundamental limitation on parallel speedup.–If fraction s of sequential execution is inherently serial,

speedup 1/s

–Example: 2-phase calculation,• sweep over n-by-n grid and do some independent computation.• sweep again and add each value to global sum.

–Time for first phase = n2/p–Second phase serialized at global variable, so time = n2

–Speedup or at most 2

–Possible Trick: divide second phase into two:• Accumulate into private sum during sweep.• Add per-process private sum into global sum.

–Parallel time is n2/p + n2/p + p, and speedup at best

2n2

n2

p + n2

2n2

2n2 + p2


Parallel Performance MetricsParallel Performance MetricsDegree of Parallelism (DOP)Degree of Parallelism (DOP)

• For a given time period, DOP reflects the number of processors in a specific parallel computer actually executing a particular parallel program.

• Average Parallelism: – Given maximum parallelism = m

– n homogeneous processors

– Computing capacity of a single processor – Total amount of work W (instructions, computations):

or as a discrete summation W ii

i

m

t .

1

W DOP t dtt

t

( )1

2

A DOP t dtt t t

t

1

2 1 1

2

( ) A ii

i

m

ii

m

t t

.

1 1

ii

m

t t t

12 1Where ti is the total time that DOP = i and

The average parallelism A:

In discrete form


Example: Concurrency Profile of Example: Concurrency Profile of A Divide-and-Conquer AlgorithmA Divide-and-Conquer Algorithm

• Execution observed from t1 = 2 to t2 = 27

• Peak parallelism m = 8 • A = (1x5 + 2x3 + 3x4 + 4x6 + 5x2 + 6x2 + 8x3) / (5 + 3+4+6+2+2+3) = 93/25 = 3.72

Degree of Parallelism (DOP)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

11

10

9

8

7

6

5

4

3 2

1

Timet1 t2

A ii

i

m

ii

m

t t

.

1 1


Steps in Creating a Parallel ProgramSteps in Creating a Parallel Program

4 steps: Decomposition, Assignment, Orchestration, Mapping.

– Done by programmer or system software (compiler, runtime, ...).– Issues are the same, so assume programmer does it all explicitly.

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration


Summary of Parallel Algorithms AnalysisSummary of Parallel Algorithms Analysis• Requires characterization of multiprocessor system and

algorithm.• Historical focus on algorithmic aspects: partitioning, mapping.• PRAM model: data access and communication are free:

– Only load balance (including serialization) and extra work matter:

– Useful for early development, but unrealistic for real performance.

– Ignores communication and also the imbalances it causes.– Can lead to poor choice of partitions as well as orchestration.– More recent models incorporate communication costs; BSP,

LogP, ...

Sequential Instructions

Max (Instructions + Synch Wait Time + Extra Instructions)Speedup <


Summary of TradeoffsSummary of Tradeoffs• Different goals often have conflicting demands

– Load Balance:• Fine-grain tasks.

• Random or dynamic assignment.

– Communication:• Usually coarse grain tasks.

• Decompose to obtain locality: not random/dynamic.

– Extra Work:• Coarse grain tasks.

• Simple assignment.

– Communication Cost:• Big transfers: amortize overhead and latency.

• Small transfers: reduce contention.


Generic Message-Passing RoutinesGeneric Message-Passing Routines• Send and receive message-passing procedure/system calls often have

the form:

send(parameters)

recv(parameters)

– where the parameters identify the source and destination processes, and the data.


Blocking send( ) and recv( ) System CallsBlocking send( ) and recv( ) System Calls


Non-blocking send( ) and recv( ) System CallsNon-blocking send( ) and recv( ) System Calls


Message-Passing Message-Passing Computing ExamplesComputing Examples

• Problems with a very large degree of parallelism:– Image Transformations:

• Shifting, Rotation, Clipping etc.

– Mandelbrot Set:

• Sequential, static assignment, dynamic work pool assignment.

• Divide-and-conquer Problem Partitioning:– Parallel Bucket Sort.

– Numerical Integration: • Trapezoidal method using static assignment.

• Adaptive Quadrature using dynamic assignment.

– Gravitational N-Body Problem: Barnes-Hut Algorithm.

• Pipelined Computation.


Synchronous IterationSynchronous Iteration• Iteration-based computation is a powerful method for solving

numerical (and some non-numerical) problems.

• For numerical problems, a calculation is repeated and each time, a result is obtained which is used on the next execution. The process is repeated until the desired results are obtained.

• Though iterative methods are is sequential in nature, parallel implementation can be successfully employed when there are multiple independent instances of the iteration. In some cases this is part of the problem specification and sometimes one must rearrange the problem to obtain multiple independent instances.

• The term "synchronous iteration" is used to describe solving a problem by iteration where different tasks may be performing separate iterations but the iterations must be synchronized using point-to-point synchronization, barriers, or other synchronization mechanisms.


BarriersBarriersA synchronization mechanism

applicable to shared-memory

as well as message-passing,

where each process must wait

until all members of a specific

process group reach a specific

reference point in their

computation.

• Possible Implementations:– A library call possibly. implemented using a counter– Using individual point-to-point synchronization forming:

• A tree.• Butterfly connection pattern.


Message-Passing Local Message-Passing Local SynchronizationSynchronization

Process Pi-1Process Pi Process Pi+1

recv(Pi); send(Pi-1); recv(Pi);send(Pi); send(Pi+1); send(Pi);

recv(Pi-1);recv(pi+1);


Network CharacteristicsNetwork Characteristics• Topology:

– Physical interconnection structure of the network graph:• Node Degree. • Network diameter: Longest minimum routing distance between any two nodes in hops.• Average Distance between nodes .• Bisection width: Number of links whose removal disconnects the graph and cuts it in

half.• Symmetry: The property that the network looks the same from every node.• Homogeneity: Whether all the nodes and links are identical or not.

– Type of interconnection:• Static or Direct Interconnects: Nodes connected directly using static links point-to-

point links.

• Dynamic or Indirect Interconnects: Switches are usually used to realize dynamic links between nodes:

– Each node is connected to specific subset of switches. (e.g multistage interconnection networks MINs).

– Blocking or non-blocking, permutations realized.

• Shared-, broadcast-, or bus-based connections. (e.g. Ethernet-based).


Sample Static Network TopologiesSample Static Network Topologies

Linear

Ring2D Mesh

Hybercube

Binary Tree Fat Binary Tree Fully Connected


Static Connection Static Connection Networks Examples: Networks Examples:

2D Mesh2D Mesh

• Node Degree: 4• Network diameter: 2(r-1)• No of links: 2N - 2r• Bisection Width: r• Where r = N

For an r x r 2D Mesh:


Static Connection Networks Examples: Static Connection Networks Examples:

HypercubesHypercubes• Also called binary n-cubes. • Dimension = n = log2N

• Number of nodes = N = 2n

• Diameter: O(log2N) hops

• Good bisection BW: N/2• Complexity:

– Number of links: N(log2N)/2

– Node degree is n = log2N

0-D 1-D 2-D 3-D 4-D

5-D


Message Routing Message Routing Functions ExampleFunctions Example

Network Topology: 3-dimensional static-link hypercube Nodes denoted by C2C1C0

101

010

111

011

100

110

000001

000 001 010 011 100 101 110 111

Routing by least significant bit C0

000 001 010 011 100 101 110 111

Routing by middle bit C1

000 001 010 011 100 101 110 111

Routing by most significant bit C2


• Embed multiple logical dimension in one physical dimension using long interconnections.

6 x 3 x 2

Embeddings In Two DimensionsEmbeddings In Two Dimensions


Dynamic Connection NetworksDynamic Connection Networks

• Switches are usually used to implement connection paths or virtual circuits between nodes instead of fixed point-to-point connections.

• Dynamic connections are established based on program demands.

• Such networks include: – Bus systems.

– Multi-stage Networks (MINs):• Omega Network.

• Baseline Network etc.

– Crossbar switch networks.


Dynamic Networks DefinitionsDynamic Networks Definitions• Permutation networks: Can provide any one-to-one mapping between

sources and destinations.

• Strictly non-blocking: Any attempt to create a valid connection succeeds. These include Clos networks and the crossbar.

• Wide Sense non-blocking: In these networks any connection succeeds if a careful routing algorithm is followed. The Benes network is the prime example of this class.

• Rearrangeably non-blocking: Any attempt to create a valid connection eventually succeeds, but some existing links may need to be rerouted to accommodate the new connection. Batcher's bitonic sorting network is one example.

• Blocking: Once certain connections are established it may be impossible to create other specific connections. The Banyan and Omega networks are examples of this class.

• Single-Stage networks: Crossbar switches are single-stage, strictly non-blocking, and can implement not only the N! permutations, but also the NN combinations of non-overlapping broadcast.


PermutationsPermutations• For n objects there are n! permutations by which the n objects can be

reordered. • The set of all permutations form a permutation group with respect to a

composition operation. • One can use cycle notation to specify a permutation function. For Example: The permutation = ( a, b, c)( d, e) stands for the bijection mapping: a b, b c , c a , d e , e d in a circular fashion. The cycle ( a, b, c) has a period of 3 and the cycle (d, e) has a period of 2. Combining the two cycles, the permutation has a cycle period of 2 x 3 = 6. If one

applies the permutation six times, the identity mapping I = ( a) ( b) ( c) ( d) ( e) is obtained.


Perfect ShufflePerfect Shuffle• Perfect shuffle is a special permutation function suggested by Harold Stone (1971) for parallel

processing applications. • Obtained by rotating the binary address of an one position left.• The perfect shuffle and its inverse for 8 objects are shown here:

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111Perfect Shuffle Inverse Perfect Shuffle


• In the Omega network, perfect shuffle is used as an inter-stage connection pattern for all log2N stages.

• Routing is simply a matter of using the destination's address bits to set switches at each stage.

• The Omega network is a single-path network: There is just one path between an input and an output.

• It is equivalent to the Banyan, Staran Flip Network, Shuffle Exchange Network, and many others that have been proposed.

• The Omega can only implement NN/2 of the N! permutations between inputs and outputs, so it is possible to have permutations that cannot be provided (i.e. paths that can be blocked). – For N = 8, there are 84/8! = 4096/40320 = 0.1016 = 10.16% of the

permutations that can be implemented.

• It can take log2N passes of reconfiguration to provide all links. Because there are log2 N stages, the worst case time to provide all desired connections can be (log2N)2.

Multi-Stage Networks: Multi-Stage Networks: The Omega NetworkThe Omega Network


Shared Memory MultiprocessorsShared Memory Multiprocessors• Symmetric Multiprocessors (SMPs):

– Symmetric access to all of main memory from any processor.

• Currently Dominate the high-end server market:– Building blocks for larger systems; arriving to desktop.

• Attractive as high throughput servers and for parallel. programs:– Fine-grain resource sharing.– Uniform access via loads/stores.– Automatic data movement and coherent replication in caches.

• Normal uniprocessor mechanisms used to access data (reads and writes).– Key is extension of memory hierarchy to support multiple

processors.


Shared Memory Multiprocessors VariationsShared Memory Multiprocessors Variations

I/O devicesMem

P1

$ $

Pn

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

P1

$

Interconnection network

$

Pn

Mem Mem

(b) Bus-based shared memory

(c) Dancehall

(a) Shared cache

First-level $

Bus

P1

$

Interconnection network

$

Pn

Mem Mem

(d) Distributed-memory


Caches And Cache Coherence In Caches And Cache Coherence In Shared Memory MultiprocessorsShared Memory Multiprocessors

• Caches play a key role in all shared memory multiprocessor system variations:– Reduce average data access time.– Reduce bandwidth demands placed on shared interconnect.

• Private processor caches create a problem:– Copies of a variable can be present in multiple caches. – A write by one processor may not become visible to others:

• Processors accessing stale value in their private caches.– Process migration.– I/O activity.– Cache coherence problem.– Software and/or software actions needed to ensure write

visibility to all processors thus maintaining cache coherence.


Shared Memory Access Consistency ModelsShared Memory Access Consistency Models• Shared Memory Access Specification Issues:

– Program/compiler expected shared memory behavior.

– Specification coverage of all contingencies.

– Adherence of processors and memory system to the expected behavior.

• Consistency Models: Specify the order by which shared memory access events of one process should be observed by other processes in the system.– Sequential Consistency Model.

– Weak Consistency Models.

• Program Order: The order in which memory accesses appear in the execution of a single process without program reordering.

• Event Ordering: Used to declare whether a memory event is legal when several processes access a common set of memory locations.


Sequential Consistency (SC) ModelSequential Consistency (SC) Model• Lamport’s Definition of SC:

[Hardware is sequentially consistent if] the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order.

• Sufficient conditions to achieve SC in shared-memory access:– Every process issues memory operations in program order– After a write operation is issued, the issuing process waits for the write to

complete before issuing its next operation.– After a read operation is issued, the issuing process waits for the read to

complete, and for the write whose value is being returned by the read to complete, before issuing its next operation (provides write atomicity).

• According to these Sufficient, but not necessary, conditions:– Clearly, compilers should not reorder for SC, but they do!

• Loop transformations, register allocation (eliminates!).– Even if issued in order, hardware may violate for better performance

• Write buffers, out of order execution.– Reason: uniprocessors care only about dependences to same location

• Makes the sufficient conditions very restrictive for performance.


Sequential Consistency (SC) ModelSequential Consistency (SC) ModelProcessors issuing memory references as per program order

P1 P2 Pn

Memory

The “switch” is randomly set after each memoryreference

• As if there were no caches, and a only single memory exists.

• Total order achieved by interleaving accesses from different processes.

• Maintains program order, and memory operations, from all processes,

appear to [issue, execute, complete] atomically w.r.t. others.

• Programmer’s intuition is maintained.


Further Interpretation of SCFurther Interpretation of SC• Each process’s program order imposes partial order on set

of all operations.

• Interleaving of these partial orders defines a total order on all operations.

• Many total orders may be SC (SC does not define particular interleaving).

• SC Execution: An execution of a program is SC if the results it produces are the same as those produced by some possible total order (interleaving).

• SC System: A system is SC if any possible execution on that system is an SC execution.


Weak (Release) Consistency (WC)Weak (Release) Consistency (WC) • The DBS Model of WC: In a multiprocessor shared-

memory system:– Accesses to global synchronizing variables are strongly

ordered.

– No access to a synchronizing variable is issues by a processor before all previous global data accesses have been globally performed.

– No access to global data is issued by a processor before a previous access to a synchronizing variable has been globally performed.

Dependence conditions weaker than in SC because they are limited to synchronization variables.

Buffering is allowed in write buffers except for hardware- recognized synchronization variables.


TSO Weak Consistency ModelTSO Weak Consistency Model • Sun’s SPARK architecture WC model.

• Memory access order between processors determined by a hardware memory access “switch”.

• Stores and swaps issued by a processor are placed in a dedicated store FIFO buffer for the processor. Order of memory operations is the same as processor issue order.

• A load by a processor first checks its store buffer if it contains a store to the same location. – If it does then the load returns the value of the most recent

such store.

– Otherwise the load goes directly to memory.

– A processor is logically blocked from issuing further operations until the load returns a value.


Cache Coherence Using A BusCache Coherence Using A Bus• Built on top of two fundamentals of uniprocessor systems:

– Bus transactions.– State transition diagram in cache.

• Uniprocessor bus transaction:– Three phases: arbitration, command/address, data transfer.– All devices observe addresses, one is responsible

• Uniprocessor cache states:– Effectively, every block is a finite state machine.– Write-through, write no-allocate has two states:

valid, invalid.– Write-back caches have one more state: Modified (“dirty”).

• Multiprocessors extend both these two fundamentals somewhat to implement coherence.


Write-invalidate Snoopy Bus ProtocolWrite-invalidate Snoopy Bus Protocol

For Write-Through CachesFor Write-Through CachesState Transition DiagramState Transition Diagram

W(i) = Write to block by processor i

W(j) = Write to block copy in cache j by processor j iR(i) = Read block by processor i.

R(j) = Read block copy in cache j by processor j iZ(i) = Replace block in cache .

Z(j) = Replace block copy in cache j i

InvalidInvalid ValidValid

R(i), W(i)

W(j), Z(i)

R(i)W(i)R(j)Z(j)

R(j)Z(j)Z(i)W(j)


Write-invalidate Snoopy Bus Protocol For Write-Back CachesFor Write-Back CachesState Transition Diagram

RW RO

INV

RW: Read-WriteRO: Read OnlyINV: Invalidated or not in cache

W(i) = Write to block by processor iW(j) = Write to block copy in cache j by processor j iR(i) = Read block by processor i.R(j) = Read block copy in cache j by processor j iZ(i) = Replace block in cache .Z(j) = Replace block copy in cache j i

W(j), Z(i)

R(i)R(j)Z(j)

R(i)W(i) Z(j)

R(j), Z(j), W(j), Z(i)

R(j)

W(i)

W(i)

R(i)W(j) Z(i)


– BusRd(S) Means shared line asserted on BusRd transaction.

– Flush: If cache-to-cache sharing, only one cache flushes data.

PrWr/—

BusRd/Flush

PrRd/

BusRdX/Flush

PrWr/BusRdX

PrWr/—

PrRd/—

PrRd/—BusRd/Flush

E

M

I

S

PrRd

BusRd(S)

BusRdX/Flush

BusRdX/Flush

BusRd/Flush

PrWr/BusRdX

PrRd/BusRd (S)

MESI State Transition DiagramMESI State Transition Diagram


Parallel System Performance: Parallel System Performance: Evaluation & ScalabilityEvaluation & Scalability

• Factors affecting parallel system performance:– Algorithm-related, parallel program related, architecture/hardware-

related.

• Workload-Driven Quantitative Architectural Evaluation:– Select applications or suite of benchmarks to evaluate architecture either

on real or simulated machine.– From measured performance results compute performance metrics:

• Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism.

– Application Models of Parallel Computer Models: How the speedup of an application is affected subject to specific constraints:

• Fixed-load Model.• Fixed-time Model.• Fixed-Memory Model.

• Performance Scalability:– Definition.– Conditions of scalability.– Factors affecting scalability.


Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited • Degree of Parallelism (DOP): For a given time period, reflects

the number of processors in a specific parallel computer actually executing a particular parallel program.

• Average Parallelism: – Given maximum parallelism = m

– n homogeneous processors.

– Computing capacity of a single processor – Total amount of work (instructions, computations:

or as a discrete summation W ii

i

m

t .

1

W DOP t dtt

t

( )1

2

A DOP t dtt t t

t

1

2 1 1

2

( )A i

ii

m

ii

m

t t

.

1 1

ii

m

t t t

12 1Where ti is the total time that DOP = i and

The average parallelism A:

In discrete form


Parallel Performance Metrics RevisitedParallel Performance Metrics RevisitedAsymptotic Speedup:Asymptotic Speedup:

T

Ti

T

Ti

ii

mi

i

m

ii

mi

i

m

ii

m

i

m

t W

t W

SW

W i

( ) ( )

( ) ( )

( )

( )

1 1

1

1 1

1 1

1

1

Execution time with one processor:

Execution time with an infinite number of available processors:

Asymptotic speedup S


Harmonic Mean PerformanceHarmonic Mean Performance• Arithmetic mean execution time per instruction:

• The harmonic mean execution rate across m benchmark programs:

• Weighted harmonic mean execution rate with weight distribution = {fi|i = 1, 2, …, m}

• Harmonic Mean Speedup for a program with n parallel execution modes:

a ii

m

ii

m

T T Rm m

1 1 1

1 1

ha ii

mR T Rm

1

11

h

i ii

mRf R

*

1

1

S T Tf Ri ii

n 1

1

1*


Efficiency, Utilization, Redundancy, Quality of ParallelismEfficiency, Utilization, Redundancy, Quality of Parallelism• System Efficiency: Let O(n) be the total number of unit operations

performed by an n-processor system and T(n) be the execution time in unit time steps:

– Speedup factor: S(n) = T(1) /T(n)

– System efficiency for an n-processor system: E(n) = S(n)/n = T(1)/[nT(n)]

• Redundancy: R(n) = O(n)/O(1)

• Utilization: U(n) = R(n)E(n) = O(n) /[nT(n)]

• Quality of Parallelism: Q(n) = S(n) E(n) / R(n) = T3(1) /[nT2(n)O(n)]

Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited


Parallel Performance Metrics Revisited: Amdahl’s LawParallel Performance Metrics Revisited: Amdahl’s Law• Harmonic Mean Speedup (i number of processors used):

• In the case w = {fi for i = 1, 2, .. , n} = (, 0, 0, …, 1-), the system is running sequential code with probability and utilizing n processors with probability (1-) with other processor modes not utilized.

Amdahl’s Law:

S 1/ as n Under these conditions the best speedup is upper-bounded by 1/

S T Tf Ri ii

n 1

1

1*

nSn

n

1 1( )


The Isoefficiency ConceptThe Isoefficiency Concept• Workload w as a function of problem size s : w = w(s)

• h total communication/other overhead , as a function of problem size s and machine size n, h = h(s,n)

• Efficiency of a parallel algorithm implemented on a given parallel computer can be defined as:

• Isoefficiency Function: E can be rewritten as:

E = 1/(1 + h(s, n)/w(s)). To maintain a constant E, W(s) should grow in proportion to h(s,n) or,

C = E/(1-E) is a constant for a fixed efficiency E.

The isoefficiency function is defined as follows:

If the workload w(s) grows as fast as fE(n) then a constant efficiency

can be maintained for the algorithm-architecture combination.

EW s

W s h s n

( )

( ) ( , )

w sE

Eh s n( ) ( , )

1

Ef n C h s n( ) ( , )


Speedup Performance Laws: Fixed-Workload SpeedupSpeedup Performance Laws: Fixed-Workload SpeedupWhen DOP = i > n (n = number of processors)

n

ii

m

i

i

mSW

WT

T n

iin

( )

( )

1 1

1

iit Wn

i

i

n( )

Execution time of Wi

T ni

i

ni

i

m W( )

1

Total execution time

n

ii

m

i

i

mSW

WT

T n Q n

iin

Q n

( )

( ) ( )( )

1 1

1

If DOP = , then i n n ii i it t W ( ) ( )

Fixed-load speedup factor is defined as the ratio of T(1) to T(n):

Let Q(n) be the total system overheads on an n-processor system:

The overhead delay Q(n) is both application- and machine-dependent and difficult to obtain in closed form.


Amdahl’s Law for Fixed-Load SpeedupAmdahl’s Law for Fixed-Load Speedup• For the special case where the system either operates in

sequential mode (DOP = 1) or a perfect parallel mode (DOP = n), the Fixed-load speedup is simplified to:

We assume here that the overhead factor Q(n) = 0

For the normalized case where:

The equation is reduced to the previously seen form of

Amdahl’s Law:

nn

nS W W

W W n

1

1

1 11 1 1W W W Wn n

with and ( )

nSn

n

1 1( )


Fixed-Time SpeedupFixed-Time Speedup• To run the largest problem size possible on a larger

machine with about the same execution time.

Let be the maximum DOP for the scaled up problem,

be the scaled workload with DOP = i

In general, for and

Assuming that we obtain:

m

i m

T( )=T'(n)

i

i i

W'W' W W' W

'

' 2

11 1

ii

mi

i

m

W Wi

i

nQ n

1 1

''

( )

Speedup is given by: nS T T n' ( ) / '( ) 1

n

ii

m

i

i

m

ii

m

ii

mSW

WW

W

T

T n

iin

Q n'

''

'( )

'( )( )

' '

1 1

1

1

1


Gustafson’s Fixed-Time SpeedupGustafson’s Fixed-Time Speedup• For the special fixed-time speedup case where DOP can

either be 1 or n and assuming Q(n) = 0

n

ii

m

ii

mn

n

n

n

n n n n

SW

W

W WW W

W WW W

W W W W W W

T

T n

n

n n

'' ' '

' ' '

( )

'( )

'

1 1

1

1

1

1

1

1 1Where and

Assuming and and a= -W W W Wn n1 11 1

nST

T n

nn n'

( )

'( )

( )

( )( )

1 1

11


Fixed-Memory SpeedupFixed-Memory Speedup

Wi

i

m

W

1

workload for sequential execution* *

*

W W ii

mn

1

scaled workload on nodes

The memory bound for an active node is

1

1

g W ii

m

The fixed-memory speedup is defined by:

n

ii

i

i

SW

WT

T n

m

iin

Q nm

*

*

*

( )

'( )( )

*

*

1 1

1

Assuming

and either sequential or perfect parallelim and Q(n) = 0

*( ) ( ) ( ) ( )g WnM G n g M G n

n

nn

n

n

nS W W

W WW W

W Wn

G n

G n n*

* *

* */

( )

( ) /

1

1

1

1

• Let M be the memory requirement of a given problem

• Let W = g(M) or M = g-1(W) where:


Scalability MetricsScalability Metrics• The study of scalability is concerned with determining the

degree of matching between a computer architecture and and an application algorithm and whether this degree of matching continues to hold as problem and machine sizes are scaled up .

• Basic scalablity metrics affecting the scalability of the system for a given problem:

Machine Size n Clock rate f

Problem Size s CPU time T

I/O Demand d Memory Capacity m

Communication overhead h(s, n), where h(s, 1) =0

Computer Cost c

Programming Overhead p


Parallel Scalability MetricsParallel Scalability Metrics

Scalability of An architecture/algorithm Combination

Machine Size Hardware

Cost CPU Time

I/O Demand

Memory Demand

Programming Cost Problem

Size

Communication Overhead


Parallel System ScalabilityParallel System Scalability• Scalability (informal restrictive definition):

A system architecture is scalable if the system efficiency E(s, n) = 1 for all algorithms with any number of processors and any size problem s.

• Scalability Definition (more formal):

The scalability (s, n) of a machine for a given algorithm is defined as the ratio of the asymptotic speedup S(s,n) on the real machine to the asymptotic speedup SI(s, n)

On the ideal realization of an EREW PRAM

IS

( , )( , )

( , )

( , )

( , )s n

S s n

s n

s n

T s nI

I

ST

II

S Ts n

T s

s n( , )

( , )

( , )

1


MPPs Scalability IssuesMPPs Scalability Issues• Problems:

– Memory-access latency.– Interprocess communication complexity or synchronization

overhead.– Multi-cache inconsistency.– Message-passing overheads.– Low processor utilization and poor system performance for

very large system sizes.

• Possible Solutions:– Low-latency fast synchronization techniques.– Weaker memory consistency models.– Scalable cache coherence protocols.– To relize shared virtual memory.– Improved software portability; standard parallel and

distributed operating system support.


Cost ScalingCost Scaling• cost(p,m) = fixed cost + incremental cost (p,m)

• Bus Based SMP?

• Ratio of processors : memory : network : I/O ?

• Parallel efficiency(p) = Speedup(P) / P

• Costup(p) = Cost(p) / Cost(1)

• Cost-effective: speedup(p) > costup(p)

• Is super-linear speedup


Scalable Distributed Scalable Distributed Memory MachinesMemory Machines

Goal: Parallel machines that can be scaled to hundreds or thousands of processors.

• Design Choices:– Custom-designed or commodity nodes?– Network scalability. – Capability of node-to-network interface (critical).– Supporting programming models?

• What does hardware scalability mean?– Avoids inherent design limits on resources.– Bandwidth increases with machine size P.– Latency should not increase with machine size P.– Cost should increase slowly with P.


Generic Distributed Memory Generic Distributed Memory OrganizationOrganization

• Network bandwidth?• Bandwidth demand?

– Independent processes?– Communicating processes?

• Latency? O(log2P) increase?

• Cost scalability of system?

Scalable network

CA

P

$

Switch

M

Switch Switch

Multi-stageinterconnection network (MIN)?Custom-designed?

Node:O(10) Bus-based SMP

Custom-designed CPU?Node/System integration level?How far? Cray-on-a-Chip? SMP-on-a-Chip?

OS Supported?Network protocols?

Communication Assist Extend of functionality?

MessagetransactionDMA?

Global virtual Shared address space?

Cache coherenceCache coherence Protocols.Protocols.


Network Latency Scaling ExampleNetwork Latency Scaling Example

• Max distance: log2 n

• Number of switches: n log n

• overhead = 1 us, BW = 64 MB/s, 200 ns per hop

• Using pipelined or cut-through routing:• T64(128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us

• T1024(128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us

• Store and Forward• T64

sf(128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us

• T64sf(1024) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us

O(log2 n) Stage MIN using switches:

Only 20% increase in latency for 16x size increase


Physical ScalingPhysical Scaling• Chip-level integration:

– Integrate network interface, message router I/O links.– Memory/Bus controller/chip set.– IRAM-style Cray-on-a-Chip. – Future: SMP on a chip?

• Board-level:– Replicating standard microprocessor cores.

• CM-5 replicated the core of a Sun SparkStation 1 workstation.• Cray T3D and T3E replicated the core of a DEC Alpha workstation.

• System level:• IBM SP-2 uses 8-16 almost complete RS6000 workstations placed in racks.


Spectrum of DesignsSpectrum of DesignsNone: Physical bit stream

– blind, physical DMA nCUBE, iPSC, . . .

User/System– User-level port CM-5, *T

– User-level handler J-Machine, Monsoon, . . .

Remote virtual address– Processing, translation Paragon, Meiko CS-2

Global physical address– Proc + Memory controller RP3, BBN, T3D

Cache-to-cache– Cache controller Dash, KSR, Flash

Incr

easi

ng

HW

Su

pp

ort

, S

pec

iali

zati

on

, In

tru

sive

nes

s, P

erfo

rman

ce

(??

?)


Scalable Cache Coherent SystemsScalable Cache Coherent Systems• Scalable distributed shared memory machines Assumptions:

– Processor-Cache-Memory nodes connected by scalable network.– Distributed shared physical address space.– Communication assist must interpret network transactions, forming shared

address space.

• For a system with shared physical address space:– A cache miss must be satisfied transparently from local or remote memory

depending on address.

– By its normal operation, cache replicates data locally resulting in a potential cache coherence problem between local and remote copies of data.

– A coherency solution must be in place for correct operation.

• Standard snoopy protocols studied earlier may not apply for lack of a bus or a broadcast medium to snoop on.

• For this type of system to be scalable, in addition to latency and bandwidth scalability, the cache coherence protocol or solution used must also scale as well.


Scalable Cache CoherenceScalable Cache Coherence• A scalable cache coherence approach may have similar

cache line states and state transition diagrams as in bus-based coherence protocols.

• However, different additional mechanisms other than broadcasting must be devised to manage the coherence protocol.

• Two possible approaches:– Approach #1: Hierarchical Snooping.– Approach #2: Directory-based cache coherence.– Approach #3: A combination of the above two

approaches.


Approach #1: Hierarchical SnoopingApproach #1: Hierarchical Snooping• Extend snooping approach: A hierarchy of broadcast media:

– Tree of buses or rings (KSR-1).– Processors are in the bus- or ring-based multiprocessors at the

leaves.– Parents and children connected by two-way snoopy interfaces:

• Snoop both buses and propagate relevant transactions.– Main memory may be centralized at root or distributed among

leaves.

• Issues (a) - (c) handled similarly to bus, but not full broadcast. – Faulting processor sends out “search” bus transaction on its bus.– Propagates up and down hierarchy based on snoop results.

• Problems: – High latency: multiple levels, and snoop/lookup at every level.– Bandwidth bottleneck at root.

• This approach has, for the most part, been abandoned.


Hierarchical Snoopy Cache CoherenceHierarchical Snoopy Cache Coherence

Simplest way: hierarchy of buses; snoopy coherence at each level.

– or rings.• Consider buses. Two possibilities:

(a) All main memory at the global (B2) bus.

(b) Main memory distributed among the clusters.

(a)(b)

P P

L1 L1

L2

B1

P P

L1 L1

L2

B1

B2

Main Memory (Mp)

P P

L2

L1 L1

B1

Memory

P P

L1 L1

B1

L2Memory

B2


Scalable Approach #2: Scalable Approach #2: DirectoriesDirectories

P

A M/D

C

P

A M/D

C

P

A M/D

C

Read requestto directory

Reply withowner identity

Read req.to owner

DataReply

Revision messageto directory

1.

2.

3.

4a.

4b.

P

A M/D

CP

A M/D

C

P

A M/D

C

RdEx requestto directory

Reply withsharers identity

Inval. req.to sharer

1.

2.

P

A M/D

C

Inval. req.to sharer

Inval. ack

Inval. ack

3a. 3b.

4a. 4b.

Requestor

Node withdirty copy

Directory nodefor block

Requestor

Directory node

Sharer Sharer

(a) Read miss to a block in dirty state (b) Write miss to a block with two sharers

Many alternatives exist for organizing Many alternatives exist for organizing directory information.directory information.


Organizing DirectoriesOrganizing Directories

Let’s see how they work and their scaling characteristics with P

Centralized Distributed

HierarchicalFlat

Memory-based Cache-based

Directory Schemes

How to find source ofdirectory information

How to locate copies


Flat, Memory-based Flat, Memory-based Directory SchemesDirectory Schemes

• All info about copies co-located with block itself at home.– Works just like centralized scheme, except distributed.

• Scaling of performance characteristics:

– Traffic on a write: proportional to number of sharers.

– Latency a write: Can issue invalidations to sharers in parallel.

• Scaling of storage overhead:

– Simplest representation: full bit vector, i.e. one presence bit per node.

– Storage overhead doesn’t scale well with P; a 64-byte cache line implies:

• 64 nodes: 12.7% overhead.

• 256 nodes: 50% overhead.; 1024 nodes: 200% overhead.

– For M memory blocks in memory, storage overhead is proportional to P*M

P

M


• How they work:• Home only holds pointer to rest of directory info.

• Distributed linked list of copies, weaves through caches:• Cache tag has pointer, points to next cache with a copy.

• On read, add yourself to head of the list (comm. needed).

• On write, propagate chain of invalidations down the list.

P

Cache

P

Cache

P

Cache

Main Memory(Home)

Node 0 Node 1 Node 2

• Utilized in Scalable Coherent Interface (SCI) IEEE Standard: • Uses a doubly-linked list.

Flat, Cache-based SchemesFlat, Cache-based Schemes


• Two-level “hierarchy”.

• Individual nodes are multiprocessors, connected non-hierarchically.– e.g. mesh of SMPs.

• Coherence across nodes is directory-based.– Directory keeps track of nodes, not individual processors.

• Coherence within nodes is snooping or directory.– Orthogonal, but needs a good interface of functionality.

• Examples:– Convex Exemplar: directory-directory.

– Sequent, Data General, HAL: directory-snoopy.

Approach #3: Approach #3: A Popular Middle GroundA Popular Middle Ground


Example Two-level Example Two-level HierarchiesHierarchies

P

C

Snooping

B1

B2

P

C

P

CB1

P

C

MainMem

MainMem

AdapterSnoopingAdapter

P

CB1

Bus (or Ring)

P

C

P

CB1

P

C

MainMem

MainMem

Network

Assist Assist

Network2

P

C

AM/D

Network1

P

C

AM/D

Directory adapter

P

C

AM/D

Network1

P

C

AM/D

Directory adapter

P

C

AM/D

Network1

P

C

AM/D

Dir/Snoopy adapter

P

C

AM/D

Network1

P

C

AM/D

Dir/Snoopy adapter

(a) Snooping-snooping (b) Snooping-directory

Dir. Dir.

(c) Directory-directory (d) Directory-snooping


Advantages of Multiprocessor Advantages of Multiprocessor NodesNodes• Potential for cost and performance advantages:

– Amortization of node fixed costs over multiple processors:

• Applies even if processors simply packaged together but not coherent.

– Can use commodity SMPs.

– Less nodes for directory to keep track of.

– Much communication may be contained within node (cheaper).

– Nodes prefetch data for each other (fewer “remote” misses).

– Combining of requests (like hierarchical, only two-level).

– Can even share caches (overlapping of working sets).

– Benefits depend on sharing pattern (and mapping):

• Good for widely read-shared: e.g. tree data in Barnes-Hut.

• Good for nearest-neighbor, if properly mapped.

• Not so good for all-to-all communication.


Disadvantages of Disadvantages of Coherent MP NodesCoherent MP Nodes

• Bandwidth shared among nodes.• Bus increases latency to local memory.• With local node coherence in place, a CPU typically must

wait for local snoop results before sending remote requests.

• Snoopy bus at remote node increases delays there too, increasing latency and reducing bandwidth.

• Overall, may hurt performance if sharing patterns don’t comply with system architecture.

EECC756 - Shaaban #1 Exam Review Spring2001 5-10-2001 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.

Documents

data parallel

exam review spring2001

parallel computer architecture

parallel machines

data access

parallel processors

multiple data streams

multiple data streams