CMPE 421 Parallel Computer Architecture Multi Processing 1.

Post on 12-Jan-2016






Click to see full reader



CMPE 421 Parallel Computer Architecture

Multi Processing


Multi Processing Goal: connecting multiple computers to get higher performance Multiprocessors: To create powerful computers by connecting many

existing smaller ones Scalability

- H/W and S/W are designed to be sold with variable number of processors Availability

- If one fails others should continue Power efficiency

Job-level (process-level) parallelism (Threads) High throughput for independent jobs

Parallel processing program Single program run on multiple processors

Multi-core microprocessors Chips with multiple processors (cores)

Clusters: A set of computers connected over a local area network that function as single

large multiprocessor


Hardware and Software

Hardware Serial: e.g., Pentium 4 Parallel: e.g., Core 2 Duo, quad-core, Xeon

Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system

Sequential/concurrent software can run on serial/parallel hardware

Challenge: making effective use of parallel hardware Example

Database File servers Computer-aided design packages Multiprocessing operating systems Google PC’s clusters


Parallel Programming

Parallel software is the problemNeed to get significant performance

improvement Otherwise, just use a faster uniprocessor, since it’s


Difficulties Partitioning (loops) Coordination Communications overhead


Scaling Example

Workload: sum of 10 scalars, and 10 × 10 matrix sum

Speed up from 10 to 100 processorsSingle processor: Time = (10 + 100) × tadd10 processors

Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5.5

100 processors Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 = 10

Assumes load can be balanced across processors


Scaling Example (cont)

What if matrix size is 100 × 100?Single processor: Time = (10 + 10000) × tadd

10 processors Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd

Speedup = 10010/1010 = 9.9

100 processors Time = 10 × tadd + 10000/100 × tadd = 110 × tadd

Speedup = 10010/110 = 91

Assuming load balanced


Strong vs Weak Scaling

Strong scaling: problem size fixed As in example

Weak scaling: problem size proportional to number of processors

10 processors, 10 × 10 matrix- Time = 20 × tadd

100 processors, 32 × 32 matrix- Time = 10 × tadd + 1000/100 × tadd = 20 × tadd

Constant performance in this example


Problems to derive the design of multiprocessors and clusters

1. How do parallel processors share data?

2. How do parallel processors coordinate?» WHEN operating on shared data ….

3. How many processors?


How do parallel processors share data?

Shared memory Procedures Single address space that all procedures share Since the processors communicate through shared variables in

memory, Accessing any memory location is performed via loads and stores

Single Bus


How do parallel processors coordinate? Difficulty of coordination within processors is:

- One processor could start working on data before another is finished with it.This coordination is called synchronization

- When sharing is supported with a single address space, there must be a separate mechanism for synchronization.

Write-back caches used to keep bus traffic at a minimum Caches are used to reduce latency and to lower bus traffic Must provide hardware to ensure that caches and memory are

consistent (cache coherency) we will discuss later

One approach uses a lock: only one processor at a time can acquire the lock, and other processors interested in shared data must wait until the original processor unlocks the variable


Types of Memory Access

UMAs (uniform memory access) – SMP (symmetric multiprocessors)

all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested

NUMAs (nonuniform memory access) some main memory accesses are faster than others depending

on the processor making the request and which location is requested

can scale to larger sizes than UMAs so are potentially higher performance compared to UMA


Example: Sum Reduction

Suppose we have a single-bus multiprocessor UMA of 100 processors. Write a parallel processing program to calculate sum of 100,000 numbers.

Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];

vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable

Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction steps


Example: Sum Reduction

half = 100;repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);

Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable)

The processors then coordinate in adding together the partial sums (half is a private variable initialized to 100 (the number of processors))


An Example with 10 Processors

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9

sum[P0]sum[P1]sum[P2] sum[P3]sum[P4]sum[P5]sum[P6] sum[P7]sum[P8] sum[P9]


P0 P1 P2 P3 P4

half = 10

half = 5

P1 half = 2

P0half = 1


Alternative Model (Message Passing) Distributed Memory Multiprocessors

Each Processor has private physical address space Hardware send/receives messages between processor Example: Clusters (desktop computers) Processors knows when a message is sent Processors knows when a message arrives

Sometimes confirmation

is asked to receiving


• Network of independent computers• Each has private memory and OS• Connected using I/O system

• E.g., Ethernet/switch, Internet• Suitable for applications with independent tasks

• Web servers, databases, simulations,


Sum Reduction (Again)

Suppose we have a network-connected multiprocessor of 100 processors. Write a parallel processing program to calculate sum of 100,000 numbers.

Divided the 100,000 numbers into 100 subsets, each of 1000 numbers, and each subset is summed by an individual processor.

Then add the partial sums together with log2 (100) steps The do partial sums

sum = 0;for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];

Reduction Half the processors send, other half receive and add The quarter send, quarter receive and add, …

Different subsets of 100,000 numbers are copied to different individual memories

All processors have same copy of the program


Sum Reduction (Again)Communication and coordination between

processors through message passingGiven send() and receive() operations

limit = 100; half = 100;/* 100 processors */repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */until (half == 1); /* exit with final sum */

Send/receive also provide synchronization Assumes send/receive take similar time to addition


Cache Coherence Shared memory multiprocessors have cache coherence problem

Multiple copies of the same memory data can exist in different caches simultaneously.

Update in the caches will cause an inconsistent view of memory

Solutions: Software oriented

– Compilers identify the data items that may cause cache

– inconsistent and instruct hardware not caching them

Hardware oriented Directory protocol

- The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry

Snoopy Protocol- is the process where the individual caches monitor address lines for accesses to memory

locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location.

– Write-update

– Write-invalidate


Cache Coherence

Ex: P1 reads A from shared MEM (A=5)

P2 reads A from shared MEM (A=5)

P1 modifies A as 10 (A=10 in cache 1, A=5 in cache2, mem)

inconsistent We can not apply a write-trough policy for every update

Why? P1 should tell P2 and MEM that A=10, It will be to slow, and to costly in terms of MEM access and bus traffic

We need cache coherence protocols to solve this problem Multiple copies are not problem when reading they become a problem

when writing


Snooping Cache controllers monitor (snoop) the shared bus to determine

whether or not they have a copy of the shared block which is being updated (written)

Shared reads are no problem

So, snoopy cache coherence protocols must find all caches that share an object to be written

When there is a “write” to a shared data block, all other copies of that variable can be updated or invalidated. (How?, will be explained later)

Tags are duplicated to reduce the demands of snooping on the cache

Monitor whether or not they have a copy of the deisred block

In Snoopy caches there is broadcast media that listens to all invalidates and read request and performs appropriate coherence operations locally


Read Misses

If the request is a read” which missed (value could not be found in local cache), then all other caches check to see whether they have a copy of the requested block, if yes, then the supply the data to requesting processor.


Handling Writes

Ensuring that all other processors sharing data are informed of writes can be handled two ways:

1. Write-update (write-broadcast) – writing processor broadcasts new data over the bus, all copies are updated All writes go to the bus higher bus traffic Since new values appear in caches sooner, can reduce latency

2. Write-invalidate – (other “owners” are told their copies are no longer valid) writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) Uses the bus only on the first write lower bus traffic, so better

use of bus bandwidth


An Example of a Cache Coherence (CC) Protocol

Finite state transition diagram for a write-invalidation protocol based on a write-back policy. Each cache block is in one of three states:

1. Shared (read only): This cache block is clean (not written) and may be shared.

2. Modified (read/write): This cache block is dirty (written) and may not be shared.

3. Invalid: This cache block does not have valid data.


A Write-Invalidate CC Protocol




write-back caching protocol in black

read (miss)

write (h

it or m


read (hit or miss)

read (hit) or write (hit)


e (m


send in


receives invalidate(write by another processor

to this block)A





or h


a re

ad m


or a


e m





ck (




; w



k ol

d bl






signals from the processor coherence additions in redsignals from the bus coherence additions in blue


CC Protocol

A read miss causes the cache to acquire the bus and write back the victim block (if it was in the Modified (dirty) state). All the other caches monitor the read miss to see if this block is in their cache. If one has a copy and it is in the Modified state, then the block is written back and its state is changed to Invalid. The read miss is then satisfied by reading from the block memory, and the state of the block is set to Shared.

Read hits do not change the cache state.


CC Protocol

A write miss to an Invalid block causes the cache to acquire the bus, read the block, modify the portion of the block being written and changing the block’s state to Modified. A write miss to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the portion of the block being written and change the block’s state to Modified.

A write hit to a Modified block causes no action. A write hit to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the part of the block being written, and change the state to Modified.


Write-Invalidate CC Examples I = invalid (many), S = shared (many), M = modified (only one)

Proc 1


Main Mem A

Proc 2


1. read miss for A

2. read request for A

3. snoop sees read request for

A & lets MM supply A

4. gets A from MM & changes its state

to S

Proc 1


Main Mem A

Proc 2


1. write miss for A

2. writes A & changes its state

to M

Proc 1


Main Mem A

Proc 2


1. read miss for A3. snoop sees read request for A, writes-

back A to MM

2. read request for A

4. gets A from MM & changes its state

to M

3. P2 sends invalidate for A

4. change A state to I

5. P2 sends invalidate for A

6. change A state to I

Proc 1


Main Mem A

Proc 2


1. write miss for A

2. writes A & changes its state

to M

3. P2 sends invalidate for A

4. change A state to I


Grid Computing

Separate computers interconnected by long-haul networks

E.g., Internet connections Work units farmed out, results sent back

Can make use of idle time on PCs E.g., SETI@home, World Community Grid



Performing multiple threads of execution in parallel

Replicate registers, PC, etc. Fast switching between threads

Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed

Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls

(eg, data hazards)


Simultaneous Multithreading

In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function

units are available Within threads, dependencies handled by scheduling and

register renaming Example: Intel Pentium-4 HT

Two threads: duplicated registers, shared function units and caches


Multithreading Example


Future of Multithreading

Will it survive? In what form? Power considerations simplified microarchitectures

Simpler forms of multithreading

Tolerating cache-miss latency Thread switch may be most effective

Multiple simple cores might share resources more effectively


Instruction and Data Streams

An alternate classification

Data Streams

Single Multiple

Instruction Streams

Single SISD:Intel Pentium 4

SIMD: SSE instructions of x86

Multiple MISD:No examples today

MIMD:Intel Xeon e5345

SPMD: Single Program Multiple Data A parallel program on a MIMD computer Conditional code for different processors

top related