Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.

Multiprocessor Systems

2

• Old CW: Power is free, Transistors expensive• New CW: “Power wall” Power expensive, Xtors free

(Can put more on chip than can afford to turn on)• Old: Multiplies are slow, Memory access is fast• New: “Memory wall” Memory slow, multiplies fast

(200 clocks to DRAM memory, 4 clocks for FP multiply)• Old : Increasing Instruction Level Parallelism via compilers,

innovation (Out-of-order, speculation, VLIW, …)• New CW: “ILP wall” diminishing returns on more ILP • New: Power Wall + Memory Wall + ILP Wall = Brick Wall

• Old CW: Uniprocessor performance 2X / 1.5 yrs• New CW: Uniprocessor performance only 2X / 5 yrs?

Conventional Wisdom (CW) in Computer Architecture

3

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Pe

rfo

rma

nce

(vs

. V

AX

-11

/78

0)

25%/year

52%/year

??%/year

Uniprocessor Performance (SPECint)

• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Sea change in chip design: multiple “cores” or processors per chip

3X

Evolution from Single Core to Multi-Core“… today’s processors … are nearing an impasse as technologies approach the

speed of light..” David Mitchell, The Transputer: The Time Is Now (1989)

Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. …

This is a sea change in computing” Paul Otellini, President, Intel (2005)

• All microprocessor companies switch to MP (2X CPUs / 2 yrs) Procrastination penalized: 2X sequential perf. / 5 yrsManufacturer/Year AMD/’05 Intel/’06 IBM/’04 Sun/’05Processors/chip 2 2 2 8Threads/Processor 1 2 2 4

Threads/chip 2 4 4 32

Procrastination : to keep delaying something that must be done

5

Flynn’s Taxonomy• Flynn classified by data and control streams in 1966

• SIMD Data Level Parallelism• MIMD Thread Level Parallelism• MIMD popular because

• Flexible: N pgms and 1 multithreaded pgm• Cost-effective: same MPU in desktop & MIMD

Single Instruction Single Data (SISD)(Uniprocessor)

Single Instruction Multiple Data SIMD(single PC: Vector, CM-2)

Multiple Instruction Single Data (MISD)(????)

Multiple Instruction Multiple Data MIMD(Clusters, SMP servers)

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

Flynn’s Taxonomy• SISD (Single Instruction Single Data)

• Uniprocessors

• MISD (Multiple Instruction Single Data)• Single data stream operated by successive functional units

• SIMD (Single Instruction Multiple Data)• Instruction stream executed by multiple processors on different data

• Simple programming model, low overhead• Examples: Connection Machine and Vector Processors

• MIMD (Multiple Instruction Multiple Data) is the most general• Flexibility for parallel applications and multi-programmed systems

• Processors can run independent processes or applications• Processors can also run threads belonging to one parallel application

• Uses off-the-shelf microprocessors and components

Major MIMD Styles• Symmetric Multiprocessors (SMP)

• Main memory is shared and equally accessible by all processors• Called also Uniform Memory Access (UMA)• Bus based or interconnection network based

• Distributed memory multiprocessors• Distributed Shared Memory (DSM) multiprocessors

• Distributed memories are shared and can be accessed by all processors• Non-uniform memory access (NUMA)• Latency varies between local and remote memory access

• Message-Passing multiprocessors, multi-computers, and clusters• Distributed memories are NOT shared• Each processor can access its own local memory• Processors communicate by sending and receiving messages

Shared Memory Architecture• Any processor can directly reference any physical

memory• Any I/O controller to any physical memory• Operating system can run on any processor

• OS uses shared memory to coordinate

• Communication occurs implicitly as result of loads and stores

• Wide range of scale• Few to hundreds of processors• Memory may be physically

distributed among processors

• History dates to early 1960s

Shared PhysicalMemory

Processor

Processor

I/O I/O

I/O

Processor

ProcessorProcessor

Shared Memory OrganizationsP1

$

Interconnection network

$

Pn

Mem Mem

Dance Hall (UMA)

P1

$

Interconnection network

$

Pn

Mem Mem

Distributed Shared Memory (NUMA)

P1

$ $

Pn

Mem I/O devices

Bus-based Shared Memory

P1

Switch

Main memory

Pn

Interleaved

Interleaved

Cache

Shared Cache

Bus-Based Symmetric Multiprocessors

• Symmetric access to main memory from any processor• Dominate the server market

• Building blocks for larger systems

• Attractive as throughput servers and for parallel programs

I/O systemMain memory

Bus

P1

MultilevelCache

Pn

MultilevelCache

Uniform access via loads/storesAutomatic data movement and

coherent replication in cachesCheap and powerful extension to

uniprocessorsKey is extension of memory

hierarchy to support multiple processors

Shared Address Space Programming Model

• A process is a virtual address space• With one or more threads of control

• Part of the virtual address space is shared by processes• Multiple threads share the address space of a single process

• All communication isthrough shared memory

• Achieved by loads and stores• Writes by one process/thread

are visible to others• Special atomic operations

for synchronization• OS uses shared memory

to coordinate processes

St or e

P1

P2

Pn

P0

Load

P0 p r i v at e

P1 p r i v at e

P2 pr i v at e

Pn pr i v at e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physicaladdress space

Shared portionof address space

Private portionof address space

Common physicaladdresses

Medium and Large Scale Multiprocessors

• Problem is interconnect: high cost (crossbar) or bandwidth (bus)• Centralized memory or uniform memory access (UMA)

• Latencies to memory uniform, but uniformly large• Interconnection network: crossbar or multi-stage

• Distributed shared memory or non-uniform memory access (NUMA)• Distributed memory forms one global shared physical address space• Access to local memory is faster than access to remote memory

° ° °

Interconnection Network

Distributed Shared Memory

M

P

$ M

P

$ M

P

$

° ° °


Centralized Memory

M

P

$

P

$

P

$

M M° ° °

Message Passing Architectures • Complete computer as a building block

• Includes processor, memory, and I/O system

• Easier to build and scale thanshared memory architectures

• Communication via explicit I/O operations• Communication integrated at I/O level, Not into memory system

• Much in common with networks of workstations or clusters• However, tight integration between processor and network• Network is of higher capability than a local area network

• Programming model• Direct access only to local memory (private address space) • Communication via explicit send/receive messages (library or system

calls)

° ° °


M

P

$ M

P

$ M

P

$

Message-Passing Abstraction

• Send specifies receiving process and buffer to be transmitted• Receive specifies sending process and buffer to receive into• Optional tag on send and matching rule on receive

• Matching rule: match a specific tag t or any tag or any process• Combination of send and matching receive achieves …

• Pairwise synchronization event• Memory to memory copy of message

• Overheads: copying, buffer management, protection• Example: MPI and PVM message passing libraries

Address XSend Q, X, t

MatchLocal process address space

Address Y

Local process address space

Process P Process Q

Receive P, Y, t

Variants of Send and Receive

• Parallel programs using send and receive are quite structured• Most often, all nodes execute identical copies of a program• Processes can name each other using a simple linear ordering

• Blocking send:• Sender sends a request and waits until the reply is returned

• Non-blocking send:• Sender sends a message and continues without waiting for a reply

• Blocking receive:• Receiver blocks if it tries to receive a message that has not arrived

• Non-blocking receive:• Receiver simply posts a receive without waiting for sender

000001

010011

100

110

101

111

Evolution of Message-Passing Machines

• Early machines: FIFO on each link• Hardware close to programming model • Synchronous send/receive operations• Topology central (hypercube algorithms)

CalTech Cosmic Cube (Seitz)

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidir ectional

2D grid networkwith processing nodeattached to every switch

Sandia’s Intel Paragon XP/S-based SuperComputer

Each card is an SMP with two or more i860 processors and a network interface chip connected to the cache-coherent memory bus.

Each node has a DMA engine to transfer contiguous chunks of data to and from the network at a high rate.

Example 1: Matrix Addition (1/3)

Assume an MIMD Shared-Memory Multiprocessor (SMP) with P processors. The objective is to write a parallel program for Matrix Addition C = A + B, where A, B, and C are all NxN matrices using P processor.

(1) How to partition the array of results (C) for parallel processing among P processes!

A few approaches: (1) Block-row partitioning, where each process is assigned N/p rows and N columns, or (2) Block-column partitioning, where each process is assigned N rows and N/p columns. The first is preferred due to row-major storage.

(2) Find the number of results assigned to each processes for load balancing

There are N 2 results to compute from array C. Each processor is to have N2/P for load balancing. Using Block-row partitioning, each process is assigned a block-row of (N/P)*N results.

(3) Determine the range of results that must be computed by each process as function of the process ID (Pid).

Using block-row partitioning, each process is assigned N/p rows and N columns, This can be achieved if we partition A into block-rows such that each processor will get the range:

Range = (Imin, Imax) = [Pid*N/P, Imin+N/P-1], where each process must plug its Pid in the above formula to work on its range, i.e. parallel SPMD program.

Example 1: Matrix Addition (2/3)

(3) Using a shared-memory multiprocessor, write an SPMV program to compute the Matrix Addition by P processes (threads) by assuming that all arrays are declared as shared data.

The SPMD program that will run on each processor will be: Imin=Pid*N/P; Imax= Imin+N/P-1 ; For i= Imin, I < Imax { For j=0, i<N c[i,j] = a[i,j] + b[i,j]; } Note: the range of computed data for each process depends on the Pid.

Example 1: Matrix Addition (3/3) (4) Using a Distributed-Memory Multiprocessor, write an SPMD program using MPI library to parallelize C=A*B, where each is an NxN matrix, using P processors. Assume block-row partitioning, each process will run only in its range of data, and the program and data return to master processor (process 0) after completion of all the iterations.

{MPI-init;

MPI-Comm-size(MPI-comm-wolrd, &numprocs) #Init MPI

MPI-Comm-rank(MPI-comm-wolrd, &mypid) #Get process id as mypid

array_size = N; #Now Process 0 scatter arrays A and B over all the processes:

MPI- scatter (&a, array_size=N, 0 (pid), mytag, MPI-Comm-world) #Scatter A

MPI- Scatter(&b, array_size=N, 0 (pid), mytag, MPI-Comm-world) # Scatter B

my_range = N/P; #Nbr of rows to processes by each process

For i= 0, i < my_range

{ For j = 0, I < N

c[i,j] = a[i,j] + b[i,j];

}

MPI- Gather(&c, array_size=N, 0 (pid), mytag, MPI-Comm-world)

}

Example 2: Summing 100,000 Numbers on 100 Processors

sum[Pid] = 0;for (i = 1000*Pid; i< 1000*(Pid+1); i = i + 1)

sum[Pid] = sum[Pid] + A[i];

Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pid is the processor’s id, i is a private variable)

The processors then coordinate in adding together the partial sums (P is a private variable initialized to 100 (the number of processors))

repeatsynch(); /*synchronize firstif (P%2 != 0 && Pid == 0)

sum[0] = sum[0] + sum[P-1];P = P/2if (Pid<P) sum[Pid] = sum[Pid] + sum[Pid+P]

until (P == 1); /*final sum in sum[0]

An Example with 10 Processors

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9

sum[P0]sum[P1]sum[P2] sum[P3]sum[P4]sum[P5]sum[P6] sum[P7]sum[P8] sum[P9]

P0

P0 P1 P2 P3 P4

P = 10

P = 5

P1 P = 2

P0 P = 1

Message Passing Multiprocessors

• Each processor has its own private address space• Q1 – Processors share data by explicitly sending and

receiving information (messages)• Q2 – Coordination is built into message passing primitives

(send and receive)

Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.

Documents

memory wall memory slow

multiple processors

dram memory

multiple cores

single core

remote memory accessmessage

uniform memory access

power wall power expensive