Multiprocessing & Cache Coherency.

11/14 Multiprocessing.1

Multiprocessing

& Cache Coherency


What is multiprocessing (REVIEW)

• Computer System – supports several simultaneous processes

• All OSes support multiprocessing• More complex - must share system

resources• ILP running out of steam• Today’s CPUs are Chip MultiProcessors

CMP


Multiple Processes – One CPU (review) stack

task priority

CPU registers

CPU registers

ProcessorMemory

stack

task priority

CPU registers

stack

task priority

CPU registers

...

}context


Context-Switch to share CPU (review) • Time-slicing

– Time-slice: period of time task runs before context-switch

– hardware interrupts system timer

– kernel Scheduling• Preemption

– Currently task halted and switched out by higher-priority task

– Typical in Embedded, Real time

Time-slice

Context switches

Context switches


Process State (review) • A process can be in one of many states

WaitingforEvent

Delayed

Dormant Ready Running

Interrupted

taskdeleted

interrupted

task create

taskdelete

task deleted

context switch

delayexpired

eventoccurred

wait forevent

delay taskfor n ticks

task delete


Extensions of Memory System

P1

$

Inter connection network

$

Pn

Mem Mem

P1

$

Inter connection network

$

Pn

Mem Mem

Centralized MemoryDance Hall, UMA

Distributed Memory (NUMA)

Scale


symmetric• All memory is equally far away from all processors• Any processor can do any I/O (set up a DMA transfer)

Symmetric Multiprocessors

MemoryI/O controller

Graphicsoutput

CPU-Memory busbridge

Processor

I/O controller I/O controller

I/O bus

Networks

Processor


Bus-Based Symmetric Shared Memory

• on chip Building blocks for larger systems; already on desktop• Attractive for servers and parallel programs

– Fine-grain resource sharing– Uniform access via loads/stores– Automatic data movement and coherent replication in caches– Cheap and powerful extension

• Normal uniprocessor mechanisms to access data

I/O devicesMem

P1

$ $

Pn

Bus


SMP :: exampleConnecting IBM Power chips

•8-way SMP•Each CMP has2 cores


Parallel Programming Models• Programming model : languages – libraries create

abstract view of machine• Control

– How is parallelism created– Operation ordering – Synchronization control

• Data– private vs. shared– Communicated How shared data accessed

• Synchronization– What operations can be used– What are atomic (indivisible) operations?


Programming Model 1: Shared Memory

• Program: collection of threads with private variables, • AND shared variables, e.g., static variables, shared common blocks,

– Threads communicate implicitly by writing / reading shared variables.

– Thread coordination by synchronizing shared variables

PnP1P0

s s = ...y = ..s ...

Shared memory

i: 2 i: 5 Private memory

i: 8


Synchronization Techniques• Mutexes – mutual exclusion locks (binary semaphore)

– threads are mostly independent and must access common data lock *l = alloc_and_init(); /* shared */ lock(l); access data unlock(l);• Barrier – global (/coordinated) synchronization

– simple use of barriers -- all threads hit the same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier;• Need atomic operations bigger than loads/stores

– atomic swap, test-and-test-and-set• Transactional memory

– Hardware equivalent of optimistic concurrency– Solves many parallel programming problems


Programming Model 2: Message Passing

• Program : a collection of processes.– Usually fixed at program startup– local address space -- NO shared data.– Logically shared data partitioned.

• Processes communicate by explicit send/receive pairs– Coordination implicit in every communication event.– MPI (Message Passing Interface) most commonly used SW

PnP1P0

y = ..s ...

s: 12

i: 2

Private memory

s: 14

i: 3

s: 11

i: 1

send P1,s

Network

receive Pn,s


MPI – de facto standard• MPI has become de facto standard for parallel computing

using message passing• Example: (FYI)

for(i=1;i<numprocs;i++) { sprintf(buff, "Hello %d! ", i); MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG,

MPI_COMM_WORLD); } for(i=1;i<numprocs;i++) {

MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);

printf("%d: %s\n", myid, buff); }

• Pros and Cons of standards– MPI a standard for development in the HPC community

portability– The MPI standard buit on mid-80s technology,


Shared Memory VS or Message Passing• Advantages of Shared Memory:

– Implicit communication (loads/stores)– Low overhead when cached

• Disadvantages of Shared Memory:– Complex to scale well– Requires synchronization operations

• Advantages of Message Passing– Explicit Communication (sending/receiving of messages)– Easier to control data placement (no automatic caching)

• Disadvantages of Message Passing– High Message passing overhead– Complex to program

• Due to CMPs, cache-coherent shared memory systems will be dominant form of multiprocessor


Caches and Cache Coherence• Caches play key role

– Reduce average data access time– Reduce bandwidth demands placed on shared interconnect

• private processor caches create a problem– Copies of a variable can be present in multiple caches – A write by one processor may not become visible to others

• stale value in their caches

• Solutions– Cache snoop architecture & protocols


Example Cache Coherence Problem

notes:Processors see different values for u after event 3With write back caches, value written back to memory depends on which cache flushes or when writes back value

Processes accessing main memory see stale value

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?4

u = ?

u :51

u :5

2

u :5

3

u = 7


Problems with Parallel I/O

Memory Disk: Physical memory may be stale if Cache copy is dirty

Disk Memory: Cache may hold stale data and not see memory writes

Use non-cacheable paging to solve

DISK DMA

PhysicalMemory

Proc.Cache

Memory Bus

Cached portions of page

DMA transfers


Snoopy Cache-Coherence Protocols

• Cache Controller “snoops” all transactions on the shared bus– relevant transaction if for a block it contains– take action to ensure coherence

• invalidate, update, or supply value– depends on state of the block and the protocol

StateAddressData

I/O devicesMem

P1

$

Bus snoop

$

Pn

Cache-memorytransaction


Write-through Invalidate Protocol• Basic Bus-Based Protocol

– Each processor has cache, state– All transactions over bus snooped

• Writes invalidate all other caches– can have multiple simultaneous readers

of block, but write invalidates them

• Two states per block in each cache– state bits associated with blocks that

are in the cache – other blocks can be seen as being in

invalid (not-present) state in that cache

State Tag Data

I/O devicesMem

P1

$ $

Pn

Bus

State Tag Data


Example: Write-thru Invalidate

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?4

u = ?

u :51

u :5

2

u :5

3

u = 7

u = 7


Write-through vs. Write-back• Write-through protocol is simple

– every write is observable• Every write goes on the bus

Only one write can take place at a time in any processor• Uses a lot of bandwidth!

Example: 200 MHz dual issue, CPI = 1, 15% stores of 8 bytes

30 M stores per second per processor

240 MB/s per processor!

State Tag Data

I/O devicesMem

P1

$ $

Pn

Bus

State Tag Data


Invalidate vs. Update• Basic question of program behavior:

– Is a block written by one processor later read by others before it is overwritten?

• Invalidate. – yes: readers will take a miss– no: multiple writes without additional traffic

• Update. – yes: avoids misses on later references– no: multiple useless updates


Coherent Memory System• Reading a location should return latest

value written by any process• Easy in uniprocessors ; Except for I/O, -

infrequent - software solutions work– eg: non cacheable operations, ..

• coherence problem more pervasive performance critical in multiprocessors


Coherence Meansas if no cache exists

1. operations issued by any process occur in order issued by process, and

2. value returned by read is last value written to that location in serial order

3. Two necessary features:

Write propagation: value written must become visible to others

Write serialization: writes to location seen in same order by all– if I see w1 after w2, you should not see w2 before w1


Two Hardware Cache Coherence Solutions

– “snoopy” schemes» rely on broadcast to observe all coherence traffic» well suited for buses and small-scale systems

– directory schemes» uses centralized information to avoid broadcast» scales well to large numbers of processors


Snoopy Cache Protocols• all coherence-related activity is broadcast to all

processors on a bus (MESI protocol)• each processor monitors (“snoops”) bus actions• Processor reacts when activity relevant to current

cache contents• » if another processor wishes to write to a line, you

may need to “invalidate” (I.e. discard) the copy in your own cache» if another processor wishes to read a line for which you have

a dirty copy, you may need to supply


MESI Invalidate Cache Protocol• 4 States (per cache block/line)

– Invalid I– Shared S: Two or more caches have copy– Dirty or Modified M: one only– Exclusive E :Only this cache has copy, not modified

• Implemented in most commercial processors, Core Duo, Core 2, IBM Power, ..

M: Modified ExclusiveE: Exclusive, unmodifiedS: Shared I: Invalid

Each cache line has a tagAddress tag

state bits


MESI Protocol

• M odified / E xclusive / S hared / I nvalid • Upon loading, a line is marked E, subsequent read

OK, write marks M • If another's load is seen, mark S • Write to an S, send I to all, mark M • If another reads an M line, write it back, mark it S • Read/write to an I misses


Snooper Snooper Snooper Snooper

Snoop with Level-2 Caches Possible

• Processors have two-level caches

• Inclusion property: entries in IL1 & DL1 are in L2 invalidation in L2 invalidation in L1• Snooping on L2 does not affect CPU-L1 bandwidth

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $


Cache Coherent System summary:

• Provide set of states, state transition diagram, and actions

• Manage coherence protocol– (0) Determine when to invoke coherence protocol– (a) Find info about state of block in other caches to determine

action - whether need to communicate with other cached copies– (b) Locate the other copies– (c) Communicate with those copies (invalidate/update)

• (0) is done the same way on all systems– state of the line is maintained in the cache– protocol is invoked if an “access fault” occurs on the line

• Different approaches distinguished by (a) to (c)


Bus-based Coherence summary• All of (a), (b), (c) done through broadcast on bus

– faulting processor sends out a “search” – others respond to the search probe and take necessary action

• Could do it in scalable network too– broadcast to all processors, and let them respond

• Conceptually simple, but broadcast doesn’t scale with p– on bus, bus bandwidth doesn’t scale– on scalable network, every fault leads to at least p network

transactions

• Scalable coherence:– can have same cache states and state transition diagram– different mechanisms to manage protocol


More Scalable coherency Approach : Directories

• Every memory block / line has associated directory entry– Tracks copies of cached blocks and their states– on a miss, find directory entry; communicate only with nodes

that have copies – in scalable networks, communication with directory and

copies is through network transactions

• alternatives for organizing directory information

• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• k processors. • each cache-block in memory:

k presence-bits, 1 dirty-bit• With each cache-block in cache:

1 valid bit, and 1 dirty (owner) bit


Directory Operation

• k processors. • With each cache-block in memory:

k presence-bits, 1 dirty-bit• With each cache-block in cache:

1 valid bit, and 1 dirty (owner) bit• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• Read from memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i]

ON; }• if dirty-bit ON then { recall line from dirty proc ; update

memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to memory by processor i:• If dirty-bit OFF then {send invalidations to all caches that

have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }

Multiprocessing & Cache Coherency.

Documents