11/14 Multiprocessing.1 Multiprocessing & Cache Coherency
Jan 08, 2018
11/14 Multiprocessing.1
Multiprocessing
& Cache Coherency
11/14 Multiprocessing.2
What is multiprocessing (REVIEW)
• Computer System – supports several simultaneous processes
• All OSes support multiprocessing• More complex - must share system
resources• ILP running out of steam• Today’s CPUs are Chip MultiProcessors
CMP
11/14 Multiprocessing.3
Multiple Processes – One CPU (review) stack
task priority
CPU registers
CPU registers
ProcessorMemory
stack
task priority
CPU registers
stack
task priority
CPU registers
...
}context
11/14 Multiprocessing.4
Context-Switch to share CPU (review) • Time-slicing
– Time-slice: period of time task runs before context-switch
– hardware interrupts system timer
– kernel Scheduling• Preemption
– Currently task halted and switched out by higher-priority task
– Typical in Embedded, Real time
Time-slice
Context switches
Context switches
11/14 Multiprocessing.5
Process State (review) • A process can be in one of many states
WaitingforEvent
Delayed
Dormant Ready Running
Interrupted
taskdeleted
interrupted
task create
taskdelete
task deleted
context switch
delayexpired
eventoccurred
wait forevent
delay taskfor n ticks
task delete
11/14 Multiprocessing.6
Extensions of Memory System
P1
$
Inter connection network
$
Pn
Mem Mem
P1
$
Inter connection network
$
Pn
Mem Mem
Centralized MemoryDance Hall, UMA
Distributed Memory (NUMA)
Scale
11/14 Multiprocessing.7
symmetric• All memory is equally far away from all processors• Any processor can do any I/O (set up a DMA transfer)
Symmetric Multiprocessors
MemoryI/O controller
Graphicsoutput
CPU-Memory busbridge
Processor
I/O controller I/O controller
I/O bus
Networks
Processor
11/14 Multiprocessing.8
Bus-Based Symmetric Shared Memory
• on chip Building blocks for larger systems; already on desktop• Attractive for servers and parallel programs
– Fine-grain resource sharing– Uniform access via loads/stores– Automatic data movement and coherent replication in caches– Cheap and powerful extension
• Normal uniprocessor mechanisms to access data
I/O devicesMem
P1
$ $
Pn
Bus
11/14 Multiprocessing.9
SMP :: exampleConnecting IBM Power chips
•8-way SMP•Each CMP has2 cores
11/14 Multiprocessing.10
Parallel Programming Models• Programming model : languages – libraries create
abstract view of machine• Control
– How is parallelism created– Operation ordering – Synchronization control
• Data– private vs. shared– Communicated How shared data accessed
• Synchronization– What operations can be used– What are atomic (indivisible) operations?
11/14 Multiprocessing.11
Programming Model 1: Shared Memory
• Program: collection of threads with private variables, • AND shared variables, e.g., static variables, shared common blocks,
– Threads communicate implicitly by writing / reading shared variables.
– Thread coordination by synchronizing shared variables
PnP1P0
s s = ...y = ..s ...
Shared memory
i: 2 i: 5 Private memory
i: 8
11/14 Multiprocessing.12
Synchronization Techniques• Mutexes – mutual exclusion locks (binary semaphore)
– threads are mostly independent and must access common data lock *l = alloc_and_init(); /* shared */ lock(l); access data unlock(l);• Barrier – global (/coordinated) synchronization
– simple use of barriers -- all threads hit the same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier;• Need atomic operations bigger than loads/stores
– atomic swap, test-and-test-and-set• Transactional memory
– Hardware equivalent of optimistic concurrency– Solves many parallel programming problems
11/14 Multiprocessing.13
Programming Model 2: Message Passing
• Program : a collection of processes.– Usually fixed at program startup– local address space -- NO shared data.– Logically shared data partitioned.
• Processes communicate by explicit send/receive pairs– Coordination implicit in every communication event.– MPI (Message Passing Interface) most commonly used SW
PnP1P0
y = ..s ...
s: 12
i: 2
Private memory
s: 14
i: 3
s: 11
i: 1
send P1,s
Network
receive Pn,s
11/14 Multiprocessing.14
MPI – de facto standard• MPI has become de facto standard for parallel computing
using message passing• Example: (FYI)
for(i=1;i<numprocs;i++) { sprintf(buff, "Hello %d! ", i); MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG,
MPI_COMM_WORLD); } for(i=1;i<numprocs;i++) {
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d: %s\n", myid, buff); }
• Pros and Cons of standards– MPI a standard for development in the HPC community
portability– The MPI standard buit on mid-80s technology,
11/14 Multiprocessing.15
Shared Memory VS or Message Passing• Advantages of Shared Memory:
– Implicit communication (loads/stores)– Low overhead when cached
• Disadvantages of Shared Memory:– Complex to scale well– Requires synchronization operations
• Advantages of Message Passing– Explicit Communication (sending/receiving of messages)– Easier to control data placement (no automatic caching)
• Disadvantages of Message Passing– High Message passing overhead– Complex to program
• Due to CMPs, cache-coherent shared memory systems will be dominant form of multiprocessor
11/14 Multiprocessing.16
Caches and Cache Coherence• Caches play key role
– Reduce average data access time– Reduce bandwidth demands placed on shared interconnect
• private processor caches create a problem– Copies of a variable can be present in multiple caches – A write by one processor may not become visible to others
• stale value in their caches
• Solutions– Cache snoop architecture & protocols
11/14 Multiprocessing.17
Example Cache Coherence Problem
notes:Processors see different values for u after event 3With write back caches, value written back to memory depends on which cache flushes or when writes back value
Processes accessing main memory see stale value
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?4
u = ?
u :51
u :5
2
u :5
3
u = 7
11/14 Multiprocessing.18
Problems with Parallel I/O
Memory Disk: Physical memory may be stale if Cache copy is dirty
Disk Memory: Cache may hold stale data and not see memory writes
Use non-cacheable paging to solve
DISK DMA
PhysicalMemory
Proc.Cache
Memory Bus
Cached portions of page
DMA transfers
11/14 Multiprocessing.19
Snoopy Cache-Coherence Protocols
• Cache Controller “snoops” all transactions on the shared bus– relevant transaction if for a block it contains– take action to ensure coherence
• invalidate, update, or supply value– depends on state of the block and the protocol
StateAddressData
I/O devicesMem
P1
$
Bus snoop
$
Pn
Cache-memorytransaction
11/14 Multiprocessing.20
Write-through Invalidate Protocol• Basic Bus-Based Protocol
– Each processor has cache, state– All transactions over bus snooped
• Writes invalidate all other caches– can have multiple simultaneous readers
of block, but write invalidates them
• Two states per block in each cache– state bits associated with blocks that
are in the cache – other blocks can be seen as being in
invalid (not-present) state in that cache
State Tag Data
I/O devicesMem
P1
$ $
Pn
Bus
State Tag Data
11/14 Multiprocessing.21
Example: Write-thru Invalidate
I/O devices
Memory
P1
$ $ $
P2 P3
5
u = ?4
u = ?
u :51
u :5
2
u :5
3
u = 7
u = 7
11/14 Multiprocessing.22
Write-through vs. Write-back• Write-through protocol is simple
– every write is observable• Every write goes on the bus
Only one write can take place at a time in any processor• Uses a lot of bandwidth!
Example: 200 MHz dual issue, CPI = 1, 15% stores of 8 bytes
30 M stores per second per processor
240 MB/s per processor!
State Tag Data
I/O devicesMem
P1
$ $
Pn
Bus
State Tag Data
11/14 Multiprocessing.23
Invalidate vs. Update• Basic question of program behavior:
– Is a block written by one processor later read by others before it is overwritten?
• Invalidate. – yes: readers will take a miss– no: multiple writes without additional traffic
• Update. – yes: avoids misses on later references– no: multiple useless updates
11/14 Multiprocessing.24
Coherent Memory System• Reading a location should return latest
value written by any process• Easy in uniprocessors ; Except for I/O, -
infrequent - software solutions work– eg: non cacheable operations, ..
• coherence problem more pervasive performance critical in multiprocessors
11/14 Multiprocessing.25
Coherence Meansas if no cache exists
1. operations issued by any process occur in order issued by process, and
2. value returned by read is last value written to that location in serial order
3. Two necessary features:
Write propagation: value written must become visible to others
Write serialization: writes to location seen in same order by all– if I see w1 after w2, you should not see w2 before w1
11/14 Multiprocessing.26
Two Hardware Cache Coherence Solutions
– “snoopy” schemes» rely on broadcast to observe all coherence traffic» well suited for buses and small-scale systems
– directory schemes» uses centralized information to avoid broadcast» scales well to large numbers of processors
11/14 Multiprocessing.27
Snoopy Cache Protocols• all coherence-related activity is broadcast to all
processors on a bus (MESI protocol)• each processor monitors (“snoops”) bus actions• Processor reacts when activity relevant to current
cache contents• » if another processor wishes to write to a line, you
may need to “invalidate” (I.e. discard) the copy in your own cache» if another processor wishes to read a line for which you have
a dirty copy, you may need to supply
11/14 Multiprocessing.28
MESI Invalidate Cache Protocol• 4 States (per cache block/line)
– Invalid I– Shared S: Two or more caches have copy– Dirty or Modified M: one only– Exclusive E :Only this cache has copy, not modified
• Implemented in most commercial processors, Core Duo, Core 2, IBM Power, ..
M: Modified ExclusiveE: Exclusive, unmodifiedS: Shared I: Invalid
Each cache line has a tagAddress tag
state bits
11/14 Multiprocessing.29
MESI Protocol
• M odified / E xclusive / S hared / I nvalid • Upon loading, a line is marked E, subsequent read
OK, write marks M • If another's load is seen, mark S • Write to an S, send I to all, mark M • If another reads an M line, write it back, mark it S • Read/write to an I misses
11/14 Multiprocessing.30
Snooper Snooper Snooper Snooper
Snoop with Level-2 Caches Possible
• Processors have two-level caches
• Inclusion property: entries in IL1 & DL1 are in L2 invalidation in L2 invalidation in L1• Snooping on L2 does not affect CPU-L1 bandwidth
CPU
L1 $
L2 $
CPU
L1 $
L2 $
CPU
L1 $
L2 $
CPU
L1 $
L2 $
11/14 Multiprocessing.31
Cache Coherent System summary:
• Provide set of states, state transition diagram, and actions
• Manage coherence protocol– (0) Determine when to invoke coherence protocol– (a) Find info about state of block in other caches to determine
action - whether need to communicate with other cached copies– (b) Locate the other copies– (c) Communicate with those copies (invalidate/update)
• (0) is done the same way on all systems– state of the line is maintained in the cache– protocol is invoked if an “access fault” occurs on the line
• Different approaches distinguished by (a) to (c)
11/14 Multiprocessing.32
Bus-based Coherence summary• All of (a), (b), (c) done through broadcast on bus
– faulting processor sends out a “search” – others respond to the search probe and take necessary action
• Could do it in scalable network too– broadcast to all processors, and let them respond
• Conceptually simple, but broadcast doesn’t scale with p– on bus, bus bandwidth doesn’t scale– on scalable network, every fault leads to at least p network
transactions
• Scalable coherence:– can have same cache states and state transition diagram– different mechanisms to manage protocol
11/14 Multiprocessing.33
More Scalable coherency Approach : Directories
• Every memory block / line has associated directory entry– Tracks copies of cached blocks and their states– on a miss, find directory entry; communicate only with nodes
that have copies – in scalable networks, communication with directory and
copies is through network transactions
• alternatives for organizing directory information
• ••
P P
Cache Cache
Memory Directory
presence bits dirty bit
Interconnection Network
• k processors. • each cache-block in memory:
k presence-bits, 1 dirty-bit• With each cache-block in cache:
1 valid bit, and 1 dirty (owner) bit
11/14 Multiprocessing.34
Directory Operation
• k processors. • With each cache-block in memory:
k presence-bits, 1 dirty-bit• With each cache-block in cache:
1 valid bit, and 1 dirty (owner) bit• ••
P P
Cache Cache
Memory Directory
presence bits dirty bit
Interconnection Network
• Read from memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i]
ON; }• if dirty-bit ON then { recall line from dirty proc ; update
memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}
• Write to memory by processor i:• If dirty-bit OFF then {send invalidations to all caches that
have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }