Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah, 4 th Dec 2013
Tornado. Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa , Orran Krieger, Jonathan Appavoo , Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah , 4 th Dec 2013. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tornado
Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System
Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm
(Department of Electrical and Computer Engineering, University of Toronto, 1999)
Presented by: Anusha Muthiah, 4th Dec 2013
Why Tornado?
Previous shared memory multiprocessors evolved from designs for architecture that
existed back then
Locality – Traditional System
Memory latency was low: fetch – decode – execute without stalling
Memory faster than CPU = okay to leave data in memory
CPU CPU CPU
Shared Bus
Shared memory
Loads &
Stores
Locality – Traditional System
Memory got faster CPU got faster than memory Memory latency was high. Fast CPU = useless!
CPU CPU CPU
Shared Bus
Shared memory
Locality – Traditional System
Reading memory was very important Fetch – decode – execute cycle
CPU CPU CPU
Shared Bus
Shared memory
CacheCPU CPU CPU
Shared Bus
Shared memory
cache cache cache
Extremely close fast memory•Hitting a lot of addresses within a small region•E.g. Program – seq of instructions•Adjacent words on same page of memory•More if in a loop
CacheCPU CPU CPU
Shared Bus
Shared memory
cache cache cache
Repeated acess to the same page of memorywas good!
Memory latency
CPU executes at speed of cache Making up for memory latency
CPU
cache
Bus
Memory
Really small How can it help us more than 2% of the time?
Locality
Locality: same value or storage location being frequently accessed
Temporal Locality: reusing some data/resource within a small duration
Spatial Locality: use of data elements within relatively close storage locations
Cache
All CPUs accessing same page in memory
CPU CPU CPU
Shared Bus
Shared memory
cache cache cache
Example: Shared Counter
Memory
CPU-1
Cache
CPU-2
Cache
Counter
Shared by both CPU-1 & CPU-2
Copy of counter exists in bothprocessor’s cache
Example: Shared Counter
Memory
CPU-1 CPU-2
0
Example: Shared Counter
Memory
CPU-1
0
CPU-2
0
Example: Shared Counter
Memory
CPU-1
1
CPU-2
1
Write in exclusive
mode
Example: Shared Counter
Memory
CPU-1
1
CPU-2
1
1
Read : OK
Shared mode
Example: Shared Counter
Memory
CPU-1 CPU-2
2
2
Invalidate
From shared to exclusive
21
Example: Shared CounterTerrible Performance!
Problem: Shared Counter Counter bounces between CPU caches leading
to high cache miss rate
Try an alternate approach Counter converted to an array Each CPU has its own counter Updates can be local Number of increments = commutativity of
addition To read, you need all counters (add up)
Example: Array-based Counter
Memory
CPU-1 CPU-2
0 0
Array of counters, one for each CPU
CPU-2
CPU-1
Example: Array-based Counter
Memory
CPU-1
1
CPU-2
1 0
Example: Array-based Counter
Memory
CPU-1
1
CPU-2
1
1 1
Example: Array-based Counter
Memory
CPU-1
1
CPU-2
1
1 1
2CPU
Read Counte
r
Add All Counters (1 + 1)
Performance: Array-based CounterPerforms no better than ‘shared counter’!
Why Array-based counter doesn’t work? Data is tranferred between main memory and
caches through fixed sized blocks of data called cache lines
If two counters are in the same cache line, they can only be used by one processor at a time for writing
Ultimately, still has to bounce between CPU caches
Memory 1 1
Share cache line
False sharing
Memory
CPU-1 CPU-2
0,0
False sharing
Memory
CPU-1
0,0
CPU-2
0,0
False sharing
Memory
CPU-1
0,0
CPU-2
0,0
0,0
Sharing
False sharing
Memory
CPU-1
1,0
CPU-2
1,0
Invalidate
False sharing
Memory
CPU-1
1,0
CPU-2
1,0
1,0
Sharing
False sharing
Memory
CPU-1 CPU-2
1,1
1,1
Invalidate
Ultimate Solution Pad the array (independent cache line)
Spread counter components out in memory
Example: Padded Array
Memory
CPU-1 CPU-2
00
Individual cache lines for each counter
Example: Padded Array
Memory
CPU-1
1
CPU-2
1
11
Updates independent of each
other
Performance: Padded Array
Works better
Cache
All CPUs accessing same page in memory Write sharing is very destructive
Ideally: CPU uses data from its own cache Cache hit rate good Cache contents stay in cache and not invalidated
by other CPUs CPU has good locality of reference (it frequently
accesses its cache data) Sharing is minimized
Object Oriented Approach Goal:
Minimize access of shared data structures Minimize using shared locks
Operating Systems are driven by requests of applications on virtual resources
Good performance – requests to different virtual resources must be handled independently
How do we handle requests independently? Represent each resource by a different object Try to reduce sharing
If an entire resource is locked, requests get queued up
Avoid making resource a source of contention Use fine-grain locking instead
Coarse/Fine grain locking example Process object: maintains the list of mapped
memory region in the process’s address space On a page fault, it searches process table to
find the responsible region to forward page fault to
Region 1
Region 2
…
Region n
Region 3
Thread 1
Whole table gets locked even though it only needs one region.Other threads get queued up
Process Object
Coarse/Fine grain locking example
Region n
…
Region 3
Region 2
Region 1
Use individual locks
Thread 1
Thread 2
Key memory management object relationships in Tornado
Page fault delivered
Forwards request to responsible region
Translates fault addr into a file offset. Forwards req to FCM
If file data not cached in memory, request new phy page frame
And asks Cached Object Repto fill page from file
If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object
Advantage of using Object Oriented Approach Performance critical case of in-core page fault
(TLB miss fault for a page table resident in memory)
Objects invoked specific to faulting object Locks acquired & data structures accessed are
internal to the object
In contrast, many OSes use global page cache or single HAT layer = source of contention
Clustered Object Approach Obj Oriented is good, but some resources are
just too widely shared
Enter Clustering
Eg: Thread dispatch queue If single list used : high contention (bottleneck +
more cache coherency) Solution: Partition queue and give each processor
a private list(no contention!)
Clustered Object Approach
Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors
All clients access a clusteredobject using a common clustered obj reference
Each call to a Clustered Obj automatically directed to appropriate local rep
Clustered object Shared counter Example
It looks likea shared counter
But actually made up of representative countersthat each CPU can access independently
Degree of Clustering
One rep for the entire system
One rep per processor
One rep for a cluster of Neighbouring processors
Cached Obj Rep – Read mostly. All processors share a single rep.
Region – Read mostly. On critical path for all page faults.
FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps
Clustered Object Approach
Clustering through replication One way to have data shared: only read Replicate and all CPUs can read their copy from cache
Writing? How to maintain consistency among replicas? Fundamentally same as cache coherence
Take advantage of the semantics of object you are replicating
Use object semantic knowledge to make a specific implementation (obj specific algm to maintain distributed state)
Advantages of Clustered Object Approach Supports replication, partitioning and locks
(essential to SMMP)
Built on object oriented design No need to worry about location or organization of
objects Just use clustered object references
Allows incremental implementation Depending on performance needed, change degree of
clustering(Initially one rep serving all requests and then optimize when widely shared)
Clustered Object Implementation Each processor: own set translation tables
Clustered object reference is just a pointer to the translation table
Clustered Obj 1
Clustered Obj 2
…
Ptr
Ptr
…
Translation table
Pointer to rep responsible for handling method invocations for localprocessor
Clustered Object Approach
Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors
All clients access a clusteredObject using a common clustered obj reference
Each call to a Clustered Obj automatically directed to appropriate local rep
Clustered Object Implementation Each ‘per processor’ copy of the table is
located in the same virtual address space One pointer into table will give rep responsible
for handling invocation on each processor
Clustered Object Implementation We do not know which reps will be needed Reps created on first access
1. Each clustered object defines a miss handling object2. All entries in the translation table are initialized to point to a
global miss handling object 3. On first invocation, global miss handling object called4. Global miss handling object saves state and calls clustered
object’s miss handling object 5. Object miss handler:
if (rep exists)pointer to rep installed in translation tableelsecreate rep, install pointer to it in translation table
Counter: Clustered Object
Counter – Clustered Object
CPU CPU
Object Reference
Rep 1 Rep 2
Counter: Clustered Object
Counter – Clustered Object
CPU
1
CPU
1
Object Reference
Rep 1 Rep 2
Counter: Clustered Object
Counter – Clustered Object
CPU
2
CPU
1
Object Reference
Rep 1 Rep 2
Update independent of each
other
Counter: Clustered Object
Counter – Clustered Object
CPU
1
CPU
1
Object Reference
Rep 1 Rep 2
Read Counte
r
Counter: Clustered Object
Counter – Clustered Object
CPU
1
CPU
1
Object Reference
Rep 1 Rep 2
Add All Counters (1 + 1)
Synchronization Locking
Related to concurrency control w.r.t modifications to data structures
Existence Guarantee To ensure the data structure containing variable is
not deallocated during update Concurrent access to a shared data structure =
must provide existence guarantee
Locking Lots of overhead
Basic instruction overhead Extra cache coherence traffic – write sharing of
lock
Tornado: Encapsulate all locks within individual objects Reduce scope of lock Limits contention Split objects into multiple representatives – further
reduce contention!
Existence Guarantee Scheduler’s ready list
Element 2
Thread Control Blocks
Thread 1: Traversing the listThread 2: Trying to delete element 2
Garbage collectedMemory possibly reallocated
Existence Guarantee How to stop a thread from deallocating an
object being used by another?
Semi-automatic garbage collection scheme
Garbage Collection Implementation
References
Temporary Persistent
Held privately by single threadDestroyed when thread terminates
Stored in shared memoryAccessed by multiple threads
Clustered Object destruction Phase 1:
Object makes sure all persistent references to it have been removed (lists, tables etc)
Phase 2: Object makes sure all temporary references to it
have been removed Uniprocessor: Number of active operations
maintained in per-processor counter. Count = 0 (none active)
Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)
IPC Acts like a clustered object call crosses that crosses
from protection domain of client to that of server and back
Client requests are always serviced on their local processors
Processor sharing: like handoff scheduling
Performance Test machine:
16 processor NUMAchine prototype
SimOS Simulator
Component Results
Avg no. of cylcles Required for ‘n’ threads
•Performs quite well for memory allocation and miss handling•Lots of variations in garbage collection and PPC tests•Multiple data structures occasionally mapping to the same cache block on some processors
Component Results
Avg cycles required onSimOS with 4-way associative cache
4-way associativity: A cache entry from main memory can go into any one of the 4 places
Highly scalable Compares favourably to commercial operating
systems (x100 slowdown on 16 processors) First generation: Hurricane (coarse grain
approach to scalability) Third generation: K42 (IBM & University of
Toronto)
References
Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, 1998
Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, 1999