Top Banner
Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah, 4 th Dec 2013
73

Tornado

Feb 23, 2016

Download

Documents

butch

Tornado. Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa , Orran Krieger, Jonathan Appavoo , Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah , 4 th Dec 2013. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tornado

Tornado

Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System

Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm

(Department of Electrical and Computer Engineering, University of Toronto, 1999)

Presented by: Anusha Muthiah, 4th Dec 2013

Page 2: Tornado

Why Tornado?

Previous shared memory multiprocessors evolved from designs for architecture that

existed back then

Page 3: Tornado

Locality – Traditional System

Memory latency was low: fetch – decode – execute without stalling

Memory faster than CPU = okay to leave data in memory

CPU CPU CPU

Shared Bus

Shared memory

Loads &

Stores

Page 4: Tornado

Locality – Traditional System

Memory got faster CPU got faster than memory Memory latency was high. Fast CPU = useless!

CPU CPU CPU

Shared Bus

Shared memory

Page 5: Tornado

Locality – Traditional System

Reading memory was very important Fetch – decode – execute cycle

CPU CPU CPU

Shared Bus

Shared memory

Page 6: Tornado

CacheCPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Extremely close fast memory•Hitting a lot of addresses within a small region•E.g. Program – seq of instructions•Adjacent words on same page of memory•More if in a loop

Page 7: Tornado

CacheCPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Repeated acess to the same page of memorywas good!

Page 8: Tornado

Memory latency

CPU executes at speed of cache Making up for memory latency

CPU

cache

Bus

Memory

Really small How can it help us more than 2% of the time?

Page 9: Tornado

Locality

Locality: same value or storage location being frequently accessed

Temporal Locality: reusing some data/resource within a small duration

Spatial Locality: use of data elements within relatively close storage locations

Page 10: Tornado

Cache

All CPUs accessing same page in memory

CPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Page 11: Tornado

Example: Shared Counter

Memory

CPU-1

Cache

CPU-2

Cache

Counter

Shared by both CPU-1 & CPU-2

Copy of counter exists in bothprocessor’s cache

Page 12: Tornado

Example: Shared Counter

Memory

CPU-1 CPU-2

0

Page 13: Tornado

Example: Shared Counter

Memory

CPU-1

0

CPU-2

0

Page 14: Tornado

Example: Shared Counter

Memory

CPU-1

1

CPU-2

1

Write in exclusive

mode

Page 15: Tornado

Example: Shared Counter

Memory

CPU-1

1

CPU-2

1

1

Read : OK

Shared mode

Page 16: Tornado

Example: Shared Counter

Memory

CPU-1 CPU-2

2

2

Invalidate

From shared to exclusive

21

Page 17: Tornado

Example: Shared CounterTerrible Performance!

Page 18: Tornado

Problem: Shared Counter Counter bounces between CPU caches leading

to high cache miss rate

Try an alternate approach Counter converted to an array Each CPU has its own counter Updates can be local Number of increments = commutativity of

addition To read, you need all counters (add up)

Page 19: Tornado

Example: Array-based Counter

Memory

CPU-1 CPU-2

0 0

Array of counters, one for each CPU

CPU-2

CPU-1

Page 20: Tornado

Example: Array-based Counter

Memory

CPU-1

1

CPU-2

1 0

Page 21: Tornado

Example: Array-based Counter

Memory

CPU-1

1

CPU-2

1

1 1

Page 22: Tornado

Example: Array-based Counter

Memory

CPU-1

1

CPU-2

1

1 1

2CPU

Read Counte

r

Add All Counters (1 + 1)

Page 23: Tornado

Performance: Array-based CounterPerforms no better than ‘shared counter’!

Page 24: Tornado

Why Array-based counter doesn’t work? Data is tranferred between main memory and

caches through fixed sized blocks of data called cache lines

If two counters are in the same cache line, they can only be used by one processor at a time for writing

Ultimately, still has to bounce between CPU caches

Memory 1 1

Share cache line

Page 25: Tornado

False sharing

Memory

CPU-1 CPU-2

0,0

Page 26: Tornado

False sharing

Memory

CPU-1

0,0

CPU-2

0,0

Page 27: Tornado

False sharing

Memory

CPU-1

0,0

CPU-2

0,0

0,0

Sharing

Page 28: Tornado

False sharing

Memory

CPU-1

1,0

CPU-2

1,0

Invalidate

Page 29: Tornado

False sharing

Memory

CPU-1

1,0

CPU-2

1,0

1,0

Sharing

Page 30: Tornado

False sharing

Memory

CPU-1 CPU-2

1,1

1,1

Invalidate

Page 31: Tornado

Ultimate Solution Pad the array (independent cache line)

Spread counter components out in memory

Page 32: Tornado

Example: Padded Array

Memory

CPU-1 CPU-2

00

Individual cache lines for each counter

Page 33: Tornado

Example: Padded Array

Memory

CPU-1

1

CPU-2

1

11

Updates independent of each

other

Page 34: Tornado

Performance: Padded Array

Works better

Page 35: Tornado

Cache

All CPUs accessing same page in memory Write sharing is very destructive

CPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Page 36: Tornado

Shared Counter Example

Minimize read/write & write sharing = minimize cache coherence

Minimize false sharing

Page 37: Tornado

Now and then Then (Uni): Locality is good Now (Multi): Locality is bad. Don’t share!

Traditional OS: implemented to have good locality

Running same code on new architecture = cache interference

More CPU is detrimental

CPU wants to run from its own cache without interference from others

Page 38: Tornado

Modularity Minimize write sharing & false sharing How do you structure a system so that CPUs

don’t share the same memory? Then: No objects Now: Split everything and keep it modular

Paper: Object Oriented Approach Clustered Objects Existense Guarantee

Page 39: Tornado

Goal

Ideally: CPU uses data from its own cache Cache hit rate good Cache contents stay in cache and not invalidated

by other CPUs CPU has good locality of reference (it frequently

accesses its cache data) Sharing is minimized

Page 40: Tornado

Object Oriented Approach Goal:

Minimize access of shared data structures Minimize using shared locks

Operating Systems are driven by requests of applications on virtual resources

Good performance – requests to different virtual resources must be handled independently

Page 41: Tornado

How do we handle requests independently? Represent each resource by a different object Try to reduce sharing

If an entire resource is locked, requests get queued up

Avoid making resource a source of contention Use fine-grain locking instead

Page 42: Tornado

Coarse/Fine grain locking example Process object: maintains the list of mapped

memory region in the process’s address space On a page fault, it searches process table to

find the responsible region to forward page fault to

Region 1

Region 2

Region n

Region 3

Thread 1

Whole table gets locked even though it only needs one region.Other threads get queued up

Process Object

Page 43: Tornado

Coarse/Fine grain locking example

Region n

Region 3

Region 2

Region 1

Use individual locks

Thread 1

Thread 2

Page 44: Tornado

Key memory management object relationships in Tornado

Page fault delivered

Forwards request to responsible region

Translates fault addr into a file offset. Forwards req to FCM

If file data not cached in memory, request new phy page frame

And asks Cached Object Repto fill page from file

If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object

Page 45: Tornado

Advantage of using Object Oriented Approach Performance critical case of in-core page fault

(TLB miss fault for a page table resident in memory)

Objects invoked specific to faulting object Locks acquired & data structures accessed are

internal to the object

In contrast, many OSes use global page cache or single HAT layer = source of contention

Page 46: Tornado

Clustered Object Approach Obj Oriented is good, but some resources are

just too widely shared

Enter Clustering

Eg: Thread dispatch queue If single list used : high contention (bottleneck +

more cache coherency) Solution: Partition queue and give each processor

a private list(no contention!)

Page 47: Tornado

Clustered Object Approach

Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors

All clients access a clusteredobject using a common clustered obj reference

Each call to a Clustered Obj automatically directed to appropriate local rep

Page 48: Tornado

Clustered object Shared counter Example

It looks likea shared counter

But actually made up of representative countersthat each CPU can access independently

Page 49: Tornado

Degree of Clustering

One rep for the entire system

One rep per processor

One rep for a cluster of Neighbouring processors

Cached Obj Rep – Read mostly. All processors share a single rep.

Region – Read mostly. On critical path for all page faults.

FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps

Page 50: Tornado

Clustered Object Approach

Page 51: Tornado

Clustering through replication One way to have data shared: only read Replicate and all CPUs can read their copy from cache

Writing? How to maintain consistency among replicas? Fundamentally same as cache coherence

Take advantage of the semantics of object you are replicating

Use object semantic knowledge to make a specific implementation (obj specific algm to maintain distributed state)

Page 52: Tornado

Advantages of Clustered Object Approach Supports replication, partitioning and locks

(essential to SMMP)

Built on object oriented design No need to worry about location or organization of

objects Just use clustered object references

Allows incremental implementation Depending on performance needed, change degree of

clustering(Initially one rep serving all requests and then optimize when widely shared)

Page 53: Tornado

Clustered Object Implementation Each processor: own set translation tables

Clustered object reference is just a pointer to the translation table

Clustered Obj 1

Clustered Obj 2

Ptr

Ptr

Translation table

Pointer to rep responsible for handling method invocations for localprocessor

Page 54: Tornado

Clustered Object Approach

Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors

All clients access a clusteredObject using a common clustered obj reference

Each call to a Clustered Obj automatically directed to appropriate local rep

Page 55: Tornado

Clustered Object Implementation Each ‘per processor’ copy of the table is

located in the same virtual address space One pointer into table will give rep responsible

for handling invocation on each processor

Page 56: Tornado

Clustered Object Implementation We do not know which reps will be needed Reps created on first access

1. Each clustered object defines a miss handling object2. All entries in the translation table are initialized to point to a

global miss handling object 3. On first invocation, global miss handling object called4. Global miss handling object saves state and calls clustered

object’s miss handling object 5. Object miss handler:

if (rep exists)pointer to rep installed in translation tableelsecreate rep, install pointer to it in translation table

Page 57: Tornado

Counter: Clustered Object

Counter – Clustered Object

CPU CPU

Object Reference

Rep 1 Rep 2

Page 58: Tornado

Counter: Clustered Object

Counter – Clustered Object

CPU

1

CPU

1

Object Reference

Rep 1 Rep 2

Page 59: Tornado

Counter: Clustered Object

Counter – Clustered Object

CPU

2

CPU

1

Object Reference

Rep 1 Rep 2

Update independent of each

other

Page 60: Tornado

Counter: Clustered Object

Counter – Clustered Object

CPU

1

CPU

1

Object Reference

Rep 1 Rep 2

Read Counte

r

Page 61: Tornado

Counter: Clustered Object

Counter – Clustered Object

CPU

1

CPU

1

Object Reference

Rep 1 Rep 2

Add All Counters (1 + 1)

Page 62: Tornado

Synchronization Locking

Related to concurrency control w.r.t modifications to data structures

Existence Guarantee To ensure the data structure containing variable is

not deallocated during update Concurrent access to a shared data structure =

must provide existence guarantee

Page 63: Tornado

Locking Lots of overhead

Basic instruction overhead Extra cache coherence traffic – write sharing of

lock

Tornado: Encapsulate all locks within individual objects Reduce scope of lock Limits contention Split objects into multiple representatives – further

reduce contention!

Page 64: Tornado

Existence Guarantee Scheduler’s ready list

Element 2

Thread Control Blocks

Thread 1: Traversing the listThread 2: Trying to delete element 2

Garbage collectedMemory possibly reallocated

Page 65: Tornado

Existence Guarantee How to stop a thread from deallocating an

object being used by another?

Semi-automatic garbage collection scheme

Page 66: Tornado

Garbage Collection Implementation

References

Temporary Persistent

Held privately by single threadDestroyed when thread terminates

Stored in shared memoryAccessed by multiple threads

Page 67: Tornado

Clustered Object destruction Phase 1:

Object makes sure all persistent references to it have been removed (lists, tables etc)

Phase 2: Object makes sure all temporary references to it

have been removed Uniprocessor: Number of active operations

maintained in per-processor counter. Count = 0 (none active)

Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)

Page 68: Tornado

IPC Acts like a clustered object call crosses that crosses

from protection domain of client to that of server and back

Client requests are always serviced on their local processors

Processor sharing: like handoff scheduling

Page 69: Tornado

Performance Test machine:

16 processor NUMAchine prototype

SimOS Simulator

Page 70: Tornado

Component Results

Avg no. of cylcles Required for ‘n’ threads

•Performs quite well for memory allocation and miss handling•Lots of variations in garbage collection and PPC tests•Multiple data structures occasionally mapping to the same cache block on some processors

Page 71: Tornado

Component Results

Avg cycles required onSimOS with 4-way associative cache

4-way associativity: A cache entry from main memory can go into any one of the 4 places

Page 72: Tornado

Highly scalable Compares favourably to commercial operating

systems (x100 slowdown on 16 processors) First generation: Hurricane (coarse grain

approach to scalability) Third generation: K42 (IBM & University of

Toronto)

Page 73: Tornado

References

Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, 1998

Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, 1999

Shared counter illustration, CS533, 2012 slides