Tornado

Tornado

Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System

Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm

(Department of Electrical and Computer Engineering, University of Toronto, 1999)

Presented by: Anusha Muthiah, 4th Dec 2013

Why Tornado?

Previous shared memory multiprocessors evolved from designs for architecture that

existed back then

Locality – Traditional System

Memory latency was low: fetch – decode – execute without stalling

Memory faster than CPU = okay to leave data in memory

CPU CPU CPU

Shared Bus

Shared memory

Loads &

Stores


Memory got faster CPU got faster than memory Memory latency was high. Fast CPU = useless!

CPU CPU CPU

Shared Bus

Shared memory


Reading memory was very important Fetch – decode – execute cycle

CPU CPU CPU

Shared Bus

Shared memory

CacheCPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Extremely close fast memory•Hitting a lot of addresses within a small region•E.g. Program – seq of instructions•Adjacent words on same page of memory•More if in a loop

CacheCPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Repeated acess to the same page of memorywas good!

Memory latency

CPU executes at speed of cache Making up for memory latency

CPU

cache

Bus

Memory

Really small How can it help us more than 2% of the time?

Locality

Locality: same value or storage location being frequently accessed

Temporal Locality: reusing some data/resource within a small duration

Spatial Locality: use of data elements within relatively close storage locations

Cache

All CPUs accessing same page in memory

CPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Example: Shared Counter

Memory

CPU-1

Cache

CPU-2

Cache

Counter

Shared by both CPU-1 & CPU-2

Copy of counter exists in bothprocessor’s cache


Memory

CPU-1 CPU-2

0


Memory

CPU-1

0

CPU-2

0


Memory

CPU-1

1

CPU-2

1

Write in exclusive

mode


Memory

CPU-1

1

CPU-2

1

1

Read : OK

Shared mode


Memory

CPU-1 CPU-2

2

2

Invalidate

From shared to exclusive

21

Example: Shared CounterTerrible Performance!

Problem: Shared Counter Counter bounces between CPU caches leading

to high cache miss rate

Try an alternate approach Counter converted to an array Each CPU has its own counter Updates can be local Number of increments = commutativity of

addition To read, you need all counters (add up)

Example: Array-based Counter

Memory

CPU-1 CPU-2

0 0

Array of counters, one for each CPU

CPU-2

CPU-1


Memory

CPU-1

1

CPU-2

1 0


Memory

CPU-1

1

CPU-2

1

1 1


Memory

CPU-1

1

CPU-2

1

1 1

2CPU

Read Counte

r

Add All Counters (1 + 1)

Performance: Array-based CounterPerforms no better than ‘shared counter’!

Why Array-based counter doesn’t work? Data is tranferred between main memory and

caches through fixed sized blocks of data called cache lines

If two counters are in the same cache line, they can only be used by one processor at a time for writing

Ultimately, still has to bounce between CPU caches

Memory 1 1

Share cache line

False sharing

Memory

CPU-1 CPU-2

0,0

False sharing

Memory

CPU-1

0,0

CPU-2

0,0

False sharing

Memory

CPU-1

0,0

CPU-2

0,0

0,0

Sharing

False sharing

Memory

CPU-1

1,0

CPU-2

1,0

Invalidate

False sharing

Memory

CPU-1

1,0

CPU-2

1,0

1,0

Sharing

False sharing

Memory

CPU-1 CPU-2

1,1

1,1

Invalidate

Ultimate Solution Pad the array (independent cache line)

Spread counter components out in memory

Example: Padded Array

Memory

CPU-1 CPU-2

00

Individual cache lines for each counter

Example: Padded Array

Memory

CPU-1

1

CPU-2

1

11

Updates independent of each

other

Performance: Padded Array

Works better

Cache

All CPUs accessing same page in memory Write sharing is very destructive

CPU CPU CPU

Shared Bus

Shared memory

cache cache cache

Shared Counter Example

Minimize read/write & write sharing = minimize cache coherence

Minimize false sharing

Now and then Then (Uni): Locality is good Now (Multi): Locality is bad. Don’t share!

Traditional OS: implemented to have good locality

Running same code on new architecture = cache interference

More CPU is detrimental

CPU wants to run from its own cache without interference from others

Modularity Minimize write sharing & false sharing How do you structure a system so that CPUs

don’t share the same memory? Then: No objects Now: Split everything and keep it modular

Paper: Object Oriented Approach Clustered Objects Existense Guarantee

Goal

Ideally: CPU uses data from its own cache Cache hit rate good Cache contents stay in cache and not invalidated

by other CPUs CPU has good locality of reference (it frequently

accesses its cache data) Sharing is minimized

Object Oriented Approach Goal:

Minimize access of shared data structures Minimize using shared locks

Operating Systems are driven by requests of applications on virtual resources

Good performance – requests to different virtual resources must be handled independently

How do we handle requests independently? Represent each resource by a different object Try to reduce sharing

If an entire resource is locked, requests get queued up

Avoid making resource a source of contention Use fine-grain locking instead

Coarse/Fine grain locking example Process object: maintains the list of mapped

memory region in the process’s address space On a page fault, it searches process table to

find the responsible region to forward page fault to

Region 1

Region 2

…

Region n

Region 3

Thread 1

Whole table gets locked even though it only needs one region.Other threads get queued up

Process Object

Coarse/Fine grain locking example

Region n

…

Region 3

Region 2

Region 1

Use individual locks

Thread 1

Thread 2

Key memory management object relationships in Tornado

Page fault delivered

Forwards request to responsible region

Translates fault addr into a file offset. Forwards req to FCM

If file data not cached in memory, request new phy page frame

And asks Cached Object Repto fill page from file

If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object

Advantage of using Object Oriented Approach Performance critical case of in-core page fault

(TLB miss fault for a page table resident in memory)

Objects invoked specific to faulting object Locks acquired & data structures accessed are

internal to the object

In contrast, many OSes use global page cache or single HAT layer = source of contention

Clustered Object Approach Obj Oriented is good, but some resources are

just too widely shared

Enter Clustering

Eg: Thread dispatch queue If single list used : high contention (bottleneck +

more cache coherency) Solution: Partition queue and give each processor

a private list(no contention!)

Clustered Object Approach

Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors

All clients access a clusteredobject using a common clustered obj reference

Each call to a Clustered Obj automatically directed to appropriate local rep

Clustered object Shared counter Example

It looks likea shared counter

But actually made up of representative countersthat each CPU can access independently

Degree of Clustering

One rep for the entire system

One rep per processor

One rep for a cluster of Neighbouring processors

Cached Obj Rep – Read mostly. All processors share a single rep.

Region – Read mostly. On critical path for all page faults.

FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps


Clustering through replication One way to have data shared: only read Replicate and all CPUs can read their copy from cache

Writing? How to maintain consistency among replicas? Fundamentally same as cache coherence

Take advantage of the semantics of object you are replicating

Use object semantic knowledge to make a specific implementation (obj specific algm to maintain distributed state)

Advantages of Clustered Object Approach Supports replication, partitioning and locks

(essential to SMMP)

Built on object oriented design No need to worry about location or organization of

objects Just use clustered object references

Allows incremental implementation Depending on performance needed, change degree of

clustering(Initially one rep serving all requests and then optimize when widely shared)

Clustered Object Implementation Each processor: own set translation tables

Clustered object reference is just a pointer to the translation table

Clustered Obj 1

Clustered Obj 2

…

Ptr

Ptr

…

Translation table

Pointer to rep responsible for handling method invocations for localprocessor


Looks like a single object Actually made up of several component objects. Each rep handles calls from a subset of processors

All clients access a clusteredObject using a common clustered obj reference

Each call to a Clustered Obj automatically directed to appropriate local rep

Clustered Object Implementation Each ‘per processor’ copy of the table is

located in the same virtual address space One pointer into table will give rep responsible

for handling invocation on each processor

Clustered Object Implementation We do not know which reps will be needed Reps created on first access

1. Each clustered object defines a miss handling object2. All entries in the translation table are initialized to point to a

global miss handling object 3. On first invocation, global miss handling object called4. Global miss handling object saves state and calls clustered

object’s miss handling object 5. Object miss handler:

if (rep exists)pointer to rep installed in translation tableelsecreate rep, install pointer to it in translation table

Counter: Clustered Object

Counter – Clustered Object

CPU CPU

Object Reference

Rep 1 Rep 2



CPU

1

CPU

1

Object Reference

Rep 1 Rep 2



CPU

2

CPU

1

Object Reference

Rep 1 Rep 2

Update independent of each

other



CPU

1

CPU

1

Object Reference

Rep 1 Rep 2

Read Counte

r



CPU

1

CPU

1

Object Reference

Rep 1 Rep 2

Add All Counters (1 + 1)

Synchronization Locking

Related to concurrency control w.r.t modifications to data structures

Existence Guarantee To ensure the data structure containing variable is

not deallocated during update Concurrent access to a shared data structure =

must provide existence guarantee

Locking Lots of overhead

Basic instruction overhead Extra cache coherence traffic – write sharing of

lock

Tornado: Encapsulate all locks within individual objects Reduce scope of lock Limits contention Split objects into multiple representatives – further

reduce contention!

Existence Guarantee Scheduler’s ready list

Element 2

Thread Control Blocks

Thread 1: Traversing the listThread 2: Trying to delete element 2

Garbage collectedMemory possibly reallocated

Existence Guarantee How to stop a thread from deallocating an

object being used by another?

Semi-automatic garbage collection scheme

Garbage Collection Implementation

References

Temporary Persistent

Held privately by single threadDestroyed when thread terminates

Stored in shared memoryAccessed by multiple threads

Clustered Object destruction Phase 1:

Object makes sure all persistent references to it have been removed (lists, tables etc)

Phase 2: Object makes sure all temporary references to it

have been removed Uniprocessor: Number of active operations

maintained in per-processor counter. Count = 0 (none active)

Multiprocessor: Clustered object knows which set of processors can access it. Counter must become zero on all processors (circulating token scheme)

IPC Acts like a clustered object call crosses that crosses

from protection domain of client to that of server and back

Client requests are always serviced on their local processors

Processor sharing: like handoff scheduling

Performance Test machine:

16 processor NUMAchine prototype

SimOS Simulator

Component Results

Avg no. of cylcles Required for ‘n’ threads

•Performs quite well for memory allocation and miss handling•Lots of variations in garbage collection and PPC tests•Multiple data structures occasionally mapping to the same cache block on some processors

Component Results

Avg cycles required onSimOS with 4-way associative cache

4-way associativity: A cache entry from main memory can go into any one of the 4 places

Highly scalable Compares favourably to commercial operating

systems (x100 slowdown on 16 processors) First generation: Hurricane (coarse grain

approach to scalability) Third generation: K42 (IBM & University of

Toronto)

References

Clustered Objects: Initial Design, Implementation and Evaluation by Jonathan Appavoo, Masters thesis, University of Toronto, 1998

Tornado: Maximizing Locality and Concurrency in a shared-memory multiprocessor operating system by Benjamin Gamsa, PhD thesis, University of Toronto, 1999

Shared counter illustration, CS533, 2012 slides

Tornado

Documents

shared countermemorycpu

fast cpu

cpu caches

arraybased countermemorycpu

memory memory latency

array of counters

accessed data

shared countercounter