Transactional- Memory Real Time Systems Leeor Peled, Advanced topics 049011 Technion, December 2014.

Transactional- Memory Real Time SystemsLeeor Peled,

Advanced topics 049011

Technion, December 2014

Lock-freedom

• Shared data that does not require mutual exclusion.– Avoid common problems as deadlocks,

livelocks, priority inversion, convoying, fail-tolerance, async signal safety

– Allow interruption/preemption without blocking the objects being operated upon.

• LF Algorithms vs LF data structures

Lock-Free

Wait-Free

Wait-Free bounded

Synchronization Paradigms

• Classification:– Blocking

• Blocking• Starvation-Free

– Obstruction-Free– Lock-Free – Wait-Free

• Wait-Free• Wait-Free Bounded• Wait-Free Population Oblivious

Wait-Free population oblivious

Synchronization for lawyers

• Starvation-Free : As long as one thread is in the critical section, then some other thread that wants to enter in the critical section will eventually succeed (even if the thread in the critical section has halted).

• Obstruction-Free: A function is Obstruction-Free if, from any point after which it executes in isolation, if finishes in a finite number of steps.

• Lock-Free: A method is Lock-Free if it guarantees that infinitely often some thread calling this method finishes in a finite number of steps.

• Wait-Free: A method is Wait-Free if it guarantees that every call finishes its execution in a finite number of steps.

• Wait-Free Bounded: A method is Wait-Free Bounded if it guarantees that every call finishes its execution in a finite and bounded number of steps. This bound may depend on the number of threads.

• Wait-Free Population Oblivious: A Wait-Free method whose performance does not depend on the number of active threads.

•

Real Time approved

Synchronization Paradigms (2)

• Are lock-free algorithms completely useless in RT context?– Bounded number of retries in priority-based

systems (Anderson, ’97)• Hard-RT scheduler based on lock-free objects

often incurs less overhead than wait-free implementation

– NonBlocking serialization for RT systems (Hohmuth & Härtig ‚‘01)• Implement linux kernel benchmarks with LF/WF

algorithms, demonstrating RT capabilities

Alternative: Transactional Memory

• Originally proposed by Herlihy & Moss, ’93 – earlier idea by Knight, ’86

• HW concept based on cache coherency extension – Speculative work, writes are marked in cache and

can’t become external/visible until commit• Upon commit, allow snoops/WB• Upon abort – invalidate spec lines and rollback• Reads are also marked to monitor conflicts

Example – deadlock prevention

• consider implementations of move(A,B, elem)– moves a single element from data structure A to B

• Drawbacks? Think of a linked-list

Lock ALock BA.remove(elem)B.insert(elem)Unlock BUnlock A

atomic { A.remove(elem) B.insert(elem)}

Non TM TM

Overflow…

Way 0 Way 1 Way 2 Way 3

store 0,[a]TX_begin store 1,[a] store 1,[b] store 1,[c] store 1,[d] store 1,[e]TX_end

[a], 1, w [b], 1, w [c], 1, w [d], 1, w[a], 0, M

4-way L1 cache

[e], 1

What happens if a write hits a spec/non-spec line? Other resources are also limited

• Assume [a]..[e] all map to the same L1 set– Limited capacity– Worse - non determinism

Software Transactional Memory

• Proposed by Shavit and Touitou (‘95)– Manage data structure through a SW

intermediate layer– Log all reads/writes to track conflicts

• Enhanced in TL2– Rely on versioned clock for commits

• Standalone approach or temporary solution until HW catches up?

TM flavors

• TM (Herlihy, Moss, ‘93) - original design, best effort• SLE (Rajwar, Goodman, ’01) - simplify interface: avoid locks, no TM ISA required• LTM (Ananian, ’03) - physical memory spilling by HW• UTM (“) - virtual memory, context switch support, very heavy (virtualizes each line)• VTM (Rajwar, Herlihy, ’05) – another unbounded flavor, virtualizes Txs like virt-mem• HyTM (Moir, Sun Labs, ’05) - attempt HTM, fall back on STM. Special consideration

to syncing between instances of both types.• DSTM (Koomar) - similar to HyTM (although both are trying hard to deny it)• TL2 (Dice, Shavit ’06) – another hybrid, very popular as baseline for others• PhTM (Lev, ’07) – another hybrid, no simultaneous HW/SW Transactions• USTM (Baugh, ’08) - another hybrid - user fault-on STM, with unbounded HTM based

on HW memory protection • TLE (Dice, ’08) – TM version of SLE• TTM, LogTM, etc (Moore)

Bottom line: Most of the above are still best-effort HTMs – no success (forward progress) guaranteed, some level of SW support required

HTM: Industry Trends

• Sun Microsystems: Rock CPU– Feat. Hybrid-TM and lots of other goodies such as spec-lookahead, OOO

retirement, and a built in desk warmer (250W!). Allows mix of Tx and non-Tx code inside Tx boundaries, but retains TSO.

– R.I.P as of May 2010

• Azul: Vega 2/3 - “Java Compute Appliance (JCA)”. – Release 2007/8. RISC, in order, CMP (48/54 cores per die)– JVM oriented, >100k threads– Simple HTM, no regs rollbacks (rely on SW), no STM fallback

• AMD: Advanced Synchronization Facility (ASF)– Spec released on 2009. ISA includes Speculate/commit, locked-mov– Very resource constrained (4 atomic lines), flat nesting, also allows mix of Tx

and non-Tx code inside tx boundaries, but may break x86 mem consistency.

• Intel: – TM compiler with HW support (HASTM based on RSM)– TSX on Haswell! Oops, sorry - only as of HSW-EX due to errata

Sun: http://labs.oracle.com/scalable/pubs/ASPLOS2006.pdf Azul: http://sss.cs.purdue.edu/projects/tm/tmw2010//talks/Click-2010_TMW.pdfAMD: http://www.amd64.org/fileadmin/user_upload/pub/transact-2010-asfooo.pdf http://www-ali.cs.umass.edu/~moss/transact-2010/public-papers/08.pdf http://llvm.org/pubs/2010-04-EUROSYS-DresdenTM.pdf

http://labs.oracle.com/scalable/pubs/ASPLOS2006.pdf

http://sss.cs.purdue.edu/projects/tm/tmw2010/talks/Click-2010_TMW.pdf

http://www.amd64.org/fileadmin/user_upload/pub/transact-2010-asfooo.pdf

http://www-ali.cs.umass.edu/~moss/transact-2010/public-papers/08.pdf

http://llvm.org/pubs/2010-04-EUROSYS-DresdenTM.pdf

RTTM (Schöeberl ‘10)- premise

• “RTTM brings the benefits of transactional memories into the real-time systems world”.

• Paper contributions:– Design of a time-predictable hardware transactional memory– Analysis of the worst-case number of retries in a periodic thread

model– suggestions for analysis to reduce the number of possible

conflicting transactions– First evaluation of RTTM on a simulation within a Java based

CMP.

• Optimized for WCET, not avg performance• Implemented on Java optimized processor(JOP)

Java optimized processor (Schoeberl ‘07)

• Unlike JVM, JOP is "a RISC stack architecture”

WCET-friendly CPU

• Time-predictable computer Architecture, Schoeberl ‘08– A collection of simplifications for CPU design to reduce the

bounds on WCET, at small penalty to ACET/BCET– Provides some reasoning (but no concrete proof)

WCET-friendly CPU - 2

• Time Division Multiple Access (TDMA) memory access scheduling (Pitter and Schoeberl, ’09, Rosen ‘07)

• Memory access allows a slot per core– Transactions may only start during the access window– Gap allows completion (depends on memory access

time)

Memory access WCET

• Fixed priority (WCET for high prio req)

– • Fair priority

– • TDMA

OS scheduling

• “Real Time Specification for Java”– RT threads are assigned a deadline– Scheduler is preemptive based on priority

• Same priority behaves like fifo

– Scheduler guarantees all threads hit their deadline• Estimation on blocking boundaries

RTTM - proposal

• Transaction buffering - fully assoc.• Read set caching (tags only)• Word granularity (no false conflicts)• Commit in bursts

– All other cores listen (conflict checks)– Protected by global lock (“commit token”)

• (what is the overhead for short transactions?)

• No aborts on overflow! Grab the commit token on the fly

• On true abort – mark as zombie transaction

RTTM Analysis

• Assume n threads, each executes a single atomic region once – - Thread period – - execution time (cost)– - atomic region time– r - max retries

• WCET assumes:– Always conflict (all atomic regions use the same var)– worst phase - all threads attempt to enter the atomic

region simultaneously

RTTM Analysis (2)

• Single transaction per thread per period– : time of transaction resolution (all threads)

• Period per transactions (same thread)

• Max number of retries

– Assuming )•

Preliminary analysis

• Possible directions– Context-sensitive points-to analysis– Static detection of race conditions– Simulation-based analysis of buffer overflows

• RTTM’s Analysis was based on WALA analyzer (open source from IBM, 06’)

http://wala.sourceforge.net/wiki/index.php/Main_Page

Experiment methodology

• Implemented over JOP simulated on JVM• 3 tasks

– Producer enqueues into a buffer– Consumer removes elements from its buffer– Mover atomically moves elements between

• Buffer types– Standard Java vector– Bounded queue

Results

STM example (Fahmy, ‘09)

• EDF scheduling • Response time analysis

– Predicted vs simulated w/ random alignments (> 1)– Utilization: task time vs period (< 1)

Bibliography

• Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. ISCA ‘93.

• J.H. Anderson, S. Ramamurthy, K. Jeffay. Real-time computing with lock-free shared objects. ACM ToCS, May ‘97

• M. Hohmuth H. Härtig, Pragmatic nonblocking synchronization for real-time systems, USENIX ‘01

• M. Schoeberl, F. Brandner, J. Vitek, RTTM: Real-Time Transactional Memory, SAC ’10

• M. Schoeberl , A Java processor architecture for embedded real-time systems, Journal of Systems Architecture, volume 54, Jan 2008, 265-286

• M.Schoeberl. Time-predictable computer architecture. EURASIP J. Embedded Syst. 2009, Article 2 (January 2009)

• C. Pitter and M. Schoeberl. A real-time Java chip-multiprocessor. Trans. on Embedded Computing Sys., accepted for publication 2009.

• Manson (‘05) – Preemptible atomic regions (uni-processor)

Memory ordering rules

Type Alpha ARMv7PA-RISC

POWERSPARC RMO

SPARC PSO

SPARC TSO

x86x86 oostore

AMD64 IA-64 zSeries

Loads reordered after loads

Y Y Y Y Y Y Y

Loads reordered after stores

Y Y Y Y Y Y Y

Stores reordered after stores

Y Y Y Y Y Y Y Y

Stores reordered after loads

Y Y Y Y Y Y Y Y Y Y Y Y

Atomic reordered with loads

Y Y Y Y Y

Atomic reordered with stores

Y Y Y Y Y Y

Dependent loads reordered

Y

Incoherent instruction cache pipeline

Y Y Y Y Y Y Y Y Y Y

Source: http://en.wikipedia.org/wiki/Memory_ordering

http://en.wikipedia.org/wiki/Memory_ordering



Transactional- Memory Real Time Systems Leeor Peled, Advanced topics 049011 Technion, December 2014.

Documents

waitfree method

lockfree algorithms

lockfree objects

lock b

b store

lawyers starvationfree

removeelem b

b drawbacks