Transactional- Memory Real Time Systems Leeor Peled, Advanced topics 049011 Technion, December 2014
Dec 16, 2015
Transactional- Memory Real Time SystemsLeeor Peled,
Advanced topics 049011
Technion, December 2014
Lock-freedom
• Shared data that does not require mutual exclusion.– Avoid common problems as deadlocks,
livelocks, priority inversion, convoying, fail-tolerance, async signal safety
– Allow interruption/preemption without blocking the objects being operated upon.
• LF Algorithms vs LF data structures
Lock-Free
Wait-Free
Wait-Free bounded
Synchronization Paradigms
• Classification:– Blocking
• Blocking• Starvation-Free
– Obstruction-Free– Lock-Free – Wait-Free
• Wait-Free• Wait-Free Bounded• Wait-Free Population Oblivious
Wait-Free population oblivious
Synchronization for lawyers
• Starvation-Free : As long as one thread is in the critical section, then some other thread that wants to enter in the critical section will eventually succeed (even if the thread in the critical section has halted).
• Obstruction-Free: A function is Obstruction-Free if, from any point after which it executes in isolation, if finishes in a finite number of steps.
• Lock-Free: A method is Lock-Free if it guarantees that infinitely often some thread calling this method finishes in a finite number of steps.
• Wait-Free: A method is Wait-Free if it guarantees that every call finishes its execution in a finite number of steps.
• Wait-Free Bounded: A method is Wait-Free Bounded if it guarantees that every call finishes its execution in a finite and bounded number of steps. This bound may depend on the number of threads.
• Wait-Free Population Oblivious: A Wait-Free method whose performance does not depend on the number of active threads.
•
Real Time approved
Synchronization Paradigms (2)
• Are lock-free algorithms completely useless in RT context?– Bounded number of retries in priority-based
systems (Anderson, ’97)• Hard-RT scheduler based on lock-free objects
often incurs less overhead than wait-free implementation
– NonBlocking serialization for RT systems (Hohmuth & Härtig ‚‘01)• Implement linux kernel benchmarks with LF/WF
algorithms, demonstrating RT capabilities
Alternative: Transactional Memory
• Originally proposed by Herlihy & Moss, ’93 – earlier idea by Knight, ’86
• HW concept based on cache coherency extension – Speculative work, writes are marked in cache and
can’t become external/visible until commit• Upon commit, allow snoops/WB• Upon abort – invalidate spec lines and rollback• Reads are also marked to monitor conflicts
Example – deadlock prevention
• consider implementations of move(A,B, elem)– moves a single element from data structure A to B
• Drawbacks? Think of a linked-list
Lock ALock BA.remove(elem)B.insert(elem)Unlock BUnlock A
atomic { A.remove(elem) B.insert(elem)}
Non TM TM
Overflow…
Way 0 Way 1 Way 2 Way 3
store 0,[a]TX_begin store 1,[a] store 1,[b] store 1,[c] store 1,[d] store 1,[e]TX_end
[a], 1, w [b], 1, w [c], 1, w [d], 1, w[a], 0, M
4-way L1 cache
[e], 1
What happens if a write hits a spec/non-spec line? Other resources are also limited
• Assume [a]..[e] all map to the same L1 set– Limited capacity– Worse - non determinism
Software Transactional Memory
• Proposed by Shavit and Touitou (‘95)– Manage data structure through a SW
intermediate layer– Log all reads/writes to track conflicts
• Enhanced in TL2– Rely on versioned clock for commits
• Standalone approach or temporary solution until HW catches up?
TM flavors
• TM (Herlihy, Moss, ‘93) - original design, best effort• SLE (Rajwar, Goodman, ’01) - simplify interface: avoid locks, no TM ISA required• LTM (Ananian, ’03) - physical memory spilling by HW• UTM (“) - virtual memory, context switch support, very heavy (virtualizes each line)• VTM (Rajwar, Herlihy, ’05) – another unbounded flavor, virtualizes Txs like virt-mem• HyTM (Moir, Sun Labs, ’05) - attempt HTM, fall back on STM. Special consideration
to syncing between instances of both types.• DSTM (Koomar) - similar to HyTM (although both are trying hard to deny it)• TL2 (Dice, Shavit ’06) – another hybrid, very popular as baseline for others• PhTM (Lev, ’07) – another hybrid, no simultaneous HW/SW Transactions• USTM (Baugh, ’08) - another hybrid - user fault-on STM, with unbounded HTM based
on HW memory protection • TLE (Dice, ’08) – TM version of SLE• TTM, LogTM, etc (Moore)
Bottom line: Most of the above are still best-effort HTMs – no success (forward progress) guaranteed, some level of SW support required
HTM: Industry Trends
• Sun Microsystems: Rock CPU– Feat. Hybrid-TM and lots of other goodies such as spec-lookahead, OOO
retirement, and a built in desk warmer (250W!). Allows mix of Tx and non-Tx code inside Tx boundaries, but retains TSO.
– R.I.P as of May 2010
• Azul: Vega 2/3 - “Java Compute Appliance (JCA)”. – Release 2007/8. RISC, in order, CMP (48/54 cores per die)– JVM oriented, >100k threads– Simple HTM, no regs rollbacks (rely on SW), no STM fallback
• AMD: Advanced Synchronization Facility (ASF)– Spec released on 2009. ISA includes Speculate/commit, locked-mov– Very resource constrained (4 atomic lines), flat nesting, also allows mix of Tx
and non-Tx code inside tx boundaries, but may break x86 mem consistency.
• Intel: – TM compiler with HW support (HASTM based on RSM)– TSX on Haswell! Oops, sorry - only as of HSW-EX due to errata
Sun: http://labs.oracle.com/scalable/pubs/ASPLOS2006.pdf Azul: http://sss.cs.purdue.edu/projects/tm/tmw2010//talks/Click-2010_TMW.pdfAMD: http://www.amd64.org/fileadmin/user_upload/pub/transact-2010-asfooo.pdf http://www-ali.cs.umass.edu/~moss/transact-2010/public-papers/08.pdf http://llvm.org/pubs/2010-04-EUROSYS-DresdenTM.pdf
RTTM (Schöeberl ‘10)- premise
• “RTTM brings the benefits of transactional memories into the real-time systems world”.
• Paper contributions:– Design of a time-predictable hardware transactional memory– Analysis of the worst-case number of retries in a periodic thread
model– suggestions for analysis to reduce the number of possible
conflicting transactions– First evaluation of RTTM on a simulation within a Java based
CMP.
• Optimized for WCET, not avg performance• Implemented on Java optimized processor(JOP)
Java optimized processor (Schoeberl ‘07)
• Unlike JVM, JOP is "a RISC stack architecture”
WCET-friendly CPU
• Time-predictable computer Architecture, Schoeberl ‘08– A collection of simplifications for CPU design to reduce the
bounds on WCET, at small penalty to ACET/BCET– Provides some reasoning (but no concrete proof)
WCET-friendly CPU - 2
• Time Division Multiple Access (TDMA) memory access scheduling (Pitter and Schoeberl, ’09, Rosen ‘07)
• Memory access allows a slot per core– Transactions may only start during the access window– Gap allows completion (depends on memory access
time)
Memory access WCET
• Fixed priority (WCET for high prio req)
– • Fair priority
– • TDMA
OS scheduling
• “Real Time Specification for Java”– RT threads are assigned a deadline– Scheduler is preemptive based on priority
• Same priority behaves like fifo
– Scheduler guarantees all threads hit their deadline• Estimation on blocking boundaries
RTTM - proposal
• Transaction buffering - fully assoc.• Read set caching (tags only)• Word granularity (no false conflicts)• Commit in bursts
– All other cores listen (conflict checks)– Protected by global lock (“commit token”)
• (what is the overhead for short transactions?)
• No aborts on overflow! Grab the commit token on the fly
• On true abort – mark as zombie transaction
RTTM Analysis
• Assume n threads, each executes a single atomic region once – - Thread period – - execution time (cost)– - atomic region time– r - max retries
• WCET assumes:– Always conflict (all atomic regions use the same var)– worst phase - all threads attempt to enter the atomic
region simultaneously
RTTM Analysis (2)
• Single transaction per thread per period– : time of transaction resolution (all threads)
• Period per transactions (same thread)
• Max number of retries
– Assuming )•
Preliminary analysis
• Possible directions– Context-sensitive points-to analysis– Static detection of race conditions– Simulation-based analysis of buffer overflows
• RTTM’s Analysis was based on WALA analyzer (open source from IBM, 06’)
Experiment methodology
• Implemented over JOP simulated on JVM• 3 tasks
– Producer enqueues into a buffer– Consumer removes elements from its buffer– Mover atomically moves elements between
• Buffer types– Standard Java vector– Bounded queue
Results
STM example (Fahmy, ‘09)
• EDF scheduling • Response time analysis
– Predicted vs simulated w/ random alignments (> 1)– Utilization: task time vs period (< 1)
Bibliography
• Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. ISCA ‘93.
• J.H. Anderson, S. Ramamurthy, K. Jeffay. Real-time computing with lock-free shared objects. ACM ToCS, May ‘97
• M. Hohmuth H. Härtig, Pragmatic nonblocking synchronization for real-time systems, USENIX ‘01
• M. Schoeberl, F. Brandner, J. Vitek, RTTM: Real-Time Transactional Memory, SAC ’10
• M. Schoeberl , A Java processor architecture for embedded real-time systems, Journal of Systems Architecture, volume 54, Jan 2008, 265-286
• M.Schoeberl. Time-predictable computer architecture. EURASIP J. Embedded Syst. 2009, Article 2 (January 2009)
• C. Pitter and M. Schoeberl. A real-time Java chip-multiprocessor. Trans. on Embedded Computing Sys., accepted for publication 2009.
• Manson (‘05) – Preemptible atomic regions (uni-processor)
Memory ordering rules
Type Alpha ARMv7PA-RISC
POWERSPARC RMO
SPARC PSO
SPARC TSO
x86x86 oostore
AMD64 IA-64 zSeries
Loads reordered after loads
Y Y Y Y Y Y Y
Loads reordered after stores
Y Y Y Y Y Y Y
Stores reordered after stores
Y Y Y Y Y Y Y Y
Stores reordered after loads
Y Y Y Y Y Y Y Y Y Y Y Y
Atomic reordered with loads
Y Y Y Y Y
Atomic reordered with stores
Y Y Y Y Y Y
Dependent loads reordered
Y
Incoherent instruction cache pipeline
Y Y Y Y Y Y Y Y Y Y
Source: http://en.wikipedia.org/wiki/Memory_ordering