CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
School of Computer Science and Statistics, Trinity College Dublin 2
Why be concerned? • clock rate of a single CPU core appears to be limited to ≈ 4GHz • single CPU core processing power very far short of doubling every 18 months • Intel, AMD, Sun, IBM, … producing multicore CPUs instead • typical desktop has 4 cores with each core capable of executing 2 threads [hyper-
threading] giving a total of 8 concurrent threads
• typical desktop in 2014 16 threads, 2016 32 threads, … [Moore's Law and Joy's Law] • need to be able to exploit cheap threads on multicore CPUs • locked based solutions are simply not scalable as a lock inhibits parallelism
• need to explore lockless data structures and algorithms
School of Computer Science and Statistics, Trinity College Dublin 6
Spin Lock Implementations • implementations should minimise bus traffic especially when a lock is heavily
contested
• CPUs waiting for a lock are idle and shouldn't generate unnecessary bus traffic which slow the CPUs doing real work
• spin lock implementations usually rely on atomic instructions which comprise an indivisible read-modify-write [RMW] access to a shared memory location
• in a single CPU system, many instructions are effectively atomic because interrupts
School of Computer Science and Statistics, Trinity College Dublin 7
Spin Lock Implementations… • consider a spinlock implementation based on an IA32 logical shift right instruction [shr]
; ; simple spin lock (NB: 1 == free, 0 == taken) ; wait shr lock, 1 ; lock in memory jnc wait ; jump no carry (retry if C == 0) ret ; return free mov lock, 1 ; lock = 1 (free) ret ; return
• works in a single CPU system, but not in a multiprocessor • why? determined by how CPU updates memory
if lock free and “shr lock, 1” is executed, lock becomes taken and the carry flag is set atomically/simultaneously sets lock as taken and returns the fact that the lock has been acquired in the carry flag
School of Computer Science and Statistics, Trinity College Dublin 9
Atomic Instructions • atomic RMW memory accesses [read cycle followed by a write cycle] must NOT be
interleaved with memory accesses made by other CPUs • CPUs generally have special atomic instructions which indicate externally that an
atomic RMW memory access is being performed • if bus cycles are arbitrated on a cycle by cycle basis [i.e. NON atomic] then
a CPU could read a lock and find it free; on the next bus cycle another CPU could also read the lock and find it still free before the first CPU has been given a bus cycle to set the lock; this would result in the lock being allocated to both CPUs
• IA32/x64 CPUs asserts a /LOCK signal [external pin on chip] to inform bus arbiter that it is trying to perform an atomic RMW memory access
• bus arbiter must simply lock CPU onto bus while the /LOCK signal is asserted
School of Computer Science and Statistics, Trinity College Dublin 10
IA32/x64 Atomic Instructions • XCHG [exchange] instruction generates an atomic read-modify-write memory access • use variant which exchanges [swaps] a register with a memory location
; ; testAndSet lock [NB: 0 = free, 1 = taken] ; wait mov eax, 1 ; eax = 1 xchg eax, lock ; exchange eax and lock in memory test eax, eax ; test eax [result of xchg] jne wait ; re-try if unsuccessful ret ; return
School of Computer Science and Statistics, Trinity College Dublin 13
Volatile • lock must be declared as volatile
• description of volatile from Visual Studio 2012 documentation
objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.
• to declare object pointed to by a pointer as volatile use: volatile int *p; // what p points to is volatile
• to declare the pointer itself volatile use:
int * volatile p; // contents of p is volatile
• both volatile int* volatile p; // p and what p points to are all volatile
School of Computer Science and Statistics, Trinity College Dublin 16
Serializing Instructions… • need to consider memory read and write ordering if locks are to work correctly • CPU must NOT read ahead data in the shared data structure before it has obtained the
lock [otherwise the CPU with lock may not have finished updating the shared data structure and out of date will be read]
• CPU must not release the lock until ALL its writes to the shared data structure have
been completed [otherwise next lock holder could read out of date data] • LOCKED instructions [e.g. xchg, lock xadd] act implicitly as a memory barrier or fence • reads/writes cannot pass [be carried out ahead of] locked [serialising] instructions
School of Computer Science and Statistics, Trinity College Dublin 17
Serializing Instructions… • CPUs often have explicit memory barrier or fence instructions to flush the write buffer
and to enforce ordering • IA32/x64 have the following fence instructions
SFENCE store fence flush all writes before executing instruction LFENCE load fence don't read ahead until instruction executed MFENCE memory fence flush all writes before executing instruction and…
don't read ahead until instruction executed
• see section 8.2 on Memory Ordering in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1
• and also Intel® 64 Architecture Memory Ordering White Paper
School of Computer Science and Statistics, Trinity College Dublin 19
Serializing Instructions… • why does the previous testAndSet code work on an IA32/x64 CPU?
1) writes are a made to memory in program order so that when the lock is cleared and visible [mov lock, 0] ALL previous writes to the shared data structure are also visible
2) lock obtained using a serialising instruction [xchg eax, lock] which prevents read ahead so that data in the shared data structure will not be read until the lock is obtained
3) executing serialising instructions reduces CPU performance as it prevents CPU from reading and writing ahead
School of Computer Science and Statistics, Trinity College Dublin 20
Load Locked / Store Conditional Instructions • alternative approach for performing atomic RMW accesses to memory • executing a load locked [LL] followed by a store conditional [SC] instruction is used to
perform an atomic RMW access to memory
• first used by MIPS CPU [ll/sc]
• also used by Alpha [ldq_l/stq_c], IBM Power PC [lwarx/stwcx] and ARM [ldrex/strex] CPUs
School of Computer Science and Statistics, Trinity College Dublin 21
Alpha LL/SC Implementation • each CPU has a lockFlag [LF] and a lockPhysicalAddressRegister [LPAR] used by the LL
and SC instructions • LDQ_L Ra, va ; load quadword locked
lockFlag = 1 lockPhysicalAddressRegister = physicalAddress(va) Ra = [va]
• STQ_C Ra, va ; conditionally store quadword
if (lockFlag == 1) ; check lock flag [va] = Ra ; conditional store if lockFlag is set Ra = lockFlag ; used to test if store occurred lockFlag = 0 ; clear lock flag
School of Computer Science and Statistics, Trinity College Dublin 22
Alpha LL/SC Implementation… • where is the magic?
• if the per CPU lockFlag is still set when an associated STQ_C is executed, the store occurs
otherwise NO store takes place [conditional store] • what clears the lockFlag? if any CPU does a store [write] to the physical memory address contained in a
lockPhysicalAddressRegister, the associated CPU clears its lockFlag
School of Computer Science and Statistics, Trinity College Dublin 29
Cost of Sharing Data Between Threads… • for 25% sharing, for example, each thread executes
InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(maxThread), 1); // shared NB: threads numbers from 0 .. maxThread-1
• use _aligned_malloc to allocate data on a cache line boundary volatile long *g; // NB: position of volatile g = (long*) _aligned_malloc((maxThread+1)*lineSz, lineSz); // shared global variable
• GINDX macro defined as follows #define GINDX(n) (g + n*lineSz/sizeof(long)) // index into g
School of Computer Science and Statistics, Trinity College Dublin 31
Worker Function DWORD WINAPI worker(LPVOID thread) { long long ops = 0; // 64 bit local counter while (1) { for (int i = 0; i < NOPS / 4; i++) { // NOPS/4 since work comprises...
// do some work // 4 InterlockedExchange operations } ops += NOPS; // local to thread if (clock() - tstart > NSECONDS*CLOCKS_PER_SEC) // NSECONDS of work? break; } cnt[(int) thread] = ops; // remember in global cnt array return 0; }
School of Computer Science and Statistics, Trinity College Dublin 36
TestAndSet Lock… • ALL waiting CPUs repeatedly execute an xchg instruction trying to get hold of lock • the memory accesses made by the xchg instruction don't benefit from having a cache
since the shared cache lines are continually overwritten [even if the lock is a 1, it is overwritten with a 1] which invalidates the entries in the other caches which results in bus cycles for both the read and write parts of ALL xchg instructions [think MESI]
• ALL the xchg read and writes will be to memory
• a write update cache coherency protocol would allows the reads to be local cache reads [Firefly]
• the lock is overwritten even if there is NO chance of obtaining the lock
• why is there not an instructions which conditionally writes a 1 if the value read is 0
The PAUSE instruction can improves the performance of processors supporting Hyper-Threading Technology when executing “spin-wait loops” and other routines where one thread is accessing a shared lock or semaphore in a tight polling loop. When executing a spin-wait loop, the processor can suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation and flushes the core processor’s pipeline. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation and prevent the pipeline flush. In addition, the PAUSE instruction de-pipelines the spin-wait loop to prevent it from consuming execution resources excessively. (See Section 7.11.6.1, “Use the PAUSE Instruction in Spin-Wait Loops,” for more information about using the PAUSE instruction with IA-32 processors supporting Hyper-Threading Technology.)
School of Computer Science and Statistics, Trinity College Dublin 39
TestAndTestAndSet Lock… • the advantage is that the test of the lock [lock == 1] is executed entirely within the
cache and the xchg instruction is only executed when the lock is known to be free and there is a chance of acquiring the lock
• the cached lock variable will be invalidated or updated when the lock is released and
only then is an attempt made to obtain the lock by executing a xchg instruction
• if the release of the lock invalidates the other shared caches lines then O(n2) [where n is number of CPUs waiting for lock] bus cycles will be generated? quote from the literature
• ALL n waiting CPUs continuously read the lock [from their own local cache]; these
cache lines will be invalidated when the lock is released; subsequent reads of the lock will appear on bus which will be serialised by a typical round-robin bus arbiter and each CPU, in turn, will see the lock free
School of Computer Science and Statistics, Trinity College Dublin 40
TestAndTestAndSet Locks... • an individual CPU executes its xchg instruction but then sees the remaining CPUs
executing their xchg instructions which will invalidate its cache line so a bus cycle has to be performed to read the lock again i.e. O(n2)
• however won't the bus cycles for the xchg be such that all CPUs will execute them one
after another [thanks to the round robin arbiter] so that a CPU's cache line is effectively invalidated only once? i.e. O(n)
• if the release of the lock updates the other caches directly then the generated bus traffic will only be of O(n)
• either way there will be enough bus activity to interfere with the process in the critical
section as well as the other processes not involved with the lock
• if the lock is held for a long time the impact is unimportant, but for short critical sections the lock will be released before the last spurt of activity has subsided resulting in continued bus saturation
School of Computer Science and Statistics, Trinity College Dublin 41
TestAndSet Lock with Exponential Back Off • don't continuously try to acquire lock, delay between attempts
to acquire lock: d = 1; // initialise back off delay while (InterlockedExchange(&lock, 1)) { // if unsuccessful… delay(d); // delay d time units d *= 2; // exponential back off }
• testAndTestAndSet lock NOT necessary when using a back off scheme • the longer the CPU has being waiting for the lock, the longer it will have to wait before
it attempts to acquire the lock again, possibility of starvation • supposed to work well in practice
School of Computer Science and Statistics, Trinity College Dublin 42
Ticket Lock with Proportional Back Off class TicketLock { public: volatile long ticket; // initialise to 0 volatile long nowServing; // initialise to 0 };
inline void acquire(TicketLock *lock) // acquire lock { int myTicket = InterlockedExchangeAdd(&lock->ticket, 1); // get ticket [atomic] while (myTicket != lock->nowServing) // if not our turn… delay(myticket - lock->nowServing); // delay relative to… } // position in Q inline void release(TicketLock *lock) // release lock { lock->nowServing ++; // give lock to next CPU } // NB: not atomic
School of Computer Science and Statistics, Trinity College Dublin 43
Ticket Lock with Proportional Back Off… • think of waiting in a Q in the Andrews St. tourist office, ISS computer help desk, A&E, … • deterministic • ONLY 1 atomic instruction executed per lock acquisition • FAIR, locks granted in order of request which eliminates the possibility of starvation • back off proportional to position in Q
• if time in critical section is constant , the delay can be calculated such that the
subsequent test of lock->nowServing will just succeed • still polls a common location [lock->nowServing] which will cause some bus traffic with
an invalidate protocol • delay not necessary with a write-update protocol [Firefly]
School of Computer Science and Statistics, Trinity College Dublin 44
MCS Lock [Mellor-Crummy and Scott] • lockless queue of waiting threads • each thread has its own QNode which is linked into a Q of QNodes waiting for lock • a global variable lock points to tail of Q • acquire lock by adding a thread’s QNode [qn] to tail of Q and waiting until
qn->waiting==0 • release lock by setting qn->next->waiting=0 [if qn not at the tail of Q]
School of Computer Science and Statistics, Trinity College Dublin 46
Compare and Swap [CAS]
• pseudo C version of CAS
atomic long CAS(long *a, long e, long n) // memory address, expected value, new value { long r = *a; // read contents of memory address if (r == e) // compare with expected value and if equal… *a = n; // update memory with new value return r; // success if e returned }
• NB: returns expected value if exchange took place
• CAS can be mapped onto the IA32/x64 compare and exchange instruction
cmpxchg reg, mem // if (eax == mem) // ZF = 1, mem = reg // else // ZF = 0, eax = mem
School of Computer Science and Statistics, Trinity College Dublin 47
Compare and Swap…
• make use of following instrinsic defined in intrin.h long InterlockedCompareExchange(long volatile *a, long n, long e); NB: different parameter order than previous/normal definition of CAS
• for convenience can always define #define CAS(a, e, n) InterlockedCompareExchange(a, n, e)
School of Computer Science and Statistics, Trinity College Dublin 48
How to allocate objects aligned on a cache line • can allocate objects in their own cache line(s) to avoid false sharing • one straightforward approach is to use a template class to override new and delete
// // derive from ALIGNEDMA for aligned memory allocation // template <class T> class ALIGNEDMA { public: void* operator new(size_t); // override new void operator delete(void*); // override delete };
School of Computer Science and Statistics, Trinity College Dublin 54
MCS Lock acquire
• pred = InterlockedExchange(lock, qn) performed atomically (1) • think about what happens if two or more threads try to acquire lock simultaneously • if pred is NULL [previous value of lock] then at head of Q so have lock otherwise… • set qn->waiting = 1 and… • link thread’s QNode to tail of existing Q by setting pred->next = qn (2) • wait until qn->waiting == 0
School of Computer Science and Statistics, Trinity College Dublin 55
MCS Lock release
• if (qn->next != NULL) set qn->waiting = 0 which passes lock to next thread in Q
• if (qn->next == NULL) use InterLockedCompareExchangePointer(lock, NULL, qn) to atomically set lock = 0 if its lock == qn and return if successful [there are no more threads waiting for lock] otherwise…
• a call to acquire() by another thread must have added a QNode between qn and lock • follow qn->next until not NULL and assign to succ which then points to next QNode in Q • set succ->waiting = 0 to pass lock to next thread [no explicit removal of QNodes from Q]
School of Computer Science and Statistics, Trinity College Dublin 56
Testing Framework • create a framework to compare the performance of locked and lockless lists
• will use VC++
• source code on CS4021 web site [single source for Win32 and x64] • implement an ordered list with add(key) and remove(key) operations • create n threads which pseudo randomly add or remove items from a list • add and remove operations occur with equal probability • generate keys pseudo randomly in range 0 .. maxkey-1 • changing key range controls the length of list and also the amount of contention
between threads [less contention with longer lists]
School of Computer Science and Statistics, Trinity College Dublin 58
C++ Node and List class definitions
• develop test framework for testing the performance of a list protected by different kinds of locks [CriticalSection, testAndSet, testAndTestAndSet,…]
class Node: public ALIGNEDMA<Node> { // derive from ALIGNEDMA public: int key; // key Node *next; // points to next node in list }; class List: public ALIGNEDMA<LIst> { // derive from ALIGNEDMA private: Node *head; // head of list DECLARE(); // macro to declare CriticalSection, testAndSet lock, … public: List(); // constructor ~List(); // destructor int add(int key); // return 1 if successful int remove(int key); // return 1 if successful };
School of Computer Science and Statistics, Trinity College Dublin 63
testAndSet Results… • do the results make sense? and why are they so poor?
one thread will be updating list while all others will be trying to obtain the lock each attempt to acquire the lock requires the execution of an xchg instruction
each exchange instruction not only reads memory but also writes a 1 to the lock [even
if it's already a 1] invalidating copies of the lock in other caches [MESI protocol] this greatly increases the bus traffic [reads and writing of the lock will be to/from
memory] which significantly reduces the speed of the thread that has the lock if thread pre-empted holding lock, it will obstruct other threads from making
progress [this effect is probably not too significant]
significantly reduced performance due to increased bus traffic from (1) continuously executing the xchg instruction and (2) sharing modified list nodes
School of Computer Science and Statistics, Trinity College Dublin 67
Ticket Lock Results…
• idealised diagram of what is happening • to simplify diagram assume 4 cores and 8 threads
• threads run for an OS time quantum • need to wait for quantum to end before ticket 4, 8, … start to run • hence 4 tickets/updates per OS time quantum • what is the time OS quantum?
School of Computer Science and Statistics, Trinity College Dublin 73
Lockless List Implementation using CAS
• if 2 threads try to add nodes at the same position
CAS(&a->next, b, c) // assume this CAS executes first and succeeds… CAS(&a->next, b, d) // consequently this CAS will fail
• first CAS executed succeeds, second fails as a->next != b • on failure need to RETRY operation • search AGAIN for insertion point and, if found, re-execute CAS [costly if list long]
School of Computer Science and Statistics, Trinity College Dublin 76
Using CAS to remove nodes • search for node and then execute CAS with correct parameters • consider 2 threads removing non-adjacent nodes [disjoint-access parallelism]
CAS(&a->next, b, c) // both will succeed CAS(&c->next, d, 0) // both will succeed
School of Computer Science and Statistics, Trinity College Dublin 80
A Pragmatic Implementation of Non-Blocking Linked Lists Tim Harris [2001] • two step removal [consider remove(20)] • node atomically marked [logically deleted] before updating pointer using CAS
• marked node indicated by an odd address in next field [possible as nodes normally
aligned on 4 byte boundaries]
is_marked_reference(r) // returns 1 if marked get_marked_reference(r) // convert to marked reference get_unmarked_reference(r) // convert to unmarked reference
• tests, sets and clears LSB of address [which is stored in next field]
School of Computer Science and Statistics, Trinity College Dublin 88
remove [delete]… • CAS to remove node will fail
• since node is logically deleted there is no point in calling delete again…. • BUT calling search again will remove any marked node(s) immediately before key
• NOT calling search would simply mean that the marked node(s) would remain in the list
until another node is inserted after 20 [in this example state]
• how could the list get into the following state?
School of Computer Science and Statistics, Trinity College Dublin 90
find [search] • step1: iterates along list to find the first unmarked node >= key; this is the right node;
the left node refers to the previous unmarked node found
• step 2: if the left node is the immediate predecessor of the right node, the search returns [returns with no marked nodes between left and right]
• step 3: use CAS to remove marked node(s) between the left and right nodes; on failure the search is retried
• the optimisation checks if the right node has become marked [logically deleted] and performs the search again rather than returning and then failing in add or remove
School of Computer Science and Statistics, Trinity College Dublin 91
A Pragmatic Implementation of Non-Blocking Linked Lists… • what is NOT said! • insert allocates a new node even if insertion fails • NO code for freeing or re-using nodes • nodes never become unmarked • avoids ABA problem by not re-using nodes which also… • avoids problem of threads traversing list using pointers to freed nodes • assumes nodes are garbage collected in a safe way [not an easy problem to solve] • ONLY a partial solution without memory management [perhaps the harder problem]
School of Computer Science and Statistics, Trinity College Dublin 92
Memory Management
• use garbage collection [Java, but not yet in C++] • reference counting • deferred freeing of nodes [see end of section 6 in Harris paper]
each node contains an additional link field so that it can be added to a per thread retireQ and reuseQ
each thread takes a copy of a global timer [e.g. clock()] before starting an add or remove operation and saves it in a global startOp array [each thread startOp stored in its own cache line for speed]
add and remove operations add any freed nodes to the retireQ and sets the key field to the startOp of the thread
add and remove operations, before they exit, can traverse the retireQ and transfer
nodes to a reuseQ if their startOp is less than the minimum startOp of any thread since no thread can still have a reference to the node
School of Computer Science and Statistics, Trinity College Dublin 93
Memory Management
• add retired nodes to end of retireQ • the minimum thread startOp time is 129 • can transfer all nodes in retireQ with startOp < 129 to reuseQ [first three node] • allocate nodes from per thread reuseQ and only call new/malloc if empty
School of Computer Science and Statistics, Trinity College Dublin 94
Hazard Pointers
• Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged Michael (2004) IEEE Transactions on Parallel and Distributed Systems 15 (8): 491–504
• in terms of an ordered linked list, there are two active pointers as the list is traversed during a find operation [number will be different for other algorithms]
• these active pointers called hazard pointers [used to save cur and next p499 Fig 9]
• idea is not to reuse/delete/free nodes if they have hazard pointers pointing to them
School of Computer Science and Statistics, Trinity College Dublin 95
Hazard Pointers… • maintain a global array of per thread hazard pointers [each thread saving its hazard
pointers in its own cache line for speed]
• use per a thread retireQ and reuseQ as per previous example
• retire node by adding to retireQ and when length >= 2*nthreads*HAZARDSPERTHREAD make a local copy of all hazard pointers in global array [allocate a local array] sort hazard pointers in local array [optional] for each node on retireQ, if node address doesn’t match any hazard pointer in local
array transfer to reuseQ • again need to allocate nodes from per thread reuseQ and only call new/malloc if empty
School of Computer Science and Statistics, Trinity College Dublin 99
Transactional Memory • locks hard to manage effectively
pessimistic – inhibits parallelism priority inversion – lower priority thread pre-empted while holding a lock needed
by a higher priority thread convoying – thread holding lock is descheduled and other threads queue up
unable to progress deadlock – can be difficult to avoid in complex systems
• atomic primitives such as CAS operate on one word at a time resulting in complex
algorithms • MCAS [multiple compare and swap] some help • no hardware implementation • list of addresses, expected values and new values • can be implemented using CAS
School of Computer Science and Statistics, Trinity College Dublin 103
Conflict Detection Example • in both sequences, eager detection would
detect a conflict at Read X because the other transaction has already written to X
• (a)
lazy conflict detection would detect a conflict in T1 because T2 commits first implying that T1 should have used the result of the T2 Write X operation
• (b)
lazy conflict detection would allow both T1 and T2 to commit because T1 commits first and its Read X need not use the result of the T2 Write X
School of Computer Science and Statistics, Trinity College Dublin 108
Hardware Transactional Memory
• Transactional Memory: Architectural Support of Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss Proceedings of the 20th Annual International Symposium on Computer Architecture 1993
• motivations
lock-free – operations on a data structure will not be prevented if one process/thread stalls mid execution
avoids common problems with mutual exclusion out performs best known locking techniques
School of Computer Science and Statistics, Trinity College Dublin 111
Hardware Transaction Memory • basic idea is that any cache coherency protocol capable of detecting accessibility
conflicts can also detect transaction conflicts at no extra cost
• instructions added to CPU instruction set for handling transactions – would be automatically generated a compiler
• Load transactional [LT] reads value from a shared memory location into transaction cache [and CPU register]
• Load transactional exclusive [LTX] read a value of a shared memory location into
transaction cache and mark it as RESERVED [use LTX if location likely to be updated] • Store transactional [ST] tentatively writes a value to a copy of the data in the
transaction cache which does NOT become visible to other processors until the transaction successfully commits
School of Computer Science and Statistics, Trinity College Dublin 112
Hardware Transactional Memory…
• commit [COMMIT] attempts to make a transaction’s tentative changes permanent and visible to other caches succeeds ONLY if no other transaction has written to any location in the
transaction's read or write set [and no other transaction has read any location in this transaction’s write set]
on failure all tentative changes to the write set are discarded returns success or failure
• Abort [ABORT] discards all updates to the write set • Validate [VALIDATE] tests the current transaction’s status
returns true if the transaction has not aborted [thus far] returns false if the current transaction has aborted, discards tentative updates
• CPU also keeps a TACTIVE flag indicating a transaction is in progress and a TSTATUS flag
indicating if the transaction is active or aborted; VALIDATE returns TSTATUS
School of Computer Science and Statistics, Trinity College Dublin 113
Transaction Cache States
• transaction cache lines have a write-once state AND a transaction state
• a memory location cannot be in a CPU’s normal cache and transaction cache simultaneously [exclusive caches]
• transactional cache states
EMPTY contains no data [invalid] NORMAL contains committed data XCOMMIT [discard on commit] contains original value read from “memory” XABORT [discard on abort] holds the tentative writes made to cache line
during a transaction [always paired with a XCOMMIT cache line]
• if a transaction commits successfully, the XCOMMIT lines are set to EMPTY and the XABORT lines switch to NORMAL
• must occur atomically using appropriate hardware support so ALL changes become visible “instantaneously”
• and compiler generated code sequence for transaction
tstart: ltx r1, a0 // know a0 will be modified ltx r2, a1 // know a1 will be modified add r1, 3, r1 // add 3 sub r2, 3, r2 // sub 3 st r1, a0 // tentative store st r2, a1 // tentative store commit // commit jeq tstart // retry on failure
• could add validate instructions to test for abort status earlier