CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 1

CS4021 Advanced Computer Architecture • concurrent programming with and without locks • atomic instructions / updates

• lock implementations and performance • lockless [non blocking] data structures and algorithms

CAS based MCAS based memory management [e.g. hazard pointers]

• hardware transactional memory [HTM]

Herlihy and Moss [1993] Intel Haswell CPU [2012]

• The Art of Multiprocessor Programming, Maurice Herlihy and Nir Shavit

• CS4021 website https://www.scss.tcd.ie/Jeremy.Jones/CS4021/CS4021.htm

https://www.scss.tcd.ie/Jeremy.Jones/CS4021/CS4021.htm

https://www.scss.tcd.ie/Jeremy.Jones/CS4021/CS4021.htm




Why be concerned? • clock rate of a single CPU core appears to be limited to ≈ 4GHz • single CPU core processing power very far short of doubling every 18 months • Intel, AMD, Sun, IBM, … producing multicore CPUs instead • typical desktop has 4 cores with each core capable of executing 2 threads [hyper-

threading] giving a total of 8 concurrent threads

• typical desktop in 2014 16 threads, 2016 32 threads, … [Moore's Law and Joy's Law] • need to be able to exploit cheap threads on multicore CPUs • locked based solutions are simply not scalable as a lock inhibits parallelism

• need to explore lockless data structures and algorithms




Consider an ordered linked list [set] • each node has a key and a next field

• NB: list doesn't contain duplicate keys

• add(25) adds a node containing 25 to list [does nothing if item already in list]




Consider an ordered linked list… • remove(30) removes node containing 30 from list [does nothing if item NOT in list]




Concurrent Updates • conventional approach is to protect list with a single lock

[CriticalSection, Mutex, …] which prevents concurrent accesses by different threads

• if list is protected by a lock, it is clear that ONLY ONE operation can

occur at a time [access to list serialised by lock] • ALSO clear that if the list is long enough, multiple add and remove

operations can occur concurrently as they will update pointers in disjoint parts of the list [disjoint access parallelism]

• lockless approach allows multiple add and remove operations to occur concurrently

• remedial action taken if a clash is detected [non disjoint updates]




Spin Lock Implementations • implementations should minimise bus traffic especially when a lock is heavily

contested

• CPUs waiting for a lock are idle and shouldn't generate unnecessary bus traffic which slow the CPUs doing real work

• spin lock implementations usually rely on atomic instructions which comprise an indivisible read-modify-write [RMW] access to a shared memory location

• in a single CPU system, many instructions are effectively atomic because interrupts

are ONLY recognised between instructions




Spin Lock Implementations… • consider a spinlock implementation based on an IA32 logical shift right instruction [shr]

; ; simple spin lock (NB: 1 == free, 0 == taken) ; wait shr lock, 1 ; lock in memory jnc wait ; jump no carry (retry if C == 0) ret ; return free mov lock, 1 ; lock = 1 (free) ret ; return

• works in a single CPU system, but not in a multiprocessor • why? determined by how CPU updates memory

if lock free and “shr lock, 1” is executed, lock becomes taken and the carry flag is set atomically/simultaneously sets lock as taken and returns the fact that the lock has been acquired in the carry flag




Bus Arbiter

• if CPU wishes to access shared memory, it asserts its bus request signal [/BREQn]

• arbiter grants access to one CPU at a time by asserting its bus grant signal [/BGRNTn]

• arbiter normally grants bus to CPUs on a cycle by cycle basis in a fair manner [round

robin]

• ONLY one CPU at time can access shared memory

• CPUs given access to bus and shared memory, one at a time, by a bus arbiter




Atomic Instructions • atomic RMW memory accesses [read cycle followed by a write cycle] must NOT be

interleaved with memory accesses made by other CPUs • CPUs generally have special atomic instructions which indicate externally that an

atomic RMW memory access is being performed • if bus cycles are arbitrated on a cycle by cycle basis [i.e. NON atomic] then

a CPU could read a lock and find it free; on the next bus cycle another CPU could also read the lock and find it still free before the first CPU has been given a bus cycle to set the lock; this would result in the lock being allocated to both CPUs

• IA32/x64 CPUs asserts a /LOCK signal [external pin on chip] to inform bus arbiter that it is trying to perform an atomic RMW memory access

• bus arbiter must simply lock CPU onto bus while the /LOCK signal is asserted




IA32/x64 Atomic Instructions • XCHG [exchange] instruction generates an atomic read-modify-write memory access • use variant which exchanges [swaps] a register with a memory location

; ; testAndSet lock [NB: 0 = free, 1 = taken] ; wait mov eax, 1 ; eax = 1 xchg eax, lock ; exchange eax and lock in memory test eax, eax ; test eax [result of xchg] jne wait ; re-try if unsuccessful ret ; return

free mov lock, 0 ; clear lock ret

• XCHG asserts /LOCK when executed, hence atomic




IA32/x64 Atomic Instructions

• a selection of other IA32/x64 instructions can perform atomic RMW cycles if preceded

with a LOCK prefix instruction

• bts, btr, btc, xadd, cmpxchg, cmpxchg8b, inc, dec, not, neg, add, adc, sub, sbb, and, or,

& xor [only valid if instruction performs a read-modify-write access to memory]

• consider the useful exchange and add instruction xadd

lock ; lock prefix

xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp

• without a LOCK prefix, XADD is executed non atomically




Windows Example • can mix assembly language and C++, BUT…

• x64 VC++ compiler doesn't support an inline assembler, so for Win32/x64 portability

can use the intrinsics defined in intrin.h instead

LONG __cdecl InterlockedExchange( _Inout_ LONG volatile *Target, _In_ LONG Value );

• could be used as follows

volatile long lock = 0; // declare and initialise lock while (InterlockedExchange(a, 1)); // acquire lock

• NB: even though long and int are both 32 bit signed integers, types are NOT equivalent




Volatile • lock must be declared as volatile

• description of volatile from Visual Studio 2012 documentation

objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.

• to declare object pointed to by a pointer as volatile use: volatile int *p; // what p points to is volatile

• to declare the pointer itself volatile use:

int * volatile p; // contents of p is volatile

• both volatile int* volatile p; // p and what p points to are all volatile




Windows Example… • x64 Release code for a call to InterlockedExchange() obtained using Visual Studio

Debugger [VC++ compiler generates in line code rather than a function call]

000000013F3B1330 mov eax,1 000000013F3B1335 xchg eax,dword ptr [rsi+8] 000000013F3B1338 test eax,eax 000000013F3B133A jne worker+0D0h (013F3B1330h)

[rsi+8] contains address of lock

retry if unsuccessful




Serializing Instructions… • consider the following testAndSet code for obtaining a lock

CPU0 shared data CPU1

wait mov eax, 1

obtain lock

wait mov eax, 1

xchg eax, lock xchg eax, lock

test eax, eax test eax, eax

jne wait jne wait

<update shared data> update shared data <update shared data>

mov lock, 0 release lock mov lock, 0




Serializing Instructions… • need to consider memory read and write ordering if locks are to work correctly • CPU must NOT read ahead data in the shared data structure before it has obtained the

lock [otherwise the CPU with lock may not have finished updating the shared data structure and out of date will be read]

• CPU must not release the lock until ALL its writes to the shared data structure have

been completed [otherwise next lock holder could read out of date data] • LOCKED instructions [e.g. xchg, lock xadd] act implicitly as a memory barrier or fence • reads/writes cannot pass [be carried out ahead of] locked [serialising] instructions




Serializing Instructions… • CPUs often have explicit memory barrier or fence instructions to flush the write buffer

and to enforce ordering • IA32/x64 have the following fence instructions

SFENCE store fence flush all writes before executing instruction LFENCE load fence don't read ahead until instruction executed MFENCE memory fence flush all writes before executing instruction and…

don't read ahead until instruction executed

• see section 8.2 on Memory Ordering in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1

• and also Intel® 64 Architecture Memory Ordering White Paper

http://download.intel.com/design/processor/manuals/253668.pdf









http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf

http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf




Serializing Instructions • from a hardware perspective

• CPU has an internal write

buffer which is used to buffer writes to the memory hierarchy [for speed]

• data in write buffer not visible externally until written to memory, in this case to the first level cache




Serializing Instructions… • why does the previous testAndSet code work on an IA32/x64 CPU?

1) writes are a made to memory in program order so that when the lock is cleared and visible [mov lock, 0] ALL previous writes to the shared data structure are also visible

2) lock obtained using a serialising instruction [xchg eax, lock] which prevents read ahead so that data in the shared data structure will not be read until the lock is obtained

3) executing serialising instructions reduces CPU performance as it prevents CPU from reading and writing ahead




Load Locked / Store Conditional Instructions • alternative approach for performing atomic RMW accesses to memory • executing a load locked [LL] followed by a store conditional [SC] instruction is used to

perform an atomic RMW access to memory

• first used by MIPS CPU [ll/sc]

• also used by Alpha [ldq_l/stq_c], IBM Power PC [lwarx/stwcx] and ARM [ldrex/strex] CPUs




Alpha LL/SC Implementation • each CPU has a lockFlag [LF] and a lockPhysicalAddressRegister [LPAR] used by the LL

and SC instructions • LDQ_L Ra, va ; load quadword locked

lockFlag = 1 lockPhysicalAddressRegister = physicalAddress(va) Ra = [va]

• STQ_C Ra, va ; conditionally store quadword

if (lockFlag == 1) ; check lock flag [va] = Ra ; conditional store if lockFlag is set Ra = lockFlag ; used to test if store occurred lockFlag = 0 ; clear lock flag




Alpha LL/SC Implementation… • where is the magic?

• if the per CPU lockFlag is still set when an associated STQ_C is executed, the store occurs

otherwise NO store takes place [conditional store] • what clears the lockFlag? if any CPU does a store [write] to the physical memory address contained in a

lockPhysicalAddressRegister, the associated CPU clears its lockFlag




Alpha LL/SC Hardware Perspective • consider CPU0 executing a LDQ_L ra, lock instruction [NB: lock is the virtual address of

the lock]

• state just after CPU0 has executed LDQ_L ra, lock

• NB: LF and LPAR values




Alpha LL/SC Hardware Perspective… • if any other CPU writes to the lock variable , CPU0's lockFlag will be cleared

• CPU2 writes to the lock variable resulting in CPU0's lockFlag being cleared

• when CPU0 executes the store conditional associated with the LL, it will not write to

memory




Using Alpha LL/SC to Perform an Atomic RMW • if the following sequence of instructions is successfully executed on a given CPU [BEQ

XXX doesn't branch back to XXX] …

XXX: LDQ_L ra, va ; read <modify> ; modify STQ_C rb, va ; conditional write BEQ XXX ; retry if unsuccessful

• it means that the CPU has performed an atomic RMW access to memory location a

• if the conditional store fails, must retry • LL/SC implementation means that a write to va is ONLY performed if there's a guarantee

of a atomic RMW access to va




Using Alpha LL/SC to Implement a TestAndSet Lock ACQUIRE LDQ_L R1, lock ; read lock BLBS R1, ACQUIRE ; retry if already set OR R1, #1, R2 ; r2 = 1 STQ_C R2, lock ; store conditional BEQ R2, ACQUIRE ; retry if unsuccessful MB ; memory barrier

<update shared data structure >

MB ; memory barrier STQ R31, lock ; clear lock [R31 always 0]

• BLBS: branch if register low bit set • MB: memory barrier which is equivalent to an IA32/x64 memory fence

see section on serialising instructions [slide 21]

• means that a write to a lock is ONLY performed when the lock is obtained resulting in much less bus traffic when lock contested




Cost of Sharing Data Between Threads • no problem if threads only make read accesses

• with the MESI cache coherency protocol, writes to shared data will cause copies in the

other caches to be invalidated

• program read and write accesses are typically 32 or 64 bits, while the size of a cache line is typically 64 bytes [with latest Intel CPUs]

• means that data in a cache line can be invalidated by a write to another part of the cache line

• known as false sharing [cache line shared rather than the data] • can be prevented by storing data in its own cache line

• sharing.cpp written to evaluate the cost of sharing




Cost of Sharing Data Between Threads…

• uses the following data structure

• each long variable stored in its own cache line

• code uses CPUID instruction to find CPU cache line size [see CPUID application Note]

• each thread repeatedly executes InterlockedExchangeAdd() to increment a thread specific or a shared variable for NSECONDS

• 0%, 25%, 50%, 75% and 100% sharing determined by how often the shared variable is incremented relative to the thread specific variable




Cost of Sharing Data Between Threads… • for 25% sharing, for example, each thread executes

InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(maxThread), 1); // shared NB: threads numbers from 0 .. maxThread-1

• use _aligned_malloc to allocate data on a cache line boundary volatile long *g; // NB: position of volatile g = (long*) _aligned_malloc((maxThread+1)*lineSz, lineSz); // shared global variable

• GINDX macro defined as follows #define GINDX(n) (g + n*lineSz/sizeof(long)) // index into g




Code to Create and Run N Threads for NSECONDS…

int _tmain(int argc, _TCHAR* argv[]) { … for (sharing = 0; sharing <= 100; sharing += 25) { // sharing range for (int nt = 1; nt <= 2*ncpus; nt *= 2) { // thread range tstart = clock(); for (int thread = 0; thread < nt; thread++) // create and start nt threads threadH[thread] = CreateThread(NULL, 0, worker, (LPVOID) thread, 0, NULL); WaitForMultipleObjects(nt, threadH, true, INFINITE); // wait for ALL threads to finish printResults(); // print results for (int thread = 0; thread < nt; thread++) // delete thread handles CloseHandle(threadH[thread]); } } … }




Worker Function DWORD WINAPI worker(LPVOID thread) { long long ops = 0; // 64 bit local counter while (1) { for (int i = 0; i < NOPS / 4; i++) { // NOPS/4 since work comprises...

// do some work // 4 InterlockedExchange operations } ops += NOPS; // local to thread if (clock() - tstart > NSECONDS*CLOCKS_PER_SEC) // NSECONDS of work? break; } cnt[(int) thread] = ops; // remember in global cnt array return 0; }




Cost of Sharing Data Between Threads… • NB: cache data retrieved from CPU using CPUID instruction




Cost of Sharing Data Between Threads…

1.00

1.74

2.88

3.41 3.40

0.98 0.87 0.92

1.05 1.12

1.00

0.29 0.28 0.25 0.25

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

1 2 4 8 16

Rela

tati

ve t

o s

ingle

thre

ad a

nd 0

% s

hari

ng

# threads

Cost of CPUs Sharing Write Data

0% sharing

25% sharing

50% sharing

75% sharing

100% sharing




Cost of Sharing Data Between Threads… • comments of graph…

• replacing InterlockedExchangeAdd(g, 1) with (*g)++




TestAndSet Lock • declaration of lock

volatile long lock = 0; // lock stored in shared memory • to acquire lock

while (InterlockedExchange(&lock, 1)); // wait for lock [0:free 1:taken]

• to release lock

lock = 0; // clear lock

• if an xchg instruction [InterlockedExchange] is used to obtain a lock, performance is poor when there's contention for the lock

• need to remember how the MESI cache coherency protocol operates if considering IA32/x64 CPUs




TestAndSet Lock… • ALL waiting CPUs repeatedly execute an xchg instruction trying to get hold of lock • the memory accesses made by the xchg instruction don't benefit from having a cache

since the shared cache lines are continually overwritten [even if the lock is a 1, it is overwritten with a 1] which invalidates the entries in the other caches which results in bus cycles for both the read and write parts of ALL xchg instructions [think MESI]

• ALL the xchg read and writes will be to memory

• a write update cache coherency protocol would allows the reads to be local cache reads [Firefly]

• the lock is overwritten even if there is NO chance of obtaining the lock

• why is there not an instructions which conditionally writes a 1 if the value read is 0

[e.g. conditional testAndSet] ?




TestAndTestAndSet Lock • designed to take advantage of underlying cache behaviour • to acquire lock [optimistic version]

while (InterlockedExchange(&lock, 1)) // try for lock while (lock == 1) // wait unit lock free _mm_pause(); // instrinsic see next slide

• to acquire lock [pessimistic version]

do { while (lock == 1) // wait unit lock free _mm_pause(); // intrinsic see next slide } while (InterlockedExchange(&lock, 1)); // try for lock

• optimistic version assumes lock is going to be free




IA32 TestAndTestAndSet Lock… • 7.11.2 PAUSE Instruction

The PAUSE instruction can improves the performance of processors supporting Hyper-Threading Technology when executing “spin-wait loops” and other routines where one thread is accessing a shared lock or semaphore in a tight polling loop. When executing a spin-wait loop, the processor can suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation and flushes the core processor’s pipeline. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation and prevent the pipeline flush. In addition, the PAUSE instruction de-pipelines the spin-wait loop to prevent it from consuming execution resources excessively. (See Section 7.11.6.1, “Use the PAUSE Instruction in Spin-Wait Loops,” for more information about using the PAUSE instruction with IA-32 processors supporting Hyper-Threading Technology.)




TestAndTestAndSet Lock… • the advantage is that the test of the lock [lock == 1] is executed entirely within the

cache and the xchg instruction is only executed when the lock is known to be free and there is a chance of acquiring the lock

• the cached lock variable will be invalidated or updated when the lock is released and

only then is an attempt made to obtain the lock by executing a xchg instruction

• if the release of the lock invalidates the other shared caches lines then O(n2) [where n is number of CPUs waiting for lock] bus cycles will be generated? quote from the literature

• ALL n waiting CPUs continuously read the lock [from their own local cache]; these

cache lines will be invalidated when the lock is released; subsequent reads of the lock will appear on bus which will be serialised by a typical round-robin bus arbiter and each CPU, in turn, will see the lock free




TestAndTestAndSet Locks... • an individual CPU executes its xchg instruction but then sees the remaining CPUs

executing their xchg instructions which will invalidate its cache line so a bus cycle has to be performed to read the lock again i.e. O(n2)

• however won't the bus cycles for the xchg be such that all CPUs will execute them one

after another [thanks to the round robin arbiter] so that a CPU's cache line is effectively invalidated only once? i.e. O(n)

• if the release of the lock updates the other caches directly then the generated bus traffic will only be of O(n)

• either way there will be enough bus activity to interfere with the process in the critical

section as well as the other processes not involved with the lock

• if the lock is held for a long time the impact is unimportant, but for short critical sections the lock will be released before the last spurt of activity has subsided resulting in continued bus saturation




TestAndSet Lock with Exponential Back Off • don't continuously try to acquire lock, delay between attempts

to acquire lock: d = 1; // initialise back off delay while (InterlockedExchange(&lock, 1)) { // if unsuccessful… delay(d); // delay d time units d *= 2; // exponential back off }

• testAndTestAndSet lock NOT necessary when using a back off scheme • the longer the CPU has being waiting for the lock, the longer it will have to wait before

it attempts to acquire the lock again, possibility of starvation • supposed to work well in practice




Ticket Lock with Proportional Back Off class TicketLock { public: volatile long ticket; // initialise to 0 volatile long nowServing; // initialise to 0 };

inline void acquire(TicketLock *lock) // acquire lock { int myTicket = InterlockedExchangeAdd(&lock->ticket, 1); // get ticket [atomic] while (myTicket != lock->nowServing) // if not our turn… delay(myticket - lock->nowServing); // delay relative to… } // position in Q inline void release(TicketLock *lock) // release lock { lock->nowServing ++; // give lock to next CPU } // NB: not atomic




Ticket Lock with Proportional Back Off… • think of waiting in a Q in the Andrews St. tourist office, ISS computer help desk, A&E, … • deterministic • ONLY 1 atomic instruction executed per lock acquisition • FAIR, locks granted in order of request which eliminates the possibility of starvation • back off proportional to position in Q

• if time in critical section is constant , the delay can be calculated such that the

subsequent test of lock->nowServing will just succeed • still polls a common location [lock->nowServing] which will cause some bus traffic with

an invalidate protocol • delay not necessary with a write-update protocol [Firefly]




MCS Lock [Mellor-Crummy and Scott] • lockless queue of waiting threads • each thread has its own QNode which is linked into a Q of QNodes waiting for lock • a global variable lock points to tail of Q • acquire lock by adding a thread’s QNode [qn] to tail of Q and waiting until

qn->waiting==0 • release lock by setting qn->next->waiting=0 [if qn not at the tail of Q]




MCS Lock… • before looking at the code for the MCS lock need to discuss

the Compare and Swap (CAS) instruction

how to allocate objects on a cache line boundaries

thread local storage




Compare and Swap [CAS]

• pseudo C version of CAS

atomic long CAS(long *a, long e, long n) // memory address, expected value, new value { long r = *a; // read contents of memory address if (r == e) // compare with expected value and if equal… *a = n; // update memory with new value return r; // success if e returned }

• NB: returns expected value if exchange took place

• CAS can be mapped onto the IA32/x64 compare and exchange instruction

cmpxchg reg, mem // if (eax == mem) // ZF = 1, mem = reg // else // ZF = 0, eax = mem




Compare and Swap…

• make use of following instrinsic defined in intrin.h long InterlockedCompareExchange(long volatile *a, long n, long e); NB: different parameter order than previous/normal definition of CAS

• for convenience can always define #define CAS(a, e, n) InterlockedCompareExchange(a, n, e)




How to allocate objects aligned on a cache line • can allocate objects in their own cache line(s) to avoid false sharing • one straightforward approach is to use a template class to override new and delete

// // derive from ALIGNEDMA for aligned memory allocation // template <class T> class ALIGNEDMA { public: void* operator new(size_t); // override new void operator delete(void*); // override delete };

• C++ magic




How to allocate objects aligned to a cache line…

// // new // template <class T> void* ALIGNEDMA<T>::operator new(size_t sz) {

sz = (sz+lineSz-1)/lineSz*lineSz; // make sz a multiple of lineSz return _aligned_malloc(sz, lineSz); // allocate on a lineSz boundary } // // delete // template <class T> void ALIGNEDMA<T>::operator delete(void *p) { _aligned_free(p); // free object }




MCS Lock… • derive QNode from ALIGNEDMA

• each QNode will be allocated its own cache line aligned on a cache line boundary

class QNode : public ALIGNEDMA<QNode> { public: volatile int waiting; volatile QNode *next; };

• when a new QNode is created… QNode *qn = new QNode();

• the ALIGNEDMA new function is called to allocated space for the QNode




Thread Local Storage [Tls]

• allocate next available Tls index DWORD tlsIndex = TlsAlloc(); // get a Tls index which all threads can use

• set value stored at tlsIndex

QNode *qn = new QNode(); // at start of worker function TlsSetValue(tlsIndex, qn);

• get value stored at tlsIndex

volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex);

• TlsGetValue used by acquire() and release() to get a pointer to thread’s local QNode




MCS Lock acquire inline void acquire(QNode **lock) { volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex); qn->next = NULL; volatile QNode *pred = (QNode*) InterlockedExchangePointer((PVOID*) lock, (PVOID) qn); if (pred == NULL) return; qn->waiting = 1; pred->next = qn; while (qn->waiting); }




MCS Lock… inline void release(QNode **lock) { volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex); volatile QNode *succ; if (!(succ = qn->next)) { if (InterlockedCompareExchangePointer((PVOID*)lock, NULL, (PVOID) qn) == qn) return; do { succ = qn->next; } while(!succ); } succ->waiting = 0; }




MCS Lock acquire

• pred = InterlockedExchange(lock, qn) performed atomically (1) • think about what happens if two or more threads try to acquire lock simultaneously • if pred is NULL [previous value of lock] then at head of Q so have lock otherwise… • set qn->waiting = 1 and… • link thread’s QNode to tail of existing Q by setting pred->next = qn (2) • wait until qn->waiting == 0




MCS Lock release

• if (qn->next != NULL) set qn->waiting = 0 which passes lock to next thread in Q

• if (qn->next == NULL) use InterLockedCompareExchangePointer(lock, NULL, qn) to atomically set lock = 0 if its lock == qn and return if successful [there are no more threads waiting for lock] otherwise…

• a call to acquire() by another thread must have added a QNode between qn and lock • follow qn->next until not NULL and assign to succ which then points to next QNode in Q • set succ->waiting = 0 to pass lock to next thread [no explicit removal of QNodes from Q]




Testing Framework • create a framework to compare the performance of locked and lockless lists

• will use VC++

• source code on CS4021 web site [single source for Win32 and x64] • implement an ordered list with add(key) and remove(key) operations • create n threads which pseudo randomly add or remove items from a list • add and remove operations occur with equal probability • generate keys pseudo randomly in range 0 .. maxkey-1 • changing key range controls the length of list and also the amount of contention

between threads [less contention with longer lists]




Testing Framework… • vary key range [16, 64, 256, …] and number of threads [1, 2, 4, 8, …]

• limit maximum number of threads to be twice the number of cores • run each test for NSECONDS [e.g. 10 seconds] and report results

test set up and configuration and date key range number of threads runtime operations per second performance relative to a single thread

• make sure tests are run with PC set to the high performance power plan

• results generated on a DELL M4600 precision laptop




C++ Node and List class definitions

• develop test framework for testing the performance of a list protected by different kinds of locks [CriticalSection, testAndSet, testAndTestAndSet,…]

class Node: public ALIGNEDMA<Node> { // derive from ALIGNEDMA public: int key; // key Node *next; // points to next node in list }; class List: public ALIGNEDMA<LIst> { // derive from ALIGNEDMA private: Node *head; // head of list DECLARE(); // macro to declare CriticalSection, testAndSet lock, … public: List(); // constructor ~List(); // destructor int add(int key); // return 1 if successful int remove(int key); // return 1 if successful };




Updating List

• List constructor List::List() { // constructor head = new Node(0, NULL); // sentinel INIT(); // macro to initialise lock }

• ONLY ONE thread can update list at a time, protect by acquiring lock

int List::add(int key) { ACQUIRE(); // macro to acquire CriticalSection, testAndSet lock, … … // update protected by lock RELEASE(); // macro to release CriticalSection, testAndSet lock, … }




testAndSet Results




testAndSet Results…

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

1 2 4 8 16

MO

ps

per

second

# threads

list protected by a testAndSet lock

key=16

key=64

key=256

key=1024

key=4096

key=16384




Spread sheet Model of testAndSet Lock • work in progress, will put on CS4021 web site

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 2 4 8 16

Ops

rela

tive t

o s

ingle

thre

ad k

ey=16

# threads

key=16

key=64

key=256

key=1024

key=4096

key=16384




testAndSet Results… • do the results make sense? and why are they so poor?

one thread will be updating list while all others will be trying to obtain the lock each attempt to acquire the lock requires the execution of an xchg instruction

each exchange instruction not only reads memory but also writes a 1 to the lock [even

if it's already a 1] invalidating copies of the lock in other caches [MESI protocol] this greatly increases the bus traffic [reads and writing of the lock will be to/from

memory] which significantly reduces the speed of the thread that has the lock if thread pre-empted holding lock, it will obstruct other threads from making

progress [this effect is probably not too significant]

significantly reduced performance due to increased bus traffic from (1) continuously executing the xchg instruction and (2) sharing modified list nodes




Ticket Lock Results

NB: unusual

results when

threads = 16




Ticket Lock Results… • unusual results when threads = 16

• otherwise better performance than testAndSet lock

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 2 4 8 16

rela

tive t

o s

ingle

thre

ad m

axKey=16

# threads

ticket lock key=16

key=64

key=256

key=1024

key=4096

key=16384




Ticket Lock Results • unusual results when threads > NCPUS [threads = 16, NCPUS= 8]

• assume the following code for acquire

inline void acquire(TicketLock *lock) { int myTicket = InterlockedExchangeAdd(&lock->ticket, 1); while (myTicket != lock->nowServing) _mm_pause(); }

• why? what's happening?




Ticket Lock Results…

• idealised diagram of what is happening • to simplify diagram assume 4 cores and 8 threads

• threads run for an OS time quantum • need to wait for quantum to end before ticket 4, 8, … start to run • hence 4 tickets/updates per OS time quantum • what is the time OS quantum?




Ticket Lock Results… • exchangeTicketRate.cpp • simply count how many ticket lock acquire and release operations can be performed per

second

0

5

10

15

20

25

30

35

40

45

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31Millions

of

Tic

ket

Exchanges

# threads

Ticket Exchange Rate

Millions of ticket exchanges per second

NB: 8 cores




Ticket Lock Results… • graph in terms of how long it takes to exchange ticket lock between threads

• estimate of OS time quantum is 8 x 0.01ms = 0.1ms • seems too fast [need to find another way to get this value]

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132

ms

# threads

Time to exchange ticket between two threads

ticket exchanges per sec

NB: 8 cores




Lockless List Implementation • objective is to implement an ordered lists where op/s increases with number of threads

• need to consider calls to new and delete which are called inside add and remove

• new and delete need to be lockless otherwise they will become the bottleneck

• memory management is critical

• same argument for rand()

• quite a challenge ahead




Need to know meaning of the following terms deadlock

livelock

convoying

priority inversion

obstruction free

lock free

wait free

linearisation point




Lockless List Implementation using CAS • use CAS to add nodes 15 and 35 • search for insertion point and execute and CAS with correct parameters

CAS(&a->next, b, c) CAS(&d->next, e, f)

• disjoint-access parallelism




Lockless List Implementation using CAS

• if 2 threads try to add nodes at the same position

CAS(&a->next, b, c) // assume this CAS executes first and succeeds… CAS(&a->next, b, d) // consequently this CAS will fail

• first CAS executed succeeds, second fails as a->next != b • on failure need to RETRY operation • search AGAIN for insertion point and, if found, re-execute CAS [costly if list long]




What can go wrong with an add? • imagine insertion point found, BUT before CAS(&a->next, b, c) is executed, thread is

suspended

• another thread then removes b from list and frees the memory block used by b [free(b)]




What can go wrong with an add? • another thread adds another key [12] at same position in list using the SAME memory

block which just happens to be returned by the memory allocator [malloc()] • if suspended thread now resumes and executes its

CAS(&a->next, b, c) • the CAS will succeed as a->next still equals b, BUT node NOT inserted at correct position • known as the ABA problem




Using CAS to remove nodes • search for node and then execute CAS with correct parameters • consider 2 threads removing non-adjacent nodes [disjoint-access parallelism]

CAS(&a->next, b, c) // both will succeed CAS(&c->next, d, 0) // both will succeed




Using CAS to remove nodes • if two threads try to remove the same node

CAS(&a->next, b, c) CAS(&a->next, b, c)

• first CAS executed succeeds

• second CAS executed fails as a->next no longer equals b • retry on failure, which means searching AGAIN for node [NB. may now not be found!]




What can go wrong with remove?

• imagine adding a node and removing a node concurrently

CAS(&a->next, b, c); // delete 20 CAS(&b->next, c, d); // insert 25

• NOT what was intended!




What can go wrong with remove? • consider deleting adjacent nodes

CAS(&a->next, b, c) // delete 20 CAS(&b->next, c, d) // delete 30

• AGAIN NOT what was intended




A Pragmatic Implementation of Non-Blocking Linked Lists Tim Harris [2001] • two step removal [consider remove(20)] • node atomically marked [logically deleted] before updating pointer using CAS

• marked node indicated by an odd address in next field [possible as nodes normally

aligned on 4 byte boundaries]

is_marked_reference(r) // returns 1 if marked get_marked_reference(r) // convert to marked reference get_unmarked_reference(r) // convert to unmarked reference

• tests, sets and clears LSB of address [which is stored in next field]




A Pragmatic Implementation of Non-Blocking Linked Lists… • to atomically mark node [logically delete]

CAS(&b->next, c, get_marked_reference(c)); • then use CAS to update pointer

CAS(&a->next, b, c)

• can only update an unmarked pointer

• an intelligent find() removes ALL marked nodes to the immediate left of insertion point

or node to be deleted




A Pragmatic Implementation of Non-Blocking Linked Lists… • examine code taken directly from paper, but note that it…

"is intended merely as pseudo-code and does not reflect an optimised (or even necessarily) correct implementation"




class List<KeyType> {

Node<KeyType> *head;

Node<KeyType> *tail;

List() {

head = new Node<KeyType>();

tail = new Node<KeyType>();

head.next = tail;

}

}

List and Node Class Definitions

• how many mistakes can you spot in the code snippet above? will need to fix errors in order to get the code working

class Node<KeyType> {

KeyType key;

Node *next;

Node (KeyType key) {

this.key = key;

}

}




add [insert]

public boolean List::insert(KeyType key) { Node *new_node = new Node(key); Node *right_node, *left_node; do { right_node = search(key, &left_node); if ((right_node != tail) && (right_node.key == key)) // T1 return false;

new_node.next = right_node; if (CAS(&left_node.next, right_node, new_node)) // C2 return true; } while(true); // B3 }

allocate new node to insert

returns pointers to the unmarked nodes to left and right of insertion point

return false if key already in list

keep trying if successful

try to insert node by using CAS to update pointer




add [insert]…

insertion is reasonably straightforward, but makes use of an intelligent find function

• find should return adjacent left and right pointers

• CAS(&left.next, right, new) will only succeed if there are no nodes [marked or

unmarked] between left and right and if left is also unmarked

• of course, another thread could have inserted a node between left and right before the

CAS is executed

• a node cannot be inserted by linking to a marked [logically deleted] node thus avoiding one of the problems mentioned in the previous slides




remove [delete] public boolean List::delete (KeyType search_key) { Node *right_node, *right_node_next, *left_node; do { right_node = search(search_key, &left_node); if (right_node == tail || right_node.key != search_key) //T1 return false; right_node_next = right_node.next; if (!is_marked_reference(right_node_next)) if (CAS(&right_node.next, right_node_next, get_marked_reference(right_node_next))) break; } while (true); //B4 if (!CAS(&left_node.next, right_node, right_node_next)) //C4 right_node = search(right_node.key, &left_node); return true; }

returns pointer of unmarked node to delete [if not present then unmarked node with next higher key] and address of unmarked node to its left

try to mark unmarked node; once marked, node is logically deleted

return if key not in list

keep trying

if CAS fails, use search to remove marked nodes from list

try to remove node from list by using CAS to update pointer can remove a number of adjacent nodes




remove [delete]… • assume initial search has returned left and right and that the right node has been

marked [logically deleted]

• imagine that before before CAS(&left->next, right, right->next) is executed to remove

node from list, another thread inserts a node between left and right




remove [delete]… • CAS to remove node will fail

• since node is logically deleted there is no point in calling delete again…. • BUT calling search again will remove any marked node(s) immediately before key

• NOT calling search would simply mean that the marked node(s) would remain in the list

until another node is inserted after 20 [in this example state]

• how could the list get into the following state?




find [search] private Node *List::search (KeyType search_key, Node **left_node) { Node *left_node_next, *right_node; search_again: do { Node *t = head; Node *t_next = head.next; do { if (!is_marked_reference(t_next)) { *left_node = t; left_node_next = t_next; } t = get_unmarked_reference(t_next); if (t == tail) break; t_next = t.next; } while (is_marked_reference(t_next) || (t.key < search_key)); // T1 right_node = t; if (left_node_next == right_node) if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; // G1 else return right_node; // R1 if (CAS (&(left_node.next), left_node_next, right_node)) // C1 if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; // G2 else return right_node; // R2 } while (true); // B2 }

find left_node and right_node ignore any marked nodes

check left_node and right_node are adjacent

remove one or more marked nodes

optimisation

optimisation




find [search] • step1: iterates along list to find the first unmarked node >= key; this is the right node;

the left node refers to the previous unmarked node found

• step 2: if the left node is the immediate predecessor of the right node, the search returns [returns with no marked nodes between left and right]

• step 3: use CAS to remove marked node(s) between the left and right nodes; on failure the search is retried

• the optimisation checks if the right node has become marked [logically deleted] and performs the search again rather than returning and then failing in add or remove




A Pragmatic Implementation of Non-Blocking Linked Lists… • what is NOT said! • insert allocates a new node even if insertion fails • NO code for freeing or re-using nodes • nodes never become unmarked • avoids ABA problem by not re-using nodes which also… • avoids problem of threads traversing list using pointers to freed nodes • assumes nodes are garbage collected in a safe way [not an easy problem to solve] • ONLY a partial solution without memory management [perhaps the harder problem]




Memory Management

• use garbage collection [Java, but not yet in C++] • reference counting • deferred freeing of nodes [see end of section 6 in Harris paper]

each node contains an additional link field so that it can be added to a per thread retireQ and reuseQ

each thread takes a copy of a global timer [e.g. clock()] before starting an add or remove operation and saves it in a global startOp array [each thread startOp stored in its own cache line for speed]

add and remove operations add any freed nodes to the retireQ and sets the key field to the startOp of the thread

add and remove operations, before they exit, can traverse the retireQ and transfer

nodes to a reuseQ if their startOp is less than the minimum startOp of any thread since no thread can still have a reference to the node




Memory Management

• add retired nodes to end of retireQ • the minimum thread startOp time is 129 • can transfer all nodes in retireQ with startOp < 129 to reuseQ [first three node] • allocate nodes from per thread reuseQ and only call new/malloc if empty

• why is a link field needed? why not use next?




Hazard Pointers

• Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged Michael (2004) IEEE Transactions on Parallel and Distributed Systems 15 (8): 491–504

• in terms of an ordered linked list, there are two active pointers as the list is traversed during a find operation [number will be different for other algorithms]

• these active pointers called hazard pointers [used to save cur and next p499 Fig 9]

• idea is not to reuse/delete/free nodes if they have hazard pointers pointing to them

http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf









Hazard Pointers… • maintain a global array of per thread hazard pointers [each thread saving its hazard

pointers in its own cache line for speed]

• use per a thread retireQ and reuseQ as per previous example

• retire node by adding to retireQ and when length >= 2*nthreads*HAZARDSPERTHREAD make a local copy of all hazard pointers in global array [allocate a local array] sort hazard pointers in local array [optional] for each node on retireQ, if node address doesn’t match any hazard pointer in local

array transfer to reuseQ • again need to allocate nodes from per thread reuseQ and only call new/malloc if empty




Some Results [NO memory management] • NB: number of nodes allocated [nmalloc]




Some Results…

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

1 2 4 8 16

MO

ps

per

second

# threads

lockless list [no memory management]

key=16

key=64

key=256

key=1024

key=4096

key=16384




Some Results…

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 4 8 16

speed-u

p re

lati

ve t

o a

sin

gle

thre

ad

# threads

lockless list [no memory management]

key=16

key=64

key=256

key=1024

key=4096

key=16384




Transactional Memory • locks hard to manage effectively

pessimistic – inhibits parallelism priority inversion – lower priority thread pre-empted while holding a lock needed

by a higher priority thread convoying – thread holding lock is descheduled and other threads queue up

unable to progress deadlock – can be difficult to avoid in complex systems

• atomic primitives such as CAS operate on one word at a time resulting in complex

algorithms • MCAS [multiple compare and swap] some help • no hardware implementation • list of addresses, expected values and new values • can be implemented using CAS




Transaction

• a sequence of steps executed by a single thread • transactions must be serializable meaning that they appear to execute sequentially in a

one-at-a-time order • serializability is a kind of coarse-grained version of linearizability [atomic method calls

on a given object appear to take effect instantaneously] • correctly implemented transactions do not deadlock or livelock

• composing atomic method calls is straightforward from a programming perspective [the

implementation, however, is far from straightforward]

atomic { x = q0.remove(); // atomic remove q1.add(x); // atomic add }

• atomic removal from one list and addition to another




Transaction…

• transactions are executed speculatively; as a transaction executes it makes tentative changes to objects [memory locations]

• if it completes without encountering a conflict it then commits [the tentative changes become permanent] OR

• it aborts [the tentative changes are discarded] • the tentative changes are NOT visible to other transactions until the transaction

commits • each transaction maintains a read set and a write set

• each transactional load instruction adds the memory address to the read set and each

transactional store adds the memory address and value to the write set




Transaction…

• two transactions conflict if one writes to a location accessed [read or written] by another

• conflict detection can be eager or lazy • eager detection checks every read or write to see if there is a conflicting operation in

another transaction requires all read and write sets to be visible to other transactions

• lazy detection checks when a transaction commits




Conflict Detection Example • in both sequences, eager detection would

detect a conflict at Read X because the other transaction has already written to X

• (a)

lazy conflict detection would detect a conflict in T1 because T2 commits first implying that T1 should have used the result of the T2 Write X operation

• (b)

lazy conflict detection would allow both T1 and T2 to commit because T1 commits first and its Read X need not use the result of the T2 Write X




How to resolve conflicts? • consider eager detection

• when T1 performs Read X, a conflict is

detected with Write X in T2 and one transaction has to be aborted

• If T2 is aborted, T1 will conflict later with

T3 [Read Y and Write Y] and another transaction will have to be aborted

• if, however, the policy had decided to abort

T1, T2 and T3 could have finished with only one transaction aborted instead of two




Other issues • transactions can be nested

nested transactions are especially useful if a nested transaction can abort without

aborting its parent • zombies are transactions that are destined to abort, but are still running

• such transactions may have an inconsistent read set which could lead to erroneous

behaviour [e.g. infinite loop, index out of bounds]

• zombies can be avoided by validating the entire read set after each transactional load [expensive]

• validating explicitly checks for conflicts and will abort the transaction immediately




Other issues… • consider the code for adding to a bounded [fixed size] transactional Q – items stored

in a “circular” array of length items void add(int key) atomic { if (count == items.length) // Q full retry; items[tail] = x; // add item if (++tail == items.length) // if necessary… tail = 0; // wrap tail index ++count; // increase count } }




Other issues… • retry rolls back the enclosing transaction

• can be used to wait on multiple conditions

atomic { x = q0.remove(); } orElse { x = q1.remove(); } • if Q empty, retry called which is detected by orElse so q1.remove() tried instead




Hardware Transactional Memory

• Transactional Memory: Architectural Support of Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss Proceedings of the 20th Annual International Symposium on Computer Architecture 1993

• motivations

lock-free – operations on a data structure will not be prevented if one process/thread stalls mid execution

avoids common problems with mutual exclusion out performs best known locking techniques




Hardware Transactional Memory

• a transaction [as defined here] is a finite sequence of machine instructions,

executed by a single process/thread, satisfying the following properties:

serializability: transactions appear to execute serially, meaning that the steps of

one transaction never appear to be interleaved with the steps of another

committed transactions are never observed by different processes/threads to

execute in different orders

atomicity: each transaction makes a sequence of tentative changes to shared

memory

when a transaction completes it either commits making its changes visible to

other processes/threads or it aborts causing its changes to be discarded




Hardware Transactional Memory • implemented by modifying a multiprocessor cache coherency protocol [write-Once in

this example]

• tentative changes are made to a separate transaction cache

• ONLY when the transaction is committed do changes become visible atomically to other CPUs




Hardware Transaction Memory • basic idea is that any cache coherency protocol capable of detecting accessibility

conflicts can also detect transaction conflicts at no extra cost

• instructions added to CPU instruction set for handling transactions – would be automatically generated a compiler

• Load transactional [LT] reads value from a shared memory location into transaction cache [and CPU register]

• Load transactional exclusive [LTX] read a value of a shared memory location into

transaction cache and mark it as RESERVED [use LTX if location likely to be updated] • Store transactional [ST] tentatively writes a value to a copy of the data in the

transaction cache which does NOT become visible to other processors until the transaction successfully commits




Hardware Transactional Memory…

• commit [COMMIT] attempts to make a transaction’s tentative changes permanent and visible to other caches succeeds ONLY if no other transaction has written to any location in the

transaction's read or write set [and no other transaction has read any location in this transaction’s write set]

on failure all tentative changes to the write set are discarded returns success or failure

• Abort [ABORT] discards all updates to the write set • Validate [VALIDATE] tests the current transaction’s status

returns true if the transaction has not aborted [thus far] returns false if the current transaction has aborted, discards tentative updates

• CPU also keeps a TACTIVE flag indicating a transaction is in progress and a TSTATUS flag

indicating if the transaction is active or aborted; VALIDATE returns TSTATUS




Transaction Cache States

• transaction cache lines have a write-once state AND a transaction state

• a memory location cannot be in a CPU’s normal cache and transaction cache simultaneously [exclusive caches]

• transactional cache states

EMPTY contains no data [invalid] NORMAL contains committed data XCOMMIT [discard on commit] contains original value read from “memory” XABORT [discard on abort] holds the tentative writes made to cache line

during a transaction [always paired with a XCOMMIT cache line]

• if a transaction commits successfully, the XCOMMIT lines are set to EMPTY and the XABORT lines switch to NORMAL

• must occur atomically using appropriate hardware support so ALL changes become visible “instantaneously”




Transaction Cache States… • the snooping mechanism returns a BUSY status if CPUx tries to transactionally

read a memory location that is in another CPU’s transactional cache in the RESERVED or DIRTY state [because other CPU must have written to it]

• CPUx’s TSTATUS is set false [aborted] if it receives a BUSY status when attempting to execute a LT, LTX or ST

• if the transaction aborts, the XABORT lines set to EMPTY and the XCOMMIT

lines are set to NORMAL




Intended use of transactional instructions

1. use LT or LTX to read a set of locations

2. use VALIDATE to check that the read set is consistent

on failure goto 1

3. use ST to modify a set of locations

4. use COMMIT to make changes permanent

on failure goto 1




Consider the following transaction

atomic { a0 += 3; // add 3 a1 -=3; // subtract 3 }

• and compiler generated code sequence for transaction

tstart: ltx r1, a0 // know a0 will be modified ltx r2, a1 // know a1 will be modified add r1, 3, r1 // add 3 sub r2, 3, r2 // sub 3 st r1, a0 // tentative store st r2, a1 // tentative store commit // commit jeq tstart // retry on failure

• could add validate instructions to test for abort status earlier




Example Transaction

address and value transactional cache state

write-once state I, V, R and D

transaction state

• assume a0 = 0 and a1 = 3 initially

• consider transaction state when executed on a single CPU just before COMMIT




Example Transaction…

• memory locations a0 and a1 are read using LTX and so enter cache in the

Reserved | XCOMMIT state

• a copy of a0 and a1 also made in in Reserved | XABORT state

• memory locations are then written and the XABORT cache line is

changed to state Dirty | XABORT

• note that per CPU TACTIVE and TSTATUS flags indicate that a transaction is active and that its status is also active [rather than aborted]




Example Transaction… • state after COMMIT executed • transaction successfully committed • transactional cache lines of type XCOMMIT

set to EMPTY • transactional cache lines of type XABORT

set to NORMAL • TACTIVE set to false ready for next

transaction • COMMIT operation updates the state of ALL

transaction cache lines atomically [needs appropriate hardware]

• other CPUs can now obtain updated contents of a0 and a1 from transactional cache





• how are conflicts detected between concurrent transactions?

• assume CPU 0 has executed

its LTX r1, a0 and CPU 1 has been granted access to the bus to execute its LTX r1, a0

• CPU 0 has loaded a0 into its transactional cache in the Reserved state [LT would have loaded it the Valid state]





• CPU 1 tries to read a0, but since CPU 0 has a copy in its transactional cache in the

RESERVED state, the transactional cache will detect the conflict and assert the BUSY

• when a BUSY response is received by CPU 1

it marks the transaction as being aborted by setting TACTIVE and TSTATUS to false

eager conflict detection

LTX will return arbitrary data

• when CPU1 [eventually] VALIDATEs or COMMITs transaction it will fail

sets all XABORT entries to EMPTY and sets all XCOMMIT entries to NORMAL




Summary • you are now able to:

CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

Documents