EE202C Networked Embedded Systems Design Lecture 12 Multiprocessor Synchronization

7/31/2019 EE202C Networked Embedded Systems Design Lecture 12 Multiprocessor Synchronization

1/73

Embedded Platform and Embedded Operating System IntroductionEE202C Lecture 12

1

EE 2 0 2 C L E C TU R E 1 2 EM B E D D E D P L A T F O R M S A N D O P E R A T I N G SY S T E M S

M U L T I P R O C E S S O R S Y N C H R O N I Z A T I O N

T A B L E O F C O N T E N T S

1. MULTIPROCESSOR / MULTICORE SYSTEMS: EMBEDDED AND MOBILE PLATFORMPROCESSORS ..............................................................................................................................................................22. SYMMETRIC MULTIPROCESSING (SMP) KERNEL THREADING .......................................................43. SMP KERNEL THREAD CREATION AND MANAGEMENT .....................................................................64. UNCOORDINATED KERNEL THREAD EXAMPLE .................................................................................105. LOCK CLASSES, APPLICATIONS, AND IMPLEMENTATION ..............................................................156. PROCESSOR SUPPORT FOR SYNCHRONIZATION ................................................................................177. ATOMIC PRIMITIVES FOR MULTIWORD OPERATIONS ....................................................................188. ARM ATOMIC OPERATIONS ........................................................................................................................209. ATOMIC PRIMITIVES FOR BIT OPERATIONS X86 .............................................................................2210. ATOMIC PRIMITIVES FOR BIT OPERATIONS ARM ........................................................................2411. SPIN LOCK X86 ............................................................................................................................................26

Implementation of spinlocks: Setting Locks ............. .............. ............... .............. .............. .............. .............. .......28Spinlock Energy Optimization .............................................................................................................................31Controlling synchronization and interrupts .........................................................................................................32Lock operations and load management ................... .............. ............... .............. .............. .............. .............. .......33Implementation of spinlocks: Releasing Locks ......................... .............. .............. .............. .............. ............... ....34

12. SPINLOCK SYNCHRONIZED KERNEL THREAD EXAMPLE .............................................................3913. SPIN LOCK ARM .........................................................................................................................................4514. RW SPINLOCKS ..............................................................................................................................................47

Writers (In ARM architecture) .............................................................................................................................49readers (in ARM Architecture) ............................................................................................................................51Writers and Trylock .............................................................................................................................................53

15. KERNEL SEMAPORES ..................................................................................................................................54Background ...................... .............. .............. ............... .............. .............. .............. .............. .............. ............... ....54Implementation ............ .............. ............... .............. .............. .............. .............. ............... .............. .............. .........54Operations ............................................................................................................................................................56

16. SEMAPHORE SYNCHRONIZED KERNEL THREAD EXAMPLE ........................................................6217. RW SEMAPHORES .........................................................................................................................................6718. COMPLETION VARIABLES .........................................................................................................................7019. SEQ LOCK ........................................................................................................................................................7220. SYNCHRONIZATION AGAINST OUT OF ORDER EXECUTION ........................................................73


2/73


2

1. MULTIPROCESSOR / MULTICORE SYSTEMS: EMBEDDED ANDMOBILE PLATFORM PROCESSORS

1. Intel Atom Architecturea. Dual Die Processors

i. Hyperthreading architecture1. Shared instruction cache2. Shared data cache3. Parallel decode and issue of instructions4.

Parallel register file

5. Parallel integer and floating point execution unitsb. Pipeline

i. 16 stagesc. Two cache levels

i. L1 compact cache1. 32KB I-Cache2. 24KB D-Cache

ii. 512KB L2 cache per core1. Dual core architecture provides 1M total2. Each L2 cache shared with both threads

iii. Include prefetch units that detect stride lengths and optimize prefetchoperations


3/73


3


4/73


4

2. SYMMETRIC MULTIPROCESSING (SMP) KERNEL THREADING 1. SMP architecture benefits

a. Parallel processing energy and performance benefits2. Synchronization challenges

a. Many examples in kernel and applications for independent processes and threadsb. However, many of the most important applications introduce constraints of data

dependence

c. Most important applications (for example database systems) exhibit severe reductionin throughput as thread count rises above several threads per processor.

3.

Synchronization requirements

a. Lock time resolutioni. Lock acquisition and release may require high time resolutionii. Lock testing may require high time resolution

1. Encouraging a polling methodiii. Reduced time resolution requirements must be exploited

1. Sleep timing upon lock detection greater than one clock tickpermits sleeping

2. Sleep timing less than one clock period requires busy waitb. Lock footprint

i. Lock fetch, decode costii. Lock cache footprint cost

c. Lock optimization methodsi. Locking for readers (consumers) and writers (suppliers)

1. Read lock2. Write lock3. Read and Write lock


5/73


5

4. Architecture challengesa. Managing thread creation and removalb. Lock integration with interrupt manangementc. Lock integration with preemptiond. Lock integration with computational load systems

i. Management of load bursts due to lock releasee. Energy efficient lock processor resources


6/73


6

3. SMP KERNEL THREAD CREATION AND MANAGEMENT Kernel thread descriptors

o Create information data structure This is managed for thread by keventd daemon Fields of started and result will be populated by the keventd daemon at runtime Result task structure will include name and arguments associated with thread

function

struct kthread_create_info{

int (*threadfn)(void *data);void *data;struct completion started;struct task_struct *result;struct completion done;

};

o Stop information data structure Done state written by keventd

struct kthread_stop_info{

struct task_struct *k;

int err;

struct completion done;

};


7/73


7

Kernel thread creationstruct task_struct *kthread_create(int (*threadfn)(void *data), void *data,

const char namefmt[], )

/* Note: indicates a list of variable arguments referenced

* by the name format (name_fmt) field. These may include, for

* example, the current pid value. These arguments will be displayed

* in process status information

*/

{

struct kthread_create_info create;

DECLARE_WORK(work, keventd_create_kthread, &create);

create.threadfn = threadfn; /* function argument */

create.data = data; /* function data */

init_completion(&create.started);

init_completion(&create.done);

/*

* Start the workqueue system below

*/

if (!helper_wq)

work.func(work.data);

else {

queue_work(helper_wq, &work);

wait_for_completion(&create.done);

}

if (!IS_ERR(create.result)) {

/* following code prepares process table string) */va_list args;

va_start(args, namefmt);

vsnprintf(create.result->comm, sizeof(create.result->comm),

namefmt, args);

va_end(args);

}

return create.result;

}

EXPORT_SYMBOL(kthread_create);


8/73


8

Kernel thread binding to CPUo Called after creation and before wakeup

void kthread_bind(struct task_struct *k, unsigned int cpu)

{

wait_task_inactive(k); /* wait for task to be unscheduled */set_task_cpu(k, cpu);

k->cpus_allowed = cpumask_of_cpu(cpu);

}

EXPORT_SYMBOL(kthread_bind);

static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)

{

task_thread_info(p)->cpu = cpu;

}

Kernel thread wakeup (enqueue task)int fastcall wake_up_process(task_t *p)

{

/* places stopped or sleeping task on run queue */

return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |

TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);

}

EXPORT_SYMBOL(wake_up_process);

Kernel thread checks for stop status that may be applied by another thread

o Kernel thread calls kthread_should_stop()int kthread_should_stop(void)

{

return (kthread_stop_info.k == current);

}

EXPORT_SYMBOL(kthread_should_stop);

A kernel thread may apply stop state for itselfint kthread_stop(void)

{

return (kthread_stop_info.k == current);

}

EXPORT_SYMBOL(kthread_should_stop);


9/73


9

A kernel thread may apply stop state for another threadint kthread_stop(struct task_struct *k)

{

return kthread_stop_sem(k, NULL);

}

EXPORT_SYMBOL(kthread_stop);

The implementation of kthread_stop_sem is importanto The mutex_lock ensures that only one CPU may apply the stop conditiono Also, the thread must receive a signal (as required) in order to initiate its completion

int kthread_stop_sem(struct task_struct *k, struct semaphore *s)

{

int ret;

mutex_lock(&kthread_stop_lock);get_task_struct(k);

init_completion(&kthread_stop_info.done);

smp_wmb();

kthread_stop_info.k = k; /* sets kthread pointer indicating stop */

if (s)

up(s); /* release the semaphore */

else

wake_up_process(k); /* start thread to enable completion */

put_task_struct(k); /* atomic decrement of task usage */

wait_for_completion(&kthread_stop_info.done);

kthread_stop_info.k = NULL;

ret = kthread_stop_info.err;

mutex_unlock(&kthread_stop_lock);return ret;

}

EXPORT_SYMBOL(kthread_stop_sem);


10/73


10

4 . UNCOORDINATED KERNEL THREAD EXAMPLE

/** kthread_mod_uncoord.c

*

* Demonstration of multiple kernel thread

* creation and binding on multicore system

*

*/

#include

#include

#include

#include

#include

#include

/* array of pointers to thread task structures */

#define MAX_CPU 16

#define LOOP_MAX 10

#define BASE_PERIOD 200

#define INCREMENTAL_PERIOD 330

#define WAKE_UP_DELAY 0

static struct task_struct *kthread_cycle_[MAX_CPU];

static int kthread_cycle_state = 0;

static int num_threads;

static int cycle_count = 0;

static int cycle(void *thread_data)

{

int delay, residual_delay;

int this_cpu;

int loops;

delay = BASE_PERIOD;

for (loops = 0; loops < LOOP_MAX; loops++) {

this_cpu = get_cpu();

delay = delay + this_cpu*INCREMENTAL_PERIOD;

printk

("kthread_mod: no lock pid %i cpu %i delay %i count %i

\n", current->pid, this_cpu, delay, cycle_count);

cycle_count++;

set_current_state(TASK_UNINTERRUPTIBLE);


11/73


11

residual_delay = schedule_timeout(delay);

cycle_count--;

printk

("kthread_mod: no lock pid %i cpu %i delay %i count%i\n",

current->pid, this_cpu, delay, cycle_count);

}

kthread_cycle_state--;

/*

* exit loop poll stop state with sleep cycle

*/

while (!kthread_should_stop()) {

delay = 1 * HZ;


residual_delay = schedule_timeout(delay); /* prepare to yield */

printk

("kthread_mod: wait for stop pid %i cpu %i \n",

current->pid, this_cpu);

}printk

("kthread_mod: cycle function: stop state detected for cpu %i\n",

this_cpu);

return 0;

}

int init_module(void)

{

int cpu = 0;

int count;

int this_cpu;

int num_cpu;

int delay_val;

int *kthread_arg = 0;

int residual_delay;

const char thread_name[] = "cycle_th";

const char name_format[] = "%s/%d"; /* format name and cpu id */

num_threads = 0;

num_cpu = num_online_cpus();

printk("kthread_mod: number of operating processors: %i\n",

num_cpu);

this_cpu = get_cpu();printk

("thread_mod: kthread_mod init: current task is %i on cpu %i \n",


for (count = 0; count < num_cpu; count++) {

cpu = count;

num_threads++;


12/73


12

kthread_cycle_state++;

delay_val = WAKE_UP_DELAY;

set_current_state(TASK_UNINTERRUPTIBLE); /* prepare to yield */

residual_delay = schedule_timeout(delay_val);

kthread_cycle_[count]=kthread_create(cycle, (void *) kthread_arg,

thread_name, name_format, cpu);

if (kthread_cycle_[count] == NULL) {

printk("kthread_mod: thread creation error\n");

}

kthread_bind(kthread_cycle_[count], cpu); /* sets cpu in task */

/* struct */

wake_up_process(kthread_cycle_[count]);


printk

("kthread_mod: execution after wake_up_process, current task

pid %i on cpu %i\n", current->pid, this_cpu);

printk("kthread_mod: current task is %i on cpu %i creating and

waking next thread after delay of 1s \n", current->pid,

this_cpu);

}

return 0;

}

void cleanup_module(void)

{

int ret;

int count;

int this_cpu;

/*

* determine if module removal terminated thread creation cycle early

*

* also must determine if cpu is suspended

*/

printk("kthread_mod: number of threads to stop %i and active %i\n",

num_threads, kthread_cycle_state);


printk

("kthread_mod: kthread_stop requests being applied by task %i on

cpu %i \n", current->pid, this_cpu);

for (count = 0; count < num_threads; count++) {

ret = kthread_stop(kthread_cycle_[count]); /* set done in *//* completion field */

printk("kthread_mod: kthread_stop request for cpu count returned

with value %i \n", ret);

}

}

MODULE_LICENSE("GPL");


13/73


13

Start up60937.707450] kthread_mod: number of operating processors: 4[60937.707486] thread_mod: kthread_mod init: current task is 16243 on cpu 3[60937.709822] kthread_mod: execution after wake_up_process, current task pid16243 on cpu 3

[60937.709841] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 0[60937.709919] kthread_mod: current task is 16243 on cpu 3 creating andwaking next thread after delay of 1s[60937.713666] kthread_mod: execution after wake_up_process, current task pid16243 on cpu 3[60937.713678] kthread_mod: no lock pid 16245 cpu 1 delay 530 count 1[60937.713738] kthread_mod: current task is 16243 on cpu 3 creating andwaking next thread after delay of 1s[60937.717799] kthread_mod: execution after wake_up_process, current task pid16243 on cpu 3[60937.717815] kthread_mod: no lock pid 16246 cpu 2 delay 860 count 2[60937.717893] kthread_mod: current task is 16243 on cpu 3 creating andwaking next thread after delay of 1s[60937.721661] kthread_mod: execution after wake_up_process, current task pid

16243 on cpu 3[60937.721712] kthread_mod: current task is 16243 on cpu 3 creating andwaking next thread after delay of 1s[60937.721950] kthread_mod: no lock pid 16247 cpu 3 delay 1190 count 3[60938.508041] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3[60938.508084] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3[60939.308039] kthread_mod: no lock pid 16244 cpu 0 delay 200 count 3[60939.832050] kthread_mod: no lock pid 16245 cpu 1 delay 530 count 2[60939.832086] kthread_mod: no lock pid 16245 cpu 1 delay 860 count 2[60940.308037] kthread_mod: wait for stop pid 16244 cpu 0[60941.156046] kthread_mod: no lock pid 16246 cpu 2 delay 860 count2[60941.156082] kthread_mod: no lock pid 16246 cpu 2 delay 1520 count 2[60941.308064] kthread_mod: wait for stop pid 16244 cpu 0

[60942.308038] kthread_mod: wait for stop pid 16244 cpu 0[60942.480041] kthread_mod: no lock pid 16247 cpu 3 delay 1190 count 2[60942.480074] kthread_mod: no lock pid 16247 cpu 3 delay 2180 count 2[60943.272042] kthread_mod: no lock pid 16245 cpu 1 delay 860 count 2

Note lack of coordination above Complete and removal phase

[61085.496049] kthread_mod: wait for stop pid 16247 cpu 3[61086.236050] kthread_mod: wait for stop pid 16246 cpu 2[61086.276046] kthread_mod: wait for stop pid 16245 cpu 1

[61086.308040] kthread_mod: wait for stop pid 16244 cpu 0[61086.496039] kthread_mod: wait for stop pid 16247 cpu 3[61087.236041] kthread_mod: wait for stop pid 16246 cpu 2[61087.276048] kthread_mod: wait for stop pid 16245 cpu 1[61087.308040] kthread_mod: wait for stop pid 16244 cpu 0[61150.308036] kthread_mod: wait for stop pid 16244 cpu 0[61150.720049] kthread_mod: wait for stop pid 16247 cpu 3[61151.134948] kthread_mod: number of threads to stop 4 and active 0


14/73


14

[61151.134984] kthread_mod: kthread_stop requests being applied by task 16332on cpu 0[61151.135024] kthread_mod: wait for stop pid 16244 cpu 0[61151.135049] kthread_mod: cycle function: stop state detected for cpu 0[61151.135118] kthread_mod: kthread_stop request for cpu count returned withvalue 0[61151.135171] kthread_mod: wait for stop pid 16245 cpu 1[61151.135220] kthread_mod: cycle function: stop state detected for cpu 1[61151.135267] kthread_mod: kthread_stop request for cpu count returned withvalue 0[61151.135331] kthread_mod: wait for stop pid 16246 cpu 2[61151.135357] kthread_mod: cycle function: stop state detected for cpu 2[61151.135398] kthread_mod: kthread_stop request for cpu count returned withvalue 0[61151.135456] kthread_mod: wait for stop pid 16247 cpu 3[61151.135499] kthread_mod: cycle function: stop state detected for cpu 3[61151.135541] kthread_mod: kthread_stop request for cpu count returned withvalue 0

Computing load top d 0.1 Upon start of top execution, enter 1 for CPU display

Cpu0 : 0.0%us, 9.1%sy, 0.0%ni, 90.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Note that one task presents a user space task load while all others are sleeping (note thepresence of 100% in idle state.


15/73


15

5. LOCK CLASSES, APPLICATIONS, ANDIMPLEMENTATION

Synchronization between tasks is critical to multiprocess and multiprocessor systems

o Requirement exists for ensuring that operations occur only according to design constraintsand are not subject to raceconditions.

Synchronized access to a resource may be managed by controlling access to the code segmentusing a variable that may implement a lock

o Enables a form ofcommunicationbetween processes

Hierarchyo Spinlocks

Fast acquisition and release Resource intensive for extended lock times

o RW Spinlocks High efficiency lock favoring readers Often read by rarely written Multiple readers One writer

o Kernel Semaphore Complex implementation Increased latency Efficient for long delays

o RW Semaphore Semaphore attributes with reader/writer resolution


16/73


16

o Seqlock High efficiency lock that favors writers

o Completion Variables Synchronization against out of order execution

Applicationso List manipulation

Memory, Tasks,o Timer interruptso Interrupt serviceo

System call

o Scheduler Operations

Implementationo Relies on processor hardwareo Important optimizations for performanceo Recent advances in optimization for energy


17/73


17

6. PROCESSOR SUPPORT FOR SYNCHRONIZATION Implementation of reliable synchronization methods requires support of processor hardware.

o The presence of unscheduled interrupts implies that any sequence of control is uncertain Processor architectures may enable the implementation ofatomicoperations

o Atomic operations complete without interruption of the sequence of control under allcircumstances

o Examples may include the increment of a memory register.o For this to be atomic, the fetch, decode, fetch of operand from memory, increment and

write back must occur contiguously

o The arrival of an interrupt must not induce an interruption of this sequence of control. A class of Intel IA-32 instructions are always atomic. These include:

o Byte length read or write from memoryo 32 bit aligned read or write from memory of 32 or 64 bit wordo 64 bit aligned read or write from memory of 128 bit wordo Reading or writing to cache

The cache is accessible as cache lines of 32 bytes An unaligned read or write falling within this limit will be atomic

Memory managemento Updating segment registerso Updating page tables

Interruptso The data bus is locked after an interrupt, only allowing a selected APIC to write.

Other operations are not atomico Thus other methods must applyo


18/73


18

o Here, one processor may acquire and lock the address/data bus preventing any otherprocessor (or device) from accessing memory

Assert LOCK prefixo Instructions are listed with the identifier lock prepending the instruction.

Assembler will ensure that object code includes a bus lock operation duringexecution

Will add an opcode modifier 0xF0 to instruction Only one processor may access memory during the lock

7. ATOMIC PRIMITIVES FOR MULTIWORD OPERATIONS For IA-32, Linux atomic operations are defined in

o /include/asm-i386/atomic.h#ifdef CONFIG_SMP

#define LOCK "lock ; "

#else

#define LOCK ""

#endif

First, there is an atomic data typetypedef struct { volatile int counter; } atomic_t;

o One data member - counter Defined as static inline

o Standalone object code for functions may be created by compile, if required Examples of atomic increment and decrement

static __inline__ void atomic_inc(atomic_t *v)

{__asm__ __volatile__(

LOCK "incl %0"

:"=m" (v->counter)

:"m" (v->counter));

}


19/73


19

static __inline__ void atomic_dec(atomic_t *v)

{

__asm__ __volatile__(

LOCK "decl %0"

:"=m" (v->counter)

:"m" (v->counter));

o Atomic add i to atomic type v ir indicates that a register is to be assigned by the compiler to an integer

static __inline__ void atomic_add(int i, atomic_t *v)

{


LOCK "addl %1,%0"

:"=m" (v->counter)

:"ir" (i), "m" (v->counter));

}

o Atomic sub i from atomic type vstatic __inline__ void atomic_sub(int i, atomic_t *v)

{


LOCK "subl %1,%0"

:"=m" (v->counter)

:"ir" (i), "m" (v->counter)); // i indicates 32b for reg

}

o Since the variable in question is actually a data structure member, this must be accessed via aread operation

#define atomic_read(v) ((v)->counter)

o atomic_set must be used to writeo This sets the value of v to that of integer, i

#define atomic_set(v,i) (((v)->counter) = (i))


20/73


20

8.ARM ATOMIC OPERATIONS The ARMV6 architecture implements atomic operations using a unique approach.

o Lock method differs in that memory bus locking is not applied ARM atomic instructions

o read operations are inherently atomico Write operations must be protected

Here is an example of atomic_seto This sets v equal to io Its functionality as a kernel library function is the same as its i386 counterparto However, due to processor differences between IA-32 and ARM, its underlying

implementation is quite different

The instruction Load Exclusive LDREX R1, [R2] is implemented on ARMo This loads R1 with the contents of memory register addressed by the contents of R2o Then, this initializes a monitor

The monitor observes anywrite action on the address-data bus that may occuron the 32b memory block pointed to by the contents of R2

This write action may occur due to the operation of another CPU that shares thememory space and address-data bus

The occurrence of a write action can then be detected subsequently bySTREX The instruction Store Exclusive STREX R1, R2, [R3]

o Stores R2 into memory register addressed R3o If the write is successful by the definition that the previously initialized monitor installed

above shows no prior writes, then the data pointed to be [R3] has been written atomically

o A successful write is returned as a zero value in R1 If a failure is detected, this function continues to attempt to initialize a monitor, store, and verify. An example is atomic_set() for ARM


21/73


21

o First read value of v->counter (sets monitor) This is a guard instruction The value of counter not needed

o Then start load operation with STREXo Check monitoro Loop until success detected

Note this sequence will be inserted in-line in code not called as functiono By declaring static, this inline code can be included in any kernel function.

static inline void atomic_set(atomic_t *v, int i)

{unsigned long tmp;

__asm__ __volatile__("@ atomic_set\n"

"1: ldrex %0, [%1]\n" ; tmp register receives the contents at

; memory address containing counter

; This initializes the monitor.

" strex %0, %2, [%1]\n" ; i (third on arg list) is stored into

; mem register at address of v->counter

" teq %0, #0\n" ; test if reg 0 is cleared

" bne 1b" ; if not success, repeat branch if

; not equal to label 1 back

: "=&r" (tmp) ; & requires separation of output and

; input register choices by compiler: "r" (&v->counter), "r" (i)

: "cc"); ; sequence may have modified cpsr

; compiler must ensure cpsr protected


22/73


22

9. ATOMIC PRIMITIVES FORBIT OPERATIONS X86 Atomic bit manipulations are defined in include/asm-i386/bitops.h Aside:

o Note use of volatile type qualifiero Consider example where it is desired to send two words in succession to a memory mapped

I/O port

o Informs compiler that memory is volatile and must be read from main memory at eachreferencing instruction

Avoids reference to cache that would create an error in this case since memorycorresponds to I/O port

Prevents any compiler optimization that may eliminate a reado Suppresses code optimization that may appear when these functions are inlinedo Example,

Consider sequence

volatile unsigned long *output_port = memory_mapped_interface_address

*output_port = CONTROL_WORD_1; /* set high */

*output_port = CONTROL_WORD_2; /* set low */

Note that the first instruction would be otherwise eliminatedo Clear a bit in memory

static inline void clear_bit(int nr, volatile unsigned long * addr)

{

__asm__ __volatile__( LOCK_PREFIX

"btrl %1,%0" // Bit Test and Reset Long

// 1 is the bit offset, nr.// 0 is operated on its bit

// indicated by nr is cleared

:"=m" (ADDR) // output operand

:"Ir" (nr)); // I identifies a constant in

// range of 0 31

}


23/73


23

o Change a bit in memory

static inline void change_bit(int nr, volatile unsigned long * addr)

{

__asm__ __volatile__( LOCK_PREFIX

"btcl %1,%0":"=m" (ADDR)

:"Ir" (nr));

}

o Test and Change bit in memorystatic inline int test_and_change_bit(int nr, volatile unsigned long* addr)

o Test and Clear bit in memorystatic inline int test_and_clear_bit(int nr, volatile unsigned long * addr)


24/73


24

10.ATOMIC PRIMITIVES FORBIT OPERATIONS ARM Atomic bit manipulations are defined in include/asm-arm/bitops.h

o Clear a bit in memory Operates on word referenced by pointer, p If bit > 31, then pointer is advanced to a following 32b word The bits 0 through 5 determine the bit test location to the following word Five lines of code without a conditional or loop

o Example bit = 4

bit = 0000 0000 0000 0100 bit & 31 = 0000 0000 0000 0100 & 0000 0000 0001 1111 = 0000 0000 0000

0100

mask = 1UL 5 = 0 p = p + 0 *p = *p & 1111 1111 1110 1111 Clears 5th bit

static inline void ____atomic_clear_bit(unsigned int bit, volatile

unsigned long *p)

{

unsigned long flags;

unsigned long mask = 1UL > 5;

local_irq_save(flags);

*p &= ~mask;

local_irq_restore(flags);

}


25/73


25

bit = 34 bit = 0000 0000 0010 0010 mask = 1UL 5 = 1 p = p + 1 this advances pointer to next word *p = *p & 1111 1111 1111 1011 Clear second bit in second word (bit location 34)

o Set a bit in memory Replace AND operation setting bit

static inline void ____atomic_set_bit(unsigned int bit, volatile unsigned

long *p)

{

unsigned long flags;unsigned long mask = 1UL > 5;


*p |= mask;


}


26/73


26

11.SPIN LOCKX86 Atomic operations are adequate for the example of controlling a word or bit In general, methods are needed to lock an a sequence of control for atomic operation

o Examples include examination of a list, for example of tasks or timers In the hierarchy of control sequence locking, the spinlock is the most efficient in its initialization and

use, but, also represents the most significant impact on the kernel.

o Design requirements Intended for application to locking where resource hold time is short Fast initialization Fast access and return Small cache footprint Design will tolerate processing overhead

There are two primary alternatives for code sequence lockingo A process operating on one CPU (in an SMP system) may seek to acquire a resource or enter

a sequence of instructions. If the lock is not accessible, the process (a kernel thread) may bedesigned to be dequeued until the lock is available.

o

If the lock is anticipated to not be available for an extended period, then this is acceptable.

o The process of dequeue and then enqueue incurs latency A context switch exiting and entering

o An alternative for design requirements where it is known in advance that the lock acquisitiondelay will be short is the spin lock

The kernel thread process is not dequeued during the lock delay Rather, it continues to test the lock


27/73


27

Operationso A process that may wish to protect a code sequence sets a spinlock

Only one spinlock is available per threado A second process requesting the lock makes repeated attempts to gain the lock It remains in

a busy loop, testing the lock during each period that it is scheduled.

If the previous task releases the lock it will be discovered to be available when thenew task seeking the lock is scheduled

Characteristicso The spinlock is central in kernel codeo The acquisition of a spinlock does not disable interrupt operations

Will result from our having inserted an interruptable NOP loop An example of a potential deadlock failure results from a process having acquired a

spinlock and then being interrupted and replaced by an ISR that seeks the samespinlock.

o Thus, we very often observe the use of the spin_lock_irq_save() function Usage rules

o Spinlocks are appropriate for fast execution (less than the time required for two contextswitches

o Sleep operations should not be started in a sequence of execution after a lock is acquired


28/73


28

Probability that an interrupt service routine may require the same lock resource ishigh

IMPLEMENTATION OF SPINLOCKS: SETTING LOCKS

First, setting the lock: (in kernel/spinlock.c) __lockfunc defines fastcall and the directive that first three function arguments are to be placed in

registers as opposed to the stack (as would be the compiler operation default)

o FASTCALL macro settings: #define fastcall __attribute__((regparm(3))) Pass up to three parameters via registers, the remainder on the stack

#define spin_lock(lock) _spin_lock(lock)

void __lockfunc _spin_lock(spinlock_t *lock)

{

preempt_disable();

_raw_spin_lock(lock);

}

o Preemption disabled (from /linux/preempt.h)#define preempt_disable() \

do { \

inc_preempt_count(); \

barrier(); \

} while (0)

o Actual spinlock slock = 1 if lock available slock = 0 after decrement on lock request if request successful

static inline void _raw_spin_lock(spinlock_t *lock)

{

__asm__ __volatile__(spin_lock_string

:"=m" (lock->slock) : : "memory");

}


29/73


29

#define spin_lock_string \

"\n1:\t" \

"lock ; decb %0\n\t" \

"jns 3f\n" \

"2:\t" \

"rep;nop\n\t" \"cmpb $0,%0\n\t" \

"jle 2b\n\t" \

"jmp 1b\n" \

"3:\n\t"

o gcc preprocessor will produce

static inline void _raw_spin_lock(spinlock_t *lock)

{


"1:\t" \

"lock ; decb %0\n\t" \

"jns 3f\n" \

"2:\t" \

"rep;nop\n\t" \

"cmpb $0,%0\n\t" \

"jle 2b\n\t" \

"jmp 1b\n" \

"3:\n\t"

: "=m" (lock->slock) :

: "memory");

}

o decb decrements spinlock value note argument %0 points to lock->slock

Tests if decrement of lock is zero (lock state was therefore one at time of access)o Note, not adequate to merely set slock

Consider multiple CPUs in race to set bit

Decrement removes race condition

o Each CPU can decrement the locko No CPU can exit the spinlock until the lock becomes set to one and may be

decremented

o Checks sign flag


30/73


30

if sign flag not set, then spinlock was 1 and is now zero, thus jump to 3 andcontinue

The thread executing this sequence now owns the lock If the spinlock value was zero, a decrement will yield a negativevalue

o The lock was therefore taken previouslyo Then, the system compares the memory register with zeroo If less than or equal, remain in loop since other CPU processes may be decrementing slocko If greater than zero, lock must be set to 1 and is free

However, this system does not exit immediately A race condition with multiple CPUs is in progress

o Another CPU has set the lock to 1o But, yet another CPU may now have decremented the lock

Test is: can this thread successfully decrement the lock to zero from oneo If so, this is the only thread that owns the locko If not, this CPU has lost the race

Hence, this system performs one more test to ensureo The current thread acquires the locko No other thread has or can acquire the lock


31/73


31

SPINLOCK ENERGY OPTIMIZATION

Note the rep;nop sequence

static inline void rep_nop(void){

__asm__ __volatile__("rep;nop": : :"memory");

}

#define cpu_relax() rep_nop()

Detail point on optimizationo Analysis has shown that some systems spend a significant fraction of time in spinlock

state.

Delay may be unavoidable However, introduces undesired power dissipation

o Now, the rep;nop sequence introduces a method for signaling the CPU that the currentthread is executing and waiting for a spinlock

o The rep preprocessor causes a number of NOPs to be introduced equal to the contentsof cx register

Assembler introduces the rep opcode modifier byte 0xF2 prepending theinstruction causes the instruction to be called repeatedly

Applies only to string of instructions defaults to a single NOP otherwiseo However, processor observes the presence of rep nopo Processor then may adjust clock frequency, core voltage, and reduce energy usage

The cpu_relax() macro has appeared in recent kernel versions that includes the rep nop sequence


32/73


32

CONTROLLING SYNCHRONIZATION AND INTERRUPTS

Setting a lock while disabling interruptso _spin_lock_irq(spinlock_t *lock)o Interrupts are enabled unconditionally at the time the lock is released

This may create an error condition if interrupts were previously disabled#define _spin_lock_irq(lock) \

do { \

local_irq_disable(); \

preempt_disable(); \

_raw_spin_lock(lock); \

} while (0)

Setting a lock while disabling interrupts and storing interrupt state.

o _spin_lock_irqsave(spinlock_t *lock , unsigned long flags)o This enables the state of interrupts to be stored at the time the lock is seto Interrupts are enabled at the time the lock is released only if they were initially enabled

#define _spin_lock_irqsave(lock, flags)

do {


preempt_disable();_raw_spin_lock(lock);

} while (0)

#define local_irq_save(x)__asm__ __volatile__(

"pushfl ;

popl %0 ;

cli"

:"=g" (x): /* disable interrupts */

:"memory")

/asm-i386/system.ho This is called with flags argument

Flags are first saved on the stack


33/73


33

Flag value is then popped into a general purpose register- stack pointer is returnedto initial value

o Thus, the flags are now saved as a temporary variable (in the next stackentry)

Memory keyword informs compiler that memory has been changed, resulting inblocking of compiler actions to reorder sequence of control

LOCK OPERATIONS AND LOAD MANAGEMENT

Setting a lock while disabling interrupt bottom halveso Critical for many network driverso This permits hardware interrupts to proceed

However, the computational demand of bottom halves is not introduced, delayinginterrupt service routines

o For example, timer interrupts and other critical eventso _spin_lock_bh(spinlock_t *lock)o This enables the state of interrupts to be stored at the time the lock is seto Interrupts are enabled at the time the lock is released only if they were initially enabled

#define _spin_lock_bh(lock) \

do { \

local_bh_disable(); \

preempt_disable(); \

_raw_spin_lock(lock); \

} while (0)

o This takes us to interrupts.c Now, softirqs (for networking, for example) will only be allowed if the preemption

counter is less than SOFTIRQ_OFFSET

o If many processes have incremented the preemption counter, policy isto not add yet another task, rather allow these to complete

Convenient approacho Just add SOFTIRQ_OFFSET


34/73


34

o Now, with the increase in preempt count, BH are disabled since thenumber of allowed softirqs will be incremented above the limit by simplyadding the max value to the current value

o creates a convenient method for returning and restoring preemption count while gating BHoperations

o Note the while (0) construct and barrier#define local_bh_disable() \

do { add_preempt_count(SOFTIRQ_OFFSET); barrier(); } while (0)

o And in sched.c (removing debug options)void fastcall add_preempt_count(int val)

{

preempt_count() += val;

}

IMPLEMENTATION OF SPINLOCKS: RELEASING LOCKS

Design goalso Release lock resourceo Enable preemptiono Evaluate if rescheduling should occur

Release spin_lock

#define _spin_unlock(lock)

do {

_raw_spin_unlock(lock);

preempt_enable();

__release(lock);

} while (0)

In preempt.ho Enable

#define preempt_enable()

do {

preempt_enable_no_resched();


35/73


35

preempt_check_resched();

} while (0)

Decrement preempt count

#define preempt_enable_no_resched()do {

barrier();

dec_preempt_count();

} while (0)

Call reschedule if current is flagged this return from spinlock represents important opportunity toexploit resched option

#define preempt_check_resched()

do {

if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))preempt_schedule();

} while (0)

Release spin_lock_irqo Here is the unconditional restore

#define _spin_unlock_irq(lock) \

do { \

_raw_spin_unlock(lock); \local_irq_enable(); \

preempt_enable(); \

} while (0)

Unlock (lock arg applies only if debug is applied) this is removed belowstatic inline void _raw_spin_unlock(spinlock_t *lock)

{


spin_unlock_string

);

}


36/73


36

Sets spin lock bit note memory barrier#define spin_unlock_string

"movb $1,%0" \

:"=m" (lock->slock) : : "memory"

Release spin_lock_irq_restoreo Here is the conditional restore

#define spin_unlock_irqrestore(lock, flags)

do {



preempt_enable();

} while (0)

#define local_irq_restore(x) do {

if ((x & 0x000000f0) != 0x000000f0)

local_irq_enable();

} while (0)

#define local_irq_enable() __asm__ __volatile__(

"sti"

: : :"memory")

Finally, releasing a spin_unlock_bh#define spin_unlock_bh(lock) _spin_unlock_bh(lock)

o Here is the conditional restorevoid __lockfunc _spin_unlock_bh(spinlock_t *lock)

{


preempt_enable();

local_bh_enable();

In softirq.c, local_bh_enable is foundo Note the recovery using the (SOFTIRQ_OFFSET 1 ) subtractiono This removes the SOFTIRQ_OFFSET and enables SOFTIRQ threads to be executed by

softirqd

o However, preemption remains disabled (due to the -1 above) if it was previous to this action


37/73


37

o Note that check is made on Not in interrupt and that there is a pending softirq Then actually perform the softirq immediately before any other process that the

scheduler may have selected

o Note that preemption counter is decremented preemption will be enabled when thisreaches zero.

o Note that resched is calledvoid local_bh_enable(void)

{

sub_preempt_count(SOFTIRQ_OFFSET - 1);

if (unlikely(!in_interrupt() && local_softirq_pending()))

do_softirq();

dec_preempt_count();

preempt_check_resched();

}

Lock state testingo Lock state can be tested without locking to enable flow control

For example, spin_trylock(), spin_trylock_bh() Implemented with atomic xchgl instruction in x86

static inline int __raw_spin_trylock(raw_spinlock_t *lock){

int oldval;


"xchgl %0,%1"

:"=q" (oldval), "=m" (lock->slock)

:"" (0) : "memory");

return oldval > 0;

}

o Implemented in ARM architecture: Loads lock value Stores exclusive if equal (if lock value is 0) Otherwise, exits with lock value in tmp


38/73


38

static inline int __raw_spin_trylock(raw_spinlock_t *lock)

{

unsigned long tmp;

__asm__ __volatile__(" ldrex %0, [%1]\n"

" teq %0, #0\n"

" strexeq %0, %2, [%1]"

: "=&r" (tmp)

: "r" (&lock->lock), "r" (1)

: "cc");

if (tmp == 0) {

smp_mb();

return 1;

} else {

return 0;

}}


39/73


39

12.SPINLOCK SYNCHRONIZED KERNEL THREAD EXAMPLE /*

* kthread_mod_coord.c

*

* Demonstration of multiple kernel thread* creation and binding on multicore system

*

* This system includes spinlock synchronization

*

*/

#include

#include

#include

#include

#include

#include


#define MAX_CPU 16

#define LOOP_MAX 10








static spinlock_t kt_lock = SPIN_LOCK_UNLOCKED;


{


int this_cpu;

int ret;

int loops;





ret = spin_is_locked(&kt_lock);

if (ret != 0) {

printk("kthread_mod: cpu %i start spin cycle\n", this_cpu);


40/73


40

}

spin_lock(&kt_lock);

printk

("kthread_mod: lock pid %i cpu %i delay %i count %i \n",


cycle_count++;

set_current_state(TASK_UNINTERRUPTIBLE);residual_delay = schedule_timeout(delay);

cycle_count--;

printk

("kthread_mod: unlock pid %i cpu %i delay %i count %i\n",


spin_unlock(&kt_lock);

}


/*

* exit loop

*/


delay = 1 * HZ;



printk



}

printk


this_cpu);

return 0;

}


{

int cpu = 0;

int count;

int this_cpu;

int num_cpu;

int delay_val;


int residual_delay;

const char thread_name[] = "cycle_th";const char name_format[] = "%s/%d"; /* format name and cpu id */

num_threads = 0;



printk


41/73


41

("kthread_mod: init task %i cpu %i of total CPU %i \n",

current->pid, this_cpu, num_cpu);


cpu = count;

num_threads++;kthread_cycle_state++;




kthread_cycle_[count] =

kthread_create(cycle, (void *) kthread_arg,




}kthread_bind(kthread_cycle_[count], cpu);



printk

("kthread_mod: current task %i cpu %i create/wake next thread\n",


}

return 0;

}


{

int ret;

int count;

int this_cpu;

/*

* determine if module removal terminated thread creation cycle early

*


*/




("kthread_mod: kthread_stop requests being applied by task %i

on cpu %i \n", current->pid, this_cpu);


ret = kthread_stop(kthread_cycle_[count]) /* sets done state*/

printk

("kthread_mod: kthread_stop request for cpu count returned


42/73


42

with value %i \n", ret);

}

}


Start upo Note coordinationo Note locking occurs

However, locking only occurs when relationship between delays leads to resourcecontention

[61888.295386] kthread_mod: init task 17348 cpu 2 of total CPU 4[61888.297709] kthread_mod: current task 17348 cpu 2 create/wake next thread[61888.297805] kthread_mod: lock pid 17349 cpu 0 delay 200 count 0[61888.301142] kthread_mod: current task 17348 cpu 2 create/wake next thread[61888.301158] kthread_mod: cpu 1 start spin cycle[61888.309106] kthread_mod: current task 17348 cpu 2 create/wake next thread[61888.309146] kthread_mod: cpu 2 start spin cycle[61889.004148] kthread_mod: current task 17348 cpu 0 create/wake next thread[61889.004161] kthread_mod: cpu 3 start spin cycle[61889.100033] kthread_mod: unlock pid 17349 cpu 0 delay 200 count 0[61889.100073] kthread_mod: cpu 0 start spin cycle[61889.100080] kthread_mod: lock pid 17350 cpu 1 delay 230 count 0[61890.020530] kthread_mod: unlock pid 17350 cpu 1 delay 230 count 0[61890.020581] kthread_mod: cpu 1 start spin cycle[61890.020588] kthread_mod: lock pid 17351 cpu 2 delay 260 count 0[61891.061032] kthread_mod: unlock pid 17351 cpu 2 delay 260 count 0[61891.061074] kthread_mod: cpu 2 start spin cycle[61891.061080] kthread_mod: lock pid 17352 cpu 3 delay 290 count 0[61892.217531] kthread_mod: unlock pid 17352 cpu 3 delay 290 count 0[61892.217572] kthread_mod: cpu 3 start spin cycle[61892.217582] kthread_mod: lock pid 17349 cpu 0 delay 200 count 0[61893.312070] kthread_mod: unlock pid 17349 cpu 0 delay 200 count 0[61893.312131] kthread_mod: cpu 0 start spin cycle[61893.312138] kthread_mod: lock pid 17350 cpu 1 delay 260 count 0[61895.332564] kthread_mod: unlock pid 17350 cpu 1 delay 260 count 0[61895.332609] kthread_mod: cpu 1 start spin cycle[61895.332617] kthread_mod: lock pid 17351 cpu 2 delay 320 count 0[61897.921049] kthread_mod: unlock pid 17351 cpu 2 delay 320 count 0[61897.921094] kthread_mod: cpu 2 start spin cycle[61897.921101] kthread_mod: lock pid 17352 cpu 3 delay 380 count 0[61899.889564] kthread_mod: unlock pid 17352 cpu 3 delay 380 count 0[61899.889609] kthread_mod: cpu 3 start spin cycle[61899.889619] kthread_mod: lock pid 17349 cpu 0 delay 200 count 0[61902.088074] kthread_mod: unlock pid 17349 cpu 0 delay 200 count 0[61902.088118] kthread_mod: cpu 0 start spin cycle[61902.088124] kthread_mod: lock pid 17350 cpu 1 delay 290 count 0[61903.248533] kthread_mod: unlock pid 17350 cpu 1 delay 290 count 0[61903.248577] kthread_mod: cpu 1 start spin cycle[61903.248586] kthread_mod: lock pid 17351 cpu 2 delay 380 count 0


43/73


43

Cpu0 : 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu2 : 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 3.7%us, 5.6%sy, 0.0%ni, 90.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

The top utility, operating in batch mode, may record per cpu loado This is a direct result of the resource cost associated with the spinlocko Note the processor usage at 100 % system and 0% user

Note: Three processors (threads) are operating at full load One processor, CPU3, is executing (and is in a sleep state). Note that this processor is spending the balance of its time in the idle thread.

o Entries indicate percentage of time processor was executing a task other than the idle taskduring the time since the last screen update

Note the behavior above:o CPU 0 wins the race to acquire the spinlock and remains in a sleep state, requiring

now CPU load

o CPU 1, 2, and 3 operate at 100 percent load, waiting for the spinlock resource tobecome available

o Again at t = 1 second, a race condition occurs and CPU 1 wins


44/73


44

Note the behavior above over an extended periodo CPU 0, 1, and 3 acquire the spinlocko CPU 2 does not acquire the lock


45/73


45

13.SPIN LOCKARM Energy aware lock operation new in the ARM Linux 2.6.15 Applied directly to new multiprocessor embedded cores

o ARM11 MPcore example Embedded control Networking Graphics

Conventional multiprocessor systems suffer from energy inefficiency due to processors waiting forand expending energy in polling of spinlockso A significant fraction of processor time may be lost in synchronizationo Problems include:

Priority Inversion Deadlocks convoy behavior

o Groups of processors executing control sequences in parallel and stalling insynchrony, waiting for the same lock

Now energy saving is ensured by placing processor in temporary stall state with ability to wakeprocessor within one cycle upon lock being freed

o Notification via SCU (Snoop and Control Unit) in multiprocessor core


46/73


46

Ensures that all caches are coherento Signal propagates to all CPUs (see unlock)o wfene instruction - receiveo sev instruction - notify

static inline void __raw_spin_lock(raw_spinlock_t *lock)

{

unsigned long tmp;


"1: ldrex %0, [%1]\n" ; load lock member of &lock into r

" teq %0, #0\n" ; test lock value

" wfene\n" ; wait for notification

" strexeq %0, %2, [%1]\n" ; attempt to store 1 in r

" teqeq %0, #0\n" ; test

" bne 1b" ; loop if unsuccessful

: "=&r" (tmp)

: "r" (&lock->lock), "r" (1)

: "cc");

smp_mb();

}

wmb() and rmb() both are defined as mb() for ARMo define mb() __asm__ __volatile__ ("" : : : "memory")o This ensures that any writes or reads of variables that are being protected by the spinlock are

scheduled prior to releasing the lock

static inline void __raw_spin_unlock(raw_spinlock_t *lock)

{

smp_mb();


" str %1, [%0]\n" ; release, store 0 in lock member

" mcr p15, 0, %1, c7, c10, 4\n" ; drain storage buffer

" sev" ; send signal to waiting CPUs

:

: "r" (&lock->lock), "r" (0)

: "cc"); ; CPSR updated

}

Drain Storage Buffer operationo Forces synchronization of this stored data into D-cache of each processor


47/73


47

14.RW SPINLOCKS Spinlock allows only one sequence of control to enter a sequence of instructions An alternative exists for the spinlock

o The reader/writer lock admits many readers The lock prevents access by readers if a writer has taken the lock

o Permits only one writero Writer may not access lock while any reader or other writer holds lock

R RR WR RR W


48/73


48

RW spinlocks are based on a rwlock_t structureo This contains a counter variable that is equal to the sum of readers that have acquired the

rwlock

Without debugging options, this appearstypedef struct {

volatile unsigned int lock;} rwlock_t;

To initialize (x is the lock of type rwlock_t#define RW_LOCK_UNLOCKED (rwlock_t) { 0 }

Implementation and usage For a sequence of control that the designer intends to use to read a shared memory resource, then

read_lock(rw_lock *lock) is used

R WR WR WR W


49/73


49

o All of the other variants above for spin_lock are included Read_lock write_lock Read_lock_irq write_lock_irq Read_lock_irqsave write_lock_irqsave

Read_lock_bh

write_lock_bh Read_unlock write_unlock Read_unlock_irq write_unlock_irq Read_unlock_irqsave write_unlock_irqsave Read_unlock_bh write_unlock_bh Read_trylock (added in 2.6

kernel) write_trylock

WRITERS (IN ARM ARCHITECTURE)

Consider write lock acquisition (called by writers) Recall that strex r1, r2, [r3] stores contents of r2 into memory at address contained in r3 and places

zero result in r1 if no other writes have occurred to [r3] since previous ldrex

Note this can also execute conditionallystatic inline void _raw_write_lock(rwlock_t *rw)

{

unsigned long tmp;


"1: ldrex %0, [%1]\n" ; load exclusive and monitor lock

" teq %0, #0\n" ; test if lock zero" strexeq %0, %2, [%1]\n" ; attempt to write LOCK_BIAS if zero

" teq %0, #0\n" ; - note above is conditional execution

" bne 1b" ; spin until lock acquired

: "=&r" (tmp)

: "r" (&rw->lock), "r" (0x80000000)

: "cc", "memory");

Note that this sets a value of 232 interpreted as negative value Also, write unlock this merely involves clearing the lock (called by writers)


50/73


50

static inline void _raw_write_unlock(rwlock_t *rw)

{


"str %1, [%0]" ; store zero at address: &rw->lock

:: "r" (&rw->lock), "r" (0)

: "cc", "memory");

}


51/73


51

READERS (IN ARM ARCHITECTURE)

This must admit many readers This must track the number of readers and prevent any writer from entering a code section until all

readers have exited

o Each reader increments lock on entry and decrements on releaseo Writers only are permitted to enter if lock is set to zero

Here is the operation for read_lock this is called by a reader attempting to enter a critical sectiono Note:

if a writer has taken the lock its value will be -232o Then the increment result below will remain negative for 231 readers

If no reader or writer is present, the initial value of the lock variable is zeroo The lock is incremented by one for each reader upon acquiring the lock

o This implementation tests for presence of writer and spins in this event Otherwise, readers are admitted atomically increments the lock value by setting loading, setting monitor, and

incrementing a value in a register, initially equal to lock value

o The lock value is stored only if result is zero or positiveo It is then decremented if the lock is negative (returning value to prior state)o Then, if negative, remain in busy wait loop until writer exits and releases the locko Otherwise store will have occurred and exit

Note strexpl is Store Exclusive executing onpositive or zero comparisono Result of adding 1 to register and placing in register will be positive only if lock was initially

zero.

o Otherwise, LS modifier executes on Lower or Same Note, rsbpls %0, %1, #0\n" returns negative result if result of strexpl is non-zero

o Then remain in loop until another process sets lock to zero


52/73


52

static inline void _raw_read_lock(rwlock_t *rw)

{

unsigned long tmp, tmp2; /* will be stored in registers */


"1: ldrex %0, [%2]\n" ; load exclusive (&rw->lock) into reg

" adds %0, %0, #1\n" ; increment lock value (blindly)" strexpl %1, %0, [%2]\n" ; store reg exclusive setting reg (tmp2)

; if result is positive indicating no

; writers present

; But, a value of 1 will appear in

; (tmp2) if lock has been modified

; Thus, must now decrement to return lock

; to its initial value in next

; instruction note %1 contains value 1

; as a result of this event so,

; not required to load 1 immediate

" rsbpls %0, %1, #0\n" ; decrement lock if lower or same

" bmi 1b" ; branch if negative since lock value is

: "=&r" (tmp), "=&r" (tmp2) ; negative and writers are present: "r" (&rw->lock)

: "cc", "memory");

}

Now, unlocking proceeds witho Readers decrement the lock value on exitingo This is performed exclusively (atomically)o Lock value is positive if readers are present and decrements to 0 when all readers exit.

Here is the operation for read_unlock called by a reader exiting a critical section

static inline void _raw_read_unlock(rwlock_t *rw)

{


"1: ldrex %0, [%2]\n" ; load exclusive lock into reg

" sub %0, %0, #1\n" ; decrement lock value

" strex %1, %0, [%2]\n" ; store lock value

" teq %1, #0\n" ; test if successful exclusive operation

" bne 1b" ; branch if not exclusive

: "=&r" (tmp), "=&r" (tmp2)

: "r" (&rw->lock)

: "cc", "memory");

Here is the operation for read_unlock called by a reader exiting a critical section


53/73


53

WRITERS AND TRYLOCK

Writer may attempt to set lock to LOCK_BIAS or exit if another thread holds write lock

static inline int _raw_write_trylock(rwlock_t *rw)

{

unsigned long tmp;


"1: ldrex %0, [%1]\n"

" teq %0, #0\n"

" strexeq %0, %2, [%1]" ; store exclusive if equal (successful)

: "=&r" (tmp)

: "r" (&rw->lock), "r" (0x80000000)

: "cc", "memory");

return tmp == 0 ; otherwise exit

}


54/73


54

15. KERNEL SEMAPORES BACKGROUND

The semaphore is a unique variable with the following characteristics:o The semaphore value can be used to determine whether a process will execute or wait.o The semaphore may be operated on by wait or post.o wait

The wait function causes the semaphore value to be decremented by 1 if thesemaphore is non-zero

o The process calling wait on the semaphore is allowed to continueo This operation is atomic in that it completes without interruption by other

processes. Thus, two processes that attempt to decrement the semaphoreonly decrement it by one unit. If the semaphore is non-zero, only oneprocess will be allowed to continue, one will block.

If the semaphore is zeroo The process calling wait on the semaphore is blockedo The process remains blocked until the action of decrementing the

semaphore may return zero (as opposed to a negative value).

o post The post function increments the semaphore. This is again atomic in that if two

processes both attempt to increment a semaphore of value 0, it is incremented bytwo. For example, without the semaphore operation, both processes mightconclude that the proper value for the semaphore is 1.

A process may use the semaphore to protect a critical sectionof code such that its access to sharedresources is protected (as if it were the only process operating) during a code sequence. This holdstrue even if the process is interrupted or taken from running to ready by the operating system.

IMPLEMENTATION

The next step in the locking hierarchy is the semaphore This prevents a process from passing a point in the sequence of control defined by the semaphore


55/73


55

o However, unlike spinlocks, semaphores cause the process that reaches the semaphore tosleep

o Formally, this means that the process (kernel thread) is dequeued and a user space process ornew kernel thread operates as a result of context switch

o This is clearly efficient for designs where the sleep time is longo However, scheduler latency must be accounted for

The semaphore design is considerably more complexo Task wait queue managemento Management of many waiting tasks that may be admitted when the semaphore is available

Again, a data structure is the design foundationo

srtuct semaphore with data members

count: an atomic variable with these stateso Positive: semaphore is freeo 0: if semaphore is acquired and one thread is executing and

no other threads are sleeping while waiting for the semaphore.

o Negative: A number of threads equal to the absolute value of countare waiting for the semaphore

wait: pointer to a linked list of waiting tasks sleepers: A flag indicating the presence of queued processes this is zero if

no sleeping processed, 1 otherwise

Functionso An atomic down operation reduces the count variable

If semaphore is taken or busy, results in task being placed on wait queue untilsemaphore state changes


56/73


56

OPERATIONS

Initialization (see /include/asm-i386/semaphore.h)o void sema_init (struct semaphore *sem, int val)

struct semaphore {atomic_t count;

int sleepers;

wait_queue_head_t wait;

};

static inline void sema_init (struct semaphore *sem, int val)

{

atomic_set(&sem->count, val);

sem->sleepers = 0;

init_waitqueue_head(&sem->wait);

}

Mutexo Initializing a semaphore to 1 produces a mutex variableo This implies that only one lock holder is enabled one thread that can occupy a code

sequence

static inline void init_MUTEX (struct semaphore *sem)

{

sema_init(sem, 1);

}

Requesting a semaphore : down down(struct semaphore * sem)

o This will place a task that fails to receive the semaphore in the wait queue in a TASKUNINTERRUPTABLE state.


57/73


57

Implementation of down()o First, note code structure.

Begins with atomic decrement. If decrement does not yield zero, then lock has been taken. Jump to LOCK_SECTION_START Otherwise, exit the down() function

o Optimization: Note access to the semaphore lock is This code is inlined in compilation with other code

o Included in volatile node without reorderingo

Note, LOCK_SECTION_START is defined to create a subsection for thiscode separate from this section

Thus, as this code sequence is included in the inline function, only the decl and jsinstructions appear in the inline sequence

o This prevents code in the lock section from being imported into the instruction cache,evicting other instructions more likely to be used

static inline void down(struct semaphore * sem)

{


LOCK "decl %0\n\t" ; decrement sem->count

"js 2f\n" ; jump on sign; otherwise exit this

; function since next

; instructions not included

"1:\n"

LOCK_SECTION_START("")

"2:\t lea %0,%%eax\n\t" ; load addr of sem in eax

"call __down_failed\n\t" ;

"jmp 1b\n" ; loop back to

; LOCK_SECTION_START label

LOCK_SECTION_END

:"=m" (sem->count)

:

:"memory","ax");

}


58/73


58

o __down_failed prepares call to __downo Places current task on a waitqueueo Note that this will be included in the text section (code section) that is occupied by sched.c

asm(".section .sched.text\n" ; include in sched.c text section".align 4\n"".globl __down_failed\n""__down_failed:\n\t"

"pushl %edx\n\t" ; save state"pushl %ecx\n\t""call __down\n\t" ; __down places task on waitqueue"popl %ecx\n\t" ; restore"popl %edx\n\t""ret"

);

Examine __down() First, obtain a pointer to a task struct with address equal to that of the current task task_struct Create a waitqueue for the current task Now, the current task was TASK_RUNNING, now set state member to be

TASK_UNINTERRUPTIBLE

Acquire a spinlock with interrupts disabled and with ability to restore interrupts Add the current task to the wait waitqueue associated with this semaphore

o Mark this as WQ_EXCLUSIVE will control waking processo Adds tasks to tail of waitqueue

Increment the sleepers member Now, enter a loop

o First, get the number of sleeperso

Subtract the number (sleepers -1) from the semaphore counter

o If the result is negative, set sleepers to zero and break Consider an example semaphore is acquired and no task waiting on the semaphore


59/73


59

o Upon entry to _down, semaphore is down (lock is held) count = 0 and no other sleepingfunction is waiting on the queue (the only other task of interest is the task that currentlyholds the semaphore

First, semaphore count will be decremented to -1o Then, sleeper is incremented by 1o Then, count value is incremented by adding (sleepers 1) (original sleep count)

This will yield (sleepers) + -1 = -1 (in our case with no sleepers)o Note the definition of atomic_add_negative result is true if result of

addition is negative, otherwise false.

This negative result will cause the conditional branch notto be takeno Then, sleeper is set to 1, this indicates the presence of the task requesting the semaphoreo Call schedule()

Schedule will observe the TASK_INTERRUPTIBLE status and dequeue this task. This places task in the waitqueue, waiting for an event

o After return from schedule Locks are taken Task is marked UNINTERRUPTABLE and another check is performed on the

semaphore status with sleepers = 1

If the semaphore is not available, then the control remains in the loop Sleepers are set to 1 and schedule is again called

o If the semaphore count has been incremented to 1 (released) by other action Then the branch is taken, each of the sleepers are removed.

As the semaphore becomes available A call is made to release the spinlock and restore interrupts Any processes sleeping on the waitqueue will be activated

o With a set of rules to be seen below The task is set to TASK_RUNNING

o The next time that the scheduler function runs, this task is eligible for selection


60/73


60

fastcall void __sched __down(struct semaphore * sem)

{

struct task_struct *tsk = current;

DECLARE_WAITQUEUE(wait, tsk);


tsk->state = TASK_UNINTERRUPTIBLE;spin_lock_irqsave(&sem->wait.lock, flags);

add_wait_queue_exclusive_locked(&sem->wait, &wait);

sem->sleepers++;

for (;;) { /* loop will not exit until */

/* all sleepers exit */

int sleepers = sem->sleepers;

if (!atomic_add_negative(sleepers - 1, &sem->count)) {

sem->sleepers = 0;

break;

}

sem->sleepers = 1; /* this task - see -1 above */

spin_unlock_irqrestore(&sem->wait.lock, flags);schedule(); /* will lead to sleep */

spin_lock_irqsave(&sem->wait.lock, flags);

tsk->state = TASK_UNINTERRUPTIBLE;

}

remove_wait_queue_locked(&sem->wait, &wait);

wake_up_locked(&sem->wait);

spin_unlock_irqrestore(&sem->wait.lock, flags);

tsk->state = TASK_RUNNING;

}


61/73


61

It is important to consider how a list of tasks sleeping on the waitqueue may be activated (set toTASK_RUNNING)

o So, consider the modified example where N tasks occupy the waitqueue Now, these all entered the waitqueue through this function ! So, upon being woken, they will execute this loop, entering the control immediately

after schedule()

So, each task will execute the remove_waitqueue and then wake_up_locked. Thewake up locked function will activate each task in turn.

void fastcall add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t

*wait)

{


wait->flags |= WQ_FLAG_EXCLUSIVE;

spin_lock_irqsave(&q->lock, flags);

__add_wait_queue_tail(q, wait);

spin_unlock_irqrestore(&q->lock, flags);

}

Consider wake_up_locked this will call __wake_up_common( ) Will wakeup one exclusive task the process that initially called __down Will place task on the runqueue marked as TASK_RUNNING Semaphore functions

static inline int down_interruptible(struct semaphore * sem)

static inline int down_trylock(struct semaphore * sem)

static inline void up(struct semaphore * sem)

o An atomic up operation increments the count variable


62/73


62

16.SEMAPHORE SYNCHRONIZED KERNEL THREAD EXAMPLE /* kthread_mod_coord_semaphore.c

*

* Demonstration of multiple kernel thread

* creation and binding on multicore system

* with semaphore synchronization

*

*/

#include

#include

#include

#include

#include

#include

#define MAX_CPU 16

#define LOOP_MAX 20









static struct semaphore kthread_mod_sem;


{


int this_cpu;

int ret_sem;

int loops;





printk("kthread_mod: cpu %i executing down on kthread_mod_semaphore \n",

this_cpu);

down(&kthread_mod_sem);


63/73


63

printk

("kthread_mod: Thread pid %i acquired semaphore executing on cpu

%i delay %i count %i\n", current->pid, this_cpu, delay,

cycle_count);

cycle_count++;


residual_delay = schedule_timeout(delay);cycle_count--;

printk

("kthread_mod: Thread pid %i releasing semaphore executing on cpu

%i delay %i count %i\n", current->pid, this_cpu, delay,

cycle_count);

up(&kthread_mod_sem);

}


/*

* exit loop

*/


delay = 1 * HZ;



printk



}

printk


this_cpu);

return 0;

}


{

int cpu = 0;

int count;

int this_cpu;

int num_cpu;

int delay_val;


int residual_delay;

const char thread_name[] = "cycle_th";const char name_format[] = "%s/%d"; /* format for name and cpu id */

num_threads = 0;



printk

("kthread_mod: init task %i cpu %i of total CPU %i \n",

current->pid, this_cpu, num_cpu);


64/73


64

init_MUTEX(&kthread_mod_sem);


cpu = count;

num_threads++;kthread_cycle_state++;




kthread_cycle_[count] =

kthread_create(cycle, (void *) kthread_arg,




}

kthread_bind(kthread_cycle_[count], cpu); /* sets cpu in task struct */



printk

("kthread_mod: current task %i cpu %i create/wake next thread \n",


}

return 0;

}


{

int ret;

int count;

int this_cpu;

/*

* determine if early module removal terminated thread creation cycle early

*


*/




("kthread_mod: kthread_stop requests being applied by task %i on cpu %i\n",



ret = kthread_stop(kthread_cycle_[count]); /* sets done state */

printk

("kthread_mod: kthread_stop request for cpu count returned with value

%i \n", ret);


65/73


65

}

}


Note the behavior where the same thread on cpu0 reacquires the semaphoreo As thread executed up() it returns and executes down()o Unlike the spinlock example, other competing threads do not observe the availability of

the semaphore since their test of the s

EE202C Networked Embedded Systems Design Lecture 12 Multiprocessor Synchronization

Documents