Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Threads: either under- or over-utilised

• Underutilised: limited by creation speed of work

– Cannot exploit all the CPUs even though there is more work

• Overutilised: losing performance due to context switches

– There is overhead when switching between OS threads

– Each thread needs to warm up cache again

– Increases memory pressure

• Worst case: continual slow-down

– The cost of creating threads is partially borne by kernel

– User code may slow down more than kernel code under load

– Number of workers slowly goes up; completion rate goes down

HPCE / dt10/ 2015 / 7.1

Solving under-utilisation

template<class TI, class TF>

void parallel_for(const TI &begin, const TI &end, const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

TI mid=(begin+end)/2;

std::thread left( // Spawn the left thread in parallel

[&](){ parallel_for(begin, mid, f); }

);

// Perform the right segment on our thread

parallel_for(mid, end, f);

// wait for the left to finish

left.join();

}

}

HPCE / dt10/ 2015 / 7.2

Creation of work using trees

• Tree starts on one thread [0..4)


void parallel_for(

const TI &begin,const TI &end,

const TF &f)

{

if(begin+1 == end){

f(begin);

}else{


std::thread left(


);


left.join();

}

}

HPCE / dt10/ 2015 / 7.3


• Tree starts on one thread

• Create thread to branch


void parallel_for(


const TF &f)

{

if(begin+1 == end){

f(begin);

}else{


std::thread left(


);


left.join();

}

}

[0..4)

[2..4) [0..2)

spawn

HPCE / dt10/ 2015 / 7.4





void parallel_for(


const TF &f)

{

if(begin+1 == end){

f(begin);

}else{


std::thread left(


);


left.join();

}

}

[0..4)

[2..4)

[3..4)

[0..2)

[2..3) [1..2) [0..1)

spawn

spawn spawn

HPCE / dt10/ 2015 / 7.5




• Execute function at leaves


void parallel_for(


const TF &f)

{

if(begin+1 == end){

f(begin);

}else{


std::thread left(


);


left.join();

}

}

[0..4)

[2..4)

[3..4)

[0..2)

[2..3) [1..2) [0..1)

f(3) f(2) f(1) f(0)

spawn

spawn spawn

HPCE / dt10/ 2015 / 7.6




• Execute function at leaves

• Join back up to the root


void parallel_for(


const TF &f)

{

if(begin+1 == end){

f(begin);

}else{


std::thread left(


);


left.join();

}

}

[0..4)

[2..4)

[3..4)

[0..2)

[2..3) [1..2) [0..1)

f(3) f(2) f(1) f(0)

[3..4) [2..3) [1..2) [0..1)

[2..4) [0..2)

[0..4)

spawn

join

join join

spawn spawn

HPCE / dt10/ 2015 / 7.7

Properties of fork / join trees

• Recursively creating trees of work is very efficient

– We are not limited to one thread creating all tasks

– Exponential rather than linear growth of threads with time

• Problem solved?

• Growth of threads is exponential with time

• Can put significant pressure on the OS thread scheduler

– Context switching 1000s of threads is very inefficient

• Each thread requires significant resources

– Need kernel handles, stack, thread-info block, ...

– Can’t allocate more than a few thousand threads per process

HPCE / dt10/ 2015 / 7.8

Re-examining the goals

• What we want is parallel_for:

“Iterations may execute in parallel”

• std::thread gives us something different:

“The new thread will execute in parallel”

• Our thread based strategy is too eager to go parallel

• We want to go just parallel enough, then stay serial

HPCE / dt10/ 2015 / 7.9

Tasks versus threads

• A task is a chunk of work that can be executed

– A task may execute in parallel with other tasks

– A task will eventually be executed, but no guarantee on when

• Tasks are scheduled and executed by a run-time (TBB)

– Maintain a list of tasks which are ready to run

– Have one thread per CPU for running tasks

– If a thread is idle, assign a task from the ready queue to it

– No limit on number of tasks which are ready to run

– (OS is still responsible for mapping threads to CPUs)

• TBB has a number of high-level ways to use tasks

– But there is a single low-level underlying task primitive HPCE / dt10/ 2015 / 7.10

Overview of task groups

• A task group collects together a number of child tasks

– The task creating the group is called the parent

– One or more child tasks are created and run() by the parent

– Child tasks may execute in parallel

– Parent task must wait() for all child tasks before returning

HPCE / dt10/ 2015 / 7.11

parallel_for using tbb::task_group

#include "tbb/task_group.h"


void parallel_for(const TI &begin, const TI &end, const TF &f)

{

if(begin+1 == end){

f(begin);

}else{

auto left=[&](){ parallel_for(begin, (begin+end)/2, f); }

auto right=[&](){ parallel_for((begin+end)/2, end, f); }

// Spawn the two tasks in a group

tbb::task_group group;

group.run(left);

group.run(right);

group.wait(); // Wait for both to finish

}

}

HPCE / dt10/ 2015 / 7.12

Overview of task groups

• A task group collects together a number of child tasks

– The task creating the group is called the parent

– One or more child tasks are created and run() by the parent

– Child tasks may execute in parallel

– Parent task must wait() for all child tasks before returning

• Some important differences between tasks and threads

– Threads must execute in parallel

– A thread may continue after its creator exits

– Threads must be joined individually

HPCE / dt10/ 2015 / 7.13

More patterns: tbb::parallel_invoke

template<typename Func0, typename Func1>

void parallel_invoke(const Func0& f0, const Func1& f1);

template<typename Func0, typename Func1, typename Func2>

void parallel_invoke(const Func0& f0, const Func1& f1, const Func2& f2);

• Takes two or more functions and may run in parallel

– Overloaded for different numbers of arguments

– No overload for 1 argument for obvious reasons

• Interface is very clean, but also quite simple

– Decision about number of tasks is completely static

– You can’t add more tasks once some starts

– No choice about when to synchronise with tasks

HPCE / dt10/ 2015 / 7.14

parallel_invoke using task_group

• parallel_invoke can be implement using task_group

– task_group supports a super-set of the functionality

template<typename Fc0, typename Fc1, typename Fc2>

void parallel_invoke(const Fc0& f0, const Fc1& f1, const Fc2& f2)

{


group.run(f0);

group.run(f1);

group.run(f2);

group.wait();

}

HPCE / dt10/ 2015 / 7.15

Can’t do task_group using parallel_invoke

• task_group is intrinsically dynamic

– Decide how much work to add at run-time

– Can add work even while tasks are running in the group

void my_function(int n, float *x)

{


for(unsigned i=0;i<n;i++){

if(x[i]==0)

group.run([=](){ f(i); });

else

group.run([=](){ g(x[i]); });

}

group.wait();

}

HPCE / dt10/ 2015 / 7.16

The underlying primitive: tbb::task

• TBB has a basic primitive called tbb::task

• This is the raw unit of scheduling understood by the lib.

– Other high-level wrappers create tasks internally

– The TBB run-time takes tasks and schedules them to a CPU

• Tasks are very flexible, with a lot of power

– Can express complicate dependency graphs

– Build non-local synchronisation and barriers

• With power comes responsibility

– They allow you to make mistakes

– Possible (though not likely) to mess up the TBB run-time

• Better to create wrappers on top that hide tasks

– parallel_for, parallel_invoke, task_group, parallel_reduce, ...

HPCE / dt10/ 2015 / 7.17

class MyTask

: public tbb::task

{

int start, end;

MyTask(int _start, int _end)

{ start=_start; end=_end; }

tbb::task * execute()

{

if(cond())

return 0;

set_ref_count(3);

MyTask &t1=*new(allocate_child()) MyTask(start,(start+end)/2);

spawn(t1);

MyTask &t2=*new(allocate_child()) MyTask((start+end)/2, end);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

}

};

void CreateTasks(int start, int end)

{

MyTask &root=*new(allocate_root()) MyTask(start,end);

tbb::task::spawn_root_and_wait();

}

void MyTask(int start, int end)

{

if(cond())

return 0;


group.run([=](){MyTask(start,(start+end)/2); });

group.run([=](){MyTask((start+end)/2,end); });

DoSomethingFirst();

group.wait();

DoSomethingElse();

return 0;

}

HPCE / dt10/ 2015 / 7.18

Life-cycle of a task

• Life-cycle of task due to interaction between task and run-

time

– Individual task calls spawn, wait_for_all (sync), return

– TBB run-time will keep track of a task’s children (dependencies)

Scheduling through reference counts

• Each task has a reference count and a successor task

• The reference count identifies whether a task is blocked

– If the reference count is zero then the task could be run

– But only if it has been given to the task scheduler

– Legal to create a task and not give it to the scheduler

– Note the difference: “reference count” vs “C++ reference”

• Successor task identifies the task blocked by this task

– Generally the successor task is the creator, or parent

– When a task completes it decrements the count of its successor

HPCE / dt10/ 2015 / 7.20

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Created


{



}

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running


{



}

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task * MyTask::execute()

{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Created


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

successor

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Created


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Created

successor successor

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Ready


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

tbb::task

refcount

MyTask

3

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Running


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

tbb::task

refcount

MyTask

2

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

tbb::task

refcount

MyTask

2

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

tbb::task

refcount

MyTask

2

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Running

successor successor

tbb::task

refcount

MyTask

1

start ...

end ...

successor -

Blocked

tbb::task

refcount

MyTask

0

start ...

end ...

Running


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Finished

successor successor

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Finished


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

successor

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running


{

if(cond())

return 0;

set_ref_count(3);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Finished


{



}

Managing reference counts

• What happens if we get the reference count wrong?

• Finishing task calls decrement_ref_count on

successor

– Automatically returns task to scheduler if count becomes zero

HPCE / dt10/ 2015 / 7.36

tbb::task

refcount

MyTask

1

start ...

end ...

successor -

Running


{

if(cond())

return 0;

set_ref_count(1);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Finished


{

if(cond())

return 0;

set_ref_count(1);



spawn(t1);

spawn(t2);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

tbb::task

refcount

MyTask

0

start ...

end ...

successor -

Running


{

if(cond())

return 0;



spawn(t1);

spawn(t2);

set_ref_count(3);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Running

successor

tbb::task

refcount

MyTask

-1

start ...

end ...

successor -

Running

tbb::task

refcount

MyTask

0

start ...

end ...

Finished


{

if(cond())

return 0;



spawn(t1);

spawn(t2);

set_ref_count(3);

DoSomethingFirst();

wait_for_all();

DoSomethingElse();

return 0;

}

tbb::task

refcount

MyTask

0

start ...

end ...

Ready

successor successor

Some help is available

• TBB library comes in two forms: debug and release

– release library does no error checking – all about speed

– debug library will check reference counts at many points

• Choose library version at compilation and link stages

– Debug: #define TBB_USE_DEBUG=1 when compiling

• On microsoft compilers it will automatically link the correct library

• On other compilers use “-ltbb” vs “-ltbb_debug”

– Usually maintain different release and debug settings

• Debug: /DTBB_USE_DEBUG=1 /MDd

• Release: /DNDEBUG=1 /O2

– Can setup in Visual Studio or in a makefile

• Generally: try to avoid raw tasks if possible HPCE / dt10/ 2015 / 7.41

Many design patterns are built on tasks

• Iteration in various forms

– parallel_for, parallel_for_each

• Reduction and accumulation

– parallel_reduce

• Data-dependent looping and queue processing

– parallel_do

• Support for heterogeneous tasks

– parallel_invoke, task_group

• Heterogeneous tasks and token-based data-flow

– parallel_pipeline

• Goal: turn design patterns in concrete functions

HPCE / dt10/ 2015 / 7.42

Threads: either under- or over-utilisedcas.ee.ic.ac.uk/.../hpce/hpce-lec7-tbb-continued.pdf · Threads: either under- or over-utilised • Underutilised: limited by creation speed

Documents