An Adaptive Task Creation Strategy for Work-Stealing Scheduling

Post on 24-Feb-2016

55 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

An Adaptive Task Creation Strategy for Work-Stealing Scheduling. Lei Wang , Huimin Cui, Yuelu Duan , Fang Lu, Xiaobing Feng , Pen-Chung Yew. ICT, Chinese Academy of Sciences, China University of Minnesota, U.S.A. Forecast . Adaptive task granularity. fine-grained parallelism. tasks. - PowerPoint PPT Presentation

Transcript

INSTITUTE OF COMPUTING

TECHNOLOGY

An Adaptive Task Creation Strategy for Work-Stealing Scheduling

Lei Wang, Huimin Cui, Yuelu Duan, Fang Lu, Xiaobing Feng, Pen-Chung Yew

ICT, Chinese Academy of Sciences, ChinaUniversity of Minnesota, U.S.A

1

INSTITUTE OF COMPUTING

TECHNOLOGY

Forecast

2

Adaptive task granularity

fine-grained parallelism

tasks

Multi-cores

An adaptive task creation strategy Work-stealing

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline An adaptive task creation strategy

A new data attribute -- taskprivate

Evaluations

Conclusions

3

INSTITUTE OF COMPUTING

TECHNOLOGY

Background Cilk, Cilk++, X10, OpenMP3.0, TBB, TPL …

Parallel programming languages and libraries to support task-level parallelism

Programmer: dividing work into tasks instead of threads

Runtime system: mapping and scheduling tasks into physical threads

Key technique Work-stealing scheduling

4

INSTITUTE OF COMPUTING

TECHNOLOGY

Granularity

too fine scheduling overhead dominates

too coarse lose potential parallelism, cause starvation

5

cut-off = 3

cut-off = 1

INSTITUTE OF COMPUTING

TECHNOLOGY

An unbalanced computation tree

6P0 – red, P1 – blue, P2 – green, P3 – yellow.

INSTITUTE OF COMPUTING

TECHNOLOGY

A cut-off strategy

7P0 – red, P1 – blue, P2 – green, P3 -- yellow

Load imbalance

INSTITUTE OF COMPUTING

TECHNOLOGY

An adaptive task creation strategy -- AdaptiveTC

8

A special task

P0 – red, P1 – blue, P2 – green, P3 -- yellow

INSTITUTE OF COMPUTING

TECHNOLOGY

AdaptiveTC When executing a spawn statement

a task, a function call (a fake task), a special task the task the fake task the special task

Adaptively switching between tasks and fake tasks to get a better performance Cut-off A special task

9

Keeping idle threads busy Improving performanceGood load balancing

a task a fake taska fake task a task

INSTITUTE OF COMPUTING

TECHNOLOGY

cilk int nqueens(int depth, int n, char x [ ]){… tmpx = Cilk_alloca(n * sizeof(char)); memcpy(tmpx, x, n * sizeof(char)); sn += spawn nqueens(depth + 1, n, tmpx);…sync;return sn;}

(3)

cilk int nqueens(int depth, int n, char x [ ]){… tmpx = (char *)malloc(n * sizeof(char)); memcpy(tmpx, x, n * sizeof(char)); sn += spawn nqueens(depth + 1, n, tmpx);...sync;free(x); return sn;}

(2) cilk int nqueens(int depth, int n, char x [ ]){... tmpx =(char *)malloc(n * sizeof(char)); memcpy(tmpx, x, n * sizeof(char)); sn += spawn nqueens(depth + 1, n, tmpx); free(tmpx);...sync;return sn;}

(1)

Which Cilk programs are correct?

10

N-queen problem

INSTITUTE OF COMPUTING

TECHNOLOGY

A new data attribute -- taskprivate Workspace copying

Not easy to program Overhead is high

taskprivate Introduced for

workspace variables

11

cilk int nqueens(int depth, int n, char x [ ]) taskprivate: (x[]) (n * sizeof(char));{ int sn = 0; if(depth >= n){ sn++; return sn; } for(j = 0; j < n; j++){ if(place(depth, j, x)){ x[depth] = j; sn += spawn nqueens(depth + 1, n, x); } } sync; return sn;}

An AdaptiveTC program for nqueens

In a fake task (a function call) x[depth] = j; sn += nqueens(depth + 1, n, x);

In a task

x[depth] = j; tmpx = Cilk_alloca(n * sizeof(char)); memcpy(tmpx, x, n * sizeof(char)); sn += nqueens(depth + 1, n, tmpx);

INSTITUTE OF COMPUTING

TECHNOLOGY

Test system, test cases 8 cores

2-processor quad core Intel Xeon E5520 (2.26GHz, 8G memory)

8 test cases 6 are backtracking search programs. 2 are divide and conquer programs.

Compared systems Cilk-5.4.6, Tascell (PPoPP’09), AdaptiveTC gcc -O3

12

INSTITUTE OF COMPUTING

TECHNOLOGY

Test case 1 -- performance

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

CilkCilk-SYNCHEDTascellAdaptiveTC

Number of Threads

Spee

dup

(Seconds) 1 thread 8 threads

C 61 61

Cilk 198 24.57

Cilk-SYNCHED 184 22.41

Tascell 85 14.24

AdaptiveTC 66 8.27

13Nqueen-array(16)

INSTITUTE OF COMPUTING

TECHNOLOGY

Test case 1 -- analysis

Tascell Cilk Cilk-SYNCHED

AdaptiveTC0%

20%

40%

60%

80%

100%

120%working taskprivate variable

Load balanced

28.7% 69.2% 67% 7.9% The usage of cores with 8 threads

14

Tascell Cilk AdaptiveTC

83.3%99.9% 99.0%

16.7%0.1% 1.0%

busy idle

Breakdown of overhead

overhead

INSTITUTE OF COMPUTING

TECHNOLOGY

1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

CilkCilk-SYNCHEDTascellAdaptiveTC

Number of Threads

Spee

dup

Test case 2 -- performance

(Seconds) 1 thread 8 threads

C 554 554

Cilk 669 85

Cilk-SYNCHED 661 88

Tascell 627 114

AdaptiveTC 612 77

15Nqueen-compute(16)

INSTITUTE OF COMPUTING

TECHNOLOGY

Test case 2 -- analysis

11.7% 17.2% 16.2% 9.5%

Tascell Cilk Cilk-SYNCHED

AdaptiveTC0%

20%

40%

60%

80%

100%

120%

working taskprivate variabledeque/nested function

Load balanced

The usage of cores with 8 threads

Tascell Cilk AdaptiveTC

79.2%99.9% 99.1%

20.8%0.1% 0.9%

busy idle

16

Breakdown of overhead

overhead

INSTITUTE OF COMPUTING

TECHNOLOGY

012345678

1 2 3 4 5 6 7 8

spee

dup

# of threads

Sudoku ( i nput_bal ance tree)

Ci l kCi l k-SYNCHEDTascel lAdapti veTC

Kni ght' s tour(6*6)

0123456789

10

1 2 3 4 5 6 7 8# of threads

spee

dup Ci l k

Ci l k-SYNCHEDTascel lAdapti veTC

St r i mko

012345678

1 2 3 4 5 6 7 8# of threads

Spee

dup Ci l k

Ci l k- SYNCHEDTascel lAdapt i veTC

Pentomi no(13)

012345678

1 2 3 4 5 6 7 8# of threads

Spee

dup Ci l k

Ci l k- SYNCHEDTascel lAdapt i veTC

Experimental results

17

INSTITUTE OF COMPUTING

TECHNOLOGY

Comp(60000)

012345678

1 2 3 4 5 6 7 8# of threads

Spee

dup Ci l k

Tascel lAdapti veTC

Fi b(45)

01234567

1 2 3 4 5 6 7 8# of threads

spee

dup Ci l k

Tascel lAdapt i veTC

Nqueen

_arra

y(16)

Nqueen

_com

pute(

16)

Strimko

Knight'

s Tou

r(6*6

)

Sudok

u (ba

lance_

tree)

Pentom

ino(13

)

Fib(45

)

Comp(6

0000

)

Averag

e0

0.51

1.52

2.53

3.54

Cilk Cilk_SYNCHED Tascell AdaptiveTC

Spee

dup

Experimental results (cont’d)

18

Figure: Speedup with 8 threads, baseline is Cilk’s execution time

speedup

Cilk 1Cilk-SYNED 1.07Tascell 1.5AdaptiveTC 2.24

INSTITUTE OF COMPUTING

TECHNOLOGY

Conclusions -- AdaptiveTC An adaptive task creation strategy controls

the tasks granularity. Reducing the system overhead Achieving a good load balancing

A new data attribute taskprivate is introduced for workspace variables. Improving the programmability Reducing the cost of workspace copying with an

adaptive task creation strategy19

INSTITUTE OF COMPUTING

TECHNOLOGY

Thanks!20

top related