Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads Aleksandar Prokopec Martin Odersky 1
Feb 24, 2016
1
Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads
Aleksandar ProkopecMartin Odersky
2
Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads
Aleksandar ProkopecMartin Odersky
Irregular Data-Parallel
3
Uniform workload
(0 until 10000000) reduce (+)
4
Uniform workload
(0 until 10000000) reduce (+)
sum = sum + x
5
Uniform workload
(0 until 10000000) reduce (+)
sum = sum + x
…
N
cycles
6
Baseline workload
for (0 until 10000000) {}
…
N
cycles
7
Irregular workload
8
Irregular workload
N
cycles
9
Irregular workload
for { x <- 0 until width y <- 0 until height} image(x, y) = compute(x, y)
N
cycles
10
Irregular workload
for { x <- 0 until width y <- 0 until height} image(x, y) = compute(x, y)image(x, y) = compute(x, y)
N
cycles
11
Workload function
workload(n) – work spent on element n after the data-parallel operation completed
12
Workload function
Could be…
Runtime valuedependent
for { x <- 0 until width y <- 0 until height} img(x, y) = compute(x, y)
workload(n) – work spent on element n after the data-parallel operation completed
13
Workload function
Could be…
Execution-scheduledependent
for (n <- nodes) n.neighbours += new Node
workload(n) – work spent on element n after the data-parallel operation completed
14
Workload function
Could be…
Totally randomfor ((x, y) <- img.indices) img(x, y) = sample( x + random(), y + random() )
workload(n) – work spent on element n after the data-parallel operation completed
15
Data-parallel scheduler
Assign loop elements to workerswithout knowledge about the workload function.
16
Data-parallel scheduler
1. Linear speedup for the baseline workload
Assign loop elements to workerswithout knowledge about the workload function.
17
Data-parallel scheduler
1. Linear speedup for the baseline workload2. Optimal speedup for irregular workloads
Assign loop elements to workerswithout knowledge about the workload function.
18
Static batching
Decides on the worker-element assignment before the data-parallel operation begins.
N
cycles
19
Static batching
Decides on the worker-element assignment before the data-parallel operation begins.
No knowledge → divide uniformly.
Not optimal for even mildly irregular workloads.
N
cycles
20
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
progress
21
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
0
22
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
2 T0: CAS
T0
23
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
4T1: CAS
T0 T1
24
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
6 T0: CAS
T0T1
25
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
8 T0: CAS
T0T1
26
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
10 T0: CAS
T0T1
27
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
12 T0: CAS
T0T1
28
Fixed-size batching
Workload-driven – decides during execution.
N
cycles
progress
Pros: lightweightCons: minimum batch size, contention
29
Fixed-size batching - contention
30
Factoring, GSS, TS
Batch size varies.
N
cycles
progress
Pros: lightweightCons: contention
31
Task-based work-stealing
N
cycles
0..2 2..4 4..8 8..16
32
Task-based work-stealing
N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T10..2
33
Task-based work-stealing
N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T10..2
steal – a rare event
34
Task-based work-stealing
N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T110..12
12..16
8..100..2
35
Task-based work-stealing
Pros: can be adaptive - uses stealing informationCons: heavyweight - minimum batch size much larger
N
cycles
0..2 2..4 4..8 8..16
2..4
4..8
8..16
T0 T110..12
12..16
0..2 8..10
36
Task-based work-stealing
N
cycles
0..2 2..4 4..8 8..16
Cannot be stolenafter T0 starts processing it
37
Work-stealing tree
0 0T0 N
owned
38
Work-stealing tree
0 0T0 N 0 50T0 N
owned owned
T0: CAS
39
Work-stealing tree
0 0T0 N 0 50T0 N 0 NT0 N…
owned owned completed
T0: CAS T0: CAS
What about stealing?
40
Work-stealing tree
0 0T0 N 0 50T0 N 0 NT0 N…
owned owned completed
0 -51T0 N
T0: CAS
T1: CAS
stolen
T0: CAS
41
Work-stealing tree
0 50T0 N 0 NT0 N…
owned completed
0 -51T0 N
T0: CAS
stolen
T0: CAS
0 0T0 N
owned
T1: CAS
42
Work-stealing tree
0 50T0 N 0 NT0 N…
owned completed
0 -51T0 N
T0: CAS
stolen
0 -51T0 N
expanded
50 50T0 M M MT1 N
T0: CAS
0 0T0 N
owned
M = (50 + N) / 2
43
Work-stealing tree
0 50T0 N 0 NT0 N…
owned completed
0 -51T0 N
T0: CAS
stolen
0 -51T0 N
expanded
50 50T0 M M MT1 N
T0: CAS
0 0T0 N
owned
M = (50 + N) / 2
T0 or T1: CAS
44
Work-stealing tree
0 50T0 N 0 NT0 N…
owned completed
0 -51T0 N
T0: CAS
stolen
0 -51T0 N
expanded
50 50T0 M M MT1 N
T0 or T1: CAS
T0: CAS
0 0T0 N
owned
M = (50 + N) / 2
45
Work-stealing tree - contention
50
Work-stealing tree scheduling
1) find either a non-expanded, non-completed node2) if not found, terminate3) if not owned, steal and/or expand, and descend4) advance until node is completed or stolen5) go to 1)
51
Work-stealing tree scheduling
1) find either a non-expanded, non-completed node2) if not found, terminate3) if not owned, steal and/or expand, and descend4) advance until node is completed or stolen5) go to 1)
1) find either a non-expanded, non-completed node
52
Choosing the node to steal
Find first, in-order traversal
2 9
5
3
53
Choosing the node to steal
Find first, in-order traversal
2 9
5
3
Catastrophic – a lot of stealing, huge trees
54
Choosing the node to steal
Find first, in-order traversal Find first, random order traversal
2 9
5
3
2 9
5
3
Catastrophic – a lot of stealing, huge trees
55
Choosing the node to steal
Find first, in-order traversal Find first, random order traversal
2 9
5
3
2 9
5
3
Catastrophic – a lot of stealing, huge trees
Works reasonably well.
56
Choosing the node to steal
Find first, in-order traversal Find first, random order traversal Find most elements
2 9
5
3
2 9
5
3
2 9
5
3
Catastrophic – a lot of stealing, huge trees
Works reasonably well. Generates least nodes.Seems to be best.
57
Comparison with fixed-size batching
58
Comparison with fixed-size batching
59
Comparison with task work-stealing
60
Thank you!
Questions?
61
Finding work
62
Other workloads