Top Banner
CSE211: Compiler Design Nov. 19, 2020 Topic: SMP parallelism Compiler implementations! Discussion questions: Do modern compilers automatically parallelize your code? Have you ever used a auto-parallelizing compiler? 0 1 3 4 thread 1 thread 0 worklist 0 worklist 1
132

CSE211: Compiler Design

Nov 26, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE211: Compiler Design

CSE211: Compiler Design Nov. 19, 2020

• Topic: SMP parallelism• Compiler implementations!

• Discussion questions:• Do modern compilers automatically

parallelize your code?• Have you ever used a auto-parallelizing

compiler?

0 1 3 4

thread 1thread 0

worklist 0 worklist 1

Page 2: CSE211: Compiler Design

Announcements

• Midterm is due today. Clarification questions are posted as discussions on Canvas. Resubmit by emailing me if you’d like

• HW3 is released. Due Dec. 4

• Paper/projects proposals due Nov. 24

• Guest speaker next lecture

Page 3: CSE211: Compiler Design

CSE211: Compiler Design Nov. 19, 2020

• Topic: SMP parallelism• Compiler implementations!

• Discussion questions:• Do modern compilers automatically

parallelize your code?• Have you ever used a auto-parallelizing

compiler?

0 1 3 4

thread 1thread 0

worklist 0 worklist 1

Page 4: CSE211: Compiler Design

Implementing SMP parallelism in a compiler

• Where to do it?• High-level: DSL, Python, etc.• Mid-level: C/C++• Low-level: LLVM-IR• ISA: e.g. x86

Page 5: CSE211: Compiler Design

Implementing SMP parallelism in a compiler

• Where to do it?• High-level: DSL, Python, etc.• Mid-level: C/C++• Low-level: LLVM-IR• ISA: e.g. x86

Tradeoffs at all levels

Page 6: CSE211: Compiler Design

Implementing SMP parallelism in a compiler

• Where to do it?• High-level: DSL, Python, etc.• Mid-level: C/C++• Low-level: LLVM-IR• ISA: e.g. x86

Here you’ve lost information about for loops, but SSA providesa nice foundation for analysis

Page 7: CSE211: Compiler Design

Implementing SMP parallelism in a compiler

• Where to do it?• High-level: DSL, Python, etc.• Mid-level: C/C++• Low-level: LLVM-IR• ISA: e.g. x86

Good frameworks available for managing threads (C++, OpenMP).Good tooling for analysis and codegen clang visitors, pycparser, etc.

Page 8: CSE211: Compiler Design

Implementing SMP parallelism in a compiler

• Where to do it?• High-level: DSL, Python, etc.• Mid-level: C/C++• Low-level: LLVM-IR• ISA: e.g. x86

In many cases, DSLs compiler down to, or link to C/C++:DNN libraries, Graph analytic DSLs, Numpy.

Some DSLs compile to LLVM: Numba

Page 9: CSE211: Compiler Design

Implementing SMP parallelism in a compiler

• Where to do it?• High-level: DSL, Python, etc.• Mid-level: C/C++• Low-level: LLVM-IR• ISA: e.g. x86

We will assume this level for the lecture

Page 10: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

Page 11: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

0 1 2 3 4 5 6 7 SIZE -1

Page 12: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

0 1 2 3 4 5 6 7 SIZE -1

say SIZE / NUM_THREADS = 4

Page 13: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

0 1 2 3 4 5 6 7 SIZE -1

say SIZE / NUM_THREADS = 4

Thread 0 Thread 1 Thread N

Page 14: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

make a new function with the for loop inside. Pass all needed variables as arguments. Take an extra argument for a thread id

Page 15: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

void parallel_loop(..., int tid) {

for (x = 0; x < SIZE; x++) {// work based on x

}}

make a new function with the for loop inside. Pass all needed variables as arguments. Take an extra argument for a thread id

Page 16: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;for (x = 0; x < SIZE; x++) {// work based on x

}}

determine chunk size in new function

Page 17: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int x = 0; x < SIZE; x++) {// Each iteration takes roughly// equal time}

...}

• How to implement in a compiler:

void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

Set new loop bounds

Page 18: CSE211: Compiler Design

Regular Parallel Loops

void foo() {...for (int t = 0; t < NUM_THREADS; t++) {spawn(parallel_loop(..., t))

}join();

...}

• How to implement in a compiler:

void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

Spawn threads

Page 19: CSE211: Compiler Design

Regular Parallel Loops

• Example, 2 threads/cores, array of size 8

void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 4

0: start= 0

0: end= 4

1: start= 4

1: end= 8

Page 20: CSE211: Compiler Design

Regular Parallel Loops

• Example, 2 threads/cores, array of size 8

void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 4

0: start = 0

0: end = 4

1: start = 4

1: end = 8

Page 21: CSE211: Compiler Design

End example

Page 22: CSE211: Compiler Design

Regular Parallel Loops

• Example, 2 threads/cores, array of size 9

void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = ?

0: start= ?

0: end= ?

1: start= ?

1: end= ?

8

Page 23: CSE211: Compiler Design

Regular Parallel Loops

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 4

0: start = 0

0: end = 4

1: start = 4

1: end = 8

8void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

• Example, 2 threads/cores, array of size 9

Page 24: CSE211: Compiler Design

Regular Parallel Loops

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 4

0: start = 0

0: end = 4

1: start = 4

1: end = 8

8void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_size;if (tid == NUM_THREADS - 1) {end += (end - SIZE);

}for (x = start; x < end; x++) {// work based on x

}}

• Example, 2 threads/cores, array of size 9

Page 25: CSE211: Compiler Design

Regular Parallel Loops

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 4

0: start = 0

0: end = 4

1: start = 4

1: end = 9

8void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_size;if (tid == NUM_THREADS - 1) {end += (end - SIZE);

}for (x = start; x < end; x++) {// work based on x

}}

• Example, 2 threads/cores, array of size 9

last thread gets more work

Page 26: CSE211: Compiler Design

End example

Page 27: CSE211: Compiler Design

Regular Parallel Loops

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 4

0: start = 0

0: end = 4

1: start = 4

1: end = 8

8void parallel_loop(..., int tid) {

int chunk_size =(SIZE+(NUM_THREADS-1))/NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

• Example, 2 threads/cores, array of size 9 ceiling division

Page 28: CSE211: Compiler Design

Regular Parallel Loops

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 5

0: start = 0

0: end = 5

1: start = 5

1: end = 10

8void parallel_loop(..., int tid) {

int chunk_size =(SIZE+(NUM_THREADS-1))/NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_sizefor (x = start; x < end; x++) {// work based on x

}}

• Example, 2 threads/cores, array of size 9

9

out of bounds

Page 29: CSE211: Compiler Design

Regular Parallel Loops

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 5

0: start = 0

0: end = 5

1: start = 5

1: end = 10

void parallel_loop(..., int tid) {

int chunk_size =(SIZE+(NUM_THREADS-1))/NUM_THREADS;int start = chunk_size * tid;int end =min(start+chunk_size, SIZE)for (x = start; x < end; x++) {// work based on x

}}

• Example, 2 threads/cores, array of size 9

9

out of bounds

8

Page 30: CSE211: Compiler Design

Regular Parallel Loops

0 1 2 3 4 5 6 7

thread 1thread 0

chunk_size = 5

0: start = 0

0: end = 5

1: start = 5

1: end = 9

void parallel_loop(..., int tid) {

int chunk_size =(SIZE+(NUM_THREADS-1))/NUM_THREADS;int start = chunk_size * tid;int end =min(start+chunk_size, SIZE)for (x = start; x < end; x++) {// work based on x

}}

• Example, 2 threads/cores, array of size 9

8

most threads do equal amountsof work, last thread may do less.

Page 31: CSE211: Compiler Design

End example

Page 32: CSE211: Compiler Design

Good for SMP parallelism

C1C0

L1 cache

L1 cache

L2 cache

DRAM

thread 0 thread 1

0 1 2 3 4 5 6 7

stays in thread 0’sL1 cache

stays in thread 1’sL1 cache

SMP parallelism

Page 33: CSE211: Compiler Design

What about streaming multiprocessors (GPUs)?

CE1CE0

one streamingmultiprocessorcontains many small ComputeElements (CE)

thread 0 thread 1

streaming multiprocessor

L1 cache

DRAM

0 1 2 3 4 5 6 7

is this partition good for GPUs??

CEs Can load adjacentmemory locationssimultaneously.

CEs execute iterationssynchronously

load/store unit

Page 34: CSE211: Compiler Design

What about streaming multiprocessors (GPUs)?

one streamingmultiprocessorcontains many small ComputeElements (CE)

0

1 2 3

4

5 6 7

CEs Can load adjacentmemory locationssimultaneously.

CEs execute iterationssynchronously

is this partition good for GPUs??

ITER 0:

CE1CE0

thread 0 thread 1

streaming multiprocessor

L1 cache

DRAM

load/store unit

Page 35: CSE211: Compiler Design

What about streaming multiprocessors (GPUs)?

one streamingmultiprocessorcontains many small ComputeElements (CE)

0

1 2 3

4

5 6 7

CEs Can load adjacentmemory locationssimultaneously.

CEs execute iterationssynchronously

is this partition good for GPUs??

ITER 0:

not adjacent, so the loads have to be serialized

CE1CE0

thread 0 thread 1

streaming multiprocessor

L1 cache

DRAM

load/store unit

Page 36: CSE211: Compiler Design

What about streaming multiprocessors (GPUs)?

one streamingmultiprocessorcontains many small ComputeElements (CE)

0 1 2 3 4 5 6 7

CEs Can load adjacentmemory locationssimultaneously

What about a striped pattern?

ITER 0:

CE1CE0

thread 0 thread 1

streaming multiprocessor

L1 cache

DRAM

load/store unit

Page 37: CSE211: Compiler Design

What about streaming multiprocessors (GPUs)?

one streamingmultiprocessorcontains many small ComputeElements (CE)

0 1 2 3 4 5 6 7

CEs Can load adjacentmemory locationssimultaneously

What about a striped pattern?

ITER 0:

CE1CE0

thread 0 thread 1

streaming multiprocessor

L1 cache

DRAM

load/store unit

Page 38: CSE211: Compiler Design

What about streaming multiprocessors (GPUs)?

one streamingmultiprocessorcontains many small ComputeElements (CE)

0 1

2 3 4 5 6 7

CEs Can load adjacentmemory locationssimultaneously

What about a striped pattern?

ITER 0:

adjacent memory locations can be loaded at thesame time!

CE1CE0

thread 0 thread 1

streaming multiprocessor

L1 cache

DRAM

load/store unit

Page 39: CSE211: Compiler Design

End example

Page 40: CSE211: Compiler Design

Kepler architecture

From:https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

Page 41: CSE211: Compiler Design

How to compiler for GPUs?

• Example, 2 threads/cores, array of size 8. Change code for a GPU

void parallel_loop(..., int tid) {

int chunk_size = SIZE / NUM_THREADS;int start = chunk_size * tid;int end = start + chunk_size;for (x = start; x < end; x+=NUM_THREADS) {// work based on x

}}

0 1 2 3 4 5 6 7

thread 1thread 0

Page 42: CSE211: Compiler Design

How to compiler for GPUs?

• Example, 2 threads/cores, array of size 8

0 1 2 3 4 5 6 7

x: ?

ITER 0

x: ?

void parallel_loop(..., int tid) {

for (x = tid; x < end; x+=NUM_THREADS) {// work based on x

}}

thread 1thread 0

Page 43: CSE211: Compiler Design

How to compiler for GPUs?

• Example, 2 threads/cores, array of size 8

0 1 2 3 4 5 6 7

thread 1thread 0

x: 0

ITER 0

x: 1

void parallel_loop(..., int tid) {

for (x = tid; x < end; x+=NUM_THREADS) {// work based on x

}}

Page 44: CSE211: Compiler Design

How to compiler for GPUs?

• Example, 2 threads/cores, array of size 8

void parallel_loop(..., int tid) {

for (x = tid; x < end; x+=NUM_THREADS) {// work based on x

}}

0 1 2 3 4 5 6 7

thread 1thread 0

x: 2

ITER 1

x: 3

Page 45: CSE211: Compiler Design

End example

Page 46: CSE211: Compiler Design

Takeaways:

• Chunk data for SMP parallelism. Cores have disjoint L1 caches.

• Stride data for SM (GPU) parallelism, adjacent threads can more efficiently access adjacent memory.

• Easily compute bounds using runtime variables • SIZE, NUM_THREADS, THREAD_ID

• Create one function parameterized by thread id (SPMD parallelism)

Page 47: CSE211: Compiler Design

Irregular parallelism in loops

• Tasks are not balanced

• Appears in lots of emerging workloads

Page 48: CSE211: Compiler Design

Irregular parallelism in loops

• Tasks are not balanced

• Appears in lots of emerging workloads

social network analytics where threads are parallel across users

Page 49: CSE211: Compiler Design

Irregular parallelism in loops

• Tasks are not balanced

• Appears in lots of emerging workloads

sparse DNNs where a large percentage of weights are dropped

Page 50: CSE211: Compiler Design

Irregular parallelism in loops

• Tasks are not balanced

for (x = 0; x < SIZE; x++) {for (y = x; y < SIZE; y++) {a[x,y] = b[x,y] + c[x,y];

}}

Simple parallelism across the x loop only gives 1.3x speedup with 2 threads

This can be improved using load-balancing strategies, like workstealing

Given the end of Dennard’s scaling,we should aim to get these applicationsto scale better!

Page 51: CSE211: Compiler Design

Work stealing

• Tasks are dynamically assigned to threads.

Page 52: CSE211: Compiler Design

Work stealing - global implicit worklist

• Pros• Simple to implement

• Cons:• High contention on global counter• Potentially bad memory locality.

Page 53: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 1 2 3 4 5 6 7 SIZE -1

thread 1thread 0

cannot color initially!

Page 54: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 1 2 3 4 5 6 7 SIZE -1

thread 1thread 0

Page 55: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 1

2 3 4 5 6 7 SIZE -1

thread 1thread 0

Page 56: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

2 3 4 5 6 7 SIZE -1

thread 1thread 0 1 finished tasks

Page 57: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

2 3 4 5 6 7 SIZE -1

Dynamically take the next iteration

thread 1thread 0 1 finished tasks

Page 58: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 2

3 4 5 6 7 SIZE -1

thread 1thread 0 1 finished tasks

Page 59: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

3 4 5 6 7 SIZE -1

thread 1thread 0 1 2 finished tasks

Page 60: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

3 4 5 6 7 SIZE -1

thread 1thread 0 1 2 finished tasks

Page 61: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 3

4 5 6 7 SIZE -1

thread 1thread 0 1 2 finished tasks

Page 62: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

3

4 5 6 7 SIZE -1

thread 1thread 0 1 20 finished tasks

Page 63: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

3

4 5 6 7 SIZE -1

thread 1thread 0 1 20 finished tasks

Page 64: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

34

5 6 7 SIZE -1

thread 1thread 0 1 20 finished tasks

Page 65: CSE211: Compiler Design

End example

Page 66: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

Page 67: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

void parallel_loop(...) {

for (x = 0; x < SIZE; x++) {// dynamic work based on x

}}

void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

Replicate code in a new function. Pass all needed variables as arguments. This creates SPMD parallelism.

Page 68: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

move loop variable to be a global atomic variable

atomic_int x = 0;void parallel_loop(...) {

for (x = 0; x < SIZE; x++) {// dynamic work based on x

}}

void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

Page 69: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

change loop bounds in new function to use a local variable using global variable.

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

Page 70: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++;local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

These must beatomic updates!

change loop bounds in new function to use a local variable using global variable.

Page 71: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

Spawn threads in original function and join them afterwards

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

void foo() {...for (t = 0; x < THREADS; t++) {spawn(parallel_loop);

}join();...

}

Page 72: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

Are we finished?

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

void foo() {...for (t = 0; x < THREADS; t++) {spawn(parallel_loop);

}join();...

}

Page 73: CSE211: Compiler Design

Work stealing - global implicit worklist

• How to implement in a compiler:

Are we finished?

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

void foo() {...for (t = 0; x < THREADS; t++) {spawn(parallel_loop);

}join();x = 0;...

}

Page 74: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 1 2 3 4 5 6 7 SIZE -1

thread 1thread 0

x: 00 - local_x - UDEF1 - local_x - UNDEF

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

Page 75: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 1 2 3 4 5 6 7 SIZE -1

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 20 - local_x - 01 - local_x - 1

thread 1thread 0

Page 76: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 1

2 3 4 5 6 7 SIZE -1

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 20 - local_x - 01 - local_x - 1

thread 1thread 0

Page 77: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

2 3 4 5 6 7 SIZE -1

thread 1thread 0

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 20 - local_x - 01 - local_x - 1

Page 78: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

2 3 4 5 6 7 SIZE -1

thread 1thread 0

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 30 - local_x - 01 - local_x - 2

Page 79: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 2

3 4 5 6 7 SIZE -1

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 30 - local_x - 01 - local_x - 2

thread 1thread 0

Page 80: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

3 4 5 6 7 SIZE -1

thread 1thread 0

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 30 - local_x - 01 - local_x - 2

Page 81: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0

3 4 5 6 7 SIZE -1

thread 1thread 0

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 40 - local_x - 01 - local_x - 3

Page 82: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

0 3

4 5 6 7 SIZE -1

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 40 - local_x - 01 - local_x - 3

thread 1thread 0

Page 83: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

3

4 5 6 7 SIZE -1

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

thread 1thread 0

x: 40 - local_x - 01 - local_x - 3

Page 84: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

3

4 5 6 7 SIZE -1

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

thread 1thread 0

x: 50 - local_x - 41 - local_x - 3

Page 85: CSE211: Compiler Design

Work stealing - global implicit worklist

• Global worklist: threads take tasks (iterations) dynamically

34

5 6 7 SIZE -1

atomic_int x = 0;void parallel_loop(...) {

for (int local_x = x++; local_x < SIZE; local_x = x++) {

// dynamic work based on x}

}

x: 50 - local_x - 41 - local_x - 3

thread 1thread 0

Page 86: CSE211: Compiler Design

End example

Page 87: CSE211: Compiler Design

Next implementation

Page 88: CSE211: Compiler Design

Work stealing - local worklists

• More difficult to implement: typically requires concurrent data-structures

• low contention on local data-structures

• potentially better cache locality

Page 89: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0 1 2 3

thread 1thread 0

Work stealing - local worklists

Page 90: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0 1 3 4

thread 1thread 0

worklist 0 worklist 1

Work stealing - local worklists

Page 91: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0

1

3

4

thread 1thread 0

worklist 0 worklist 1

Work stealing - local worklists

Page 92: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0

1 4

thread 1thread 0

worklist 0 worklist 1

Work stealing - local worklists

Page 93: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0

1 4

thread 1thread 0

worklist 0 worklist 1

Work stealing - local worklists

Page 94: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0

1

4

thread 1thread 0

worklist 0 worklist 1

Work stealing - local worklists

Page 95: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0

1

thread 1thread 0

worklist 0 worklist 1

Work stealing - local worklists

Page 96: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0

1

thread 1thread 0

worklist 0 worklist 1

steal!

Work stealing - local worklists

Page 97: CSE211: Compiler Design

• local worklists: divide tasks into different worklists for each thread

0 1

thread 1thread 0

worklist 0 worklist 1

Work stealing - local worklists

Page 98: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

Page 99: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

Make a new function, taking any variables used in loop body as args. Additionally take in a thread id

void parallel_loop(..., int tid) {

for (x = 0; x < SIZE; x++) {// dynamic work based on x

}}

Page 100: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...for (x = 0; x < SIZE; x++) {// dynamic work based on x

}...

}

Make a global array of concurrent queues

void parallel_loop(..., int tid) {

for (x = 0; x < SIZE; x++) {// dynamic work based on x

}}

Page 101: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = SIZE/NUM_THREADS;for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

initialize queues in main thread

void parallel_loop(..., int tid) {

for (x = 0; x < SIZE; x++) {// dynamic work based on x

}}

Page 102: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = SIZE/NUM_THREADS;for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

initialize queues in main thread

0 1 2 3x

0 0 1 1tid

NUM_THREADS = 2;SIZE = 4;CHUNK = 2;

Page 103: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

initialize queues in main thread

0 1 2 3x

0 0 1 1tid

NUM_THREADS = 2;SIZE = 4;CHUNK = 2;

Page 104: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

initialize queues in main thread

0 1 2 3x

0 0 1 1tid

NUM_THREADS = 2;SIZE = 4;CHUNK = 2;

use ceiling division to make sure all work gets assignedto a valid thread

Page 105: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

loop bounds in parallel function

void parallel_loop(..., int tid) {

for (x = 0; x < SIZE; x++) {// dynamic work based on x

}}

Page 106: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

loop bounds in parallel function, enqueue stores result in argument, returns false if queue is empty.

void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}}

Page 107: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

new global variable to track the number of threads that are finished

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;

}

Page 108: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}...

}

Steal values from threads that are not finished

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != num_threads) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

Page 109: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}for (t = 0; t < NUM_THREADS; t++) {

spawn(parallel_loop(..., t)}join();finished_threads = 0;...

} launch threads, join, reinitialize

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

Page 110: CSE211: Compiler Design

thread 1thread 0

Work stealing - local worklists

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

0 1 3 4

worklist 0 worklist 1

Page 111: CSE211: Compiler Design

thread 1thread 0

Work stealing - local worklists

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

0 1 3 4

worklist 0 worklist 1

Page 112: CSE211: Compiler Design

thread 1thread 0

Work stealing - local worklists

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

0

1

3

4

worklist 0 worklist 1

Page 113: CSE211: Compiler Design

thread 1thread 0

Work stealing - local worklists

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

0

1 4

worklist 0 worklist 1

Page 114: CSE211: Compiler Design

thread 1thread 0

Work stealing - local worklists

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

0

1 4

worklist 0 worklist 1

Page 115: CSE211: Compiler Design

thread 1thread 0

Work stealing - local worklists

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

0

1

4

worklist 0 worklist 1

Page 116: CSE211: Compiler Design

Work stealing - local worklists

0

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

Page 117: CSE211: Compiler Design

Work stealing - local worklists

0

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 1

Page 118: CSE211: Compiler Design

Work stealing - local worklists

0

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 1

Page 119: CSE211: Compiler Design

Work stealing - local worklists

0

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 1

Page 120: CSE211: Compiler Design

Work stealing - local worklists

0

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 1

Page 121: CSE211: Compiler Design

Work stealing - local worklists

0 1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 1

Page 122: CSE211: Compiler Design

Work stealing - local worklists

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 1

Page 123: CSE211: Compiler Design

Work stealing - local worklists

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 1

Page 124: CSE211: Compiler Design

Work stealing - local worklists

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 2

Page 125: CSE211: Compiler Design

Work stealing - local worklists

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 2

Page 126: CSE211: Compiler Design

Work stealing - local worklists

1

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 2

Page 127: CSE211: Compiler Design

Work stealing - local worklists

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].dequeue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].dequeue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 2

Page 128: CSE211: Compiler Design

Work stealing - local worklists

worklist 0 worklist 1

atomic_int finished_threads = 0;void parallel_loop(..., int tid) {

int task = 0;while (cq[tid].enqueue(&task)) {// dynamic work based on task

}finished_threads++;while (finished_threads != NUM_THREADS) {target = //select a random threadif (cq[target].enqueue(&task))

// dynamic work based on task}

}

thread 1thread 0

finished_threads: 2

Page 129: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}for (t = 0; t < NUM_THREADS; t++) {

spawn(parallel_loop(..., t)}join();finished_threads = 0;...

}

Final note: initializing the worklists may becomea bottleneck. Amdahl's law

Page 130: CSE211: Compiler Design

• How to implement in a compiler:

Work stealing - local worklists

concurrent_queues cq[NUM_THREADS];void foo() {...int chunk = ceil(SIZE/NUM_THREADS);for (x = 0; x < SIZE; x++) {int tid = x / chunk;cq[tid].enqueue(x);

}for (t = 0; t < NUM_THREADS; t++) {

spawn(parallel_loop(..., t)}join();finished_threads = 0;...

}

Final note: initializing the worklists may become a bottleneck. e.g. Amdahl's law

Can be made parallel using regular parallelism constructs

Page 131: CSE211: Compiler Design

Summary

• Many ways to parallelize DOALL loops• Independent iterations are key to giving us this freedom!

• Some are more complicated than others.• Local worklists require concurrent data structures• Global worklist requires read-modify-write

• Compiler implementation can enable rapid exploration and experimentation.

Page 132: CSE211: Compiler Design

Next week

• Guest lecture about types

• This will take us through the Thanksgiving break• Paper and project proposals will be due!