Scalability comparison: Traditional fork-join-based parallelism vs. Goroutines Porting the Barcelona OpenMP Tasks Suite to Go Artjom Simon https://github.com/artjomsimon/go-bots Know Your Gophers 2015-05-12
Aug 08, 2015
Scalability comparison: Traditional fork-join-basedparallelism vs. Goroutines
Porting the Barcelona OpenMP Tasks Suite to Go
Artjom Simonhttps://github.com/artjomsimon/go-bots
Know Your Gophers
2015-05-12
Traditional approach in C
Cilk:cilk_spawn task();
[...]cilk_sync;
OpenMP:#pragma omp parallel{
#pragma omp task[...]#pragma omp taskwait[...]
}
Go: Parallel For Loop Pattern1
queue := make(chan int)done := make(chan bool)NP := runtime.GOMAXPROCS(0)
go func() {for i := 0; i < n; i++ { queue <- i }close(queue)
}()
for i := 0; i < NP; i++ {go func() {
for i := range queue { work(i) }done<-true
}()}
for i := 0; i < NP; i++ { <-done }
1Benchmarking Usability and Performance of Multicore Languages, PDF:http://arxiv.org/pdf/1302.2837v2
...used in academic publications3
3http://www.sarc-ip.org/files/null/Workshop/1234128788173__TSchedStrat-iwomp08.pdf
Micro benchmarks
1 8 16 32 48
1
8
16
32
48
OMP_NUM_THREADS
Spee
dup
rel.
tose
q.
spc (opteron)
n=1000µsn=100µsn=10µs
Figure: Speedup spc (icc), 10 000 Tasks
Task pools: Variations
• notaskpoolStart Goroutines as needed, no limitation, uses WaitGroup forsynchronization
• simple-queueBuffered channel of func()s holds task queue. n goroutinesreceive the func()s and execute them
• goroutines-dispatcherDispatcher function, executing tasks in Goroutine only if aglobal counter of running goroutines is < n
• const-goroutinesn goroutines remove tasks from a double-linked list
Micro benchmarks
1 8 16 32 48
1
8
16
32
48
OMP_NUM_THREADS
Spee
dup
rel.
zuse
quen
tiel
l
spc (opteron)
gccicc
clanggo-notaskpool
go-simple-queuego-const-goroutines
go-goroutine-dispatch
Figure: Speedup spc, n=100µs, 10 000 Tasks
BOTS: nqueens
• N-Queens problem with n=12• Recursive backtracking search• No cut-off when creating tasks
Ergebnisse: BOTS (nqueens)
1 8 16 32 48
0
5
10
CPU cores
Spee
dup
nqueens (opteron)
gccicc
clanggo-const-goroutines
go-dispatchgo-notaskpool
gccgo-const-goroutinesgccgo-dispatch
gccgo-notaskpool
Figure: Speedup for nqueens -n 12, parallel
BOTS: sparselu
• LU factorization of a sparse block matrix• 50x50-Matrix, 100x100 sub block matrices
Results: BOTS (sparselu)
1 8 16 32 48
0
10
20
30
CPU cores
Spee
dup
sparselu (opteron)
gccicc
clanggo-const-goroutines
go-dispatchgo-notaskpoolgo-simplequeue
gccgo-const-goroutinesgccgo-dispatch
gccgo-notaskpool
Figure: Speedup sparselu -n 50 -m 100, parallel
Memory
opteron
0
0.5
1
1.5
·105RSS
[Kby
tes]
spc-par 10000 1000, 4 Threads
gccicc
clang
go-notaskpoolgo-const-goroutines
gccgo-notaskpoolgccgo-const-goroutines
Figure: Memory comparison (Resident Set Size), spc parallel
Image credits
Icon N-Queens problem: Colin M.L. Burnett, Wikimedia Commons,(GFDL & BSD & GPL)http://commons.wikimedia.org/wiki/File:Chess_d45.svg(2015-03-09)