Early Experiences on Accelerating Dijkstra’s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens {anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr http://www.cslab.ece.ntua.gr May 31, 2009 CSLab National Technical University of Athens
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Early Experiences on Accelerating Dijkstra’s AlgorithmUsing Transactional Memory
Computing Systems LaboratorySchool of Electrical and Computer Engineering
National Technical University of Athens{anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr
http://www.cslab.ece.ntua.gr
May 31, 2009
CSLabNational Technical University of Athens
Outline
1 Dijkstra’s Basics
2 Straightforward Parallelization Scheme
3 Helper-Threading Scheme
4 Experimental Evaluation
5 Conclusions
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 2 / 19
The Basics of Dijkstra’s Algorithm
SSSP Problem
Directed graph G = (V ,E ), weight function w : E → R+, sourcevertex s
∀v ∈ V : compute δ(v) = min{w(p) : sp v}
Shortest path estimate d(v)
gradually converges to δ(v) through relaxations
relax (v,w): d(w) = min{d(w), d(v) + w(v ,w)}I can we find a better path s
p w by going through v?
Three partitions of vertices
Settled: d(v) = δ(v)
Queued: d(v) > δ(v) and d(v) 6=∞Unreached: d(v) =∞
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 3 / 19
The Basics of Dijkstra’s Algorithm
Serial algorithm
Input : G = (V ,E), w : E → R+,1
source vertex s, min QOutput : shortest distance array d ,2
predecessor array πforeach v ∈ V do3
d [v ]← inf;4
π[v ]← nil;5
Insert(Q, v);6
end7
d [s]← 0;8
while Q 6= ∅ do9
u ← ExtractMin(Q);10
foreach v adjacent to u do11
sum← d [u] + w(u, v);12
if d [v ] > sum then13
DecreaseKey(Q, v , sum);14
d [v ]← sum;15
π[v ]← u;16
end17
end18
50
55
60 65
70
5
2
10
10 15
20
7
S
A
B
CD
E
8
8
8
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 4 / 19
The Basics of Dijkstra’s Algorithm
5
7 8
12 9
15 17 13 16
10 13
12 14
i
j
k
4
6 5
7 9
15 12 13 16
8 13
10 14
i:17 6
Min-priority queue implemented as binary min-heap
maintains all but the settled vertices
min-heap property: ∀i : d(parent(i)) ≤ d(i)
amortizes the cost of multiple ExtractMin’s and DecreaseKey’sI O((|E |+ |V |)log |V |) time complexity
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 5 / 19
Straightforward Parallelization
Fine-grain parallelization at the inner loop level
Fine-Grain Multi-Threaded
/* Initialization phase same to the serialcode */
while Q 6= ∅ do1
Barrier2
if tid = 0 then3
u ← ExtractMin(Q);4
Barrier5
for v adjacent to u in parallel do6
sum← d [u] + w(u, v);7
if d [v ] > sum then8
Begin-Atomic9
DecreaseKey(Q, v , sum);10
End-Atomic11
d [v ]← sum;12
π[v ]← u;13
end14
end15
50
55
60 65
70
5
2
10
10 15
20
7
S
A
B
CD
E
8
8
8
Issues:
speedup bounded by averageout-degreeconcurrent heap updates due toDecreaseKey’sbarrier synchronization overhead
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 6 / 19
Concurrent Heap Updates with Locks
5
7 8
12 9
15 17 13 16
10 13
12 14
i
j
k
4
6 5
7 9
15 12 13 16
8 13
10 14
j
k:12 4i:17 6
3
5 8
7 9
15 12 13 16
10 13
12 14
ki:17 3
j:9 2
x
x
Coarse-grain synchronization (cgs-lock)I enforces atomicity at the level of a DecreaseKey operationI one lock for the entire heapI serializes DecreaseKey’s
Fine-grain synchronization (fgs-lock)I enforces atomicity at the level of a single swapI allows multiple swap sequences to execute in parallel as long as they
are temporally non-overlappingI separate locks for each parent-child pair
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 7 / 19
Performance of FGMT with Locks
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3
2 4 6 8 10 12 14 16
Mul
tithr
eade
d sp
eedu
p
Number of threads
cgs-lockperfbar+cgs-lockperfbar+fgs-lock
Software barriers dominate total execution time
72% with 2 threads, 88% with 8replace with idealized (simulated) zero-latency barriers
Fgs-lock scheme more scalable, but still fails to outperform serial
locking overhead (2 locks + 2 unlocks per swap)
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 8 / 19
Concurrent Heap Updates with TM
5
7 8
12 9
15 17 13 16
10 13
12 14
i
j
k
4
6 5
7 9
15 12 13 16
8 13
10 14
j
k:12 4i:17 6
3
5 8
7 9
15 12 13 16
10 13
12 14
ki:17 3
j:9 2
x
x
TM-based
Coarse-grain synchronization (cgs-tm)I enclose DecreaseKey within a transactionI allows multiple swap sequences to execute in parallel as long as they
are spatially (and temporally) non-overlappingI conflicting transaction stalls and retries or aborts
Fine-grain synchronization (fgs-tm)I enclose each swap operation within a transactionI atomicity as in fgs-lockI shorter but more transactions
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 9 / 19
less overhead for cgs-tm, yet equally able to exploit availableconcurrency
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 10 / 19
Helper-Threading Scheme
Motivation
expose more parallelism to each threadeliminate costly barrier synchronization
Rationale
in serial, relaxations are performed onlyfrom the extracted (settled) vertex
allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled
I main thread operates as in the serialalgorithm
I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads
I helper thread k relaxes all out-edgesof vertex xk
50
55
60 65
70
5
2
10
10 15
20
7
i-1
i
S
A
B
CD
E
8
8
8
speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by
main thread
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 11 / 19
Execution Pattern
ste
p k
ste
p k
+1
ste
p k
+2
Main
Th
read
1
Th
read
2
Th
read
3
Th
read
4
ste
p k
ste
p k
+1
ste
p k
+2
kill kill
killM
ain H
elp
er 1
Help
er 2
Help
er 3
kill
kill
ste
p k
ste
p k
+1
ste
p k
+2
extract-min
relax edges
read tidth-min
Serial FGMT Helper Threads
the main thread stops all helpers at the end of each iteration
unfinished work will be corrected, as with mis-speculated distances
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 12 / 19
Helper-Threading Scheme
Main thread
while Q 6= ∅ do1
u ← ExtractMin(Q);2
done ← 0;3
foreach v adjacent to u do4
sum← d [u] + w(u, v);5
Begin-Xact6
if d [v ] > sum then7
DecreaseKey(Q, v , sum);8
d [v ]← sum;9
π[v ]← u;10
End-Xact11
end12
Begin-Xact13
done ← 1;14
End-Xact15
end16
Helper thread
while Q 6= ∅ do1
while done = 1 do ;2
x ← ReadMin(Q, tid)3
stop ← 04
foreach y adjacent to x and while stop = 0 do5
Begin-Xact6
if done = 0 then7
sum← d [x] + w(x , y)8
if d [y ] > sum then9
DecreaseKey(Q, y , sum)10
d [y ]← sum11
π[y ]← x12
else13
stop ← 114
End-Xact15
end16
end17
for a single neighbour, the check for relaxation, updates to the heap, andupdates to d ,π arrays, are enclosed within a transaction
I performed “all-or-none”I on a conflict, only one thread commits
interruption of helper threads implemented through TM, as well
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 13 / 19
Helper-Threading Scheme
Main thread
while Q 6= ∅ do1
u ← ExtractMin(Q);2
done ← 0;3
foreach v adjacent to u do4
sum← d [u] + w(u, v);5
Begin-Xact6
if d [v ] > sum then7
DecreaseKey(Q, v , sum);8
d [v ]← sum;9
π[v ]← u;10
End-Xact11
end12
Begin-Xact13
done ← 1;14
End-Xact15
end16
Helper thread
while Q 6= ∅ do1
while done = 1 do ;2
x ← ReadMin(Q, tid)3
stop ← 04
foreach y adjacent to x and while stop = 0 do5
Begin-Xact6
if done = 0 then7
sum← d [x] + w(x , y)8
if d [y ] > sum then9
DecreaseKey(Q, y , sum)10
d [y ]← sum11
π[y ]← x12
else13
stop ← 114
End-Xact15
end16
end17
Why with TM?
composableI all dependent atomic sub-operations composed into a large atomic
operation, without limiting concurrency
optimisticeasily programmable
Anastopoulos et al. (NTUA) MTAAP’09 May 31, 2009 14 / 19
Experimental SetupFull-system simulation
Simics 3.0.31 in conjunction with GEMS toolset 2.1boots unmodified Solaris 10 (UltraSPARC III Cu)
LogTM (“Signature Edition”)
eager version managementeager conflict detection
I on a conflict, a transaction stalls and either retries or aborts