Toward An Efficient Algorithm for the Single Source Shortest Path Problem on GPUs Thang M. Le David R. Cheriton School of Computer Science University of Waterloo [email protected]Abstract Accelerating graph algorithms on GPUs is a fairly new research area. The topic was first introduced in [3] by Harish and Narayanan in 2007. Since then, there have been numerous studies on different aspects of graphs on GPUs. Some great work has been done on graph traversal [4] [5]. Nevertheless, there has not much focus on designing an efficient algorithm for the single source shortest path (SSSP) problem on GPUs. This report summarizes our study which contains various algorithms and their performance results for the SSSP problem on GPUs. 1. Introduction Finding SSSP is a classical problem in graph theory. Dijkstra’s algorithm and Bellman Ford algorithm are the two well-known solutions for this problem. While Dijkstra’s algorithm is focusing on work efficiency which forces it to make the greedy step at each iteration by choosing the vertex having the smallest distance, Bellman Ford algorithm is exploring all possible vertices in an effort to reduce a number of iterations at the cost of efficiency. The difference of the two approaches is similar to the difference of the single instruction single data (SISD) model and the single instruction multiple data (SIMD) model proposed in [1]. Since GPUs architecture embraces single instruction multiple thread (SIMT), a variant of SIMD, Bellman Ford algorithm is favorable on GPUs. Although Bellman Ford algorithm provides a great advantage by exploring multiple vertices at a time, our experiments reveal its performance on GPUs is poor due to its inefficient approach. The main reason is the GPU global memory has a high latency access. As a result, the advantage of parallelism does not compensate for the cost of memory accesses. In our report, all of the experiments performed on the GPU are compared with our implementation of Dijkstra’s algorithm using priority queue on the CPU. We chose binary heap data structure to efficiently maintain the priority queue. Based on this implementation, the runtime complexity of Dijkstra’s algorithm is where is the number of edges, is the number of vertices. On sparse graphs, this runtime complexity becomes . This is a much faster version compared with the implementation of Dijkstra’s algorithm using ordinary array or linked list which is . For each design and implementation, we will discuss advantages and drawbacks. All of the performance results were performed on Tesla server equipped with two CPUs Intel Xeon DB Quad Core E5620 2.4 Ghz, 4 NVIDIA Tesla C2050, 24GB 1333MHz ECC DDR3 of memory and two 600GB Segate Cheetah 15000rpm 16MB Cache with a RAID controller. 2. Verification Running on a massive parallel hardware such as GPUs gave us a lot of challenges. Not only did our algorithms face a high risk of racing condition, they exhibited the threat of working on stale data due the inconsistency between cache and memory in CUDA. In fact, cache coherency issue was the biggest challenge we experienced when working with NVIDIA GPU processor. Because of all of these risks, we put a lot of effort in simplifying our designs and implementations as much as p ossible. This helped us in verifying the correctness and troubleshooting effort. In term of testing, we first made sure having a correct Dijkstra’s implementation. We then relied on this to ensure the GPU algorithms produce the same result as of the Dijkstra implementation. Not only did we compare shortest distances in both results, we used shortest path to recalculate the corresponding shortest distance and compare it with the shortest distance computed by the algorithm. Although we can reason about the correctness of our algorithms, we cannot guarantee our implementations are free from bugs. In the end, Dijkstra used to say “Testing shows the presence, not the absence of bugs”. We welcome you to report any unexpected results to us or send us your comments for improvements. 3. Usage Usage: dijkstra.exe [options] [graph_file] Options: -m <mode> : set the exxecution mode (default 0)
13
Embed
Toward an Efficient Algorithm for the Single Source Shortest Path Problem on GPUs
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Toward An Efficient Algorithm for the Single Source Shortest Path Problem on GPUs
Abstract Accelerating graph algorithms on GPUs is a fairly new research area. The topic was first introduced in [3] by Harish and
Narayanan in 2007. Since then, there have been numerous studies on different aspects of graphs on GPUs. Some great work has
been done on graph traversal [4] [5]. Nevertheless, there has not much focus on designing an efficient algorithm for the single
source shortest path (SSSP) problem on GPUs. This report summarizes our study which contains various algorithms and their
performance results for the SSSP problem on GPUs.
1. Introduction
Finding SSSP is a classical problem in graph theory. Dijkstra’s algorithm and Bellman Ford algorithm are the two well-known
solutions for this problem. While Dijkstra’s algorithm is focusing on work efficiency which forces it to make the greedy step at each iteration by choosing the vertex having the smallest distance, Bellman Ford algorithm is exploring all possible vertices in an
effort to reduce a number of iterations at the cost of efficiency. The difference of the two approaches is similar to the difference
of the single instruction single data (SISD) model and the single instruction multiple data (SIMD) model proposed in [1]. Since
GPUs architecture embraces single instruction multiple thread (SIMT), a variant of SIMD, Bellman Ford algorithm is favorable
on GPUs. Although Bellman Ford algorithm provides a great advantage by exploring multiple vertices at a time, our experiments reveal its performance on GPUs is poor due to its inefficient approach. The main reason is the GPU global memory has a high
latency access. As a result, the advantage of parallelism does not compensate for the cost of memory accesses.
In our report, all of the experiments performed on the GPU are compared with our implementation of Dijkstra’s algorithm using
priority queue on the CPU. We chose binary heap data structure to efficiently maintain the priority queue. Based on this
implementation, the runtime complexity of Dijkstra’s algorithm is where is the number of edges, is the
number of vertices. On sparse graphs, this runtime complexity becomes . This is a much faster version compared with
the implementation of Dijkstra’s algorithm using ordinary array or linked list which is . For each design and
implementation, we will discuss advantages and drawbacks. All of the performance results were performed on Tesla server
equipped with two CPUs Intel Xeon DB Quad Core E5620 2.4 Ghz, 4 NVIDIA Tesla C2050, 24GB 1333MHz ECC DDR3 of memory and two 600GB Segate Cheetah 15000rpm 16MB Cache with a RAID controller.
2. Verification
Running on a massive parallel hardware such as GPUs gave us a lot of challenges. Not only did our algorithms face a high risk of racing condition, they exhibited the threat of working on stale data due the inconsistency between cache and memory in CUDA.
In fact, cache coherency issue was the biggest challenge we experienced when working with NVIDIA GPU processor. Because
of all of these risks, we put a lot of effort in simplifying our designs and implementations as much as p ossible. This helped us in
verifying the correctness and troubleshooting effort. In term of testing, we first made sure having a correct Dijkstra’s
implementation. We then relied on this to ensure the GPU algorithms produce the same result as of the Dijkstra implementation. Not only did we compare shortest distances in both results, we used shortest path to recalculate the corresponding shortest
distance and compare it with the shortest distance computed by the algorithm. Although we can reason about the correctness of
our algorithms, we cannot guarantee our implementations are free from bugs. In the end, Dijkstra used to say “Testing shows the
presence, not the absence of bugs”. We welcome you to report any unexpected results to us or send us your comments for
improvements.
3. Usage Usage:
dijkstra.exe [options] [graph_file]
Options: -m <mode> : set the exxecution mode (default 0)
0 -- verification (runs GPU & CPU code and compares)
1 -- GPU only
2 -- CPU only -n <num_sources> : how many source nodes generated (default 1)
-s <seed> : random number seed (default uses time)
-K <k-constant>: a constant K used in queue-based-filter algorithm (only applicable to queue-based-filter algorithm)
-e: print sources which differ
-b: read input graph as an undirected graph -g: print graph statistics
4. 2-Kernel Algorithm
The SSSP problem requires calculating both shortest distance and shortest path for all vertices from a given source. Keeping
these two values consistent on GPUs is difficult. A simplest approach is to design a 3-kernel algorithm which was already done in our previous work. The question is whether we can achieve an SSSP algorithm with two kernel methods. In order to achieve this,
we need a way to perform two assignment instructions atomically. At current, CUDA 4.2 only supports basic atomic functions
which mostly comprise two instruction calls: one operational instruction and one assignment instruction. Moreover, running
programs have no control over object locks. These constraints severely limit the ability to perform two assignment instructions in
an atomic manner.
In order to accomplish what we need, one of the approach is to design a ‘smart’ data structure holding both values of shortest
distance and shortest path and applies atomic functions on this data structure instead of individual values. This work is credited to
Aditya Tayal who did a great job in defining this data structure.
Once the data structure was defined, designing a 2-kernel algorithm was effortless. Below is the 2-kernel algorithm:
At line 7, kernel1 checks for all vertices which have their Ma values set to 1. These vertices had their distance and path updated
in kernel2 previously . From line 10 to line 20, kernel1 is responsible to calculate and update new distance and path for all
Input:
Va: an array stores vertices of a graph
Ea: an array stores edges of a graph Wa: an array stores weight of edges
Ca: an array stores current shortest distance of vertices
Output:
Ma: an array stores vertices which have their shortest distances updated Ua: an array stores new shortest distance & new path of updated vertices
MaFlag: a flag indicates whether Ma array is empty.
1:__global__ void kernel1(int *Va, int *Ea, int *Wa, int *Ma, int *Ca,
2: costpath *Ua, int *MaFlag) {
3: const unsigned int tid = threadIdx.x + blockDim.x*blockIdx.x;
4: int i, n; 5: costpath newUa;
6: uint64 old, assumed, *address;
7: if (Ma[tid]) {
8: Ma[tid] = 0; 9: *MaFlag = 0;
10: for(i=Va[tid]; i<Va[tid+1]; ++i) {
11: n = Ea[i];
12: address = &(Ua[n].raw); 13: old = *address;
14: newUa.val.cost = Ca[tid]+Wa[i];
15: newUa.val.path = tid; 16: do {
17: assumed = old;
18: if ( ((costpath *)&assumed)->val.cost > newUa.val.cost )
19: old = atomicCAS(address, assumed, newUa.raw); 20: } while (assumed != old);
21: }
22: }
23:}
Input:
Ma: an array stores vertices which have their shortest distances updated
Ua: an array stores new shortest distance & new path of updated vertices Output:
Ca: an array stores current shortest distance of vertices from the source
Pa: an array stores vertex paths
MaFlag: a flag indicates whether Ma array is empty
1:__global__ void kernel2(int *Ma, int *Ca, int *Pa,
2: costpath *Ua, int *MaFlag) { 3: const unsigned int tid = threadIdx.x + blockDim.x*blockIdx.x;
do { assumed = old; if ( ((costpath *)&assumed)->val.cost > newUa.val.cost ) old = atomicCAS(address, assumed, newUa.raw);
} while (assumed != old);
neighbors of these vertices. After kernel1 finishes, the new distance and path of a vertex are stored in Ua array. Kernel2 then
compares values stored in Ca and Ua at line 9. If there is any difference, it will update the value in Ca, Pa and Ma. At the end of
its logic, kernel2 sets MaFlag appropriately according to the values store in Ma. MaFlag is a shortcut to indicate whether Ma array is empty. The algorithm will continue if MaFlag is set to 1. Otherwise, the algorithm is complete and the values stored in
Ca and Pa are shortest distances and shortest paths of all vertices from the source vertex.
Verification We ran the algorithm in verification mode for 100 source vertices. In this mode, the program checks the shortest distance results
of the algorithm with the results of Dijkstra’s algorithm. In addition, the program also traverses each shortest path returned by
each algorithm and recalculates the corresponding distance of each path which is then used to compare with the shortest distance
computed by each algorithm. Below is the summary of running the algorithm in verification mode.
The summary indicates that all shortest distances computed by the algorithm and Dijkstra’s algorithm are the same. In addition,
the length of each shortest path also matches with the corresponding shortest distance returned by both algorithms.
Performance Result
The two tables below show the performance of the 2-kernel algorithm on different graphs. The results favor Dijkstra’s algorithm
which outperforms the 2-kernel algorithm on GPUs by 20 - 30 times even though the 2-Kernel algorithm finished with less
number of iterations. The second table reflects the work-inefficient disadvantage of the 2-Kernel algorithm. On the small graphs,
kernel1 took 0.00957ms at the minimum with the average of 0.0165ms and kernel2 took 0.00883ms at the minimum with the average of 0.00922ms. On the large graphs, kernel1 took 0.11328ms at the minimum with the average of 2.8012ms and kernel2
took 0.39075ms at the minimum with the average of 0.5503ms. The minimum running time usually happens at the first few
iterations where there are not many active vertices. Since the algorithm always starts with 1 active vertex which is the source, we
would expect the minimum time on different input graphs should be relatively the same regardless of the graph size. Yet, the 2-
19: old = *address; 20: newUa.val.cost = v_Ua[tid].val.cost+Wa[i];
At line 6, we declare the shared variable setMa for each thread block. In the while loop at line 22, we set the local variable
updated if the condition at line 25 is satisfied. It means every time a vertex has its shortest distance and path updated, the local
variable updated will be set by the thread performing this update. Once the while loop is complete, threads who have the local variable updated set update the corresponding value of the updated vertex in Ma array and the shared variable setMa accordingly.
At line 38, the first thread in each thread block examines the value stored in the shared variable setMa and increments the global
MaFlag if setMa is set. By doing that, kernel1 is now taking the responsibility of kernel2 in maintaining Ma array and MaFlag,
Hence, kernel2 can be removed. On the host, we only need to keep track of the previous value of MaFlag and the current value of
MaFlag. If they are the same, the algorithm is complete and the execution is terminated. If they are different, the algorithm continues to the next iteration. It is noteworthy to mention in the 1-kernel algorithm, we need to use the volatile key work for
local variables v_Ma and v_Ua referencing Ma and Ua in global memory respectively. Every memory access to Ma and Ua
should make reference to these two local variables to avoid reading stale data due to the lack of full cache coherence in CUDA.
Verification
We run the same verification step for this algorithm. Below is the summary of running the algorithm in verification mode.
The summary indicates that all shortest distances computed by the algorithm and Dijkstra’s algorithm are the same. In addition,
the length of each shortest path also matches with the corresponding shortest distance returned by both algorithms.
Performance Results
By reducing to one kernel, the 1-kernel algorithm performs an order of magnitude faster than the 2-kernel algorithm. Since it is
still based on the inefficient approach, the performance is still incomparable with the performance of Dijstra’s algorithm.
We produced these results by running this command with different input graphs:
Warp-based methodology described in [4] can be incorporated into the 1-kernel algorithm to achieve better performance. The advantages of the warp-based approach are two-fold: better coalescing memory and avoiding thread divergence.
From line 2 – line 4, the algorithm computes the number of virtual warps within a thread block, the warp id for the current thread, the thread offset of the current thread within its virtual warp and the warp offset with respect to the entire grid. Using the warp
offset calculated at line 4, the algorithm moves the pointers of Va, Ma and Ua to the correct position for each virtual warp. After
finishing line 7, every virtual warp accesses Va, Ma and Ua at different positions. The calculation done at line 4 makes sure
virtual warps work on different chunks of data in global memory and there is no overlapping work among virtual warps. Line 8 &
9 declare the shared memory for each virtual warp in a thread block. Each virtual warp has a slot in this shared memory with the size equal to CHUNK_SZ. From line 18 – line 23, each thread in a virtual warp copies data from Ma and Va to its shared
memory slot. These instructions are executed in SIMD phase. At line 26, the for loop will go through all data previously stored in
1:__global__ void kernel1Warp(int *Va, int *Ea, int *Wa,
int *Ma, costpath *Ua, int *MaFlag) { 2: int warp_nums = blockDim.x / WARP_SZ, warp_id = threadIdx.x / WARP_SZ;
the shared memory of each virtual warp. The logic in this for loop is the same as in the 1-kernel algorithm. From line 27 – line
34, the logic is executed in SISD phase. The entire for loop from line 35 – line 54 is executed in SIMD phase.
Verification
We run the same verification step for this algorithm. Below is the summary of running the algorithm in verification mode.
The summary indicates that all shortest distances computed by the algorithm and Dijkstra’s algorithm are the same. In addition,
the length of each shortest path also matches with the corresponding shortest distance returned by both algorithms.
Performance Results
Since the average degree of all of the graphs is small, we ran the algorithm with the virtual warp size set to 4. The results did not
meet our expectation. While Hong et al achieved 1.5x speedup on Patents graph using virtual warp compared with Harish’s
algorithm, we experienced performance degradation with the warp centric approach. After reviewing the warp based approach
used in the algorithm, we can explain the difference. The main reason why we did not see the same performance speedup as described in [4] is because our algorithm is to solve the SSSP problem while the algorithm in [4] is to traverse a graph in breadth
first search (BFS). Although it makes sense to copy data of level & nodes from global memory to shared memory before
traversing a graph in BFS, it is not quite convincing to copy data of both Ma array and Va array from global memory to shared
memory in our algorithm. In the BFS traversal, a deeper level of traversal usually (but not always) accumulates more active
nodes than a shallower level. In contrast, our algorithm starts with 1 active node, the source. The algorithm then accumulates more active nodes through each iteration until reaching the maximum number. At this point, the number of active nodes starts to
decrease until there is no more active node. Hence, the pattern of numbers of active nodes in the warp based algorithm is much
different than the pattern of numbers of active nodes in the BFS traversal. Copying data to shared memory utilizes the advantage
of the GPU on-chip memory, however, it also introduces the inefficiency. Depending on different graph shapes, this inefficiency
might not be obvious in BFS traversal. However, the cost of this inefficiency overshadows the benefits of the warp based approach when applying to the 1-kernel algorithm for the SSSP problem. We implemented a slightly different version which only
performed copying data of Ma array from global memory to shared memory without copying data of Va array. Below are the
performance results of the original implementation and the later implementation.
Queue-based algorithm was our first effort toward a work-efficient algorithm for the SSSP problem on GPUs. Thus far, none of
above algorithm can be comparable with Dijkstra’s algorithm. The queue based algorithm runs with two queues. Each queue is
allocated with a size equals to the number of vertices. The kernel logic works from the current queue, any vertex has its shortest
distance and path updated is added into the next queue. The rest of the logic is the same as of the warp-based algorithm. For better utilizing shared memory, the algorithm keeps all new updated vertices in shared memory and writes these vertices out to
Performance results of warp based algorithm copying data of Ma and Va to shared memory
Performance results of warp based algorithm copying data of Ma to shared memory
Effect of different chunk sizes
#Vertices: 161595 #Edges: 399036
global memory as a whole in SIMD phase (line 64 – line 72). We found this is a simpler way than the prefix sum for coordinating
allocation described in [5]. The atomicAdd at line 66 suffers a minimal overhead from lock contention due to only the first thread
of each virtual warp performs this atomic operation. The host logic needs to examine the value stored in nextSize. If the value is 0, it indicates there are no more vertices in the next queue. At this point, the algorithm is complete and the execution is
terminated. Otherwise, the host logic launches a new kernel and swaps the current queue with the next queue.
Verification
We run the same verification step for this algorithm. Below is the summary of running the algorithm in verification mode.
1:__global__ void kernel1Queue(int *Va, int *Ea, int *Wa, int *Ma, costpath *Ua,
2: int * curQueue, int * curSize, int * nextQueue, int * nextSize) { 3: int warp_nums = blockDim.x / WARP_SZ, warp_id = threadIdx.x / WARP_SZ;
The summary indicates that all shortest distances computed by the algorithm and Dijkstra’s algorithm are the same. In addition, the length of each shortest path also matches with the corresponding shortest distance returned by both algorithms.
Performance Results
The results we got for this algorithm did not meet our expectation. At first, we expected the algorithm would achieve better results than all of the previous results described above. In fact, the algorithm performs poorly on all graphs. The number of
iterations goes up drastically. As a result, the performance is degraded. However, once we looked at the kernel runtime, we
started to realize the problem. As it shows in the second table, the minimum costs of the kernel runtime are relative to each other
regardless the size of the graphs. This is a good indication to work efficiency. However, the average costs are quite high
compared with the minimum costs. The reason is the current queue gets larger after each iteration. If we process all of the vertices stored in the current queue, the algorithm becomes inefficient. This issue is addressed in our next algorithm.
We produced these results by running this command with different input graphs:
This algorithm is similar to the queue-based algorithm. The difference is this algorithm only selects appropriate vertices to add
into the next queue. The selection is based on the filter value passed in the kernel call. Because of the existence of the filter, the
host logic can no longer rely on nextSize to determine whether the algorithm should finish. The algorithm must set the corresponding value of an updated vertex in Ma regardless whether this vertex will be added to the next queue or not. The values
in Ma array will be used to reconstruct the current queue when the next queue is empty. This means we need an extra kernel
method to perform this type of logic. The terminating condition for this algorithm is a bit difference than the queue-based
algorithm. The next queue might be empty due to the filter value is set too low. If this happens, the host logic will call
reconstructQueue to reconstruct the current queue based on the values set in Ma. If the current queue after completing reconstructQueue kernel is still empty, the algorithm is complete. Otherwise, the host will increase the current filter value by
Read results d->h transfer 6.92224 ms
------------- GPU KERNEL TIME with respect to CUDA d 1861.56702 ms
-------------
GPU TOTAL TIME with respect to CPU h+d 2580.41772 ms
100 out of 100 source node GPU PATH LENGTHS correct. 100 out of 100 source node CPU PATH LENGTHS correct.
0 out of 100 source node PATHS match.
Num of kernel invocations 22318
Alloc & input graph h->d input 1.02384 ms
Set up new source h->d input 18.94548 ms
Kernel execution d 1255.09021 ms Min kernel execution d 0.00992 ms
Max kernel execution d 0.08352 ms
Average kernel execution d 0.05624 ms Total kernel execution d 1255.09021 ms
Synchronize d 194.85789 ms
Ma reads d->h transfer 383.28470 ms
Read results d->h transfer 6.81235 ms
Performance Results
This algorithm truly demonstrates the work-efficient advantage which results in much better performance on large graphs
compared with all of the above results. The GPU runtime of this algorithm is comparable with Dijkstra’s algorithm. On larger graphs, this algorithm might outperform Dijkstra’s algorithm.
We produced these results by running this command with different input graphs:
Our report shows a step by step on how to design an efficient algorithm for the SSSP problem on GPUs. Initially, we started with
the basic algorithm and focused on the correctness. We then made incremental improvement on each new algorithm over the current algorithm. The 1-kernel algorithm was an important achievement in our study. It helped us to design a new algorithm
much easier with less effort in troubleshooting. Our report emphasizes the importance of work efficiency in designing an
algorithm on GPUs. As we have shown, the queue-based filter algorithm is the only algorithm comparable with Dijkstra’s
algorithm in term of performance. There is still room for improvement on this algorithm. We did not incorporate pinned/mapped
memory into the implementation. Using pinned/mapped memory might produce slightly better performance. Another improvement would be instead of adding updated vertices into the next queue, it might be better to first add them into the current
queue until the current queue is full then the algorithm can start to add next updated vertices to the next queue.
References
[1] M.J. Flynn. Some Computer Organizations and Their Effectiveness. IEEE Trans. Computers, C-21, No.9, pp. 948-960, 1972.
[2] Fred Glover, Randy Glover and Darwin Klingman. Threshold Assignment Algorithm. Mathematical Programming Studies,
Volumn 26, 12-37, 1986.
[3] Pawan Harish and P.J. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. HiPC 2007, LNCS 4873, 197-208, 2007.
GPU KERNEL TIME with respect to CUDA d 1860.01453 ms
-------------
GPU TOTAL TIME with respect to CPU h+d 2538.07373 ms -------------