Page 1
ACCELERATION OF BELLMAN- FORD SINGLE SOURCE SHORTEST PATH
USING PARALLEL COMPUTING
A Project
Presented to the faculty of the Department of Electrical and Electronic Engineering
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
Electrical and Electronic Engineering by
Ankita Saxena
FALL
2018
Page 2
ii
© 2018
Ankita Saxena
ALL RIGHTS RESERVED
Page 3
iii
ACCELERATION OF BELLMAN- FORD SINGLE SOURCE SHORTEST PATH
USING PARALLEL COMPUTING
A Project
by
Ankita Saxena
Approved by: _________________________________, Committee Chair Dr. Suresh Vadhva _________________________________, Second Reader Prof. Russ Tatro ____________________________
Date
Page 4
iv
Student: Ankita Saxena
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
__________________________, Graduate Coordinator ___________________
Dr. Preetham Kumar Date
Department of Electrical and Electronic Engineering
Page 5
v
Abstract
of
ACCELERATION OF BELLMAN- FORD SINGLE SOURCE SHORTEST PATH
USING PARALLEL COMPUTING
by
Ankita Saxena
With the advancement in technology these days, we need our work to be done in a
faster and efficient way, especially in the areas of telecommunication and VLSI
technologies. We need our problems to be rapidly resolved. Hence, in the world of
computing, a new concept of parallelism in CPU was introduced. The computer hardware
is made to perform a task more resourcefully either using a CPU or a GPU. We can
achieve better performance on CPU’s by parallelizing the workload using multiple
thread, utilizing several processors to execute different stages. It can be done by data
parallelism, by creating and distributing threads to cores. Each processor executes various
threads or processes on same or different data. For instance, if in 2 core processor, we can
ask CPU1 to perform Task1, simultaneously with the CPU2 performing task2. Multiple
tasks are done at the same time by quickly switching between various CPU cores. The
same code runs independently on different machines by data sharing embedded in the
code. One of the challenges faced here is to make sure that no data in lost in the process
of sharing data amongst different CPU cores. This parallelization helps reducing run time
execution, thus scaling up the performance of hardware.
Page 6
vi
In this project, a single source shortest path Bellman – Ford Algorithm is
implemented using an execution model, POSIX threads in C. In a given shortest path
problem, we need to determine the path between 2 nodes or vertices in a graph, such that
the sum of weights on each edge is minimum. For example, if a salesman has to travel
from place A to B which has various intermediate points and different cost, we deal with
shortest path problem. Here, weight will not only be the physical distance, but time and
monetary cost as well. For the project, this algorithm has been executed both in series (1
processor) and in parallel (multiple processors), performances were compared to
calculate the speed-up.
Performance was tested on bigger graphs to see how well larger workloads would
benefit from parallelism. A speed up of 18 was achieved using 16 processors. To ensure
the correctness, the execution on 1 processor versus 16 processor was tracked and
verified the correctness of the results. As noted, there’s scope for improvement by
employing node-based workload distribution as groups of outgoing edges and also
dynamic load balancing. As seen in results, the solution is scalable with both the size of
input graph and the number of processors.
_______________________, Committee Chair
Dr. Suresh Vadhva
_______________________
Date
Page 7
vii
ACKNOWLEDGEMENTS
I want to express my gratitude to Dr. Suresh Vadhva for providing me the
opportunity to work on this project. I am indebted to his guidance and continuous support
throughout the project. He has not only shown great patience in dealing with the issues I
was stuck for long, but also advised and motivated me to overcome the problems. I would
also like to thank Professor Russ Tatro, for giving me his valuable feedback and taking
time to review my project. I am honored to have had the opportunity to be mentored by
them. Their willingness to allow me to work and pursue on a project of my own interest
was greatly appreciated.
I am also grateful to Dr. Preetham Kumar, graduate coordinator of the Electrical
and Electronic Engineering Department, for his unbounded cooperation, and support.
Further, I am thankful to the Department of Electrical and Electronic Engineering
at California State University, Sacramento for providing me the skill set required to
complete the project.
Page 8
viii
TABLE OF CONTENTS
Page
Acknowledgements ........................................................................................................... vii
List of Tables ..................................................................................................................... xi
List of Figures ................................................................................................................... xii
Chapter
1. INTRODUCTION ...........................................................................................................1
1.1 Background ................................................................................................................ 1
1.2 Advantages of parallel computing ............................................................................. 4
1.2.1 Provide concurrency ............................................................................................ 4
1.2.2 Solving large and complicated tasks ................................................................... 4
1.2.3 Saving time, money and power ........................................................................... 4
1.3 Concept of parallel computing .................................................................................. 5
1.4 Parallel computer memory architectures ................................................................... 6
1.4.1 Shared memory ................................................................................................... 6
1.4.2. Distributed memory............................................................................................ 7
1.4.3 Hybrid distributed - shared memory ................................................................... 7
1.5 Parallel programming models .................................................................................... 8
2. POSIX THREAD PROGRAMMING .............................................................................9
2.1 Background ................................................................................................................ 9
2.2 Pthread ....................................................................................................................... 9
2.3 Why pthreads? ......................................................................................................... 10
Page 9
ix
2.4 Motivation ............................................................................................................... 11
2.5 The Pthread API ...................................................................................................... 11
2.5.1 Thread management .......................................................................................... 11
2.5.2 Mutex ................................................................................................................ 12
2.5.3 Condition variables ........................................................................................... 13
2.5.4 Synchronization ................................................................................................. 14
2.6 Pthread system calls................................................................................................. 14
2.6.1 Creating pthread ................................................................................................ 14
2.6.2 Joining pthreads................................................................................................. 15
2.6.3 Pthread mutex .................................................................................................... 15
2.6.5 Initializing and destroying pthread .................................................................... 16
3. INTRODUCTION AND IMPLEMENTATION OF THE ALGORITHM ...................17
3.1 Background .............................................................................................................. 17
3.2 Algorithm selection ................................................................................................. 18
3.2.1 Dijktra’s algorithm ............................................................................................ 19
3.2.2 Bellman Ford algorithm .................................................................................... 19
3.3 Project implementation ............................................................................................ 22
3.3.1 Serial implementation ....................................................................................... 23
3.3.2 Parallel Implementation .................................................................................... 29
3.4 Input file .................................................................................................................. 38
3.5 Tools and platform used .......................................................................................... 39
3.6 Key issues ................................................................................................................ 39
Page 10
x
3.6.1 Load balancing .................................................................................................. 39
3.6.2 Synchronization ................................................................................................. 40
3.7 Optimization ............................................................................................................ 40
4. RESULT ........................................................................................................................42
5. CONCLUSION ..............................................................................................................46
References ..........................................................................................................................49
Page 11
xi
LIST OF TABLES
Tables Page
1. Bellman - Ford algorithm iteration 0 ............................................................................ 21
2. Bellman - Ford algorithm iteration 1 ............................................................................ 22
3. Bellman - Ford algorithm further iterations .................................................................. 22
4. Speedup of 2 cases vs number of processors ................................................................ 42
Page 12
xii
LIST OF FIGURES
Figures Page
1. Serial computation of different task on a processor........................................................ 2
2. Parallel computation of different task on a processor ..................................................... 3
3. Shared memory architectures [3] .................................................................................... 6
4. Distributed memory Architecture ................................................................................... 7
5. Hybrid Distributed - Shared Memory ............................................................................. 7
6. Different ways of task management ............................................................................. 10
7. Mutual Exclusion .......................................................................................................... 12
8. Operation of different threads on a shared data ............................................................ 13
9. Shortest - path problem ................................................................................................. 17
10. Graph for Bellman - Ford algorithm ........................................................................... 21
11. Speedup (y-axis) vs Processors (x-axis) for 2 graphs ................................................. 43
Page 13
1
Chapter 1
INTRODUCTION
1.1 Background
As proved by Moore’s law, the performance of a chip will double every eighteen
months due to two main reasons, that is scaling of frequency and utilization of
instruction-level parallelism. Lately, it is very challenging to continue exploiting them as
both the sources are far from exhausted. Frequency scaling is less used these days due to
extreme heat dissipation in the desktop and server segments. And, instruction-level
parallelism is practically infeasible due to the “increasing wiring delays together with
exponentially growing reorder buffers” [1]. As a result, microprocessors have multi-core
structures which replicates doubling of transistors every eighteen months on die.
In today’s digitalized era, “computing” is a major widely used industrial concept. It’s
a process by which a particular task is managed amongst the CPU architecture and
integrated with the vital components of modern technologies. It manipulates a given data
set according to the set of instructions in accordance with algorithmic processes. The data
can be managed by designing and building hardware and software system for various
purposes. Parallel computing means solving a task of size n by partitioning its domain
into k>2 or k = 2 parts and resolving them with p number of physical processors,
simultaneously. [2] . This parallelism can be broken down into 2, data parallelism and
task parallelism. Data Task parallelism is the concurrent implementation on several cores
Page 14
2
of several different functions on either one or different datasets whereas Data
parallelism is the concurrent implementaion on several cores of the similar function on
different elements of a dataset. Data parallelism is ideal for GPU as its architecture is best
suited when the threads execute same instruction but on different data. Task parallelism
works better for CPU as its architecture grants access to different tasks to be performed
on each thread, thus portioning well for accomplishing a given task. Computational
problems are often classified as data parallelism, hence we replicate the same behavior
for the project.
This project exploits 2 types of computation, a serial computation and a parallel
computation based on different number of processors used. The number of processor
varies from 1 to 16. As shown in figure 1, for serial computing, a consecutive and
ordered implementation of processes is done. The processes are run in successive
progression. The task is broken into series of instructions and each instruction is executed
successively. It is accomplished on a single processor. A single instruction is achieved at
once.
Figure 1. Serial computation of different task on a processor
Page 15
3
Whereas for parallel computing, many tasks or processes are carried out
concurrently. The multiple computing resources are simultaneously used to solve a given
problem. A task is broken down various discrete sub tasks, which are further broken
down into sequences of instructions and are completed instantaneously on different
processors as demonstrated in Figure 2. A data sharing or coordination is setup between
the modules to maintain consistency and avoid any data loss. This method solves the
computational tasks in less amount of time with numerous resources available over only
one compute resource. Here the compute resource is a single computer with multiple
processors or cores or a random number of lone core computers connected in a network.
Figure 2. Parallel computation of different task on a processor
Parallel computers are parallel from hardware prospective, which consist of many
functional units (L1 and L2 caches, branch, GPU etc), have various execution units or
cores and use threads. Networks connect various standalone computers known as “nodes”
Instructions executed in parallel
Page 16
4
to make a cluster of parallel computers, each comprising a memory and multiple cores of
their own.
1.2 Advantages of parallel computing
Parallel programming has some advantages over serial, which makes it an
competent solution for certain types of computing problems, which are suited for
multiprocessors and multithreading techniques. Some of the reasons are as listed:
1.2.1 Provide concurrency
As parallel computing consists of multiple resources working simultaneously,
hence achieving a high speed up.
1.2.2 Solving large and complicated tasks
Some problems have a complicated computing process or are huge in terms of
data points. Series computation is also limited by the memory size. In parallel computing,
by breaking the job into multiple separate instructions and utilizing more memory, the
problem can be solved easily. For example, in a vector multiplication, computing result
of different rows/columns can be done simultaneously to form the final matrix rather than
calculating one data at a time!
1.2.3 Saving time, money and power
Due to better resource allocation, the time taken to complete a certain task is
much shorter, hence saving the potential cost as well. Also, as this computing requires
various cores and threads, it helps in reducing the wastage of computing power.
Page 17
5
1.3 Concept of parallel computing
Speed up refers to ratio of the computing time in sequential algorithm over the
parallel algorithm. If the speedup is 5 fold then in layman terms it means if serial
execution takes 5 minutes for a task or piece of code, parallel will take a minute to
execute the same task. We can obtain n-fold speedup with n number of processors to
deliver optimum solution.
In computer architecture, Amdahl's law as defined and stated provides us with
theoretical speedup in expectancy of the performance of a given task with a
particular workload that can be expected of a system by improving the resources. It’s
formulated in the following way:
P is the time proportion of the algorithm which can be parallelized.
S is the speedup factor for that portion of the algorithm due to parallelization.
For example, if we are using 5 processors, and it takes 90% of the running time. Then P =
0.90 and S = 5.
Then the overall speedup is 3.57 ~ 4 fold speedup!
5
Page 18
6
1.4 Parallel computer memory architectures
The architectures that can be enforced to obtain optimum performance on parallel
computers or evaluate VLSI complexity on chip area are the following:
1.4.1 Shared memory
In shared memory, all the processors have access to a common memory as a
global address space. The processors operate independently but share a memory source.
In this type of memory, change in a specific memory location by one processor will be
transparent and hence visible to all the other processors. Based upon the memory access
times, we split the shared memory into Uniform Memory Access (UMA) and Non-
Uniform Memory Access (NUMA). Figure 3 shows, UMA has different processors with
equal access and access time to a common memory whereas NUMA is formed by linking
more than 1 shared memory. In NUMA, the access time to memories may vary and
memory access is slower.
Figure 3. Shared memory architectures [3]
Page 19
7
1.4.2. Distributed memory
As Figure 4 illustrates, in this memory, the each processor has its own local
memory and thus the memory addresses of one processor does not match to the memory
of another processor, hence eradicating the concept of global address space.
Figure 4. Distributed memory architecture
1.4.3 Hybrid distributed - shared memory
This is the fastest and largest architecture which uses both the above discussed
shared and distributed memory architectures. The “shared memory machine” module is
from the shared memory and networking of multiple shared memory is inherited by
distributed memory as shown in the figure 5 below.
Figure 5. Hybrid distributed - shared memory
Network
Page 20
8
1.5 Parallel programming models
There are various programming models used for parallel programming which are
threads, message passing, data parallel etc. The implementation models used for this
project is threads model. Thread is an approach to improve performance with parallelism.
This type of programming model is a shared memory programming. Threads
implementation usually comprises of library of subroutines and a set of compiler
directives being called from source code.
There are two standardized implementations of threads: POSIX Threads and
OpenMP. For this project, the parallelism has been achieved using POSIX thread
implantation discussed in next chapter.
Page 21
9
Chapter 2
POSIX THREAD PROGRAMMING
2.1 Background
In shared memory multiprocessor architecture, threads are used for implementation of
parallelism. A standardized C language threads programming interface was specified by
the IEEE POSIX 1003.1c standard, for UNIX systems. Executions that follow this
standard protocol are referred to as POSIX threads, or Pthreads. The three types of
routines in pthread API are: Thread management, mutex variables and condition
variables. [4]
2.2 Pthread
A pthread is defined as an independent execution model which allows the program to
control multiple work flow overlapping each other at a given time. It is run concurrently
by the operating system as a “Process”. Process has information about the resources and
implementation of a program. The process resources may share threads depending on the
data dependency. When a thread is shared amongst two or more processes, changes made
to the shared thread is transparent and hence will be visible to all the resources. Threads
not only have to access global data, but have their own private data. In UNIX based
platform, we require Pthread libraries for such implementations.
Page 22
10
2.3 Why Pthreads?
While creating and managing a process, the operating system overhead is much
low. By overhead, we mean extra time, memory or bandwidth taken while performing
any task computation hence necessitating less system resources. When one thread is
waiting for the system call to complete its given task, some other thread can perform its
function without waiting for any completion. Hence they are light weight processes. The
tasks can be interleaved and interchanged in real time as demonstrated in Figure 6.
Figure 6. Different ways of task management
The above figure shows various cases in which interleaving, interchanging or
overlapping can be managed by thread. In first 2 cases, we can achieve the tasks in
sequential manner, by swapping them. In the third case, both the tasks are interweaved.
As for case 4, the tasks overlap at one point and hence we achieve the final outcome
earlier than the rest.
Also, it gives high computing environment for achieving optimum performance. It
is advantageous in case of efficient communication or data exchange between the shared
time
Page 23
11
memories. All the threads share same address space within a single process and hence
data can be exchanged by simple passing of pointers.
2.4 Motivation
The language used for this project is C over HLL (high level language). A major
reason for writing kernel’s using C language is that it supports low-level techniques
hence helping in enhancing performance, particularly for pointer arithmetic, easy escape
from type enforcement, explicit memory allocation, and custom allocators. [5]
Garbage collection and safety checks in HLL consume a lot of CPU time
and causes delays. It also hides importance mechanisms like memory allocation and
implements abstraction which reduces development options.
2.5 The Pthread API
It was defined in the ANSI/IEEE POSIX 1003.1 - 1995 standard. Since then, the
POSIX standard has evolved and undergone various revisions, including the Pthreads
specification. While writing a POSIX subroutine, the flow is divided into 4 groups:
Thread management, mutexes, condition variables and synchronization.
2.5.1 Thread management
It consists of various operations like thread creation, termination, join, yield etc. It
generates an executable thread, return it to the starting routing when the task is
succeeded, joining the child thread to a parent thread and/or schedules the thread
execution within the run time.
Page 24
12
2.5.2 Mutex
Mutex is an abbreviation for mutual execution object. It is a program object
created so that several threads can share same resource needed to access same file.
Hence, this deals with synchronization between various threads. Whenever a mutex is
created for a particular resource, the system creates a unique ID for it. If any thread
requires the resource, it must use the mutex to lock the resource from other threads while
using. If mutex lock already exists, another thread demanding the resource is held in a
queue by the system and then granted control when the mutex is unlocked, that is after
completion of the preceding thread on that specific resource. This helps in removing any
race condition, wherein multiple accesses to the same variable, from different threads are
made majorly while writing.
Figure 7. Mutual exclusion
The above figure shows a case of mutual exclusion, where thread B tried to gain access
from time T2 to T3 but due to the lock already acquired by thread A, it’s been blocked and
Page 25
13
queued. As soon as the thread A finishes its task, thread B starts executing and no other
thread is allowed to interfere until it completes its implementation.
2.5.3 Condition variables
When a thread is utilizing a resource, it block itself as its request may not
complete instantly. Hence this waiting for an event to occur is done by condition
variables. It sets communication between threads that share mutex. There are two
operations used for variable operations, wait and signal. If a thread is in wait call, then its
suspended and queued, thus waiting for the event represented by conditional variable to
occur. When the event occurs, the signal method is used for corresponding condition
variable. The 2 variables used inside the condition variables is wait and signal. Wait is
used to block the caller thread and enqueued on this variable’s waiting threads list. Signal
dequeues the thread at the head of waiting list, unblocking it and places on the ready list.
The above process is illustrated in Figure 8, wherein the threads being shared are
concurrently processed.
Figure 8. Operation of different threads on a shared data
Shared data
Operations or threads currently being processed
Entry Queue for access
Page 26
14
2.5.4 Synchronization
It refer to management of read/write locks or while sharing system resources by
different processes. Its main focus is to maintain data consistency and concurrent
accesses to data being shared. This is done to prevent any race condition by enforcing
mutual exclusion on critical section or at the time of conflict between two threads.
2.6 Pthread system calls
The following system calls have been used in the project for thread
implementation.
2.6.1 Creating pthread
Syntax:
It generates a new thread and starts the execution of the process by invoking a
routine. All the threads are created by the programmer in main() function. It has 4
arguments as shown above in the format. A “thread” is a unique identifier for each new
thread created. “Attr” is used to control attributed of pthread. It’s initialized by
pthread_attr_init. Its default value is NULL. The “start_routine” is used to implement
thread as soon as it is created. The “arg” is an argument which is passed by reference as a
pointer to start_routine.
Page 27
15
2.6.2 Joining pthreads
Syntax:
This function is used to join one thread with another terminated thread. It waits for the
child thread to terminate in order to join the parent. If multiple threads try to join a single
thread simultaneously, it results in a deadlock.
2.6.3 Pthread mutex
Syntax:
A Mutual Exclusion device which is used to protect data sharing concurrenct
modification. If mutex is locked then the calling thread is queued until the mutex become
available as explained in section 2.4.2. The pthread_mutex_trylock() is similar to
pthread_mutex_lock() except it returns 0 when referred to a locked mutex, thus returning
the call immediately. The function of pthread_mutex_unlock is to release the lock from
the mutex object. The lock should be freed after one thread has completed the access, and
Page 28
16
wake up the waiting thread or it will keep sleeping forever. Context switching might
occur.
2.6.5 Initializing and Destroying pthread
Syntax of destroying pthread:
Mutex destroy invalidates all the mutex objects. They are destroyed and hence
undefined. Conditon destroy shall terminate the the condition variable specified by the
cond. When the thread attributes are no longer needed, they are destroyed using
pthread_cond_destroy.
Syntax of constructing pthread:
A destroyed mutex object can be reinitialized by the above call. They are also
used to initialize the mutex lock, the condition variable for cond or attribute respectively.
Page 29
17
Chapter 3
INTRODUCTION AND IMPLEMENTATION OF THE ALGORITHM
3.1 Background
“The shortest-path problem is the problem of finding a path between two vertices
in a graph such that the sum of the weights of its constituent edges is minimized.” [6].
Single- source shortest path is a constrained version of the shortest- path problem that
requires solving for the minimum distance from one given source vertex to all other
vertices of the graph. The primary aim of this project is to implement a parallel
multithreaded solution to the single-source shortest-path problem that can scale well with
the number of processors as graphs become larger and denser. To this end, we
experimented with two alternative implementations, different only in that one uses fine-
grained locking over the graph, while the other uses conservative coarse-grained locking.
Another goal is to thoroughly evaluate the implementation and analyze the results
obtained.
Figure 9. Shortest - path problem
Shortest-path problems are based on the concept of iteratively relaxing edges explained
as follows with the help of Figure 9.
Page 30
18
While relaxing edge (u,v) if it is found that the distance d(v) to a vertex v from the source
vertex s violates the following inequality
d(v) < d(u) + w(u,v)
then d(v) is set to the newly found shortest distance on the right of the inequality. After V
iterations of relaxing all the E edges in each iteration, we know the shortest distance to
every node in graph.
In the following sections, the reasons for choosing Bellman-Ford algorithm for
our parallel implementation is being discussed. Section 3 and 4 provide a details of
parallel implementation and the optimizations made to the original algorithm.
3.2 Algorithm Selection
For a scalable parallel implementation, selecting an efficient algorithm is of
supreme importance. Several algorithms have been proposed to solve the shortest-path
problems. Keeping the literature in mind the algorithms were narrowed down to
Dijkstra’s algorithm and Bellman-Ford algorithm. Here, we briefly discuss why we
ultimately chose a modified Bellman-Ford algorithm and not the other.
Page 31
19
3.2.1 Dijktra’s algorithm
Dijkstra’s algorithm does not readily lend itself to parallelization, because it relies
on a priority queue storing the vertices of the graph, keyed on the best-known distance to
each vertex [7]. The parallelization replies on known properties of the graph. With the
research done, I also came across another con of the algorithm, as it cannot handle
negative length cycles. Example, if we have 3 nodes (A, B and C) with an undirected
graph of edges AB = 1, AC = 2and BC = -3. If we have to go from A to C, the optimal
path should be through B ( AB + BC = 1 + -3 = -2) as it is much better than AC (which is
2). But Dijkstra’s will not be able to evaluate this result, as the algorithm cannot access
negative edge weights.
3.2.2 Bellman Ford algorithm
It evaluates shortest paths in bottom-up fashion. Bellman-Ford algorithm is
highly parallel in nature. Within each iteration, each edge can be relaxed in parallel,
without loss of correctness. Therefore, while a sequential implementation can make use
of an intelligent ordering of relaxation of edges, a parallel implementation can make use
of a completely random order of relaxation. The unique ability of this algorithm to handle
negative-cycles in graphs further motivated us to select it.
Also Dijktra’s work on greedy algorithm, in which it takes all the data points in a
specific given problem, and then provides a regulation to add to the solution at each step
of the algorithm for each element. [8] It’s time complexity is O(VLogV). Bellman-Ford
Page 32
20
is simpler and suits well for distributed systems. It’s complexity is O(VE) which is more
than Dijktra’s.
Steps to calculate the shortest distance using Bellman-Ford is as follows:
1. Set all the distances from source U to each vertex as infinite whereas distance of source
to itself should be 0 and create an array of size V, where V is the number of vertices,
mentioning the distance from all the vertexes.
2. Loop it V-1 times.
if dist[v] > dist[u] + weight uv, then
Make dist[v] = dist[u] + weight uv
3. This loop is used to check the negative weight of the edges.
if dist[v] > dist[u] + weight uv, then the graph contains negative weight cycle.
This step is an extra measure taken to check if we still get a shorter path, despite repeated
iterations in step 2. If we get a shorter path, then it has a negative weight, else it doesn’t.
In some cases, calculation of shortest path might take less than V-1 cycles.
Let’s understand the algorithm using a graph.
Page 33
21
Suppose there are 6 vertexes S, A, B, C, D, E and we need to find shortest distance of
every vertex from S as shown in the figure 10 below. We assign the source vertex as 0
and initialize the rest distances as infinite. As there are 6 edges, hence we need to loop it
5 times.
S A B C D E
0
Table 1. Bellman - Ford algorithm iteration 0
Figure 10. Graph for Bellman - Ford algorithm
Processing each edge starting from S, we get the following distance with the first
iteration. The first row indicates initial distances from S. The second, third and fourth
row shows distances when outgoing edges are processed from A, C and E respectively.
We skip B and D, as we don’t know how to reach them at that instance.
Page 34
22
S A B C D E
0
0 1
1
0 1
3
1
0 1 2 3 -1 1
Table 2. Bellman - Ford algorithm iteration 1
First iteration shown above in Table 2 gives all shortest path. We get the following values
presented in Table 3 from all the edges on second iteration. Looping through the
iterations, last row shows the final value.
S A B C D E
0 1 2 3 -1 1
0 0 2 2 -1 1
0 0 1 2 -1 1
Table 3. Bellman - Ford algorithm further iterations
Here we get the shortest distances in the second iteration itself and hence save time by
not looping through 4th , 5th and 6th time.
3.3 Project Implementation
Bellman ford algorithm, works on a given graph G, whose vertices are
represented by V and edges by E. As Bellman-Ford algorithm is a single source short
Page 35
23
path algorithm hence one vertex in the graph should act as a source vertex, S. The
algorithm calculates every short distance to all the vertices from source vertex. The edges
are relaxed V-1 times.
For this project, we have implemented the algorithm both serially and parallel.
3.3.1 Serial implementation
The sequential version of original Bellman-Ford was first implemented and tested
for correctness. The high level description pseudo-code is as below
1. For each vertex in V. set the distance between every vertex from source as
infinity in the graph. Also, a pointer P is referred to a predecessor vertex and
set to none for each vertex.
2. The second for loop goes through each edge and shortens the distance
between vertices and compares that distance amongst other known distance.
This process is repeated for V -1 time.
v.distance > u.distance + weight(u,v);
Above is the equation for relaxation. Every edge need to obey the triangle
inequality. If this condition hold true then
v.distance = u.distance + weight(u,v); & v.p = u;
3. After the above algorithm is successfully running, another short loop is
implemented to check for negative cycles. Hence in for (u,v) in edge E, we again
Page 36
24
check the relaxation condition and if it’s true, it means there is negative weight
cycle existing.
Code:
/*
* Final Graduation Project
* Fall 2018
* Ankita Saxena
* Parallel Bellman Ford algorithm for Single Source
Shortest Path
*/
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
/*
* Each edge can be relaxed in parallel, so it is
convenient to represent
* the graph by an array of edges to iterate over. Static
load balancing
* becomes easier. However, when outgoing edges of only
certain nodes
* require relaxation, we need to iterate over the entire
data structure.
*/
typedef struct {
unsigned int origin;
unsigned int target;
int weight;
} edge_t;
int LoadGraph (FILE *fp, edge_t *graph, unsigned int
num_edges) {
unsigned int temp_origin;
unsigned int temp_target;
Page 37
25
int temp_weight;
unsigned int idx;
idx = 0;
while (fscanf(fp, "%u %u %u", &temp_origin,
&temp_target, &temp_weight) == 3) {
graph[idx].origin = temp_origin;
graph[idx].target = temp_target;
graph[idx].weight = temp_weight;
idx++;
}
if (num_edges != idx) {
printf("Error: Edges found in file = %u, Expected =
%u\n", idx, num_edges);
return -1;
}
return 0;
}
/*
* Check for correctness of the results.
* Limitation: For worst case, maximum weight <
INT_MAX/num_edges
*/
int Verify (const int *distance, const edge_t *graph, const
unsigned int num_edges) {
unsigned int idx; //why unsigned?
int err;
edge_t edge;
err = 0;
for (idx = 0; idx < num_edges; idx++) {
edge = graph[idx];
if (distance[edge.origin] != INT_MAX &&
(distance[edge.target] > distance[edge.origin] +
edge.weight))
err = 1;
if (err)
printf ("Error: Incorrect Results!\n Edge %0u:
Calculated distance to vertex %0u (%0d) is greater than
calculated distance to vertex %0u (%0d) + weight (%0d)",
idx, edge.target, distance[edge.target], edge.origin,
distance[edge.origin], edge.weight);
}
Page 38
26
return err;
}
void ShortestPath (
const unsigned int num_verts,
const unsigned int num_edges,
const edge_t *graph,
const unsigned int source, const unsigned int destination,
int *distance) {
unsigned int e; //edge
unsigned int v, vv; //vertices
unsigned int i, j;
edge_t edge;
int *new_dist;
int *lupdate;
int found_update;
printf ("INFO: Calculating shortest path from %u to %u.
Graph has %u vertices and %u edges\n", source, destination,
num_verts, num_edges);
new_dist = (int *) malloc (num_verts*sizeof(int));
lupdate = (int *) calloc (num_verts, sizeof(int));
for (v=0; v<num_verts; v++) {
if (v == source) {
distance[v] = 0;
}
else {
distance[v] = INT_MAX;
}
}
for (v=0; v<num_verts; v++) {
//printf ("Entering iter %u\n", v);
for (e=0; e<num_edges; e++) {
edge = graph[e];
if (distance[edge.origin] != INT_MAX &&
(distance[edge.origin] + edge.weight <
distance[edge.target])) {
new_dist[edge.target] =
distance[edge.origin] + edge.weight;
lupdate[edge.target] = 1;
}
Page 39
27
}
for (vv=0; vv<num_edges; vv++) {
found_update = 0;
if (lupdate[vv] == 1) {
found_update = 1;
distance[vv] = new_dist[vv];
lupdate[vv] = 0;
}
}
if (found_update = 0) {
break;
}
}
}
int main (int argc, char* argv[]) {
unsigned int source;
unsigned int destination;
unsigned int num_verts;
unsigned int num_edges;
char *filename;
FILE *file; /*While doing file handling we often use
FILE for declaring the pointer in order
to point to the file we want to read from or to write
on. */
edge_t *graph;
int *distance;
unsigned int e, v;
// Process command-line arguments
if (argc != 4 && argc != 6) {
printf ("%d\n",argc);
printf("Error: \n Usage: <spbf> <graph file>
<source> <destination> ::optional:: <num_verts>
<num_edges>\n");
return -1;
}
else {
Page 40
28
filename = argv[1];
source = atoi(argv[2]);
destination = atoi(argv[3]);
if (argc == 4) {
num_verts = 50000;
num_edges = 1000000;
}
else {
num_verts = atoi(argv[4]);
num_edges = atoi(argv[5]);
}
graph = (edge_t
*)malloc(num_edges*sizeof(edge_t));
distance = (int
*)malloc(num_verts*sizeof(int));
}
file = fopen (filename, "r");
if (file == NULL) {
printf("Error: Unable to open graph file\n");
return -1;
}
else {
if (LoadGraph (file, graph, num_edges) != 0) {
return -1;
}
}
// Arguments and file are good, now calculate shortest
path
ShortestPath (num_verts, num_edges, graph, source,
destination, distance);
for (v=0; v<num_verts; v++) {
printf("Distance to node %u = %d\n", v,
distance[v]);
}
Verify (distance, graph, num_edges);
return 0;
}
Page 41
29
3.3.2 Parallel Implementation
The sequential version was then parallelized to run on an arbitrary number of
processors. Our implementation is based on Papaefthymiou and Rodrigue’s paper [9].
The pseudocode for the parallel implementation is given below, followed by a discussion
of our decisions for two key issues.
Below is the algorithm implemented for parallelization.
Repeat V times:
1. Compute start and last edge to obtain the set of edges to work on (load
balancing).
2. For each edge in set:
a. Acquire locks on origin and target vertices.
b. Relax edge and set local new distance.
c. If relaxation successful, mark vertex.
d. Release locks on origin and target vertices.
3. For all vertices marked in step 2c:
a. Acquire lock on vertex.
b. If local copy of new distance is lesser than global distance copy,
update the global distance with the local distance.
c. Set own global found-update flag
d. Release lock on vertex.
Page 42
30
4. Wait at barrier
5. If NONE_LEFT flag is set, break out of loop
The setting of NONE-LEFT flag is explained in the next section (optimizations).
Code:
The organization of the code with explanation is below:
1. The pthread libraries for UNIX was included. A structure was defined for intake of
origin, target and the distance between them. It creates an array of the points which can
be depicted as a graph. This creates thread argument structure for thread function.
Page 43
31
Note: the maximum number of processor it can take has been defined as 16. As we will
be testing the tasks on different number of processors.
2. As we have tested the parallel code on different number of processors, this part is used to
define the number of processors being used (NumProcs) and accordingly allocate
pthread’s system calls (mutex and cond as explained in chapter 2).
3. Each thread is given an ID as tid. Also, manages the mutex lock. This section of code
keeps a count of the execution by SyncCount and checks if it has reached the last thread.
Barriers are used to coordinate between a set of threads in real time by putting all the
threads in the barrier that is, they are made to wait until all threads have called the barrier
Page 44
32
function. This, blocks all threads in the barrier until the slowest or the last thread
participating in the task reaches the barrier call.
4. In this part, the graph is loaded by reading the input file. If there is any discrepancy found
in the number of edges specified, it flashes an error and returns back.
Page 45
33
5. This loop helps in verifying the accuracy of the results we obtain, thus making this code
self-sufficient. It also considers the worst case wherein maximum weight is considered.
/*
* Check for correctness of the results.
* Limitation: For worst case, maximum weight < INT_MAX/num_edges
*/
int Verify (const int *distance, const edge_t *graph, const
unsigned int num_edges) {
unsigned int idx;
int err;
edge_t edge;
err = 0;
for (idx = 0; idx < num_edges; idx++) {
edge = graph[idx];
if (distance[edge.origin] != INT_MAX &&
(distance[edge.target] > distance[edge.origin] + edge.weight))
err = 1;
if (err)
printf ("Error: Incorrect Results!\n Edge %0u:
Calculated distance to vertex %0u (%0d) is greater than
calculated distance to vertex %0u (%0d) + weight (%0d)", idx,
edge.target, distance[edge.target], edge.origin,
distance[edge.origin], edge.weight);
}
return err;}
Page 46
34
6. This function call implements the Bellman-Ford algorithm according to the pseudocode
explained above. It acquires lock, performs the function with one thread, releases when
the task is done, increments the counter and relaxes all the edges over every iteration.
Page 47
35
}
lupdate[vv] = 0;
pthread_mutex_unlock (&VertexLock[vv]);
}}
if (found_update == 1) {
global_update[threadId] = 1; }
else {global_update[threadId] = 0; }
/* Wait for all threads to relax all edges for this
iteration */
Barrier ();
if (none_left == 1) {
break; }
} }
7. This is the main calling function or the entry point in the code. It manages and
communicates with all the calls mentioned above. This binds the whole code together and
executes the calls in accordance with the requirements.
int main (int argc, char* argv[]) {
char *filename;
FILE *file;
unsigned int e, v;
pthread_t *threads;
pthread_attr_t attr;
int ret;
long int threadId;
// Process command-line arguments
if (argc != 5 && argc != 7) {
printf ("%d\n",argc);
Page 50
38
3.4 Input file
For initial testing of the code, the source vertex, destination vertex and distance
between them was fed through a file to the code. This manually generated file defined the
graph. In the input file shown below, there are only 7 vertices for basic testability of the
code. The same format was used to generate and replicate bigger graphs with millions or
thousands of vertices and edges. The format was broken into different columns as source
vertex, destination vertex and edge distance.
Page 51
39
The file was opened and accessed by function call LoadGraph as shown before in the
code.
The input file looked like below:
0 2 2
0 3 16
0 1 6
0 6 14
1 3 5
1 4 4
2 5 8
2 4 3
2 1 7
3 6 3
4 6 10
4 3 4
5 6 1
3.5 Tools and platform used
Used Linux (Ubuntu) operating system running on Intel Xeon processor to run the
simulations. The code was compiled using GNU C compiler (GCC).
3.6 Key issues
While parallel implementation, I faced 2 major issues as explained below:
3.6.1 Load balancing
It refers to efficient distribution of incoming workload across multiple resources. It
increases reliability, makes optimum use of resources, maximizes throughput and helps in
avoiding overload on a single resource.
Page 52
40
Since in each iteration, a processor has to relax a fixed number of edges, our natural
choice is static load balancing. This reduces fine-grained inter-thread communication and
is straightforward to implement. I also had to decide whether to distribute edges in a
random order as they appear in the (randomly generated) graph file, or to distribute
groups of outgoing edges of vertices. While the second option would have allowed some
useful optimizations, it would require some pre-processing unless the input data is
already in that format. Thus the former option was chosen.
3.6.2 Synchronization
I experimented with two different locking schemes. First, implemented a
conservative locking scheme with a single mutex lock protecting shared access to the
distances of all the vertices in the graph. This resulted in poor results, which were
improved vastly by using a separate lock for each vertex in the graph. The drawback of
this is an increase in the number of locks with the graph vertices.
3.7 Optimization
A key optimization over the original Bellman-Ford algorithm, proposed by Yen
[10] has been included in our implementation. At the end of an iteration, if no processor
makes an update to any node of the graph, then shortest paths to all nodes are known and
the algorithm can terminate early.
To achieve this, at the end of every iteration, each processor updates a different
flag in shared memory if it made an update to any node in that iteration and then waits at
Page 53
41
the barrier. The last processor to reach the barrier will check the global flag of each
processor, including its own. If all flags are clear, then it will set the global NONE_LEFT
flag (which initially starts out clear). Before the next iteration begins, each processor will
check this flag and terminate the process if it is set.
Page 54
42
Chapter 4
RESULT
The requirement of the project was to test performance on graphs with 50,000
vertices and 100,000 edges. I pseudo-randomly generated 15 such graphs and measured
the average speedup on 1 to 16 processors as compared to a single processor. The
speedup was observed and noted in form of a table.
No. of
processors
Speedup in V = 50k and E = 1M Speedup in V = 5M and E = 100M
1 0.8 0.8
2 1.2 2.17
3 1.8 3.36
4 2.1 4.3
5 2.5 5.9
6 2.6 6.6
7 3.9 8.13
8 3.5 9.12
9 4.3 10.2
10 5.4 12.3
11 3.9 13.5
12 4.5 15.05
13 6.4 16.22
14 6.7 16.88
15 6.9 17.43
16 7.9 18
Table 4. Speedup of 2 cases vs number of processors
Page 55
43
Based on the above table, the graph was obtained.
Figure 11. Speedup (y-axis) vs Processors (x-axis) for 2 graphs
The graph above, shows speed up of 2 graphs simultaneously using various
numbers of processors. The yellow line indicates the test performed with 50,000 vertices
and 1 million edges. I achieved a speedup of 7.9 using 16 processors. A possible
explanation for the speedup not scaling as well as expected is the multiprocessor cache-
coherence effects and also the synchronization overhead. As in shared memory
multiprocessors, each processor has its shared cache memories, it’s bound to have more
than one copy of data at different locations. If one copy is changed, then it’s necessary to
propagate those changes to the other copies as well. This mechanism of maintaining data
Page 56
44
consistency throughout the system is known as cache coherence. “In terms of computer
science, an overhead is defined as any combination of excess or indirect computation
time, memory, bandwidth, or other resources that are required to perform a specific task.”
[11] This graph with wasn’t enough to show the increased efficiency in parallelism.
Another graph was plotted by increasing the workload by plotting a bigger graph
to measure the competence of parallelism against the number of processors. I tested
performance on bigger graphs to see how well larger workloads would benefit from the
parallelism. In the bigger graph with 5 million vertices and 100 million edges, an almost
linear speedup was observed.
Note than a speedup of 18 using 16 processors was achieved! This may seem
incorrect at first. To ensure correctness, I first verified the correctness of the results and
tracked the execution on 1 processor versus 16 processors. Note that in contrast to a
single processor that relaxes one edge at a time, N processors are relaxing edges in
parallel in our implementation Therefore, the order of relaxation of edges is highly likely
to change, causing the shortest path to be known early, thereby causing the algorithm to
terminate early. This is known as super linear speed up, which refers to program which
can execute faster than theoretical speed up. Despite having n cores, the recorded speed
can be greater than that of n. As soon as the threads are done, the process is terminated.
Page 57
45
The parallelism was implemented because the size of the vectors needed to be
investigated grew exponentially, and parallelism would benefit the speed of execution. It
scales up the performance of the task by recursive way of solving the problem.
Page 58
46
Chapter 5
CONCLUSION
The main purpose of this project was to utilize the CPU resources fully and gain
the optimum performance achievable. The requirement of this project was to execute a
scalable parallel implementation of single source shortest path algorithm. The algorithm
used for execution was Bellman-Ford over Dijktra’s and was implanted in C language
using pthread API.
Bellman Ford algorithm works by overvaluing the length of the path from the
source vertex to all the other vertices, assigning them the value infinity. Then it
“repeatedly relaxes those estimates by finding new paths that are shorter than the
previously overestimated paths”. Bellman Ford was chosen over Djiktra’s as it consider
the negative weight edges as well hence increasing the accuracy of the solving the
problem. Solution to the shortest path problem has several applications such as driving
directions on web mapping websites, networking or telecommunication and VLSI design
applications.
This report presents the methods used to implement a parallelized solution to the
problem using a modified Bellman-Ford algorithm, and discusses the results obtained.
The computations were done in serial and in parallel to see the performance enhancement
in CPU. I have used 15 randomly generated graphs with 50,000 vertices and 100,000
Page 59
47
edges for our experiments and calculated speedup on up to 16 processors. I achieved an
average speedup of 7.9 using 16 processors. As the graph was not very linear, hence I
increased the workload by 5 million vertices and 100 million edges. The performance in
it was much linear and improved by manifolds. In conclusion, a speedup of almost 8
times over the execution time on a single processor is not bad using 16 processors. As
noted, there’s scope for improvement by employing node-based workload distribution as
groups of outgoing edges and also dynamic load balancing. As seen in results, the
solution is scalable with both the size of input graph and the number of processors.
This result was obtained by process of multithreading which executes process or a
given task by using threads and its own program counter keeping a track of instruction to
be accomplished next, while holding the current working variables. It breaks down the
instruction into various small instructions and executes them simultaneously within cores.
One thread belongs to one process or task. Threads provide a suitable foundation for
parallel execution on shared memory multiprocessor. Thread helps in minimizing context
switching and provides concurrency within process. All the threads need to be managed
and efficient communication needs to me maintained. Proper steps for synchronization
needs to be taken. Threads utilize the multiprocessor architecture to a great scale. The
execution was done with the help of Pthread library POSIX on Linux helping to control
the flow of process by parallelizing them and overlapping time.
Page 60
48
Today, numerous parallel chip architectures include a different number of cores,
different ability or skill in each core, and optimized power or energy consumption per
core. Now a days, GPUs are the future of parallel computing. The GPU computing allows
the programmers to exploit parallelism using recently introduced advanced programming
languages like CUDA and OpenCL. This leads to interaction between different sections
in a system, thus leverages the investment in programming. GPU’s have tremendous
computing and memory power hence leading to high-performance computing. This
expands the range of computational graphics and high performance computing
applications. CUDA has helped in gaining a lot of demand for NVIDIA GPU’s.
Page 61
49
REFERENCES
1. V. Agarwal, M.S. Hrishikesh, S. W. Keckler and D. Burger, “Clock Rate
versus IPC: The End of the Road for Conventional Microarchitectures”, In
Proceedings of the 27th Annual International Symposium on Computer
Architecture (ISCA -00), pages 248 – 259, 2000.
2. Crist ´obal A. Navarro1,2,∗ , Nancy Hitschfeld-Kahler1 and Luis Mateu1 ,
“A Survey on Parallel Computing and its Applications in Data-Parallel Problems
Using GPU Architectures ”, Communications in Computational Physics, Vol. 15,
No. 2, pp. 285-329
3. “Introduction to Parallel Computing”, Author: Blaise Barney, U.S.
Department of Energy by Lawrence Livermore National Laboratory under
Contract DE-AC52-07NA27344, 07-march-2017, Available:
https://computing.llnl.gov/tutorials/parallel_comp/#Whatis. [Access: 10-Oct-
2018]
Page 62
50
4. “POSIX Threading Programming”, Author: Blaise Barney, U.S.
Department of Energy by Lawrence Livermore National Laboratory under
Contract DE-AC52-07NA27344, 07-march-2017, Available:
https://computing.llnl.gov/tutorials/pthreads/. [Access: 17-Oct-2018]
5. Cody Cutler, M. Frans Kaashoek, and Robert T. Morris, MIT CSAIL,
“The benefits and costs of writing a POSIX kernel in a high-level language”,
ISBN 978-1-931971-47-8
6. “Shortest Path Problem”, Wikipedia , 13-Sept-2018 [Online]. Available:
https://en.wikipedia.org/wiki/Shortest_path_problem. [Accessed:16-Oct-2018]
7. Kevin Kelley and Tao B., “Parallel Single Source Shortest Paths”, MIT
Computer Science and Artificial Intelligence Laboratory
8. Karleigh Moore, Jimin Khim, and Eli Ross , “Greedy Algorithms”,
Brilliant, 19-Oct-2018, Available: https://brilliant.org/wiki/greedy-algorithm/,
[Accessed: 10-Oct-2019]
Page 63
51
9. Marios Papaefthymiou and Joseph Rodrigue, “Implementing Parallel Shortest-
Paths Algorithms”
10. Yen, Jin Y. (1970). "An algorithm for finding shortest routes from all
source nodes to a given destination in general networks". Quarterly of Applied
Mathematics. 27: 526–530. MR 0253822.
11. “Overhead (Computing)”, Wikipedia , 13-Sept-2018 [Online]. Available:
https://en.wikipedia.org/wiki/Overhead_(computing). [Accessed:16-Oct-2018]