acceleration of bellman- ford single source shortest path

ACCELERATION OF BELLMAN- FORD SINGLE SOURCE SHORTEST PATH

USING PARALLEL COMPUTING

A Project

Presented to the faculty of the Department of Electrical and Electronic Engineering

California State University, Sacramento

Submitted in partial satisfaction of

the requirements for the degree of

MASTER OF SCIENCE

in

Electrical and Electronic Engineering by

Ankita Saxena

FALL

2018

ii

© 2018

Ankita Saxena

ALL RIGHTS RESERVED

iii



A Project

by

Ankita Saxena

Approved by: _________________________________, Committee Chair Dr. Suresh Vadhva _________________________________, Second Reader Prof. Russ Tatro ____________________________

Date

iv

Student: Ankita Saxena

I certify that this student has met the requirements for format contained in the University

format manual, and that this project is suitable for shelving in the Library and credit is to

be awarded for the project.

__________________________, Graduate Coordinator ___________________

Dr. Preetham Kumar Date

Department of Electrical and Electronic Engineering

v

Abstract

of



by

Ankita Saxena

With the advancement in technology these days, we need our work to be done in a

faster and efficient way, especially in the areas of telecommunication and VLSI

technologies. We need our problems to be rapidly resolved. Hence, in the world of

computing, a new concept of parallelism in CPU was introduced. The computer hardware

is made to perform a task more resourcefully either using a CPU or a GPU. We can

achieve better performance on CPU’s by parallelizing the workload using multiple

thread, utilizing several processors to execute different stages. It can be done by data

parallelism, by creating and distributing threads to cores. Each processor executes various

threads or processes on same or different data. For instance, if in 2 core processor, we can

ask CPU1 to perform Task1, simultaneously with the CPU2 performing task2. Multiple

tasks are done at the same time by quickly switching between various CPU cores. The

same code runs independently on different machines by data sharing embedded in the

code. One of the challenges faced here is to make sure that no data in lost in the process

of sharing data amongst different CPU cores. This parallelization helps reducing run time

execution, thus scaling up the performance of hardware.

vi

In this project, a single source shortest path Bellman – Ford Algorithm is

implemented using an execution model, POSIX threads in C. In a given shortest path

problem, we need to determine the path between 2 nodes or vertices in a graph, such that

the sum of weights on each edge is minimum. For example, if a salesman has to travel

from place A to B which has various intermediate points and different cost, we deal with

shortest path problem. Here, weight will not only be the physical distance, but time and

monetary cost as well. For the project, this algorithm has been executed both in series (1

processor) and in parallel (multiple processors), performances were compared to

calculate the speed-up.

Performance was tested on bigger graphs to see how well larger workloads would

benefit from parallelism. A speed up of 18 was achieved using 16 processors. To ensure

the correctness, the execution on 1 processor versus 16 processor was tracked and

verified the correctness of the results. As noted, there’s scope for improvement by

employing node-based workload distribution as groups of outgoing edges and also

dynamic load balancing. As seen in results, the solution is scalable with both the size of

input graph and the number of processors.

_______________________, Committee Chair

Dr. Suresh Vadhva

_______________________

Date

vii

ACKNOWLEDGEMENTS

I want to express my gratitude to Dr. Suresh Vadhva for providing me the

opportunity to work on this project. I am indebted to his guidance and continuous support

throughout the project. He has not only shown great patience in dealing with the issues I

was stuck for long, but also advised and motivated me to overcome the problems. I would

also like to thank Professor Russ Tatro, for giving me his valuable feedback and taking

time to review my project. I am honored to have had the opportunity to be mentored by

them. Their willingness to allow me to work and pursue on a project of my own interest

was greatly appreciated.

I am also grateful to Dr. Preetham Kumar, graduate coordinator of the Electrical

and Electronic Engineering Department, for his unbounded cooperation, and support.

Further, I am thankful to the Department of Electrical and Electronic Engineering

at California State University, Sacramento for providing me the skill set required to

complete the project.

viii

TABLE OF CONTENTS

Page

Acknowledgements ........................................................................................................... vii

List of Tables ..................................................................................................................... xi

List of Figures ................................................................................................................... xii

Chapter

1. INTRODUCTION ...........................................................................................................1

1.1 Background ................................................................................................................ 1

1.2 Advantages of parallel computing ............................................................................. 4

1.2.1 Provide concurrency ............................................................................................ 4

1.2.2 Solving large and complicated tasks ................................................................... 4

1.2.3 Saving time, money and power ........................................................................... 4

1.3 Concept of parallel computing .................................................................................. 5

1.4 Parallel computer memory architectures ................................................................... 6

1.4.1 Shared memory ................................................................................................... 6

1.4.2. Distributed memory............................................................................................ 7

1.4.3 Hybrid distributed - shared memory ................................................................... 7

1.5 Parallel programming models .................................................................................... 8

2. POSIX THREAD PROGRAMMING .............................................................................9

2.1 Background ................................................................................................................ 9

2.2 Pthread ....................................................................................................................... 9

2.3 Why pthreads? ......................................................................................................... 10

ix

2.4 Motivation ............................................................................................................... 11

2.5 The Pthread API ...................................................................................................... 11

2.5.1 Thread management .......................................................................................... 11

2.5.2 Mutex ................................................................................................................ 12

2.5.3 Condition variables ........................................................................................... 13

2.5.4 Synchronization ................................................................................................. 14

2.6 Pthread system calls................................................................................................. 14

2.6.1 Creating pthread ................................................................................................ 14

2.6.2 Joining pthreads................................................................................................. 15

2.6.3 Pthread mutex .................................................................................................... 15

2.6.5 Initializing and destroying pthread .................................................................... 16

3. INTRODUCTION AND IMPLEMENTATION OF THE ALGORITHM ...................17

3.1 Background .............................................................................................................. 17

3.2 Algorithm selection ................................................................................................. 18

3.2.1 Dijktra’s algorithm ............................................................................................ 19

3.2.2 Bellman Ford algorithm .................................................................................... 19

3.3 Project implementation ............................................................................................ 22

3.3.1 Serial implementation ....................................................................................... 23

3.3.2 Parallel Implementation .................................................................................... 29

3.4 Input file .................................................................................................................. 38

3.5 Tools and platform used .......................................................................................... 39

3.6 Key issues ................................................................................................................ 39

x

3.6.1 Load balancing .................................................................................................. 39

3.6.2 Synchronization ................................................................................................. 40

3.7 Optimization ............................................................................................................ 40

4. RESULT ........................................................................................................................42

5. CONCLUSION ..............................................................................................................46

References ..........................................................................................................................49

xi

LIST OF TABLES

Tables Page

1. Bellman - Ford algorithm iteration 0 ............................................................................ 21

2. Bellman - Ford algorithm iteration 1 ............................................................................ 22

3. Bellman - Ford algorithm further iterations .................................................................. 22

4. Speedup of 2 cases vs number of processors ................................................................ 42

xii

LIST OF FIGURES

Figures Page

1. Serial computation of different task on a processor........................................................ 2

2. Parallel computation of different task on a processor ..................................................... 3

3. Shared memory architectures [3] .................................................................................... 6

4. Distributed memory Architecture ................................................................................... 7

5. Hybrid Distributed - Shared Memory ............................................................................. 7

6. Different ways of task management ............................................................................. 10

7. Mutual Exclusion .......................................................................................................... 12

8. Operation of different threads on a shared data ............................................................ 13

9. Shortest - path problem ................................................................................................. 17

10. Graph for Bellman - Ford algorithm ........................................................................... 21

11. Speedup (y-axis) vs Processors (x-axis) for 2 graphs ................................................. 43

1

Chapter 1

INTRODUCTION

1.1 Background

As proved by Moore’s law, the performance of a chip will double every eighteen

months due to two main reasons, that is scaling of frequency and utilization of

instruction-level parallelism. Lately, it is very challenging to continue exploiting them as

both the sources are far from exhausted. Frequency scaling is less used these days due to

extreme heat dissipation in the desktop and server segments. And, instruction-level

parallelism is practically infeasible due to the “increasing wiring delays together with

exponentially growing reorder buffers” [1]. As a result, microprocessors have multi-core

structures which replicates doubling of transistors every eighteen months on die.

In today’s digitalized era, “computing” is a major widely used industrial concept. It’s

a process by which a particular task is managed amongst the CPU architecture and

integrated with the vital components of modern technologies. It manipulates a given data

set according to the set of instructions in accordance with algorithmic processes. The data

can be managed by designing and building hardware and software system for various

purposes. Parallel computing means solving a task of size n by partitioning its domain

into k>2 or k = 2 parts and resolving them with p number of physical processors,

simultaneously. [2] . This parallelism can be broken down into 2, data parallelism and

task parallelism. Data Task parallelism is the concurrent implementation on several cores

2

of several different functions on either one or different datasets whereas Data

parallelism is the concurrent implementaion on several cores of the similar function on

different elements of a dataset. Data parallelism is ideal for GPU as its architecture is best

suited when the threads execute same instruction but on different data. Task parallelism

works better for CPU as its architecture grants access to different tasks to be performed

on each thread, thus portioning well for accomplishing a given task. Computational

problems are often classified as data parallelism, hence we replicate the same behavior

for the project.

This project exploits 2 types of computation, a serial computation and a parallel

computation based on different number of processors used. The number of processor

varies from 1 to 16. As shown in figure 1, for serial computing, a consecutive and

ordered implementation of processes is done. The processes are run in successive

progression. The task is broken into series of instructions and each instruction is executed

successively. It is accomplished on a single processor. A single instruction is achieved at

once.

Figure 1. Serial computation of different task on a processor

3

Whereas for parallel computing, many tasks or processes are carried out

concurrently. The multiple computing resources are simultaneously used to solve a given

problem. A task is broken down various discrete sub tasks, which are further broken

down into sequences of instructions and are completed instantaneously on different

processors as demonstrated in Figure 2. A data sharing or coordination is setup between

the modules to maintain consistency and avoid any data loss. This method solves the

computational tasks in less amount of time with numerous resources available over only

one compute resource. Here the compute resource is a single computer with multiple

processors or cores or a random number of lone core computers connected in a network.

Figure 2. Parallel computation of different task on a processor

Parallel computers are parallel from hardware prospective, which consist of many

functional units (L1 and L2 caches, branch, GPU etc), have various execution units or

cores and use threads. Networks connect various standalone computers known as “nodes”

Instructions executed in parallel

4

to make a cluster of parallel computers, each comprising a memory and multiple cores of

their own.

1.2 Advantages of parallel computing

Parallel programming has some advantages over serial, which makes it an

competent solution for certain types of computing problems, which are suited for

multiprocessors and multithreading techniques. Some of the reasons are as listed:

1.2.1 Provide concurrency

As parallel computing consists of multiple resources working simultaneously,

hence achieving a high speed up.

1.2.2 Solving large and complicated tasks

Some problems have a complicated computing process or are huge in terms of

data points. Series computation is also limited by the memory size. In parallel computing,

by breaking the job into multiple separate instructions and utilizing more memory, the

problem can be solved easily. For example, in a vector multiplication, computing result

of different rows/columns can be done simultaneously to form the final matrix rather than

calculating one data at a time!

1.2.3 Saving time, money and power

Due to better resource allocation, the time taken to complete a certain task is

much shorter, hence saving the potential cost as well. Also, as this computing requires

various cores and threads, it helps in reducing the wastage of computing power.

5

1.3 Concept of parallel computing

Speed up refers to ratio of the computing time in sequential algorithm over the

parallel algorithm. If the speedup is 5 fold then in layman terms it means if serial

execution takes 5 minutes for a task or piece of code, parallel will take a minute to

execute the same task. We can obtain n-fold speedup with n number of processors to

deliver optimum solution.

In computer architecture, Amdahl's law as defined and stated provides us with

theoretical speedup in expectancy of the performance of a given task with a

particular workload that can be expected of a system by improving the resources. It’s

formulated in the following way:

P is the time proportion of the algorithm which can be parallelized.

S is the speedup factor for that portion of the algorithm due to parallelization.

For example, if we are using 5 processors, and it takes 90% of the running time. Then P =

0.90 and S = 5.

Then the overall speedup is 3.57 ~ 4 fold speedup!

5

https://en.wikipedia.org/wiki/Computer_architecture

https://en.wikipedia.org/wiki/Speedup

https://en.wikipedia.org/wiki/Latency_(engineering)

https://en.wikipedia.org/wiki/Workload

6

1.4 Parallel computer memory architectures

The architectures that can be enforced to obtain optimum performance on parallel

computers or evaluate VLSI complexity on chip area are the following:

1.4.1 Shared memory

In shared memory, all the processors have access to a common memory as a

global address space. The processors operate independently but share a memory source.

In this type of memory, change in a specific memory location by one processor will be

transparent and hence visible to all the other processors. Based upon the memory access

times, we split the shared memory into Uniform Memory Access (UMA) and Non-

Uniform Memory Access (NUMA). Figure 3 shows, UMA has different processors with

equal access and access time to a common memory whereas NUMA is formed by linking

more than 1 shared memory. In NUMA, the access time to memories may vary and

memory access is slower.

Figure 3. Shared memory architectures [3]

7

1.4.2. Distributed memory

As Figure 4 illustrates, in this memory, the each processor has its own local

memory and thus the memory addresses of one processor does not match to the memory

of another processor, hence eradicating the concept of global address space.

Figure 4. Distributed memory architecture

1.4.3 Hybrid distributed - shared memory

This is the fastest and largest architecture which uses both the above discussed

shared and distributed memory architectures. The “shared memory machine” module is

from the shared memory and networking of multiple shared memory is inherited by

distributed memory as shown in the figure 5 below.

Figure 5. Hybrid distributed - shared memory

Network

8

1.5 Parallel programming models

There are various programming models used for parallel programming which are

threads, message passing, data parallel etc. The implementation models used for this

project is threads model. Thread is an approach to improve performance with parallelism.

This type of programming model is a shared memory programming. Threads

implementation usually comprises of library of subroutines and a set of compiler

directives being called from source code.

There are two standardized implementations of threads: POSIX Threads and

OpenMP. For this project, the parallelism has been achieved using POSIX thread

implantation discussed in next chapter.

9

Chapter 2

POSIX THREAD PROGRAMMING

2.1 Background

In shared memory multiprocessor architecture, threads are used for implementation of

parallelism. A standardized C language threads programming interface was specified by

the IEEE POSIX 1003.1c standard, for UNIX systems. Executions that follow this

standard protocol are referred to as POSIX threads, or Pthreads. The three types of

routines in pthread API are: Thread management, mutex variables and condition

variables. [4]

2.2 Pthread

A pthread is defined as an independent execution model which allows the program to

control multiple work flow overlapping each other at a given time. It is run concurrently

by the operating system as a “Process”. Process has information about the resources and

implementation of a program. The process resources may share threads depending on the

data dependency. When a thread is shared amongst two or more processes, changes made

to the shared thread is transparent and hence will be visible to all the resources. Threads

not only have to access global data, but have their own private data. In UNIX based

platform, we require Pthread libraries for such implementations.

10

2.3 Why Pthreads?

While creating and managing a process, the operating system overhead is much

low. By overhead, we mean extra time, memory or bandwidth taken while performing

any task computation hence necessitating less system resources. When one thread is

waiting for the system call to complete its given task, some other thread can perform its

function without waiting for any completion. Hence they are light weight processes. The

tasks can be interleaved and interchanged in real time as demonstrated in Figure 6.

Figure 6. Different ways of task management

The above figure shows various cases in which interleaving, interchanging or

overlapping can be managed by thread. In first 2 cases, we can achieve the tasks in

sequential manner, by swapping them. In the third case, both the tasks are interweaved.

As for case 4, the tasks overlap at one point and hence we achieve the final outcome

earlier than the rest.

Also, it gives high computing environment for achieving optimum performance. It

is advantageous in case of efficient communication or data exchange between the shared

time

11

memories. All the threads share same address space within a single process and hence

data can be exchanged by simple passing of pointers.

2.4 Motivation

The language used for this project is C over HLL (high level language). A major

reason for writing kernel’s using C language is that it supports low-level techniques

hence helping in enhancing performance, particularly for pointer arithmetic, easy escape

from type enforcement, explicit memory allocation, and custom allocators. [5]

Garbage collection and safety checks in HLL consume a lot of CPU time

and causes delays. It also hides importance mechanisms like memory allocation and

implements abstraction which reduces development options.

2.5 The Pthread API

It was defined in the ANSI/IEEE POSIX 1003.1 - 1995 standard. Since then, the

POSIX standard has evolved and undergone various revisions, including the Pthreads

specification. While writing a POSIX subroutine, the flow is divided into 4 groups:

Thread management, mutexes, condition variables and synchronization.

2.5.1 Thread management

It consists of various operations like thread creation, termination, join, yield etc. It

generates an executable thread, return it to the starting routing when the task is

succeeded, joining the child thread to a parent thread and/or schedules the thread

execution within the run time.

12

2.5.2 Mutex

Mutex is an abbreviation for mutual execution object. It is a program object

created so that several threads can share same resource needed to access same file.

Hence, this deals with synchronization between various threads. Whenever a mutex is

created for a particular resource, the system creates a unique ID for it. If any thread

requires the resource, it must use the mutex to lock the resource from other threads while

using. If mutex lock already exists, another thread demanding the resource is held in a

queue by the system and then granted control when the mutex is unlocked, that is after

completion of the preceding thread on that specific resource. This helps in removing any

race condition, wherein multiple accesses to the same variable, from different threads are

made majorly while writing.

Figure 7. Mutual exclusion

The above figure shows a case of mutual exclusion, where thread B tried to gain access

from time T2 to T3 but due to the lock already acquired by thread A, it’s been blocked and

13

queued. As soon as the thread A finishes its task, thread B starts executing and no other

thread is allowed to interfere until it completes its implementation.

2.5.3 Condition variables

When a thread is utilizing a resource, it block itself as its request may not

complete instantly. Hence this waiting for an event to occur is done by condition

variables. It sets communication between threads that share mutex. There are two

operations used for variable operations, wait and signal. If a thread is in wait call, then its

suspended and queued, thus waiting for the event represented by conditional variable to

occur. When the event occurs, the signal method is used for corresponding condition

variable. The 2 variables used inside the condition variables is wait and signal. Wait is

used to block the caller thread and enqueued on this variable’s waiting threads list. Signal

dequeues the thread at the head of waiting list, unblocking it and places on the ready list.

The above process is illustrated in Figure 8, wherein the threads being shared are

concurrently processed.

Figure 8. Operation of different threads on a shared data

Shared data

Operations or threads currently being processed

Entry Queue for access

14

2.5.4 Synchronization

It refer to management of read/write locks or while sharing system resources by

different processes. Its main focus is to maintain data consistency and concurrent

accesses to data being shared. This is done to prevent any race condition by enforcing

mutual exclusion on critical section or at the time of conflict between two threads.

2.6 Pthread system calls

The following system calls have been used in the project for thread

implementation.

2.6.1 Creating pthread

Syntax:

It generates a new thread and starts the execution of the process by invoking a

routine. All the threads are created by the programmer in main() function. It has 4

arguments as shown above in the format. A “thread” is a unique identifier for each new

thread created. “Attr” is used to control attributed of pthread. It’s initialized by

pthread_attr_init. Its default value is NULL. The “start_routine” is used to implement

thread as soon as it is created. The “arg” is an argument which is passed by reference as a

pointer to start_routine.

15

2.6.2 Joining pthreads

Syntax:

This function is used to join one thread with another terminated thread. It waits for the

child thread to terminate in order to join the parent. If multiple threads try to join a single

thread simultaneously, it results in a deadlock.

2.6.3 Pthread mutex

Syntax:

A Mutual Exclusion device which is used to protect data sharing concurrenct

modification. If mutex is locked then the calling thread is queued until the mutex become

available as explained in section 2.4.2. The pthread_mutex_trylock() is similar to

pthread_mutex_lock() except it returns 0 when referred to a locked mutex, thus returning

the call immediately. The function of pthread_mutex_unlock is to release the lock from

the mutex object. The lock should be freed after one thread has completed the access, and

16

wake up the waiting thread or it will keep sleeping forever. Context switching might

occur.

2.6.5 Initializing and Destroying pthread

Syntax of destroying pthread:

Mutex destroy invalidates all the mutex objects. They are destroyed and hence

undefined. Conditon destroy shall terminate the the condition variable specified by the

cond. When the thread attributes are no longer needed, they are destroyed using

pthread_cond_destroy.

Syntax of constructing pthread:

A destroyed mutex object can be reinitialized by the above call. They are also

used to initialize the mutex lock, the condition variable for cond or attribute respectively.

17

Chapter 3

INTRODUCTION AND IMPLEMENTATION OF THE ALGORITHM

3.1 Background

“The shortest-path problem is the problem of finding a path between two vertices

in a graph such that the sum of the weights of its constituent edges is minimized.” [6].

Single- source shortest path is a constrained version of the shortest- path problem that

requires solving for the minimum distance from one given source vertex to all other

vertices of the graph. The primary aim of this project is to implement a parallel

multithreaded solution to the single-source shortest-path problem that can scale well with

the number of processors as graphs become larger and denser. To this end, we

experimented with two alternative implementations, different only in that one uses fine-

grained locking over the graph, while the other uses conservative coarse-grained locking.

Another goal is to thoroughly evaluate the implementation and analyze the results

obtained.

Figure 9. Shortest - path problem

Shortest-path problems are based on the concept of iteratively relaxing edges explained

as follows with the help of Figure 9.

18

While relaxing edge (u,v) if it is found that the distance d(v) to a vertex v from the source

vertex s violates the following inequality

d(v) < d(u) + w(u,v)

then d(v) is set to the newly found shortest distance on the right of the inequality. After V

iterations of relaxing all the E edges in each iteration, we know the shortest distance to

every node in graph.

In the following sections, the reasons for choosing Bellman-Ford algorithm for

our parallel implementation is being discussed. Section 3 and 4 provide a details of

parallel implementation and the optimizations made to the original algorithm.

3.2 Algorithm Selection

For a scalable parallel implementation, selecting an efficient algorithm is of

supreme importance. Several algorithms have been proposed to solve the shortest-path

problems. Keeping the literature in mind the algorithms were narrowed down to

Dijkstra’s algorithm and Bellman-Ford algorithm. Here, we briefly discuss why we

ultimately chose a modified Bellman-Ford algorithm and not the other.

19

3.2.1 Dijktra’s algorithm

Dijkstra’s algorithm does not readily lend itself to parallelization, because it relies

on a priority queue storing the vertices of the graph, keyed on the best-known distance to

each vertex [7]. The parallelization replies on known properties of the graph. With the

research done, I also came across another con of the algorithm, as it cannot handle

negative length cycles. Example, if we have 3 nodes (A, B and C) with an undirected

graph of edges AB = 1, AC = 2and BC = -3. If we have to go from A to C, the optimal

path should be through B ( AB + BC = 1 + -3 = -2) as it is much better than AC (which is

2). But Dijkstra’s will not be able to evaluate this result, as the algorithm cannot access

negative edge weights.

3.2.2 Bellman Ford algorithm

It evaluates shortest paths in bottom-up fashion. Bellman-Ford algorithm is

highly parallel in nature. Within each iteration, each edge can be relaxed in parallel,

without loss of correctness. Therefore, while a sequential implementation can make use

of an intelligent ordering of relaxation of edges, a parallel implementation can make use

of a completely random order of relaxation. The unique ability of this algorithm to handle

negative-cycles in graphs further motivated us to select it.

Also Dijktra’s work on greedy algorithm, in which it takes all the data points in a

specific given problem, and then provides a regulation to add to the solution at each step

of the algorithm for each element. [8] It’s time complexity is O(VLogV). Bellman-Ford

20

is simpler and suits well for distributed systems. It’s complexity is O(VE) which is more

than Dijktra’s.

Steps to calculate the shortest distance using Bellman-Ford is as follows:

1. Set all the distances from source U to each vertex as infinite whereas distance of source

to itself should be 0 and create an array of size V, where V is the number of vertices,

mentioning the distance from all the vertexes.

2. Loop it V-1 times.

if dist[v] > dist[u] + weight uv, then

Make dist[v] = dist[u] + weight uv

3. This loop is used to check the negative weight of the edges.

if dist[v] > dist[u] + weight uv, then the graph contains negative weight cycle.

This step is an extra measure taken to check if we still get a shorter path, despite repeated

iterations in step 2. If we get a shorter path, then it has a negative weight, else it doesn’t.

In some cases, calculation of shortest path might take less than V-1 cycles.

Let’s understand the algorithm using a graph.

21

Suppose there are 6 vertexes S, A, B, C, D, E and we need to find shortest distance of

every vertex from S as shown in the figure 10 below. We assign the source vertex as 0

and initialize the rest distances as infinite. As there are 6 edges, hence we need to loop it

5 times.

S A B C D E

0

Table 1. Bellman - Ford algorithm iteration 0

Figure 10. Graph for Bellman - Ford algorithm

Processing each edge starting from S, we get the following distance with the first

iteration. The first row indicates initial distances from S. The second, third and fourth

row shows distances when outgoing edges are processed from A, C and E respectively.

We skip B and D, as we don’t know how to reach them at that instance.

22

S A B C D E

0

0 1

1

0 1

3

1

0 1 2 3 -1 1

Table 2. Bellman - Ford algorithm iteration 1

First iteration shown above in Table 2 gives all shortest path. We get the following values

presented in Table 3 from all the edges on second iteration. Looping through the

iterations, last row shows the final value.

S A B C D E

0 1 2 3 -1 1

0 0 2 2 -1 1

0 0 1 2 -1 1

Table 3. Bellman - Ford algorithm further iterations

Here we get the shortest distances in the second iteration itself and hence save time by

not looping through 4th , 5th and 6th time.

3.3 Project Implementation

Bellman ford algorithm, works on a given graph G, whose vertices are

represented by V and edges by E. As Bellman-Ford algorithm is a single source short

23

path algorithm hence one vertex in the graph should act as a source vertex, S. The

algorithm calculates every short distance to all the vertices from source vertex. The edges

are relaxed V-1 times.

For this project, we have implemented the algorithm both serially and parallel.

3.3.1 Serial implementation

The sequential version of original Bellman-Ford was first implemented and tested

for correctness. The high level description pseudo-code is as below

1. For each vertex in V. set the distance between every vertex from source as

infinity in the graph. Also, a pointer P is referred to a predecessor vertex and

set to none for each vertex.

2. The second for loop goes through each edge and shortens the distance

between vertices and compares that distance amongst other known distance.

This process is repeated for V -1 time.

v.distance > u.distance + weight(u,v);

Above is the equation for relaxation. Every edge need to obey the triangle

inequality. If this condition hold true then

v.distance = u.distance + weight(u,v); & v.p = u;

3. After the above algorithm is successfully running, another short loop is

implemented to check for negative cycles. Hence in for (u,v) in edge E, we again

24

check the relaxation condition and if it’s true, it means there is negative weight

cycle existing.

Code:

/*

* Final Graduation Project

* Fall 2018

* Ankita Saxena

* Parallel Bellman Ford algorithm for Single Source

Shortest Path

*/

#include <stdio.h>

#include <stdlib.h>

#include <limits.h>

/*

* Each edge can be relaxed in parallel, so it is

convenient to represent

* the graph by an array of edges to iterate over. Static

load balancing

* becomes easier. However, when outgoing edges of only

certain nodes

* require relaxation, we need to iterate over the entire

data structure.

*/

typedef struct {

unsigned int origin;

unsigned int target;

int weight;

} edge_t;

int LoadGraph (FILE *fp, edge_t *graph, unsigned int

num_edges) {

unsigned int temp_origin;

unsigned int temp_target;

25

int temp_weight;

unsigned int idx;

idx = 0;

while (fscanf(fp, "%u %u %u", &temp_origin,

&temp_target, &temp_weight) == 3) {

graph[idx].origin = temp_origin;

graph[idx].target = temp_target;

graph[idx].weight = temp_weight;

idx++;

}

if (num_edges != idx) {

printf("Error: Edges found in file = %u, Expected =

%u\n", idx, num_edges);

return -1;

}

return 0;

}

/*

* Check for correctness of the results.

* Limitation: For worst case, maximum weight <

INT_MAX/num_edges

*/

int Verify (const int *distance, const edge_t *graph, const

unsigned int num_edges) {

unsigned int idx; //why unsigned?

int err;

edge_t edge;

err = 0;

for (idx = 0; idx < num_edges; idx++) {

edge = graph[idx];

if (distance[edge.origin] != INT_MAX &&

(distance[edge.target] > distance[edge.origin] +

edge.weight))

err = 1;

if (err)

printf ("Error: Incorrect Results!\n Edge %0u:

Calculated distance to vertex %0u (%0d) is greater than

calculated distance to vertex %0u (%0d) + weight (%0d)",

idx, edge.target, distance[edge.target], edge.origin,

distance[edge.origin], edge.weight);

}

26

return err;

}

void ShortestPath (

const unsigned int num_verts,

const unsigned int num_edges,

const edge_t *graph,

const unsigned int source, const unsigned int destination,

int *distance) {

unsigned int e; //edge

unsigned int v, vv; //vertices

unsigned int i, j;

edge_t edge;

int *new_dist;

int *lupdate;

int found_update;

printf ("INFO: Calculating shortest path from %u to %u.

Graph has %u vertices and %u edges\n", source, destination,

num_verts, num_edges);

new_dist = (int *) malloc (num_verts*sizeof(int));

lupdate = (int *) calloc (num_verts, sizeof(int));

for (v=0; v<num_verts; v++) {

if (v == source) {

distance[v] = 0;

}

else {

distance[v] = INT_MAX;

}

}


//printf ("Entering iter %u\n", v);

for (e=0; e<num_edges; e++) {

edge = graph[e];


(distance[edge.origin] + edge.weight <

distance[edge.target])) {

new_dist[edge.target] =

distance[edge.origin] + edge.weight;

lupdate[edge.target] = 1;

}

27

}

for (vv=0; vv<num_edges; vv++) {

found_update = 0;

if (lupdate[vv] == 1) {

found_update = 1;

distance[vv] = new_dist[vv];

lupdate[vv] = 0;

}

}

if (found_update = 0) {

break;

}

}

}

int main (int argc, char* argv[]) {

unsigned int source;

unsigned int destination;

unsigned int num_verts;

unsigned int num_edges;

char *filename;

FILE *file; /*While doing file handling we often use

FILE for declaring the pointer in order

to point to the file we want to read from or to write

on. */

edge_t *graph;

int *distance;

unsigned int e, v;

// Process command-line arguments

if (argc != 4 && argc != 6) {

printf ("%d\n",argc);

printf("Error: \n Usage: <spbf> <graph file>

<source> <destination> ::optional:: <num_verts>

<num_edges>\n");

return -1;

}

else {

28

filename = argv[1];

source = atoi(argv[2]);

destination = atoi(argv[3]);

if (argc == 4) {

num_verts = 50000;

num_edges = 1000000;

}

else {

num_verts = atoi(argv[4]);

num_edges = atoi(argv[5]);

}

graph = (edge_t

*)malloc(num_edges*sizeof(edge_t));

distance = (int

*)malloc(num_verts*sizeof(int));

}

file = fopen (filename, "r");

if (file == NULL) {

printf("Error: Unable to open graph file\n");

return -1;

}

else {

if (LoadGraph (file, graph, num_edges) != 0) {

return -1;

}

}

// Arguments and file are good, now calculate shortest

path

ShortestPath (num_verts, num_edges, graph, source,

destination, distance);


printf("Distance to node %u = %d\n", v,

distance[v]);

}

Verify (distance, graph, num_edges);

return 0;

}

29

3.3.2 Parallel Implementation

The sequential version was then parallelized to run on an arbitrary number of

processors. Our implementation is based on Papaefthymiou and Rodrigue’s paper [9].

The pseudocode for the parallel implementation is given below, followed by a discussion

of our decisions for two key issues.

Below is the algorithm implemented for parallelization.

Repeat V times:

1. Compute start and last edge to obtain the set of edges to work on (load

balancing).

2. For each edge in set:

a. Acquire locks on origin and target vertices.

b. Relax edge and set local new distance.

c. If relaxation successful, mark vertex.

d. Release locks on origin and target vertices.

3. For all vertices marked in step 2c:

a. Acquire lock on vertex.

b. If local copy of new distance is lesser than global distance copy,

update the global distance with the local distance.

c. Set own global found-update flag

d. Release lock on vertex.

30

4. Wait at barrier

5. If NONE_LEFT flag is set, break out of loop

The setting of NONE-LEFT flag is explained in the next section (optimizations).

Code:

The organization of the code with explanation is below:

1. The pthread libraries for UNIX was included. A structure was defined for intake of

origin, target and the distance between them. It creates an array of the points which can

be depicted as a graph. This creates thread argument structure for thread function.

31

Note: the maximum number of processor it can take has been defined as 16. As we will

be testing the tasks on different number of processors.

2. As we have tested the parallel code on different number of processors, this part is used to

define the number of processors being used (NumProcs) and accordingly allocate

pthread’s system calls (mutex and cond as explained in chapter 2).

3. Each thread is given an ID as tid. Also, manages the mutex lock. This section of code

keeps a count of the execution by SyncCount and checks if it has reached the last thread.

Barriers are used to coordinate between a set of threads in real time by putting all the

threads in the barrier that is, they are made to wait until all threads have called the barrier

32

function. This, blocks all threads in the barrier until the slowest or the last thread

participating in the task reaches the barrier call.

4. In this part, the graph is loaded by reading the input file. If there is any discrepancy found

in the number of edges specified, it flashes an error and returns back.

33

5. This loop helps in verifying the accuracy of the results we obtain, thus making this code

self-sufficient. It also considers the worst case wherein maximum weight is considered.

/*

* Check for correctness of the results.

* Limitation: For worst case, maximum weight < INT_MAX/num_edges

*/

int Verify (const int *distance, const edge_t *graph, const

unsigned int num_edges) {

unsigned int idx;

int err;

edge_t edge;

err = 0;

for (idx = 0; idx < num_edges; idx++) {

edge = graph[idx];


(distance[edge.target] > distance[edge.origin] + edge.weight))

err = 1;

if (err)

printf ("Error: Incorrect Results!\n Edge %0u:

Calculated distance to vertex %0u (%0d) is greater than

calculated distance to vertex %0u (%0d) + weight (%0d)", idx,

edge.target, distance[edge.target], edge.origin,

distance[edge.origin], edge.weight);

}

return err;}

34

6. This function call implements the Bellman-Ford algorithm according to the pseudocode

explained above. It acquires lock, performs the function with one thread, releases when

the task is done, increments the counter and relaxes all the edges over every iteration.

35

}

lupdate[vv] = 0;

pthread_mutex_unlock (&VertexLock[vv]);

}}

if (found_update == 1) {

global_update[threadId] = 1; }

else {global_update[threadId] = 0; }

/* Wait for all threads to relax all edges for this

iteration */

Barrier ();

if (none_left == 1) {

break; }

} }

7. This is the main calling function or the entry point in the code. It manages and

communicates with all the calls mentioned above. This binds the whole code together and

executes the calls in accordance with the requirements.

int main (int argc, char* argv[]) {

char *filename;

FILE *file;

unsigned int e, v;

pthread_t *threads;

pthread_attr_t attr;

int ret;

long int threadId;

// Process command-line arguments

if (argc != 5 && argc != 7) {

printf ("%d\n",argc);

36

37

38

3.4 Input file

For initial testing of the code, the source vertex, destination vertex and distance

between them was fed through a file to the code. This manually generated file defined the

graph. In the input file shown below, there are only 7 vertices for basic testability of the

code. The same format was used to generate and replicate bigger graphs with millions or

thousands of vertices and edges. The format was broken into different columns as source

vertex, destination vertex and edge distance.

39

The file was opened and accessed by function call LoadGraph as shown before in the

code.

The input file looked like below:

0 2 2

0 3 16

0 1 6

0 6 14

1 3 5

1 4 4

2 5 8

2 4 3

2 1 7

3 6 3

4 6 10

4 3 4

5 6 1

3.5 Tools and platform used

Used Linux (Ubuntu) operating system running on Intel Xeon processor to run the

simulations. The code was compiled using GNU C compiler (GCC).

3.6 Key issues

While parallel implementation, I faced 2 major issues as explained below:

3.6.1 Load balancing

It refers to efficient distribution of incoming workload across multiple resources. It

increases reliability, makes optimum use of resources, maximizes throughput and helps in

avoiding overload on a single resource.

40

Since in each iteration, a processor has to relax a fixed number of edges, our natural

choice is static load balancing. This reduces fine-grained inter-thread communication and

is straightforward to implement. I also had to decide whether to distribute edges in a

random order as they appear in the (randomly generated) graph file, or to distribute

groups of outgoing edges of vertices. While the second option would have allowed some

useful optimizations, it would require some pre-processing unless the input data is

already in that format. Thus the former option was chosen.

3.6.2 Synchronization

I experimented with two different locking schemes. First, implemented a

conservative locking scheme with a single mutex lock protecting shared access to the

distances of all the vertices in the graph. This resulted in poor results, which were

improved vastly by using a separate lock for each vertex in the graph. The drawback of

this is an increase in the number of locks with the graph vertices.

3.7 Optimization

A key optimization over the original Bellman-Ford algorithm, proposed by Yen

[10] has been included in our implementation. At the end of an iteration, if no processor

makes an update to any node of the graph, then shortest paths to all nodes are known and

the algorithm can terminate early.

To achieve this, at the end of every iteration, each processor updates a different

flag in shared memory if it made an update to any node in that iteration and then waits at

41

the barrier. The last processor to reach the barrier will check the global flag of each

processor, including its own. If all flags are clear, then it will set the global NONE_LEFT

flag (which initially starts out clear). Before the next iteration begins, each processor will

check this flag and terminate the process if it is set.

42

Chapter 4

RESULT

The requirement of the project was to test performance on graphs with 50,000

vertices and 100,000 edges. I pseudo-randomly generated 15 such graphs and measured

the average speedup on 1 to 16 processors as compared to a single processor. The

speedup was observed and noted in form of a table.

No. of

processors

Speedup in V = 50k and E = 1M Speedup in V = 5M and E = 100M

1 0.8 0.8

2 1.2 2.17

3 1.8 3.36

4 2.1 4.3

5 2.5 5.9

6 2.6 6.6

7 3.9 8.13

8 3.5 9.12

9 4.3 10.2

10 5.4 12.3

11 3.9 13.5

12 4.5 15.05

13 6.4 16.22

14 6.7 16.88

15 6.9 17.43

16 7.9 18

Table 4. Speedup of 2 cases vs number of processors

43

Based on the above table, the graph was obtained.

Figure 11. Speedup (y-axis) vs Processors (x-axis) for 2 graphs

The graph above, shows speed up of 2 graphs simultaneously using various

numbers of processors. The yellow line indicates the test performed with 50,000 vertices

and 1 million edges. I achieved a speedup of 7.9 using 16 processors. A possible

explanation for the speedup not scaling as well as expected is the multiprocessor cache-

coherence effects and also the synchronization overhead. As in shared memory

multiprocessors, each processor has its shared cache memories, it’s bound to have more

than one copy of data at different locations. If one copy is changed, then it’s necessary to

propagate those changes to the other copies as well. This mechanism of maintaining data

44

consistency throughout the system is known as cache coherence. “In terms of computer

science, an overhead is defined as any combination of excess or indirect computation

time, memory, bandwidth, or other resources that are required to perform a specific task.”

[11] This graph with wasn’t enough to show the increased efficiency in parallelism.

Another graph was plotted by increasing the workload by plotting a bigger graph

to measure the competence of parallelism against the number of processors. I tested

performance on bigger graphs to see how well larger workloads would benefit from the

parallelism. In the bigger graph with 5 million vertices and 100 million edges, an almost

linear speedup was observed.

Note than a speedup of 18 using 16 processors was achieved! This may seem

incorrect at first. To ensure correctness, I first verified the correctness of the results and

tracked the execution on 1 processor versus 16 processors. Note that in contrast to a

single processor that relaxes one edge at a time, N processors are relaxing edges in

parallel in our implementation Therefore, the order of relaxation of edges is highly likely

to change, causing the shortest path to be known early, thereby causing the algorithm to

terminate early. This is known as super linear speed up, which refers to program which

can execute faster than theoretical speed up. Despite having n cores, the recorded speed

can be greater than that of n. As soon as the threads are done, the process is terminated.

https://en.wikipedia.org/wiki/Task_(computing)

45

The parallelism was implemented because the size of the vectors needed to be

investigated grew exponentially, and parallelism would benefit the speed of execution. It

scales up the performance of the task by recursive way of solving the problem.

46

Chapter 5

CONCLUSION

The main purpose of this project was to utilize the CPU resources fully and gain

the optimum performance achievable. The requirement of this project was to execute a

scalable parallel implementation of single source shortest path algorithm. The algorithm

used for execution was Bellman-Ford over Dijktra’s and was implanted in C language

using pthread API.

Bellman Ford algorithm works by overvaluing the length of the path from the

source vertex to all the other vertices, assigning them the value infinity. Then it

“repeatedly relaxes those estimates by finding new paths that are shorter than the

previously overestimated paths”. Bellman Ford was chosen over Djiktra’s as it consider

the negative weight edges as well hence increasing the accuracy of the solving the

problem. Solution to the shortest path problem has several applications such as driving

directions on web mapping websites, networking or telecommunication and VLSI design

applications.

This report presents the methods used to implement a parallelized solution to the

problem using a modified Bellman-Ford algorithm, and discusses the results obtained.

The computations were done in serial and in parallel to see the performance enhancement

in CPU. I have used 15 randomly generated graphs with 50,000 vertices and 100,000

47

edges for our experiments and calculated speedup on up to 16 processors. I achieved an

average speedup of 7.9 using 16 processors. As the graph was not very linear, hence I

increased the workload by 5 million vertices and 100 million edges. The performance in

it was much linear and improved by manifolds. In conclusion, a speedup of almost 8

times over the execution time on a single processor is not bad using 16 processors. As

noted, there’s scope for improvement by employing node-based workload distribution as

groups of outgoing edges and also dynamic load balancing. As seen in results, the

solution is scalable with both the size of input graph and the number of processors.

This result was obtained by process of multithreading which executes process or a

given task by using threads and its own program counter keeping a track of instruction to

be accomplished next, while holding the current working variables. It breaks down the

instruction into various small instructions and executes them simultaneously within cores.

One thread belongs to one process or task. Threads provide a suitable foundation for

parallel execution on shared memory multiprocessor. Thread helps in minimizing context

switching and provides concurrency within process. All the threads need to be managed

and efficient communication needs to me maintained. Proper steps for synchronization

needs to be taken. Threads utilize the multiprocessor architecture to a great scale. The

execution was done with the help of Pthread library POSIX on Linux helping to control

the flow of process by parallelizing them and overlapping time.

48

Today, numerous parallel chip architectures include a different number of cores,

different ability or skill in each core, and optimized power or energy consumption per

core. Now a days, GPUs are the future of parallel computing. The GPU computing allows

the programmers to exploit parallelism using recently introduced advanced programming

languages like CUDA and OpenCL. This leads to interaction between different sections

in a system, thus leverages the investment in programming. GPU’s have tremendous

computing and memory power hence leading to high-performance computing. This

expands the range of computational graphics and high performance computing

applications. CUDA has helped in gaining a lot of demand for NVIDIA GPU’s.

49

REFERENCES

1. V. Agarwal, M.S. Hrishikesh, S. W. Keckler and D. Burger, “Clock Rate

versus IPC: The End of the Road for Conventional Microarchitectures”, In

Proceedings of the 27th Annual International Symposium on Computer

Architecture (ISCA -00), pages 248 – 259, 2000.

2. Crist ´obal A. Navarro1,2,∗ , Nancy Hitschfeld-Kahler1 and Luis Mateu1 ,

“A Survey on Parallel Computing and its Applications in Data-Parallel Problems

Using GPU Architectures ”, Communications in Computational Physics, Vol. 15,

No. 2, pp. 285-329

3. “Introduction to Parallel Computing”, Author: Blaise Barney, U.S.

Department of Energy by Lawrence Livermore National Laboratory under

Contract DE-AC52-07NA27344, 07-march-2017, Available:

https://computing.llnl.gov/tutorials/parallel_comp/#Whatis. [Access: 10-Oct-

2018]

50

4. “POSIX Threading Programming”, Author: Blaise Barney, U.S.

Department of Energy by Lawrence Livermore National Laboratory under

Contract DE-AC52-07NA27344, 07-march-2017, Available:

https://computing.llnl.gov/tutorials/pthreads/. [Access: 17-Oct-2018]

5. Cody Cutler, M. Frans Kaashoek, and Robert T. Morris, MIT CSAIL,

“The benefits and costs of writing a POSIX kernel in a high-level language”,

ISBN 978-1-931971-47-8

6. “Shortest Path Problem”, Wikipedia , 13-Sept-2018 [Online]. Available:

https://en.wikipedia.org/wiki/Shortest_path_problem. [Accessed:16-Oct-2018]

7. Kevin Kelley and Tao B., “Parallel Single Source Shortest Paths”, MIT

Computer Science and Artificial Intelligence Laboratory

8. Karleigh Moore, Jimin Khim, and Eli Ross , “Greedy Algorithms”,

Brilliant, 19-Oct-2018, Available: https://brilliant.org/wiki/greedy-algorithm/,

[Accessed: 10-Oct-2019]

https://brilliant.org/profile/karleigh-4ektd3/

https://brilliant.org/profile/jimin-hqyzve/

https://brilliant.org/profile/eli-oyeeys/

51

9. Marios Papaefthymiou and Joseph Rodrigue, “Implementing Parallel Shortest-

Paths Algorithms”

10. Yen, Jin Y. (1970). "An algorithm for finding shortest routes from all

source nodes to a given destination in general networks". Quarterly of Applied

Mathematics. 27: 526–530. MR 0253822.

11. “Overhead (Computing)”, Wikipedia , 13-Sept-2018 [Online]. Available:

https://en.wikipedia.org/wiki/Overhead_(computing). [Accessed:16-Oct-2018]

acceleration of bellman- ford single source shortest path

Documents