COLORING FOR DISTRIBUTED-MEMORY-PARALLEL ...

COLORING FORDISTRIBUTED-MEMORY-PARALLEL

GAUSS-SEIDEL ALGORITHM

a thesis submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

master of science

in

computer engineering

By

Onur Kocak

September 2019

COLORING FOR DISTRIBUTED-MEMORY-PARALLEL GAUSS-

SEIDEL ALGORITHM

By Onur Kocak

September 2019

We certify that we have read this thesis and that in our opinion it is fully adequate,

in scope and in quality, as a thesis for the degree of Master of Science.

Cevdet Aykanat(Advisor)

Ozgur Ulusoy

Engin Demir

Approved for the Graduate School of Engineering and Science:

Ezhan KarasanDirector of the Graduate School

ii

ABSTRACT

COLORING FORDISTRIBUTED-MEMORY-PARALLEL GAUSS-SEIDEL

ALGORITHM

Onur Kocak

M.S. in Computer Engineering

Advisor: Cevdet Aykanat

September 2019

Gauss-Seidel is a well-known iterative method for solving linear system of equa-

tions. The computations performed on Gauss-Seidel sweeps are sequential in

nature since each component of new iterations depends on previously computed

results. Graph coloring is widely used for extracting parallelism in Gauss-Seidel

by eliminating data dependencies caused by precedence in the calculations. In this

thesis, we present a method to provide a better coloring for distributed-memory-

parallel Gauss-Seidel algorithm. Our method utilizes combinatorial approaches

including graph partitioning and balanced graph coloring in order to decrease

the number of colors while maintaining a computational load balance among

the color classes. Experiments performed on irregular sparse problems arising

from various scientific applications show that our model effectively reduces the

required number of colors thus the number of parallel sweeps in the Gauss-Seidel

algorithm.

Keywords: parallel computing, Gauss-Seidel method, parallel Gauss-Seidel algo-

rithm, distributed memory algorithm, graph coloring.

iii

OZET

DAGITIK-BELLEK-PARALEL GAUSS-SEIDELALGORITMASI ICIN RENKLENDIRME

Onur Kocak

Bilgisayar Muhendisligi, Yuksek Lisans

Tez Danısmanı: Cevdet Aykanat

Eylul 2019

Gauss-Seidel dogrusal denklem sistemlerinin cozumunde iyi bilinen bir

ozyinelemeli yontemdir. Gauss-Seidel taramalarında yapılan hesaplamalar, yeni

yinelemelerin her bir bileseni onceden hesaplanan sonuclara baglı oldugundan

dogası geregi ardısıktır. Cizge renklendirme, hesaplamalarda onceligin neden

oldugu veri bagımlılıgını ortadan kaldırarak paralellik elde etmek icin yaygın

olarak kullanılır. Bu tezde, dagıtık-bellek-paralel Gauss-Seidel algoritması icin

daha iyi bir renklendirme saglayan bir method sunuyoruz. Metodumuz, renk

sınıfları arasındaki hesaplama yuk dengesini korurken renk sayısını azaltmak

icin cizge bolumleme ve dengeli cizge renklendirme gibi cesitli kombinasyonel

yaklasımları kullanır. Cesitli bilimsel uygulamalardan dogan duzensiz seyrek

problemler uzerinde yapılan deneyler, modelimizin Gauss-Seidel algoritmasındaki

gerekli renk sayısını ve dolayısıyla paralel taramaları etkili bir sekilde azalttıgını

gostermektedir.

Anahtar sozcukler : paralel hesaplama, Gauss-Seidel yontemi, paralel Gauss-

Seidel algoritması, dagıtık bellek algoritma, cizge renklendirme.

iv

Acknowledgement

I would like to express my deepest gratitude to my thesis supervisor Prof. Cevdet

Aykanat for his guidance, valuable suggestions and support throughout the de-

velopment of this thesis. His guidance and immense knowledge helped me in all

the aspects of research and writing of this thesis.

I would like to express my endless thanks to Dr. Reha Oguz Selvitopi for his

support and suggestions. I appreciate his guidance to me for this work.

I would like to thank the jury members, Prof. Ozgur Ulusoy and Asst. Prof.

Engin Demir, for reading and commenting on this thesis.

My sincere thanks also go to my friend Cihan Bostancı for his moral support

and suggestions throughout this thesis.

Finally, I am grateful to my lovely family and friends for their infinite moral

support and understanding throughout this thesis.

v

Contents

1 Introduction 1

2 Background 4

2.1 The Gauss-Seidel Method . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Sparse Matrices and Compressed Sparse Row (CSR) Format . . . 9

2.3 Graphs and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Graph Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Related Work 16

4 Method 20

4.1 Coarsening Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Coloring Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Experimental Results 26

vi

CONTENTS vii

5.1 Experimental Platform . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3.1 Experiments on the Coarsening Stage . . . . . . . . . . . . 30

5.3.2 Experiments on the Coloring Stage . . . . . . . . . . . . . 31

6 Conclusion 41

List of Figures

2.1 A simple M matrix and its CSR representation . . . . . . . . . . 10

2.2 A simple symmetric matrix and its undirected graph representation 12

2.3 An undirected graph and its 4-way partitioning. . . . . . . . . . . 13

2.4 An undirected graph and the same graph with distance-1 coloring 14

4.1 Step by step coarsening stage . . . . . . . . . . . . . . . . . . . . 22

4.2 Traditional parallel processing vs. parallel processing with coars-

ening and balanced coloring . . . . . . . . . . . . . . . . . . . . . 25

5.1 A comparison of coloring vs. balanced coloring by number of super

vertices per each color . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 A comparison of coloring vs. balanced coloring by sum of weights

per each color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 A comparison of coloring vs. balanced coloring by number of colors

obtained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

viii

List of Tables

5.1 Properties of Test Matrices . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Coarsening and Coloring Results for B ∈ {1, 50, 100, 250, 500} . . 33

5.3 Average Results for All Test Matrices . . . . . . . . . . . . . . . . 40

ix

List of Algorithms

1 Sequential Gauss-Seidel Algorithm . . . . . . . . . . . . . . . . . . 8

2 Sequential Gauss-Seidel Algorithm with CSR data format . . . . . 11

x

Chapter 1

Introduction

Many scientific applications and engineering simulations depend on solving linear

algebraic systems. In diverse problems arising from several areas such as electro-

magnetic, computational fluid dynamics, power systems, genetics, graph theory,

synthetic geometry, computer vision, quantum mechanics ; scientific problems of-

ten necessitate dealing with linear algebraic systems. Therefore, it is important

to efficiently solve linear algebraic systems in scientific computing. There are two

categories of linear algebraic solvers: direct solvers and iterative solvers. Direct

solvers attempt to calculate an exact solution to the system of linear equations,

whereas iterative solvers repeatedly try to solve system of equations using ap-

proximations. Direct solvers may be impractical if the coefficient matrix of the

linear algebraic system to be solved is large and sparse because the sought-after

factor can be dense [1]. Therefore, iterative solvers are more commonly used in

solving large sparse system of linear equations and we can obtain an approximate

solution with an acceptable error tolerance if the algorithm converges.

Gauss-Seidel is one of the most popular stationary iterative solvers in linear

algebra. It is also known as Liebmann method or successive displacement method.

It is an inherently sequential method and very efficient in solving problems with

tightly coupled constraints. Gauss-Seidel method is very similar to the Jacobi

method, which is another iterative solver. The basic difference between them

1

is that Gauss-Seidel uses previously computed results in new iterations. Unlike

the Jacobi method, only one storage vector is required as elements can be over-

written as they are computed. This can be advantageous for large problems.

However, computations cannot be performed in parallel because of the depen-

dency and the values at every iteration are dependent of the order of the original

equations. Gauss-Seidel has better convergence properties than the Gauss-Jacobi

algorithm [2, 3]. Successive Over Relaxation (SOR) method is one of other sta-

tionary iterative methods used to solve linear problems. However, the difficulty

in choosing an optimal relaxation parameter makes it impractical [4]. Therefore,

Gauss-Seidel is practically one of the most popular among the other iterative

solvers in linear algebra. With its attractive properties, the Gauss-Seidel method

is also widely used as a preconditioner in multi-grid applications [5].

Parallelization of the Gauss-Seidel is difficult since the nature of the algorithm

is sequential. There have been many parallelization studies done on sparse matrix

systems in scientific computing field [5, 6, 7, 8], however tightly coupled problems

lead to a dense matrix which prevents parallelization. Additionally, the cost of

synchronization in distributed memory systems affects parallelization efficiency.

Therefore, Jacobi method can be used as an alternative despite of the fact that

it is slow to converge but fully parallel. There are several approaches to extract

parallelism in Gauss-Seidel through performing special orderings on sparse matri-

ces to eliminate data dependencies. Graph coloring is a widely used method for

extracting parallelism in Gauss-Seidel by eliminating data dependencies caused

by precedence in the calculations.

In this thesis, we present a method to provide a better coloring for distributed-

memory-parallel Gauss-Seidel algorithm. Our method utilizes combinatorial ap-

proaches including graph partitioning and balanced graph coloring in order to

decrease the number of colors while maintaining a computational load balance

among the color classes. We present a method that (i) first utilizes graph parti-

tioning in order to decrease the number of colors required in the coloring stage,

(ii) then applies a balanced coloring algorithm in order to extract balanced par-

allelism by decoupling the data dependencies while maintaining strict precedence

rules in sequential Gauss-Seidel iterations. We have tested the validity of our

2

model on various test matrices arising from different scientific applications. Ex-

perimental results show that the proposed model effectively reduces the number

of colors required in coloring stage of parallel Gauss-Seidel algorithm which re-

sults in reducing number of parallel sweeps and synchronization overheads. In the

experiments, it is also observed that utilized balanced coloring method efficiently

maintains computational load balance among the color classes.

This thesis is organized as follows: Chapter 2 gives the necessary background

and preliminary information. In Chapter 3, the existing work in the literature on

parallelization of Gauss-Seidel is investigated. Chapter 4 describes our method-

ology with 3 stages. The experimental results are presented in Chapter 5. We

present the conclusion of this thesis in Chapter 6.

3

Chapter 2

Background

In this chapter, the preliminary information and definitions related to our

methodology for the Gauss-Seidel algorithm are presented. Following sections

cover the basic principles of the Gauss-Seidel method, sparse matrices and their

storage models, graph model, graph partitioning and coloring problem.

2.1 The Gauss-Seidel Method

The Gauss–Seidel method is an iterative method to solve a system of n linear

equations with unknown x.

Consider the following linear system:

Ax = b (2.1)

A =

a11 a12 · · · a1n

a21 a22 · · · a2n...

.... . .

...

an1 an2 · · · ann

, x =

x1

x2

...

xn

, b =

b1

b2...

bn

(2.2)

where A is an (n× n) matrix with nonzero diagonal entries, b and x are vectors

4

of size n.

The Gauss-Seidel method is defined by the iteration;

L∗x(k+1) = b− Ux(k) (2.3)

where x(k) is the k -th approximation or iteration of x, and the coefficient matrix

A is decomposed into a lower triangular component L∗, and a strictly upper

triangular component U : A = L∗ + U .

In detail, we can decompose A with its lower and upper triangular components

as follows:

A = L∗+U where L∗ =

a11 0 · · · 0

a21 a22 · · · 0...

.... . .

...

an1 an2 · · · ann

, U =

0 a12 · · · a1n

0 0 · · · a2n...

.... . .

...

0 0 · · · 0

(2.4)

The system of linear equations can be written as:

L∗x = b− Ux (2.5)

The Gauss–Seidel method then solves the left hand side of this expression for x,

using previous value for x on the right hand side. This can be shown as:

x(k+1) = L−1∗(b− Ux(k)

)(2.6)

Using the triangular component L∗, we can sequentially compute the values of

x(k+1) with forward substitution. Hence, the Gauss-Seidel method can be written

as:

5

x(k+1)i =

1

aii

(bi −

i−1∑j=1

aijx(k+1)j −

n∑j=i+1

aijx(k)j

), i = 1, 2, . . . , n (2.7)

where:

x(k)i is the ith unknown in x during the kth iteration, i = 1, · · · , n

x(0)i is the initial guess for the ith unknown in x,

aij is the coefficient of A in the ith row and jth column

bi is the ith value in b.

In matrix view, the Gauss-Seidel method can be defined as:

x(k+1) = (D + L∗)−1 [b−Ux(k)

](2.8)

where:x(k) is the kth solution to x

x(0) is the initial guess at x,

D is the diagonal of A,

L∗ is the lower triangular portion of A,

U is the strictly upper triangular portion of A,

b is right-hand-side vector.

Convergence is evaluated by using the absolute relative approximate error cal-

culated as follows after each iteration to check if the error is within an acceptable

tolerance:

|Ea| =∥∥x(k+1) − x(k)

∥∥ (2.9)

6

Characteristics of Gauss-Seidel are as follows:

• Convergence of the Gauss-Seidel solver depends on the characteristics of the

coefficient matrix. Convergence is only guaranteed if the coefficient matrix

A is either symmetric positive definite or strictly/irreducibly diagonally

dominant [9].

• The Gauss-Seidel method has better convergence rates compared to the

Jacobi method on model problems as it uses the updated values during

each of the iterations [5].

• The Gauss-Seidel method is widely used in multigrid applications. It works

well on unstructured problems since there is no need for choosing a damping

parameter [5].

The Gauss-Seidel algorithm is inherently sequential since new iterations de-

pend on the results of previously calculated values in that same iteration. This is

in contrast to the Jacobi method, where we need to have x(k)i in order to calculate

x(k+1)i . Therefore, it is required to identify tasks which can be computed simul-

taneously by resolving data dependencies between tasks to efficiently parallelize

the Gauss-Seidel method.

Algorithm 1 presents a sequential algorithm to solve a linear system Ax = b

using the Gauss-Seidel method. The algorithm takes A and b as input and solves

for x, which is the solution vector. We assume a initial guess for x (i.e., x(0) = 0).

The algorithm algebraically solves each linear equation for xi and repeats until

converges. The iteration is terminated when either (i) the user-specified maximum

number of iterations has been reached or (ii) the norm of successive iterates is

less than a user-specified epsilon Eu.

7

Algorithm 1: Sequential Gauss-Seidel Algorithm

Input: A, b, kmax, EuOutput: x

k ← 0

Choose an initial guess x(0) to solution x

/* Repeat until reaching a desired error tolerance or maximum

iterations limit */

while k < kmax and Ea > Eu do

for i = 1 to n do

sum← 0

for j = 1 to i− 1 dosum← sum + aijxj

end

for j = i + 1 to n dosum← sum + aijxj

end

xi ← (bi − sum)/aii

end

/* Calculate new iteration error */

if k > 0 thenEa ← ‖x− xold‖

end

/* Update the old solution */

xold ← x

/* Update the iteration counter */

k ← k + 1

end

8

2.2 Sparse Matrices and Compressed Sparse

Row (CSR) Format

A matrix is called to be a sparse matrix if most of its elements are zero. Sparse

matrices are distinct from matrices with mostly non-zero values, which are re-

ferred to as dense matrices. The sparsity of a matrix can be defined as the number

of non-zero elements divided by the total number of elements (m× n). Many of

the scientific problems require working with sparse matrices.

When working with sparse matrices, it is efficient and often necessary to use

special data structures since they may allocate large amount of memory and

storage resources when treated as if they are dense and store zero values as well.

Compressed Storage by Rows (CSR) is an efficient data structure to store and

process large sparse matrices. The CSR data structure allows fast row access in

matrix-vector multiplications.

The CSR data structure represents a matrix by three arrays: row ptr, col ind

and values. nnz denotes the total number of non-zero entries in the matrix.

• values is an array of size nnz and it stores the subsequent non-zeros of the

matrix rows on continuous memory locations traversed in row-wise fashion.

• col ind is an array of size nnz and it stores the column indices of the

respective nonzero values in the values array.

• row ptr is an array of size n + 1 and it stores the row start pointers which

corresponds to the items the values array.

9

M =

1 2 3 4 5

20 0 0 25 0 1

30 35 0 40 0 2

45 0 50 55 60 3

0 0 65 70 0 4

0 0 0 0 75 5

1 2 3 4 5 6

row ptr 1 3 6 10 12 13

1 2 3 4 5 6 7 8 9 10 11 12

col ind 1 4 1 2 4 1 3 4 5 3 4 5

1 2 3 4 5 6 7 8 9 10 11 12

values 20 25 30 35 40 45 50 55 60 65 70 75

Figure 2.1: A simple M matrix and its CSR representation

In Figure 2.1, a simple M matrix and its CSR representation with correspond-

ing arrays presented. One can observe that the lengths of col ind and values

arrays are equal to the number of nonzero items. Numbers in row ptr array

points the row start positions in col ind and values arrays and its length is one

more than the number of rows.

Algorithm 2 presents the same Gauss-Seidel algorithm as Algorithm 1, only

difference is that the coefficient matrix A is supplied in the CSR format. In-

stead of a m× n two dimensional array for storing the coefficient matrix A, the

corresponding row ptr, col ind and values arrays are provided as input.

10

Algorithm 2: Sequential Gauss-Seidel Algorithm with CSR data format

Input: row ptr, col ind, values, b, kmax, EuOutput: x

k ← 0Choose an initial guess x(0) to solution x/* Repeat until reaching a desired error tolerance or maximum

iterations limit */

while k < kmax and Ea > Eu dofor i = 1 to n do

sum← 0for j = row ptr[i] to row ptr[i + 1]− 1 do

if j 6= i thensum← sum + values[j] ∗ x[col ind[j]]

endelse

d← values[j]end

endxi ← (bi − sum)/d

end/* Calculate new iteration error */

if k > 0 thenEa ← ‖x− xold‖

end/* Update the old solution */

xold ← x/* Update the iteration counter */

k ← k + 1end

2.3 Graphs and Matrices

An undirected graph is defined as G = (V , E), where V denotes the set of vertices

and E denotes the set of edges. Each edge eij ∈ E connects a pair of distinct

vertices vi and vj. The degree di of a vertex vi is equal to the number of edges

incident to vi. For each vertex vi ∈ V , a weight w(vi); and for each edge eij ∈ E ,

a cost c(eij) can be assigned. The set of vertices connected by vertex vi is called

the adjacency list of vi and is denoted as Adj (vi) ∈ V .

11

An n × n square symmetric matrix A = (aij) can be represented as an undi-

rected graph G(A) = (V , E) with n vertices, where the the sparsity pattern of

A corresponds to the adjacency matrix representation of graph G. Vertex set Vrepresents the rows/columns and edge set E represents non-zeros of matrix A. Vcontains a vertex vi for each row/column i. E contains an edge eij, which connects

the vertices vi and vj for each non-zero pair aij and aji in A.

0 × 0 0 0 0× 0 × × 0 00 × 0 0 × 00 × 0 0 × ×0 0 × × 0 ×0 0 0 × × 0

6

5

3

4

1

2

Figure 2.2: A simple symmetric matrix and its undirected graph representation

Graph representation of matrices is often used in scientific computing applica-

tions to investigate data dependencies of tasks and communication patterns.

2.4 Graph Partitioning

Graph partitioning is a widely used method in scientific applications including

matrix multiplications to decompose computational tasks for parallelization [10,

11, 12].

Π = {P1,P2, . . . ,PK} is said to be a K-way partition of G if the following

conditions hold: (i) each part Pk is a non-empty subset of V , (ii) parts are

pairwise disjoint, i.e., Pk ∩ Pl = ∅ for all 1 ≤ k < l ≤ K, and (iii) union of

all K parts is equal to V . A partition Π of V is balanced if the following balance

criterion is satisfied by each part:

Wk ≤ Wavg(1 + ε), for k = 1, 2, . . . , K, (2.10)

where Wk is the sum of weights of all vertices in part Pk, i.e., Wk =∑

vi∈Pkw(vi),

12

Wavg is the average weight, i.e., Wavg =∑

vi∈V w(vi)/K , and ε is the maximum

imbalance ratio allowed. In a partition Π of G, an edge eij is called a cut edge

(external) if connects two different parts, and it is called uncut edge (internal) if

both vertices vi and vj are in the same part. EE denotes the set of external edges.

The cost of a partition Π is determined in terms of the cut edges as follows:

cost(Π) =∑

eij∈EE

c(eij) (2.11)

Graph partitioning problem is to partition the graph into two or more parts

while minimizing the cutsize in (2.11) and satisfying the maximum load balance

constraint in (2.10). The graph partitioning problem is known to be NP-hard.

There exist well known successful graph partitioning tools in the literature such

as Chaco [13] and METIS [14, 15].

Figure 2.3: An undirected graph and its 4-way partitioning.

Figure 2.3 presents an undirected graph and its 4-way partitioning. The total

cutsize is 9. All part weights are perfectly balanced and equal to 6.

13

2.5 Graph Coloring

A distance-1 coloring of a graph G is defined as a mapping clr : V → {1, 2, . . . , s}such that clr(vi) 6= clr(vj) for all edges ei,j ∈ E , where clr(vi) is the color of

vertex vi. This corresponds to an assignment of colors to vertices such that any

two adjacent vertices have different colors. The complexity of graph coloring

problem with using minimum number of colors is known to be NP-Hard [16].

The standard graph coloring aims to minimize the number of colors used with-

out considering the size of color classes relative to each other. A coloring of a

graph is called equitable if the sizes of any two color classes differ by at most one.

Another coloring method is balanced coloring. A coloring of a graph is called

balanced if the ratio of any two color classes is at most within a given threshold.

Given a graph G(V , E), balanced coloring is defined as computing a distance-1

coloring such that each color class receives approximately |V|C

vertices, where C is

the number of colors used [17].

Figure 2.4: An undirected graph and the same graph with distance-1 coloring

Figure 2.4 represents a simple undirected graph and its coloring. The graph

has 9 vertices and 14 edges. The maximum degree of the graph is 5. There are 3

colors used. Color classes have different sizes as 3, 2 and 4.

14

Graph coloring is frequently used in parallel computing domain to resolve the

dependencies in scientific problems by identifying the independent tasks that can

be computed in parallel. It is used for extracting parallelism in iterative sparse

solvers including Gauss-Seidel [18, 7, 19] and also used in many other scientific

applications such as calculation of sparse Jacobian matrices in partial differential

equation solvers [20], time tabling and scheduling problems [21, 22].

15

Chapter 3

Related Work

In this chapter, we investigate and discuss the existing work on parallel Gauss-

Seidel algorithms. There are several studies on parallelization of Gauss-Seidel in

the literature, which use graph decomposition, matrix ordering and graph coloring

models.

Koester et al. [6] presented a parallel Gauss-Seidel algorithm developed for ir-

regular sparse matrices from power system applications. They suggest a two step

matrix ordering technique that first partitions the matrix into a block-diagonal

bordered form using diakoptic techniques and then multi-colors the data in the

last diagonal block using graph coloring method. They order the matrix into

block-diagonal-bordered form using a node-tearing technique, and then assign

block-diagonal blocks of matrix partitions Ai,i,Ai,m,Am,i to the same proces-

sor. Their work utilizes active remote procedure calls in order to minimize the

communication overhead and obtain good speed-up. The drawback of their algo-

rithm is that the number of color classes increased dramatically in unstructured

problems, load imbalance is observed in many experiments and the calculation

of values in the last block show poor parallel performance. Since they inherently

apply different orderings in the matrix structure in order to extract parallelism,

the convergence rates are affected by the ordering.

16

Motivated by this observation, Adams [5] proposed a new distributed mem-

ory Gauss-Seidel algorithm as a multi-grid smoother with unstructured problems.

Their approach takes advantage of processor partitions and uses the domain de-

composition with the distribution of the stiffness matrix. Their approach suggests

that nodes are partitioned to minimize communication and then each processor

partitions its nodes into groups of interior (Int1, Int2) and boundary nodes (Top,

Bot, Mid). Interior nodes, which do not have any edges with nodes on dif-

ferent processors, are computed first to hide the communication latency of the

boundary nodes. The boundary nodes are partitioned into nodes, which only

communicates with processors of higher and lower colors which are referred to

Bot and Top nodes, respectively. By processing the processor partitions with

special orderings, their algorithm can be effective by utilizing optimal partition-

ing of finite element problems. However, a weakness of their approach is that

their algorithm requires instant communication for some computations that oc-

cur in Mid nodes and also requires different orderings depending the the number

of processors. This algorithm works well under certain assumptions such as load

balancing of computations are perfectly satisfied and there are enough number of

interior nodes to hide the communication requirement of boundary nodes.

Huang and Ongsakul [7] propose a task allocation algorithm and apply it

in parallel Gauss-Seidel implementation with two stages on distributed memory

systems. Their approach first suggests a recursive clustering algorithm to re-

duce communication costs and then devises a coloring algorithm to coordinate

information exchange among processors. Their approach partitions the graph into

clusters depending on the number of processors by recursively using a balanced 2-

way weakcut algorithm, which minimizes the number of boundary vertices instead

of minimizing the number of inter-connection edges between two clusters. Each

cluster is assigned to a single processor. Then, a coloring algorithm is proposed

to color only the boundary vertices. They have implemented their algorithm on

nCUBE2 machine by applying recursive k times partitioning depending on the

number of processors. Their load balancing paradigm is based on distributing

non-colored vertices among cluster groups. The weakness of their algorithm is

that the partitioning cutsize may not be minimized while maintaining the balance

17

of cluster sizes if vertices are not ideally connected, and load imbalance may be

observed between the color classes.

Courtecuisse and Allard [23] propose several Gauss-Seidel parallelization

schemes for many-core processors, that can be used in fully coupled dense prob-

lems. Their approach aims to synchronize computations without CPU interven-

tion by exploiting a small number of GPU processors. They suggest block-column

based parallelization, which first computes the blocks on the diagonal of the ma-

trix, and then process other columns. Their work also presents a different ap-

proach, which eliminates the need for global synchronization. They suggest to

use a shared memory counter, to store how many of the diagonal blocks are pro-

cessed and deduce if xi is updated for newer iterations. The drawback of their

algorithm is that it does not guarantee the order of invocation of thread blocks

and the scheduling is manually done.

Shang [8] proposes another distributed memory parallel Gauss-Seidel algo-

rithm. Their approach first suggests to divide the coefficient matrix and the

right hand side of the linear algebraic system into row blocks in the natural row-

wise order, then distribute the row blocks among local memories of all processors

through special mapping techniques. The communication volume is decreased

by conveying the solution iteration vector among processors in cycles at each

iteration. The parallel efficiency in their algorithm depends on a row-block parti-

tioning parameter g and an optimal number of processors. They suggest that pa-

rameter g depends on the relative ratio of communication to computation speed.

To increase parallel efficiency, their approach overlaps communication and compu-

tations. Their approach is a true parallel Gauss-Seidel algorithm that maintains

to convergance rates of the serial Gauss-Seidel algorithm. Although communica-

tion time is decreased by implementation of cyclically conveyed solution iteration

vector, load balancing issue is not handled among processors.

Since Gauss-Seidel is also used as a preconditioner and multigrid smoother

in linear problems, there exist some work in the literature, which implements

a parallel Gauss-Seidel algorithm for these purposes. Heuveline et al. [24] uses

multi-coloring technique for block Gauss-Seidel preconditioner on GPUs. Their

18

approach aims to resolve neighbor dependencies by introducing new neighborship

classes. They evaluated scalability and performance results on hybrid multi-core

CPU and GPU platforms.

Park et al. [19] implemented a shared memory parallel Gauss-Seidel algorithm

as a preconditioner in High Performance Conjugate Gradient Benchmark (HPCG)

application as the Gauss-Seidel kernel is the most time consuming operation

in HPCG. Their approach uses three parallelization approaches including task

scheduling with point-to-point synchronization, block multi-color reordering and

running multiple MPI ranks within a node. Their multi-coloring approach colors

vertices to the same color even there are some transitive dependency between a

part of vertices. Although their studies report good results in terms of parallel

efficiency, it is out of our interest as it targets shared memory architectures.

19

Chapter 4

Method

In this chapter, we present our 2 stage novel scheme for efficient and load bal-

anced parallelization of Gauss-Seidel method. In the first stage, we apply graph

partitioning and build a coarse graph based on partitioning of vertices. We apply

coarsening in order to obtain better coloring and decrease the number of synchro-

nization points. In the second stage, we use a balanced graph coloring approach

in order to decouple the dependencies among Gauss-Seidel iterations that is re-

quired to extract parallelism while maintaining the balance of the size of the color

classes.

4.1 Coarsening Stage

In this stage, we propose a coarsening technique based on graph partitioning

on the graph representation of the coefficient matrix A. This stage takes the

original input graph G corresponding to the coefficient matrix A and outputs a

coarse graph GC of this input graph in which each coarse vertex of the output

graph contains approximately B vertices of the original graph, where B is the

block size parameter. The coefficient matrix A is represented as a undirected

graph G = (V , E), as described in Chapter 2. Providing a block size parameter

20

B, we suggest K -way partitioning the graph G with N vertices into K parts with

each part approximately having B vertices such that K � P , where P denotes

the number of processors. Note that K = N/B. The block size parameter may

depend on the size and sparsity structure of the coefficient matrix A and it may

affect the overall efficiency of system. In our experiments, we have chosen various

block-size parameters that range from 50 to 500 depending on the problem size.

The effect of block size parameter is evaluated in Chapter 5 on scientific problems

of various sizes.

Given a number of partitions K as a parameter, we partition the graph

into K parts with approximately N/B vertices. We obtain a partitioning

Π = {V1,V2, . . . ,VK}. We calculate the edge-cut of the partitioning solution.

Then, we form a coarse graph GC = (VC, EC), where VC denotes the vertex set

of coarse graph GC, and called as super vertices, EC denotes edge set of coarse

graph GC, and called as super edges. The coarse graph GC is formed using the

partitioning information Π, that is each part of VK becomes a super vertex of

VC. In detail, let p be a vector of size |V | such that p[i] stores the number of the

partition that vertex vi belongs to. By partitioning the graph, we obtain a list

of vertex-part pairs 〈vi, p[i]〉. We assign the group of vertices in G to GC based

on their partition numbers p[i]. If there exists at least one edge in between any

two vertices, each in different super vertices, we form a super edge in between the

respective super vertices. Multiple edges between vertices of corresponding super

vertices are coalesced into a single edge.

The aim in graph partitioning problem is to minimize the cut size among all

partitions. By minimizing the cut size, we obtain highly intra-connected cluster

of vertices, and loosely inter-connected super vertices. Each super vertex denotes

an atomic task and the rows corresponding to the vertices in that super vertex

will be processed by a single processor in the Gauss-Seidel algorithm. However,

note that multiple super vertices are assigned to a single processor based on the

coloring results. This enables us to resolve communication costs while processing

vertices in multiple processors in parallel. No communication is required when

processing a single super vertex, and those fine-grain vertices that make up the

super vertex. By coarsening the graph based on partitioning information, the

21

resulting coarse graph GC is less coupled than the initial graph. Therefore, the

chance of obtaining fewer number of colors in the coloring stage (described in Sec-

tion 4.2) is increased. Additionally, coarsening reduces the coloring time, which

is a necessary pre-processing step in a parallel Gauss-Seidel implementation.

In our experimental results presented in Chapter 5, we will compare the char-

acteristics of the initial graph G with resulting coarse graph GC observed that the

resulting coarsened graph GC is less coupled in terms of vertex degrees, and the

required number of colors is fewer in GC than in G.

Figure 4.1: Step by step coarsening stage

Figure 4.1 represents an undirected graph with 27 vertices and its step by

step coarsening process. The initial graph has been partitioned into 5 parts with

partition size of 5 or 6. Each part is considered as a super vertex, then super

edges are formed among the super vertices. The resulting coarse graph has 5

super vertices. The weights of super vertices are provided with labels. One can

observe that the vertices within a single partition is densely connected.

22

4.2 Coloring Stage

In order to extract parallelism in the Gauss-Seidel iterations, we use a graph

coloring technique. We apply distance-1 coloring on the coarsened graph GC that

is obtained at the end of the coarsening stage. We first assign colors to the super

vertices of GC, process nodes of GC belonging to same color in parallel, and proceed

with the next color. We will describe details of parallel execution steps later in

this section.

Traditional graph coloring methods aim to minimize the number of colors.

However, in the context of many parallel processing applications, it is also im-

portant to obtain a balanced distribution of the sizes of the color classes. If the

color classes are imbalanced in size, utilization of hardware resources becomes in-

efficient, especially for the smaller color classes. Therefore our goal is to achieve

a balanced coloring of the vertices in GC, that the number of vertices in each color

class is approximately the same. Lu et al. [17] presented a graph coloring toolkit

named Grappolo which provides an algorithm for balanced coloring with Vertex

First Fit (VFF) method. Their work presents a set of coloring strategies based

on greedy, shuffling and recoloring approach. Basically, a vertex is assigned to the

least used color among the set of permissible colors which is bounded by a bal-

ancing constraint. Their work also includes guided balancing strategies in which

a scheduled and unscheduled shuffling and recoloring methods are introduced.

We have also considered and discussed using a traditional coloring method

due to the possibility that a balanced coloring method may increase the required

number of colors potentially. For this purpose, we have evaluated a set of color-

ing algorithms proposed by Gebremedhin et al. [25]. Their work presents a graph

coloring toolkit named ColPack for scientific purposes. ColPack offers various

sequential greedy coloring algorithms with degree based ordering in the context

of distance-1 coloring. Each algorithm progressively extends a partial coloring

by processing a single vertex at a time in some order by permanently assigning

a vertex the smallest allowable color in each step. By comparing two different

23

coloring approaches in our experiments, we have observed that a balanced color-

ing is generally possible with the same number of colors as traditional coloring.

Related experimental results are discussed in Chapter 5 in more detail.

In this stage, Grappolo is used to obtain a balanced coloring of vertices, which

is required for the kernel of parallel Gauss-Seidel implementation. We color the

vertices of coarse graph GC, and upon completion, each super vertex vci ∈ VC is

assigned a color number ci, where ci ∈ C. C denotes the set of all colors, and

nc denotes the total number of colors required. Note that vertices of G have not

been subjected to coloring at this stage. Instead, they are assigned colors based

on the color of the corresponding super vertex which they are part of. Proposed

coarsening and coloring method in together decreases the number of colors which

results in reducing number of synchronization points that contributes to the total

communication volume. Fewer number of colors requires less synchronization

points between sub-iterations, hence, communication overhead decreases. The

coloring time is also decreases with the proposed coarsening method.

We assign the rows of the iteration matrixA that correspond to the constituent

vertices of each super vertex to a different processor. Since the number of su-

per vertices of GC is much greater than the number of processors (K � P ), a

single processor processes multiple super vertices of GC. All processors computes

the corresponding xi entries of same color. Once, processing of a color class is

complete, all processors exchange xi values. This is called a local synchronization

point. Then all processors proceed with the next color after receiving the updated

xi values.

Figure 4.2 represents a visual comparison of traditional parallel processing

versus parallel processing with coarsening and balanced coloring. Since number

of colors decreases with the coarsening stage, fewer number of synchronization

points are required. Size of color classes are almost perfectly balanced and all

processors are assigned almost equal number of super vertices that are considered

as atomic tasks.

24

Figure 4.2: Traditional parallel processing vs. parallel processing with coarsening

and balanced coloring

25

Chapter 5

Experimental Results

In this chapter, we present the results of our experiments that we carried out

for the validity of the proposed methodology. Firstly, we give the details of the

experimental platform and data sets used. Then, following sections include eval-

uation of numeric results for the implementation of the 2 stages of our proposed

methodology. In our experiments, we evaluate two important metrics which are

the number of colors and partitioning cutsize. Due to the randomized nature of

some algorithms used, the experiments were repeated 5 times with different seeds

and results are averaged. Timing results for each stage of our methodolody are

also presented in each section.

5.1 Experimental Platform

The platform used in our experiments is a server computer with 4 x 6 Core AMD

Opteron 8425 HE with 2.1 GHz clock frequency, 64 KB L1 cache, 512 KB L2

cache, 6 MB shared L3 cache and 128 GB total memory. The operating system

is Debian GNU/Linux version 7. gcc version 4.7.2 is used as C/C++ compiler

with −O3 optimization flag while compiling. METIS version 5.1.10 with default

parameters is used as a graph partitioning tool. In the coloring stage, Grappolo

26

with currently available source code with no specific version tag and ColPack

version 1.0.10 are used.

5.2 Data Set

In order to validate and measure the performance of our methodology, we used a

real life test experiment data from wide range of scientific applications. We per-

formed our tests on 40 different matrices from the SuiteSparse Matrix Collection

(formerly the University of Florida Sparse Matrix Collection) [26]. SuiteSparse

Matrix Collection is a widely used set of sparse matrix benchmarks collected from

a wide range of applications. The chosen problems are symmetric and positive

definite matrices as per the convergence requirement of the Gauss-Seidel method.

The size of test matrices ranges from 16,146 to 4,588,484 vertices and 1,015,156

to 48,538,952 nonzeros in a variety of scientific applications including electromag-

netics, structural and quantum chemistry problems. The details of test matrices

are presented in Table 5.1.

The matrices in the SuiteSparse Matrix Collection are provided in Matrix

Market (MTX ) format which is a widely used storage format. However, CSR

format is used in our implementation due to its advantages discussed before. We

converted those matrices to CSR format while processing.

27

Table 5.1: Properties of Test Matrices

Matrix Name Problem Kind Rows Columns NonzerosVertex Degrees

Min Avg Max

2cubes sphere electromagnetics problem 101,492 101,492 1,647,264 5 16 31

af 0 k101 structural problem 503,625 503,625 17,550,675 15 35 35

af shell1 structural problem sequence 504,855 504,855 17,588,875 20 35 40

bcsstk31 structural problem 35,588 35,588 1,181,416 2 33 189

bmw7st 1 structural problem 141,347 141,347 7,339,667 1 52 435

cfd2 fluid dynamics problem 123,440 123,440 3,087,898 12 20 33

coAuthorsCiteseer undirected graph 227,320 227,320 1,628,268 1 7 1372

coAuthorsDBLP undirected graph 299,067 299,067 1,955,352 1 7 336

consph 2D/3D problem 83,334 83,334 6,010,480 1 72 81

cop20k A 2D/3D problem 121,192 121,192 2,624,331 0 22 81

crankseg 2 structural problem 63,838 63,838 14,148,858 48 222 3423

CurlCurl 4 model reduction problem 2,380,515 2,380,515 26,515,867 4 11 13

dblp-2010 undirected graph 326,186 326,186 1,615,400 0 5 238

delaunay n20 undirected graph 1,048,576 1,048,576 6,291,372 3 6 23

delaunay n21 undirected graph 2,097,152 2,097,152 12,582,816 3 6 23

dielFilterV2real electromagnetics problem 1,157,456 1,157,456 48,538,952 6 42 110

filter3D model reduction problem 106,437 106,437 2,707,179 8 25 112

fullb structural problem 199,187 199,187 11,708,077 18 59 144

Ga10As10H30 quantum chemistry problem 113,081 113,081 6,115,633 7 54 698

Ga3As3H12 quantum chemistry problem 61,349 61,349 5,970,947 15 97 1622

Continued on next page

28

Matrix Name Problem Kind Rows Columns NonzerosVertex Degrees

Min Avg Max

GaAsH6 quantum chemistry problem 61,349 61,349 3,381,809 15 55 1646

gas sensor model reduction problem 66,917 66,917 1,703,365 8 25 33

gupta3 optimization problem 16,783 16,783 9,323,427 33 556 14672

hugetrace-00000 undirected graph 4,588,484 4,588,484 13,758,266 2 3 3

il2010 undirected weighted graph 451,554 451,554 2,164,464 1 5 90

mo2010 undirected weighted graph 343,565 343,565 1,656,568 1 5 84

msdoor structural problem 415,863 415,863 20,240,935 28 49 77

nc2010 undirected weighted graph 288,987 288,987 1,416,620 1 5 83

offshore electromagnetics problem 259,789 259,789 4,242,673 5 16 31

oilpan structural problem 73,752 73,752 3,597,188 28 49 70

olafu structural problem 16,146 16,146 1,015,156 24 63 89

pdb1HYS weighted undirected graph 36,417 36,417 4,344,765 18 119 204

PFlow 742 2D/3D problem 742,793 742,793 37,138,461 1 50 137

pkustk05 structural problem 37,164 37,164 2,205,144 24 59 126

rgg n 2 17 s0 undirected random graph 131,072 131,072 1,457,506 0 11 28

s3dkq4m2 structural problem 90,449 90,449 4,820,891 13 53 54

thermal2 thermal problem 1,228,045 1,228,045 8,580,313 1 7 11

thermomech dM thermal problem 204,316 204,316 1,423,116 4 7 10

va2010 undirected weighted graph 285,762 285,762 1,402,128 1 5 102

venturiLevel3 undirected graph 4,026,819 4,026,819 16,108,474 2 4 6

End of Table

29

5.3 Evaluation

5.3.1 Experiments on the Coarsening Stage

To measure the effectiveness of our coarsening model, we analyze and compare the

structure of coarse graph GC with the original graph G that corresponds to the co-

efficient matrix A. We partition G for different block sizes B ∈ {50, 100, 250, 500}.Then, we form a coarse graph GC. In our experiments, we evaluate the structure

of G and GC by comparing the vertex degrees. The minimum, average and max-

imum vertex degrees of G and GC are presented in Table 5.2 with respect to the

block size parameter used for the partitioning step. Please note that the original

graph properties are presented with the lines where B = 1 for each test matrix

and the subsequent rows present the coarsening results for the respective block

sizes.

The vertex degree properties of a sparse matrix reflect the dependencies of any

nonzero item to other nonzero items on the off-diagonal blocks. Depending on the

reordering due to the partitioning, the connectivity between any items may cause

computational dependency in the Gauss-Seidel iterations. These dependencies

contributes to the output of the coloring stage. Experimental results on Table

5.2 show that our coarsening model effectively reduces the average and maximum

degrees on test matrices. For instance, the test matrix named af shell has average

degree of 34 and maximum degree of 40. After coarsening applied with B = 50,

the resulting coarse graph GC has average degree of 7 and maximum vertex degree

of 11. This observation holds true for the almost any of test matrices.

In the implementation of the graph partitioning problem, we used METIS [14]

as a graph partitioning tool. METIS is a multilevel graph partitioning software. It

uses a multilevel approach with three stages and comes with several algorithms

for each stage. It is well accepted and commonly used as a successful graph

partitioning tool used various domains including finite element methods, linear

programming, VLSI, and transportation. Their experiments show that METIS

produces partitions that are consistently better than those produced by other

30

widely used algorithms.

One important quality metric of coarsening stage that is related to the parallel

performance is the partitioning cutsize. Partitioning cutsize is important since

it affects the quality of output of the coarsening stage. In Table 5.2, number

of partitions denoted by K and partitioning cutsize are presented. Experimental

results verify that the cutsize decreases with the coarsening process. For instance,

considering the test matrix af shell1, the cutsize of partitioning decreases from

3,227,892 to 1,005,261 in relation to the blocksize parameter.

Timing results for partitioning with METIS (tpartitioning), coarsening

(tcoarsening) and coloring (tcoloring) stages are also provided in Table 5.2, all in

seconds. Our experiments show that proposed coarsening model reduces the col-

oring time. Timing results for the coloring of original graph G are presented on

the lines where B = 1 for each test matrix and the subsequent rows present the

timing results for larger block sizes. We observe that the most time consuming

operation is the initial graph partitioning done by METIS.

5.3.2 Experiments on the Coloring Stage

Graph coloring is a crucial task of any parallel Gauss-Seidel implementation due

to the sequential nature of the algorithm. Number of colors obtained from the

coloring stage is directly related to the parallel efficiency. When the computa-

tions within a color class are completed, synchronization points are required to

exchange xi values, which results in synchronization overhead. The synchroniza-

tion overheads increase to the total communication volume. Another important

metric is the balance of the color classes. Balancing the computational weights

among multiple color classes increases the overall efficiency of parallel system. In

this section, we discuss the efficiency of our proposed model by evaluating these

metrics. We evaluated and compared two graph coloring tools ColPack [25] and

Grappolo [17].

31

In Table 5.2, we evaluated the number of colors for each test matrix in relation

to various block sizes. By examining the experimental results, we observed that

our coarsening model effectively reduces the number of colors required to color

the graph. For instance, 908 colors are required to color Ga3As3H12 matrix,

whereas our coarsening model with B = 50 only requires 32 colors which results

in 28 times decrease in the synchronization overhead.

Balancing of computations in multiple color classes is important for a paral-

lel Gauss-Seidel implementation. For this requirement, we suggested a balanced

coloring approach as discussed previously. Here, we compare the results of tra-

ditional coloring with balanced coloring in Figure 5.1 and Figure 5.2. ColPack

and Grappolo are used for coloring and balanced coloring, respectively. Figure

5.1 compares two approach by the number of vertices in each color classes of

the test matrix delaunay n20. Figure 5.2 compares the sum of weights of ver-

tices having the same color for the test matrix. The results show that vertex

counts and weights are almost equal with balanced coloring method we suggested

whereas there is a huge imbalance among different colors in the traditional color-

ing method. Color IDs are sorted to remark the imbalance. In the traditional col-

oring experiments, color classes of very small number of nodes are observed. This

may cause that some processors stay idle in parallel systems with large number of

processors. Although, the balancing results are provided only for delaunay n20,

experiments were repeated for every test matrix and results are similar.

We also present a comparison of number of counts required to color same

problem with traditional coloring versus balanced coloring. In Figure 5.3, total

number of colors required is given by applying both coloring techniques for all test

matrices. Experiments show that in only rare cases balanced coloring approach

slightly increases the number of colors. In most cases, balanced coloring method

requires the same number of colors as the traditional coloring method, moreover

it produces quality color classes of almost equal sizes.

Timing results for coloring stage are presented across the column titled

(tcoloring) of Table 5.2. The results show that our coarsening model decreases

the required time for the coloring stage as expected.

32

Table 5.2: Coarsening and Coloring Results for B ∈ {1, 50, 100, 250, 500}

Matrix Name B K CutsizeVertex Degrees

# ColorsTime (seconds)

Min Avg Max tpartitioning tcoarsening tcoloring

2cubes sphere 1 101,492 - 5 16 31 13 - - 1.03

50 2,029 350,747 7 16 28 12 6.36 0.1 0.2

100 1,014 262,980 6 16 25 11 3.15 0.07 0.26

250 405 193,804 6 15 22 11 1.54 0.06 0.24

500 202 153,665 6 14 21 10 1.02 0.04 0.22

af 0 k101 1 503,625 - 15 34 35 15 - - 6.61

50 10,072 3,218,626 3 7 12 7 48.73 0.57 0.21

100 5,036 2,240,029 3 7 9 6 20.15 0.39 0.19

250 2,014 1,422,306 3 6 11 6 8.21 0.37 0.19

500 1,007 1,011,004 3 6 11 7 4.4 0.25 0.18

af shell1 1 504,855 - 20 34 40 25 0.01 - 7.99

50 10,097 3,227,892 3 7 11 7 46.77 0.51 0.22

100 5,048 2,242,409 4 6 9 7 17.6 0.42 0.19

250 2,019 1,422,405 3 6 10 6 7.67 0.3 0.2

500 1,009 1,005,261 3 6 12 6 5.19 0.3 0.19

bcsstk31 1 35,588 - 2 33 189 36 - - 0.57

50 711 246,610 4 12 25 10 2.45 0.04 0.17

100 355 170,328 4 10 20 10 1.13 0.03 0.18

250 142 107,868 3 9 17 7 0.53 0.02 0.18

500 71 70,833 4 8 12 7 0.32 0.02 0.17

bmw7st 1 1 141,347 - 1 51 435 54 - - 3.01

50 2,826 1,673,618 3 9 23 9 14.77 0.15 0.19

100 1,413 1,191,352 3 8 25 8 5.67 0.1 0.18

250 565 753,110 2 7 19 8 2.71 0.09 0.16

500 282 519,441 3 7 14 7 1.54 0.08 0.18

cfd2 1 123,440 - 8 25 30 15 - - 1.32

50 2,468 715,902 4 14 24 12 11.15 0.16 0.19

100 1,234 538,840 4 12 21 10 5.51 0.08 0.19

250 493 375,650 3 10 20 9 2.23 0.06 0.18

500 246 280,725 4 10 16 9 1.41 0.05 0.18

coAuthorsCiteseer 1 227,320 - 2 8 1373 87 - - 1.19

50 4,546 189,178 2 23 255 19 14.03 0.32 0.22

100 2,273 140,696 2 36 225 25 7.15 0.3 0.22

250 909 115,505 11 69 195 28 3.87 0.24 0.21

500 454 105,737 18 106 205 42 2.6 0.2 0.2

coAuthorsDBLP 1 299,067 - 2 7 337 115 0.01 - 1.49

50 5,981 300,029 3 41 290 35 20.09 0.64 0.29

100 2,990 256,610 6 67 290 48 10.47 0.57 0.28

250 1,196 225,328 28 127 315 43 6.49 0.42 0.29

500 598 212,810 48 188 331 66 5.45 0.32 0.22


33




consph 1 83,334 - 1 72 81 33 - - 2.41

50 1,666 1,864,232 1 18 44 14 14.4 0.17 0.18

100 833 1,499,045 1 15 24 12 6.48 0.12 0.17

250 333 1,110,406 1 13 21 11 3.09 0.08 0.18

500 166 871,298 1 12 20 10 2.03 0.07 0.16

cop20k A 1 121,192 - 1 21 81 20 - - 1.29

50 2,423 663,733 1 13 35 12 10.33 0.14 0.19

100 1,211 442,535 1 12 24 10 5.31 0.11 0.17

250 484 311,814 1 10 19 10 2.01 0.06 0.2

500 242 234,704 1 9 15 9 1.36 0.06 0.19

crankseg 2 1 63,838 - 48 221 3423 198 - - 5.56

50 1,276 5,690,067 10 54 179 34 25.51 0.42 0.19

100 638 4,773,022 10 30 91 22 9.96 0.19 0.19

250 255 3,563,088 7 17 40 13 5.09 0.13 0.18

500 127 2,707,962 7 13 27 12 2.87 0.12 0.17

CurlCurl 4 1 2,380,515 - 4 11 13 7 0.02 - 11.78

50 47,610 5,463,353 5 16 29 12 247.51 2.19 0.69

100 23,805 3,950,078 5 15 31 12 132.71 1.59 0.52

250 9,522 2,929,611 5 15 23 12 76.52 1.41 0.31

500 4,761 2,327,368 5 14 23 12 49.86 2.02 0.26

dblp-2010 1 326,186 - 1 5 239 75 - - 1.16

50 6,523 198,143 1 25 248 29 18.41 0.43 0.24

100 3,261 156,011 1 38 225 33 8.88 0.32 0.26

250 1,304 137,542 1 72 293 40 5.61 0.3 0.24

500 652 127,118 1 105 266 55 3.98 0.29 0.21

delaunay n20 1 1,048,576 - 4 6 24 8 0.01 - 3.36

50 20,971 520,061 4 7 17 7 51.75 0.82 0.25

100 10,485 368,355 3 7 14 7 30.23 0.58 0.24

250 4,194 235,678 4 7 14 6 13.23 0.49 0.2

500 2,097 168,358 4 6 14 6 6.76 0.4 0.21

delaunay n21 1 2,097,152 - 4 6 24 9 0.01 - 7.17

50 41,943 1,042,183 3 7 16 7 127.13 1.35 0.37

100 20,971 737,897 3 7 17 7 73.66 1.19 0.28

250 8,388 472,685 4 7 17 7 29.38 1.23 0.23

500 4,194 337,595 4 6 14 7 16.86 0.88 0.2

dielFilterV2real 1 1,157,456 - 6 41 110 28 - - 21.08

50 23,149 12,330,229 6 18 41 14 199.49 1.47 0.55

100 11,574 9,780,125 5 16 31 13 102.55 1.13 0.35

250 4,629 7,126,512 5 15 34 13 61.65 1.1 0.24

500 2,314 5,589,193 5 14 26 12 36.99 1.21 0.19


34




filter3D 1 106,437 - 8 25 112 16 - - 1.27

50 2,128 616,562 3 13 29 11 9.47 0.13 0.19

100 1,064 450,344 3 11 24 10 4.23 0.1 0.19

250 425 306,507 3 9 17 8 1.99 0.08 0.17

500 212 221,077 4 7 12 8 1.47 0.08 0.16

fullb 1 199,187 - 18 58 144 54 - - 4.48

50 3,983 2,959,022 4 11 26 11 26.79 0.3 0.2

100 1,991 2,249,181 4 10 19 10 11.07 0.18 0.18

250 796 1,556,875 5 10 15 8 4.82 0.16 0.18

500 398 1,195,246 4 9 16 8 3 0.15 0.18

Ga10As10H30 1 113,081 - 7 54 697 390 - - 2.77

50 2,261 2,359,972 30 76 218 52 23.73 0.54 0.24

100 1,130 2,118,671 27 70 223 50 14.45 0.23 0.21

250 452 1,563,816 15 53 139 37 8.88 0.14 0.18

500 226 1,090,058 19 40 115 29 5.32 0.16 0.18

Ga3As3H12 1 61,349 - 15 97 1622 908 - - 2.62

50 1,226 2,589,109 30 81 223 58 12.9 0.27 0.2

100 613 2,454,111 27 77 231 61 7.57 0.17 0.19

250 245 2,143,907 22 59 165 45 4.93 0.12 0.19

500 122 1,625,746 17 39 101 32 3.15 0.08 0.17

GaAsH6 1 61,349 - 15 55 1645 910 - - 1.7

50 1,226 1,337,907 33 70 195 50 10.69 0.24 0.2

100 613 1,232,815 25 64 197 52 6.56 0.16 0.19

250 245 1,067,671 22 51 140 37 3.76 0.08 0.17

500 122 847,381 16 33 90 23 2.46 0.06 0.17

gas sensor 1 66,917 - 8 25 33 13 - - 0.82

50 1,338 402,689 6 14 22 12 5.24 0.08 0.2

100 669 306,530 6 13 22 12 2.66 0.06 0.18

250 267 221,316 5 12 22 10 1.39 0.05 0.15

500 133 168,873 6 10 19 8 0.79 0.04 0.17

gupta3 1 16,783 - 33 555 14672 201 - - 3.76

50 335 4,424,987 117 239 335 152 26.2 0.22 0.21

100 167 4,355,360 76 142 167 104 17.39 0.11 0.19

250 67 4,243,284 54 64 67 56 9.54 0.08 0.18

500 33 4,478,075 1 31 32 32 13.31 0.07 0.16

hugetrace-00000 1 4,588,484 - 3 3 4 4 0.04 - 11.63

50 91,769 843,383 3 6 12 7 238.55 3.17 0.64

100 45,884 579,429 3 6 12 7 151.64 2.57 0.46

250 18,353 367,904 3 6 11 7 64.02 3.07 0.28

500 9,176 262,679 3 6 11 7 35.16 3.15 0.26


35




il2010 1 451,554 - 2 5 91 7 - - 1.29

50 9,031 150,546 2 6 18 7 22.08 0.29 0.2

100 4,515 100,285 2 6 15 7 8.93 0.24 0.2

250 1,806 59,596 2 6 13 7 4.26 0.2 0.2

500 903 40,074 2 6 12 6 2.55 0.18 0.19

mo2010 1 343,565 - 2 5 85 7 - - 1.11

50 6,871 120,522 2 6 14 7 14.86 0.24 0.2

100 3,435 80,280 2 6 14 7 6.23 0.16 0.23

250 1,374 47,872 2 6 12 7 3.21 0.17 0.23

500 687 32,500 3 6 13 6 2.08 0.15 0.18

msdoor 1 415,863 - 28 48 77 42 - - 8.89

50 8,317 4,310,933 4 7 17 8 54.68 0.51 0.21

100 4,158 3,005,766 3 7 18 7 21.23 0.38 0.19

250 1,663 1,844,762 3 7 17 7 8.4 0.36 0.17

500 831 1,239,576 3 6 15 6 4.63 0.28 0.18

nc2010 1 288,987 - 2 5 84 7 - - 0.86

50 5,779 102,515 2 6 15 7 13.32 0.21 0.19

100 2,889 67,852 2 6 17 7 5.64 0.17 0.19

250 1,155 40,075 3 6 13 6 2.39 0.11 0.19

500 577 26,957 3 6 11 6 1.42 0.12 0.19

offshore 1 259,789 - 5 16 31 12 - - 2.05

50 5,195 893,918 6 16 37 13 22.35 0.39 0.21

100 2,597 669,790 5 16 41 12 11.69 0.3 0.2

250 1,039 482,485 5 15 50 12 5.7 0.22 0.17

500 519 365,778 4 14 42 12 3.28 0.17 0.18

oilpan 1 73,752 - 28 48 70 35 - - 1.55

50 1,475 792,888 4 7 14 7 5.85 0.06 0.18

100 737 556,249 4 7 15 7 2.48 0.05 0.19

250 295 351,572 4 6 11 6 1.08 0.04 0.16

500 147 248,793 4 6 10 6 0.71 0.04 0.18

olafu 1 16,146 - 24 62 89 39 - - 0.52

50 322 260,616 3 9 22 9 1.39 0.03 0.17

100 161 179,089 3 7 13 7 0.54 0.01 0.16

250 64 103,724 3 6 9 5 0.24 0.01 0.18

500 32 66,522 2 5 8 4 0.15 0.01 0.17

pdb1HYS 1 36,417 - 18 119 204 82 - - 1.77

50 728 1,458,402 4 19 59 15 6.35 0.07 0.16

100 364 1,078,272 3 15 36 13 2.98 0.05 0.18

250 145 684,938 2 10 19 9 1.35 0.04 0.17

500 72 489,993 4 8 19 7 0.78 0.04 0.15


36




PFlow 742 1 742,793 - 1 49 137 27 - - 14.39

50 14,855 11,241,896 1 18 71 15 187.2 0.93 0.47

100 7,427 8,584,193 1 14 33 12 77.68 0.79 0.26

250 2,971 6,282,559 1 12 21 11 55.39 0.68 0.23

500 1,485 4,837,489 1 10 17 9 50.23 0.66 0.19

pkustk05 1 37,164 - 24 59 126 66 - - 1

50 743 584,681 7 13 28 12 3.49 0.04 0.18

100 371 450,434 5 11 20 8 1.58 0.03 0.18

250 148 304,077 4 8 15 8 0.73 0.02 0.18

500 74 215,495 3 7 11 7 0.47 0.02 0.18

rgg n 2 17 s0 1 131,072 - 1 12 29 16 - - 0.74

50 2,621 93,384 3 6 14 6 5.46 0.11 0.27

100 1,310 57,871 3 6 11 6 2.39 0.11 0.18

250 524 33,878 3 6 13 7 1.12 0.07 0.18

500 262 23,023 3 6 13 6 0.65 0.07 0.17

s3dkq4m2 1 90,449 - 13 53 54 24 - - 1.89

50 1,808 1,084,415 3 7 13 7 8.77 0.08 0.18

100 904 765,330 3 7 10 7 3.46 0.07 0.18

250 361 483,160 3 6 9 6 1.56 0.07 0.18

500 180 339,316 3 6 9 6 0.88 0.05 0.17

thermal2 1 1,228,045 - 1 6 10 7 0.01 - 4.37

50 24,560 993,578 1 6 11 7 84.27 0.95 0.28

100 12,280 513,059 1 6 11 7 28.87 0.77 0.22

250 4,912 321,477 1 6 10 7 15.4 0.7 0.18

500 2,456 227,701 1 6 9 6 7.54 0.57 0.2

thermomech dM 1 204,316 - 4 6 10 7 - - 0.87

50 4,086 161,932 3 6 10 6 8.35 0.19 0.19

100 2,043 82,570 3 6 9 6 3.52 0.14 0.19

250 817 50,374 3 6 10 6 1.74 0.15 0.19

500 408 35,122 3 6 9 6 1.04 0.15 0.19

va2010 1 285,762 - 2 5 103 7 - - 0.97

50 5,715 103,804 2 6 21 7 12.3 0.17 0.27

100 2,857 69,078 3 6 16 7 5.31 0.17 0.19

250 1,143 41,042 2 6 13 6 2.31 0.12 0.2

500 571 27,647 2 6 12 7 1.41 0.11 0.18

venturiLevel3 1 4,026,819 - 3 5 7 5 0.03 - 9.91

50 80,536 1,255,709 3 6 11 7 216.99 2.17 0.48

100 40,268 870,008 3 6 12 7 135.32 1.95 0.4

250 16,107 543,187 3 6 11 6 71.34 2.3 0.29

500 8,053 383,938 3 6 11 7 40.61 1.73 0.23

End of Table

37

Figure 5.1: A comparison of coloring vs. balanced coloring by number of super

vertices per each color

Figure 5.2: A comparison of coloring vs. balanced coloring by sum of weights per

each color

38

Figure 5.3: A comparison of coloring vs. balanced coloring by number of colors

obtained

39

In Table 5.3, we present the average results for all test matrices with respect

to the block size parameter by geometric mean. Results show that our model

performs up to 67.7% better coloring by the number of colors in average compared

to the baseline where B = 1. The coloring time is decreased up to 92.8% for

B = 500. However, partitioning and coarsening times are introduced.

B K CutsizeVertex Degrees



1 162,652.00 - 5.77 32.55 203.20 38.15 - - 2.65

50 3,251.63 1,136,266.95 5.15 20.62 59.45 16.99 21.13 0.30 0.25

100 1,625.49 867,214.79 4.14 19.05 46.48 15.78 10.30 0.21 0.22

250 649.48 621,076.86 4.77 16.24 34.58 13.51 5.20 0.16 0.20

500 324.30 472,592.98 4.51 14.29 27.57 12.32 3.35 0.14 0.19

Table 5.3: Average Results for All Test Matrices

40

Chapter 6

Conclusion

Gauss-Seidel is a popular iterative solver for the linear system of equations due

to its fast convergence property. In this thesis, we proposed a method based on

coarsening and balanced coloring for the efficient parallelization of the Gauss-

Seidel algorithm on distributed memory systems. Our coarsening method effec-

tively reduces the required number of colors in the coloring stage, thus decreases

the volume of communication caused by the synchronization overheads. Since

the Gauss-Seidel method is inherently sequential, we utilized a graph coloring

technique in order to extract parallelism. Utilized balanced coloring method effi-

ciently balances the computational weights among the color classes. Experiments

performed on a wide range of test matrices arising from various scientific do-

mains verify the validity of proposed model. We tested the performance of our

model against the traditional model and observed that proposed model reduces

the number of colors in the coloring stage by up to 67.7% depending on a chosen

block size parameter while balancing computations among the color classes.

41

Bibliography

[1] N. Lord, G. H. Golub, and C. F. V. Loan, “Matrix Computations,” The

Mathematical Gazette, vol. 83, no. 498, p. 556, 2007.

[2] N. Chrisochoides, E. Houstis, and J. Rice, “Mapping algorithms and software

environment for data parallel pde iterative solvers,” Journal of Parallel and

Distributed Computing, vol. 21, no. 1, pp. 75–95, 1994.

[3] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, “Solving

Problems on Concurrent Computers,” Osti.Gov, 1988.

[4] L. A. Hageman and D. M. Young, “Iterative Solution of Large Linear Sys-

tems.,” The American Mathematical Monthly, 2006.

[5] M. F. Adams, “A distributed memory unstructured gauss-seidel algorithm

for multigrid smoothers,” pp. 4–4, Association for Computing Machinery

(ACM), feb 2004.

[6] D. P. Koester, S. Ranka, and G. C. Fox, “A parallel Gauss-Seidel algorithm

for sparse power system matrices,” 2004.

[7] G. Huang and W. Ongsakul, “An efficient task allocation algorithm and

its use to parallelize irregular Gauss-Seidel type algorithms,” pp. 497–501,

Institute of Electrical and Electronics Engineers (IEEE), dec 2002.

[8] Y. Shang, “A distributed memory parallel Gauss-Seidel algorithm for linear

algebraic systems,” Computers and Mathematics with Applications, vol. 57,

no. 8, pp. 1369–1376, 2009.

42

[9] F. B. Hildebrand and Mathematics, Introduction to Numerical Analysis: Sec-

ond Edition. 1993.

[10] W. J. Camp, S. J. Plimpton, B. A. Hendrickson, and R. W. Leland, “Mas-

sively parallel methods for engineering and science problems,” Communica-

tions of the ACM, vol. 37, no. 4, pp. 30–41, 1994.

[11] T. Bultan and C. Aykanat, “A new mapping heuristic based on mean field

annealing,” Journal of Parallel and Distributed Computing, vol. 16, no. 4,

pp. 292–305, 1992.

[12] M. Kaddoura, C. W. Ou, and S. Ranka, “Partitioning Unstructured Compu-

tational Graphs for Nonuniform and Adaptive Environments,” IEEE Parallel

and Distributed Technology, vol. 3, no. 3, pp. 63–69, 1995.

[13] B. Hendrickson and R. Leland, “The Chaco User’s Guide Version 2.0,” Tech.

Rep. July, Technical Report SAND95-2344, Sandia National Laboratories,

1995.

[14] G. Karypis and V. Kumar, “Metis: a software package for partitioning un-

structured graphs,” International Cryogenics Monograph, pp. pags. 121–124,

1998.

[15] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme

for Partitioning Irregular Graphs,” SIAM Journal on Scientific Computing,

vol. 20, no. 1, pp. 359–392, 2003.

[16] S. R. dos Santos, [Computers in nursing: development of free software ap-

plication with care and management]., vol. 44. 2010.

[17] H. Lu, M. Halappanavar, D. Chavarria-Miranda, A. H. Gebremedhin, A. Pa-

nyala, and A. Kalyanaraman, “Algorithms for balanced graph colorings with

applications in parallel computing,” IEEE Transactions on Parallel and Dis-

tributed Systems, 2017.

[18] M. T. Jones and P. E. Plassmann, “A Parallel Graph Coloring Heuristic,”

SIAM Journal on Scientific Computing, vol. 14, no. 3, pp. 654–669, 2005.

43

[19] J. Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, D. D. Kalamkar,

X. Liu, M. M. A. Patwary, Y. Lu, and P. Dubey, “Efficient Shared-Memory

Implementation of High-Performance Conjugate Gradient Benchmark and

its Application to Unstructured Matrices,” in International Conference

for High Performance Computing, Networking, Storage and Analysis, SC,

vol. 2015-Janua, pp. 945–955, 2014.

[20] T. F. Coleman and J. J. More, “Estimation of sparse hessian matrices

and graph coloring problems,” Mathematical Programming, vol. 28, no. 3,

pp. 243–270, 1984.

[21] D. Welsh and M. Powell, “Summary for Policymakers,” Climate Change 2013

- The Physical Science Basis, pp. 1–30, 1967.

[22] D. Brelaz, “New methods to color the vertices of a graph,” Communications

of the ACM, vol. 22, no. 4, pp. 251–256, 1979.

[23] H. Courtecuisse and J. Allard, “Parallel dense gauss-seidel algorithm on

many-core processors,” 2009 11th IEEE International Conference on High

Performance Computing and Communications, HPCC 2009, no. 1, pp. 139–

147, 2009.

[24] V. Heuveline, D. Lukarski, and J. P. Weiss, “Scalable multi-coloring precon-

ditioning for multi-core CPUs and GPUs,” in Lecture Notes in Computer

Science (including subseries Lecture Notes in Artificial Intelligence and Lec-

ture Notes in Bioinformatics), vol. 6586 LNCS, pp. 389–397, 2011.

[25] A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, “Col-

Pack,” ACM Transactions on Mathematical Software, vol. 40, pp. 1–31, oct

2013.

[26] T. A. Davis and Y. Hu, “The university of Florida sparse matrix collection,”

ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011.

44

COLORING FOR DISTRIBUTED-MEMORY-PARALLEL ...

Documents