COLORING FOR DISTRIBUTED-MEMORY-PARALLEL GAUSS-SEIDEL ALGORITHM a thesis submitted to the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of master of science in computer engineering By OnurKo¸cak September 2019
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
COLORING FORDISTRIBUTED-MEMORY-PARALLEL
GAUSS-SEIDEL ALGORITHM
a thesis submitted to
the graduate school of engineering and science
of bilkent university
in partial fulfillment of the requirements for
the degree of
master of science
in
computer engineering
By
Onur Kocak
September 2019
COLORING FOR DISTRIBUTED-MEMORY-PARALLEL GAUSS-
SEIDEL ALGORITHM
By Onur Kocak
September 2019
We certify that we have read this thesis and that in our opinion it is fully adequate,
in scope and in quality, as a thesis for the degree of Master of Science.
Cevdet Aykanat(Advisor)
Ozgur Ulusoy
Engin Demir
Approved for the Graduate School of Engineering and Science:
proaches including graph partitioning and balanced graph coloring in order to
decrease the number of colors while maintaining a computational load balance
among the color classes. We present a method that (i) first utilizes graph parti-
tioning in order to decrease the number of colors required in the coloring stage,
(ii) then applies a balanced coloring algorithm in order to extract balanced par-
allelism by decoupling the data dependencies while maintaining strict precedence
rules in sequential Gauss-Seidel iterations. We have tested the validity of our
2
model on various test matrices arising from different scientific applications. Ex-
perimental results show that the proposed model effectively reduces the number
of colors required in coloring stage of parallel Gauss-Seidel algorithm which re-
sults in reducing number of parallel sweeps and synchronization overheads. In the
experiments, it is also observed that utilized balanced coloring method efficiently
maintains computational load balance among the color classes.
This thesis is organized as follows: Chapter 2 gives the necessary background
and preliminary information. In Chapter 3, the existing work in the literature on
parallelization of Gauss-Seidel is investigated. Chapter 4 describes our method-
ology with 3 stages. The experimental results are presented in Chapter 5. We
present the conclusion of this thesis in Chapter 6.
3
Chapter 2
Background
In this chapter, the preliminary information and definitions related to our
methodology for the Gauss-Seidel algorithm are presented. Following sections
cover the basic principles of the Gauss-Seidel method, sparse matrices and their
storage models, graph model, graph partitioning and coloring problem.
2.1 The Gauss-Seidel Method
The Gauss–Seidel method is an iterative method to solve a system of n linear
equations with unknown x.
Consider the following linear system:
Ax = b (2.1)
A =
a11 a12 · · · a1n
a21 a22 · · · a2n...
.... . .
...
an1 an2 · · · ann
, x =
x1
x2
...
xn
, b =
b1
b2...
bn
(2.2)
where A is an (n× n) matrix with nonzero diagonal entries, b and x are vectors
4
of size n.
The Gauss-Seidel method is defined by the iteration;
L∗x(k+1) = b− Ux(k) (2.3)
where x(k) is the k -th approximation or iteration of x, and the coefficient matrix
A is decomposed into a lower triangular component L∗, and a strictly upper
triangular component U : A = L∗ + U .
In detail, we can decompose A with its lower and upper triangular components
as follows:
A = L∗+U where L∗ =
a11 0 · · · 0
a21 a22 · · · 0...
.... . .
...
an1 an2 · · · ann
, U =
0 a12 · · · a1n
0 0 · · · a2n...
.... . .
...
0 0 · · · 0
(2.4)
The system of linear equations can be written as:
L∗x = b− Ux (2.5)
The Gauss–Seidel method then solves the left hand side of this expression for x,
using previous value for x on the right hand side. This can be shown as:
x(k+1) = L−1∗(b− Ux(k)
)(2.6)
Using the triangular component L∗, we can sequentially compute the values of
x(k+1) with forward substitution. Hence, the Gauss-Seidel method can be written
as:
5
x(k+1)i =
1
aii
(bi −
i−1∑j=1
aijx(k+1)j −
n∑j=i+1
aijx(k)j
), i = 1, 2, . . . , n (2.7)
where:
x(k)i is the ith unknown in x during the kth iteration, i = 1, · · · , n
x(0)i is the initial guess for the ith unknown in x,
aij is the coefficient of A in the ith row and jth column
bi is the ith value in b.
In matrix view, the Gauss-Seidel method can be defined as:
x(k+1) = (D + L∗)−1 [b−Ux(k)
](2.8)
where:x(k) is the kth solution to x
x(0) is the initial guess at x,
D is the diagonal of A,
L∗ is the lower triangular portion of A,
U is the strictly upper triangular portion of A,
b is right-hand-side vector.
Convergence is evaluated by using the absolute relative approximate error cal-
culated as follows after each iteration to check if the error is within an acceptable
tolerance:
|Ea| =∥∥x(k+1) − x(k)
∥∥ (2.9)
6
Characteristics of Gauss-Seidel are as follows:
• Convergence of the Gauss-Seidel solver depends on the characteristics of the
coefficient matrix. Convergence is only guaranteed if the coefficient matrix
A is either symmetric positive definite or strictly/irreducibly diagonally
dominant [9].
• The Gauss-Seidel method has better convergence rates compared to the
Jacobi method on model problems as it uses the updated values during
each of the iterations [5].
• The Gauss-Seidel method is widely used in multigrid applications. It works
well on unstructured problems since there is no need for choosing a damping
parameter [5].
The Gauss-Seidel algorithm is inherently sequential since new iterations de-
pend on the results of previously calculated values in that same iteration. This is
in contrast to the Jacobi method, where we need to have x(k)i in order to calculate
x(k+1)i . Therefore, it is required to identify tasks which can be computed simul-
taneously by resolving data dependencies between tasks to efficiently parallelize
the Gauss-Seidel method.
Algorithm 1 presents a sequential algorithm to solve a linear system Ax = b
using the Gauss-Seidel method. The algorithm takes A and b as input and solves
for x, which is the solution vector. We assume a initial guess for x (i.e., x(0) = 0).
The algorithm algebraically solves each linear equation for xi and repeats until
converges. The iteration is terminated when either (i) the user-specified maximum
number of iterations has been reached or (ii) the norm of successive iterates is
less than a user-specified epsilon Eu.
7
Algorithm 1: Sequential Gauss-Seidel Algorithm
Input: A, b, kmax, EuOutput: x
k ← 0
Choose an initial guess x(0) to solution x
/* Repeat until reaching a desired error tolerance or maximum
iterations limit */
while k < kmax and Ea > Eu do
for i = 1 to n do
sum← 0
for j = 1 to i− 1 dosum← sum + aijxj
end
for j = i + 1 to n dosum← sum + aijxj
end
xi ← (bi − sum)/aii
end
/* Calculate new iteration error */
if k > 0 thenEa ← ‖x− xold‖
end
/* Update the old solution */
xold ← x
/* Update the iteration counter */
k ← k + 1
end
8
2.2 Sparse Matrices and Compressed Sparse
Row (CSR) Format
A matrix is called to be a sparse matrix if most of its elements are zero. Sparse
matrices are distinct from matrices with mostly non-zero values, which are re-
ferred to as dense matrices. The sparsity of a matrix can be defined as the number
of non-zero elements divided by the total number of elements (m× n). Many of
the scientific problems require working with sparse matrices.
When working with sparse matrices, it is efficient and often necessary to use
special data structures since they may allocate large amount of memory and
storage resources when treated as if they are dense and store zero values as well.
Compressed Storage by Rows (CSR) is an efficient data structure to store and
process large sparse matrices. The CSR data structure allows fast row access in
matrix-vector multiplications.
The CSR data structure represents a matrix by three arrays: row ptr, col ind
and values. nnz denotes the total number of non-zero entries in the matrix.
• values is an array of size nnz and it stores the subsequent non-zeros of the
matrix rows on continuous memory locations traversed in row-wise fashion.
• col ind is an array of size nnz and it stores the column indices of the
respective nonzero values in the values array.
• row ptr is an array of size n + 1 and it stores the row start pointers which
corresponds to the items the values array.
9
M =
1 2 3 4 5
20 0 0 25 0 1
30 35 0 40 0 2
45 0 50 55 60 3
0 0 65 70 0 4
0 0 0 0 75 5
1 2 3 4 5 6
row ptr 1 3 6 10 12 13
1 2 3 4 5 6 7 8 9 10 11 12
col ind 1 4 1 2 4 1 3 4 5 3 4 5
1 2 3 4 5 6 7 8 9 10 11 12
values 20 25 30 35 40 45 50 55 60 65 70 75
Figure 2.1: A simple M matrix and its CSR representation
In Figure 2.1, a simple M matrix and its CSR representation with correspond-
ing arrays presented. One can observe that the lengths of col ind and values
arrays are equal to the number of nonzero items. Numbers in row ptr array
points the row start positions in col ind and values arrays and its length is one
more than the number of rows.
Algorithm 2 presents the same Gauss-Seidel algorithm as Algorithm 1, only
difference is that the coefficient matrix A is supplied in the CSR format. In-
stead of a m× n two dimensional array for storing the coefficient matrix A, the
corresponding row ptr, col ind and values arrays are provided as input.
10
Algorithm 2: Sequential Gauss-Seidel Algorithm with CSR data format
Input: row ptr, col ind, values, b, kmax, EuOutput: x
k ← 0Choose an initial guess x(0) to solution x/* Repeat until reaching a desired error tolerance or maximum
iterations limit */
while k < kmax and Ea > Eu dofor i = 1 to n do
sum← 0for j = row ptr[i] to row ptr[i + 1]− 1 do
if j 6= i thensum← sum + values[j] ∗ x[col ind[j]]
endelse
d← values[j]end
endxi ← (bi − sum)/d
end/* Calculate new iteration error */
if k > 0 thenEa ← ‖x− xold‖
end/* Update the old solution */
xold ← x/* Update the iteration counter */
k ← k + 1end
2.3 Graphs and Matrices
An undirected graph is defined as G = (V , E), where V denotes the set of vertices
and E denotes the set of edges. Each edge eij ∈ E connects a pair of distinct
vertices vi and vj. The degree di of a vertex vi is equal to the number of edges
incident to vi. For each vertex vi ∈ V , a weight w(vi); and for each edge eij ∈ E ,
a cost c(eij) can be assigned. The set of vertices connected by vertex vi is called
the adjacency list of vi and is denoted as Adj (vi) ∈ V .
11
An n × n square symmetric matrix A = (aij) can be represented as an undi-
rected graph G(A) = (V , E) with n vertices, where the the sparsity pattern of
A corresponds to the adjacency matrix representation of graph G. Vertex set Vrepresents the rows/columns and edge set E represents non-zeros of matrix A. Vcontains a vertex vi for each row/column i. E contains an edge eij, which connects
the vertices vi and vj for each non-zero pair aij and aji in A.
Figure 2.2: A simple symmetric matrix and its undirected graph representation
Graph representation of matrices is often used in scientific computing applica-
tions to investigate data dependencies of tasks and communication patterns.
2.4 Graph Partitioning
Graph partitioning is a widely used method in scientific applications including
matrix multiplications to decompose computational tasks for parallelization [10,
11, 12].
Π = {P1,P2, . . . ,PK} is said to be a K-way partition of G if the following
conditions hold: (i) each part Pk is a non-empty subset of V , (ii) parts are
pairwise disjoint, i.e., Pk ∩ Pl = ∅ for all 1 ≤ k < l ≤ K, and (iii) union of
all K parts is equal to V . A partition Π of V is balanced if the following balance
criterion is satisfied by each part:
Wk ≤ Wavg(1 + ε), for k = 1, 2, . . . , K, (2.10)
where Wk is the sum of weights of all vertices in part Pk, i.e., Wk =∑
vi∈Pkw(vi),
12
Wavg is the average weight, i.e., Wavg =∑
vi∈V w(vi)/K , and ε is the maximum
imbalance ratio allowed. In a partition Π of G, an edge eij is called a cut edge
(external) if connects two different parts, and it is called uncut edge (internal) if
both vertices vi and vj are in the same part. EE denotes the set of external edges.
The cost of a partition Π is determined in terms of the cut edges as follows:
cost(Π) =∑
eij∈EE
c(eij) (2.11)
Graph partitioning problem is to partition the graph into two or more parts
while minimizing the cutsize in (2.11) and satisfying the maximum load balance
constraint in (2.10). The graph partitioning problem is known to be NP-hard.
There exist well known successful graph partitioning tools in the literature such
as Chaco [13] and METIS [14, 15].
Figure 2.3: An undirected graph and its 4-way partitioning.
Figure 2.3 presents an undirected graph and its 4-way partitioning. The total
cutsize is 9. All part weights are perfectly balanced and equal to 6.
13
2.5 Graph Coloring
A distance-1 coloring of a graph G is defined as a mapping clr : V → {1, 2, . . . , s}such that clr(vi) 6= clr(vj) for all edges ei,j ∈ E , where clr(vi) is the color of
vertex vi. This corresponds to an assignment of colors to vertices such that any
two adjacent vertices have different colors. The complexity of graph coloring
problem with using minimum number of colors is known to be NP-Hard [16].
The standard graph coloring aims to minimize the number of colors used with-
out considering the size of color classes relative to each other. A coloring of a
graph is called equitable if the sizes of any two color classes differ by at most one.
Another coloring method is balanced coloring. A coloring of a graph is called
balanced if the ratio of any two color classes is at most within a given threshold.
Given a graph G(V , E), balanced coloring is defined as computing a distance-1
coloring such that each color class receives approximately |V|C
vertices, where C is
the number of colors used [17].
Figure 2.4: An undirected graph and the same graph with distance-1 coloring
Figure 2.4 represents a simple undirected graph and its coloring. The graph
has 9 vertices and 14 edges. The maximum degree of the graph is 5. There are 3
colors used. Color classes have different sizes as 3, 2 and 4.
14
Graph coloring is frequently used in parallel computing domain to resolve the
dependencies in scientific problems by identifying the independent tasks that can
be computed in parallel. It is used for extracting parallelism in iterative sparse
solvers including Gauss-Seidel [18, 7, 19] and also used in many other scientific
applications such as calculation of sparse Jacobian matrices in partial differential
equation solvers [20], time tabling and scheduling problems [21, 22].
15
Chapter 3
Related Work
In this chapter, we investigate and discuss the existing work on parallel Gauss-
Seidel algorithms. There are several studies on parallelization of Gauss-Seidel in
the literature, which use graph decomposition, matrix ordering and graph coloring
models.
Koester et al. [6] presented a parallel Gauss-Seidel algorithm developed for ir-
regular sparse matrices from power system applications. They suggest a two step
matrix ordering technique that first partitions the matrix into a block-diagonal
bordered form using diakoptic techniques and then multi-colors the data in the
last diagonal block using graph coloring method. They order the matrix into
block-diagonal-bordered form using a node-tearing technique, and then assign
block-diagonal blocks of matrix partitions Ai,i,Ai,m,Am,i to the same proces-
sor. Their work utilizes active remote procedure calls in order to minimize the
communication overhead and obtain good speed-up. The drawback of their algo-
rithm is that the number of color classes increased dramatically in unstructured
problems, load imbalance is observed in many experiments and the calculation
of values in the last block show poor parallel performance. Since they inherently
apply different orderings in the matrix structure in order to extract parallelism,
the convergence rates are affected by the ordering.
16
Motivated by this observation, Adams [5] proposed a new distributed mem-
ory Gauss-Seidel algorithm as a multi-grid smoother with unstructured problems.
Their approach takes advantage of processor partitions and uses the domain de-
composition with the distribution of the stiffness matrix. Their approach suggests
that nodes are partitioned to minimize communication and then each processor
partitions its nodes into groups of interior (Int1, Int2) and boundary nodes (Top,
Bot, Mid). Interior nodes, which do not have any edges with nodes on dif-
ferent processors, are computed first to hide the communication latency of the
boundary nodes. The boundary nodes are partitioned into nodes, which only
communicates with processors of higher and lower colors which are referred to
Bot and Top nodes, respectively. By processing the processor partitions with
special orderings, their algorithm can be effective by utilizing optimal partition-
ing of finite element problems. However, a weakness of their approach is that
their algorithm requires instant communication for some computations that oc-
cur in Mid nodes and also requires different orderings depending the the number
of processors. This algorithm works well under certain assumptions such as load
balancing of computations are perfectly satisfied and there are enough number of
interior nodes to hide the communication requirement of boundary nodes.
Huang and Ongsakul [7] propose a task allocation algorithm and apply it
in parallel Gauss-Seidel implementation with two stages on distributed memory
systems. Their approach first suggests a recursive clustering algorithm to re-
duce communication costs and then devises a coloring algorithm to coordinate
information exchange among processors. Their approach partitions the graph into
clusters depending on the number of processors by recursively using a balanced 2-
way weakcut algorithm, which minimizes the number of boundary vertices instead
of minimizing the number of inter-connection edges between two clusters. Each
cluster is assigned to a single processor. Then, a coloring algorithm is proposed
to color only the boundary vertices. They have implemented their algorithm on
nCUBE2 machine by applying recursive k times partitioning depending on the
number of processors. Their load balancing paradigm is based on distributing
non-colored vertices among cluster groups. The weakness of their algorithm is
that the partitioning cutsize may not be minimized while maintaining the balance
17
of cluster sizes if vertices are not ideally connected, and load imbalance may be
observed between the color classes.
Courtecuisse and Allard [23] propose several Gauss-Seidel parallelization
schemes for many-core processors, that can be used in fully coupled dense prob-
lems. Their approach aims to synchronize computations without CPU interven-
tion by exploiting a small number of GPU processors. They suggest block-column
based parallelization, which first computes the blocks on the diagonal of the ma-
trix, and then process other columns. Their work also presents a different ap-
proach, which eliminates the need for global synchronization. They suggest to
use a shared memory counter, to store how many of the diagonal blocks are pro-
cessed and deduce if xi is updated for newer iterations. The drawback of their
algorithm is that it does not guarantee the order of invocation of thread blocks
and the scheduling is manually done.
Shang [8] proposes another distributed memory parallel Gauss-Seidel algo-
rithm. Their approach first suggests to divide the coefficient matrix and the
right hand side of the linear algebraic system into row blocks in the natural row-
wise order, then distribute the row blocks among local memories of all processors
through special mapping techniques. The communication volume is decreased
by conveying the solution iteration vector among processors in cycles at each
iteration. The parallel efficiency in their algorithm depends on a row-block parti-
tioning parameter g and an optimal number of processors. They suggest that pa-
rameter g depends on the relative ratio of communication to computation speed.
To increase parallel efficiency, their approach overlaps communication and compu-
tations. Their approach is a true parallel Gauss-Seidel algorithm that maintains
to convergance rates of the serial Gauss-Seidel algorithm. Although communica-
tion time is decreased by implementation of cyclically conveyed solution iteration
vector, load balancing issue is not handled among processors.
Since Gauss-Seidel is also used as a preconditioner and multigrid smoother
in linear problems, there exist some work in the literature, which implements
a parallel Gauss-Seidel algorithm for these purposes. Heuveline et al. [24] uses
multi-coloring technique for block Gauss-Seidel preconditioner on GPUs. Their
18
approach aims to resolve neighbor dependencies by introducing new neighborship
classes. They evaluated scalability and performance results on hybrid multi-core
CPU and GPU platforms.
Park et al. [19] implemented a shared memory parallel Gauss-Seidel algorithm
as a preconditioner in High Performance Conjugate Gradient Benchmark (HPCG)
application as the Gauss-Seidel kernel is the most time consuming operation
in HPCG. Their approach uses three parallelization approaches including task
scheduling with point-to-point synchronization, block multi-color reordering and
running multiple MPI ranks within a node. Their multi-coloring approach colors
vertices to the same color even there are some transitive dependency between a
part of vertices. Although their studies report good results in terms of parallel
efficiency, it is out of our interest as it targets shared memory architectures.
19
Chapter 4
Method
In this chapter, we present our 2 stage novel scheme for efficient and load bal-
anced parallelization of Gauss-Seidel method. In the first stage, we apply graph
partitioning and build a coarse graph based on partitioning of vertices. We apply
coarsening in order to obtain better coloring and decrease the number of synchro-
nization points. In the second stage, we use a balanced graph coloring approach
in order to decouple the dependencies among Gauss-Seidel iterations that is re-
quired to extract parallelism while maintaining the balance of the size of the color
classes.
4.1 Coarsening Stage
In this stage, we propose a coarsening technique based on graph partitioning
on the graph representation of the coefficient matrix A. This stage takes the
original input graph G corresponding to the coefficient matrix A and outputs a
coarse graph GC of this input graph in which each coarse vertex of the output
graph contains approximately B vertices of the original graph, where B is the
block size parameter. The coefficient matrix A is represented as a undirected
graph G = (V , E), as described in Chapter 2. Providing a block size parameter
20
B, we suggest K -way partitioning the graph G with N vertices into K parts with
each part approximately having B vertices such that K � P , where P denotes
the number of processors. Note that K = N/B. The block size parameter may
depend on the size and sparsity structure of the coefficient matrix A and it may
affect the overall efficiency of system. In our experiments, we have chosen various
block-size parameters that range from 50 to 500 depending on the problem size.
The effect of block size parameter is evaluated in Chapter 5 on scientific problems
of various sizes.
Given a number of partitions K as a parameter, we partition the graph
into K parts with approximately N/B vertices. We obtain a partitioning
Π = {V1,V2, . . . ,VK}. We calculate the edge-cut of the partitioning solution.
Then, we form a coarse graph GC = (VC, EC), where VC denotes the vertex set
of coarse graph GC, and called as super vertices, EC denotes edge set of coarse
graph GC, and called as super edges. The coarse graph GC is formed using the
partitioning information Π, that is each part of VK becomes a super vertex of
VC. In detail, let p be a vector of size |V | such that p[i] stores the number of the
partition that vertex vi belongs to. By partitioning the graph, we obtain a list
of vertex-part pairs 〈vi, p[i]〉. We assign the group of vertices in G to GC based
on their partition numbers p[i]. If there exists at least one edge in between any
two vertices, each in different super vertices, we form a super edge in between the
respective super vertices. Multiple edges between vertices of corresponding super
vertices are coalesced into a single edge.
The aim in graph partitioning problem is to minimize the cut size among all
partitions. By minimizing the cut size, we obtain highly intra-connected cluster
of vertices, and loosely inter-connected super vertices. Each super vertex denotes
an atomic task and the rows corresponding to the vertices in that super vertex
will be processed by a single processor in the Gauss-Seidel algorithm. However,
note that multiple super vertices are assigned to a single processor based on the
coloring results. This enables us to resolve communication costs while processing
vertices in multiple processors in parallel. No communication is required when
processing a single super vertex, and those fine-grain vertices that make up the
super vertex. By coarsening the graph based on partitioning information, the
21
resulting coarse graph GC is less coupled than the initial graph. Therefore, the
chance of obtaining fewer number of colors in the coloring stage (described in Sec-
tion 4.2) is increased. Additionally, coarsening reduces the coloring time, which
is a necessary pre-processing step in a parallel Gauss-Seidel implementation.
In our experimental results presented in Chapter 5, we will compare the char-
acteristics of the initial graph G with resulting coarse graph GC observed that the
resulting coarsened graph GC is less coupled in terms of vertex degrees, and the
required number of colors is fewer in GC than in G.
Figure 4.1: Step by step coarsening stage
Figure 4.1 represents an undirected graph with 27 vertices and its step by
step coarsening process. The initial graph has been partitioned into 5 parts with
partition size of 5 or 6. Each part is considered as a super vertex, then super
edges are formed among the super vertices. The resulting coarse graph has 5
super vertices. The weights of super vertices are provided with labels. One can
observe that the vertices within a single partition is densely connected.
22
4.2 Coloring Stage
In order to extract parallelism in the Gauss-Seidel iterations, we use a graph
coloring technique. We apply distance-1 coloring on the coarsened graph GC that
is obtained at the end of the coarsening stage. We first assign colors to the super
vertices of GC, process nodes of GC belonging to same color in parallel, and proceed
with the next color. We will describe details of parallel execution steps later in
this section.
Traditional graph coloring methods aim to minimize the number of colors.
However, in the context of many parallel processing applications, it is also im-
portant to obtain a balanced distribution of the sizes of the color classes. If the
color classes are imbalanced in size, utilization of hardware resources becomes in-
efficient, especially for the smaller color classes. Therefore our goal is to achieve
a balanced coloring of the vertices in GC, that the number of vertices in each color
class is approximately the same. Lu et al. [17] presented a graph coloring toolkit
named Grappolo which provides an algorithm for balanced coloring with Vertex
First Fit (VFF) method. Their work presents a set of coloring strategies based
on greedy, shuffling and recoloring approach. Basically, a vertex is assigned to the
least used color among the set of permissible colors which is bounded by a bal-
ancing constraint. Their work also includes guided balancing strategies in which
a scheduled and unscheduled shuffling and recoloring methods are introduced.
We have also considered and discussed using a traditional coloring method
due to the possibility that a balanced coloring method may increase the required
number of colors potentially. For this purpose, we have evaluated a set of color-
ing algorithms proposed by Gebremedhin et al. [25]. Their work presents a graph
coloring toolkit named ColPack for scientific purposes. ColPack offers various
sequential greedy coloring algorithms with degree based ordering in the context
of distance-1 coloring. Each algorithm progressively extends a partial coloring
by processing a single vertex at a time in some order by permanently assigning
a vertex the smallest allowable color in each step. By comparing two different
23
coloring approaches in our experiments, we have observed that a balanced color-
ing is generally possible with the same number of colors as traditional coloring.
Related experimental results are discussed in Chapter 5 in more detail.
In this stage, Grappolo is used to obtain a balanced coloring of vertices, which
is required for the kernel of parallel Gauss-Seidel implementation. We color the
vertices of coarse graph GC, and upon completion, each super vertex vci ∈ VC is
assigned a color number ci, where ci ∈ C. C denotes the set of all colors, and
nc denotes the total number of colors required. Note that vertices of G have not
been subjected to coloring at this stage. Instead, they are assigned colors based
on the color of the corresponding super vertex which they are part of. Proposed
coarsening and coloring method in together decreases the number of colors which
results in reducing number of synchronization points that contributes to the total
communication volume. Fewer number of colors requires less synchronization
points between sub-iterations, hence, communication overhead decreases. The
coloring time is also decreases with the proposed coarsening method.
We assign the rows of the iteration matrixA that correspond to the constituent
vertices of each super vertex to a different processor. Since the number of su-
per vertices of GC is much greater than the number of processors (K � P ), a
single processor processes multiple super vertices of GC. All processors computes
the corresponding xi entries of same color. Once, processing of a color class is
complete, all processors exchange xi values. This is called a local synchronization
point. Then all processors proceed with the next color after receiving the updated
xi values.
Figure 4.2 represents a visual comparison of traditional parallel processing
versus parallel processing with coarsening and balanced coloring. Since number
of colors decreases with the coarsening stage, fewer number of synchronization
points are required. Size of color classes are almost perfectly balanced and all
processors are assigned almost equal number of super vertices that are considered
as atomic tasks.
24
Figure 4.2: Traditional parallel processing vs. parallel processing with coarsening
and balanced coloring
25
Chapter 5
Experimental Results
In this chapter, we present the results of our experiments that we carried out
for the validity of the proposed methodology. Firstly, we give the details of the
experimental platform and data sets used. Then, following sections include eval-
uation of numeric results for the implementation of the 2 stages of our proposed
methodology. In our experiments, we evaluate two important metrics which are
the number of colors and partitioning cutsize. Due to the randomized nature of
some algorithms used, the experiments were repeated 5 times with different seeds
and results are averaged. Timing results for each stage of our methodolody are
also presented in each section.
5.1 Experimental Platform
The platform used in our experiments is a server computer with 4 x 6 Core AMD
Opteron 8425 HE with 2.1 GHz clock frequency, 64 KB L1 cache, 512 KB L2
cache, 6 MB shared L3 cache and 128 GB total memory. The operating system
is Debian GNU/Linux version 7. gcc version 4.7.2 is used as C/C++ compiler
with −O3 optimization flag while compiling. METIS version 5.1.10 with default
parameters is used as a graph partitioning tool. In the coloring stage, Grappolo
26
with currently available source code with no specific version tag and ColPack
version 1.0.10 are used.
5.2 Data Set
In order to validate and measure the performance of our methodology, we used a
real life test experiment data from wide range of scientific applications. We per-
formed our tests on 40 different matrices from the SuiteSparse Matrix Collection
(formerly the University of Florida Sparse Matrix Collection) [26]. SuiteSparse
Matrix Collection is a widely used set of sparse matrix benchmarks collected from
a wide range of applications. The chosen problems are symmetric and positive
definite matrices as per the convergence requirement of the Gauss-Seidel method.
The size of test matrices ranges from 16,146 to 4,588,484 vertices and 1,015,156
to 48,538,952 nonzeros in a variety of scientific applications including electromag-
netics, structural and quantum chemistry problems. The details of test matrices
are presented in Table 5.1.
The matrices in the SuiteSparse Matrix Collection are provided in Matrix
Market (MTX ) format which is a widely used storage format. However, CSR
format is used in our implementation due to its advantages discussed before. We
converted those matrices to CSR format while processing.
27
Table 5.1: Properties of Test Matrices
Matrix Name Problem Kind Rows Columns NonzerosVertex Degrees
Min Avg Max
2cubes sphere electromagnetics problem 101,492 101,492 1,647,264 5 16 31
af 0 k101 structural problem 503,625 503,625 17,550,675 15 35 35
af shell1 structural problem sequence 504,855 504,855 17,588,875 20 35 40
bcsstk31 structural problem 35,588 35,588 1,181,416 2 33 189
bmw7st 1 structural problem 141,347 141,347 7,339,667 1 52 435
cfd2 fluid dynamics problem 123,440 123,440 3,087,898 12 20 33
To measure the effectiveness of our coarsening model, we analyze and compare the
structure of coarse graph GC with the original graph G that corresponds to the co-
efficient matrix A. We partition G for different block sizes B ∈ {50, 100, 250, 500}.Then, we form a coarse graph GC. In our experiments, we evaluate the structure
of G and GC by comparing the vertex degrees. The minimum, average and max-
imum vertex degrees of G and GC are presented in Table 5.2 with respect to the
block size parameter used for the partitioning step. Please note that the original
graph properties are presented with the lines where B = 1 for each test matrix
and the subsequent rows present the coarsening results for the respective block
sizes.
The vertex degree properties of a sparse matrix reflect the dependencies of any
nonzero item to other nonzero items on the off-diagonal blocks. Depending on the
reordering due to the partitioning, the connectivity between any items may cause
computational dependency in the Gauss-Seidel iterations. These dependencies
contributes to the output of the coloring stage. Experimental results on Table
5.2 show that our coarsening model effectively reduces the average and maximum
degrees on test matrices. For instance, the test matrix named af shell has average
degree of 34 and maximum degree of 40. After coarsening applied with B = 50,
the resulting coarse graph GC has average degree of 7 and maximum vertex degree
of 11. This observation holds true for the almost any of test matrices.
In the implementation of the graph partitioning problem, we used METIS [14]
as a graph partitioning tool. METIS is a multilevel graph partitioning software. It
uses a multilevel approach with three stages and comes with several algorithms
for each stage. It is well accepted and commonly used as a successful graph
partitioning tool used various domains including finite element methods, linear
programming, VLSI, and transportation. Their experiments show that METIS
produces partitions that are consistently better than those produced by other
30
widely used algorithms.
One important quality metric of coarsening stage that is related to the parallel
performance is the partitioning cutsize. Partitioning cutsize is important since
it affects the quality of output of the coarsening stage. In Table 5.2, number
of partitions denoted by K and partitioning cutsize are presented. Experimental
results verify that the cutsize decreases with the coarsening process. For instance,
considering the test matrix af shell1, the cutsize of partitioning decreases from
3,227,892 to 1,005,261 in relation to the blocksize parameter.
Timing results for partitioning with METIS (tpartitioning), coarsening
(tcoarsening) and coloring (tcoloring) stages are also provided in Table 5.2, all in
seconds. Our experiments show that proposed coarsening model reduces the col-
oring time. Timing results for the coloring of original graph G are presented on
the lines where B = 1 for each test matrix and the subsequent rows present the
timing results for larger block sizes. We observe that the most time consuming
operation is the initial graph partitioning done by METIS.
5.3.2 Experiments on the Coloring Stage
Graph coloring is a crucial task of any parallel Gauss-Seidel implementation due
to the sequential nature of the algorithm. Number of colors obtained from the
coloring stage is directly related to the parallel efficiency. When the computa-
tions within a color class are completed, synchronization points are required to
exchange xi values, which results in synchronization overhead. The synchroniza-
tion overheads increase to the total communication volume. Another important
metric is the balance of the color classes. Balancing the computational weights
among multiple color classes increases the overall efficiency of parallel system. In
this section, we discuss the efficiency of our proposed model by evaluating these
metrics. We evaluated and compared two graph coloring tools ColPack [25] and
Grappolo [17].
31
In Table 5.2, we evaluated the number of colors for each test matrix in relation
to various block sizes. By examining the experimental results, we observed that
our coarsening model effectively reduces the number of colors required to color
the graph. For instance, 908 colors are required to color Ga3As3H12 matrix,
whereas our coarsening model with B = 50 only requires 32 colors which results
in 28 times decrease in the synchronization overhead.
Balancing of computations in multiple color classes is important for a paral-
lel Gauss-Seidel implementation. For this requirement, we suggested a balanced
coloring approach as discussed previously. Here, we compare the results of tra-
ditional coloring with balanced coloring in Figure 5.1 and Figure 5.2. ColPack
and Grappolo are used for coloring and balanced coloring, respectively. Figure
5.1 compares two approach by the number of vertices in each color classes of
the test matrix delaunay n20. Figure 5.2 compares the sum of weights of ver-
tices having the same color for the test matrix. The results show that vertex
counts and weights are almost equal with balanced coloring method we suggested
whereas there is a huge imbalance among different colors in the traditional color-
ing method. Color IDs are sorted to remark the imbalance. In the traditional col-
oring experiments, color classes of very small number of nodes are observed. This
may cause that some processors stay idle in parallel systems with large number of
processors. Although, the balancing results are provided only for delaunay n20,
experiments were repeated for every test matrix and results are similar.
We also present a comparison of number of counts required to color same
problem with traditional coloring versus balanced coloring. In Figure 5.3, total
number of colors required is given by applying both coloring techniques for all test
matrices. Experiments show that in only rare cases balanced coloring approach
slightly increases the number of colors. In most cases, balanced coloring method
requires the same number of colors as the traditional coloring method, moreover
it produces quality color classes of almost equal sizes.
Timing results for coloring stage are presented across the column titled
(tcoloring) of Table 5.2. The results show that our coarsening model decreases
the required time for the coloring stage as expected.
32
Table 5.2: Coarsening and Coloring Results for B ∈ {1, 50, 100, 250, 500}