Efficient GPU implementation of bioinformatics applications Nuno Miguel Trindade Marcos Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. Nuno Filipe Valentim Roma Prof. Pedro Filipe Zeferino Tomás Examination Committee: Chairperson: Prof. José Carlos Martins Delgado Supervisor: Prof. Nuno Filipe Valentim Roma Member of the Committee: Prof. David Manuel Martins de Matos November 2014
82
Embed
Efficient GPU Implementation of Bioinformatics Applications
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Efficient GPU implementation of bioinformatics
applications
Nuno Miguel Trindade Marcos
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisors: Prof. Nuno Filipe Valentim RomaProf. Pedro Filipe Zeferino Tomás
Examination Committee:
Chairperson: Prof. José Carlos Martins DelgadoSupervisor: Prof. Nuno Filipe Valentim Roma
Member of the Committee: Prof. David Manuel Martins de Matos
November 2014
Agradecimentos
Em primeiro lugar, gostaria de agradecer aos Professores Nuno Roma e Pedro Tomás, meus orientado-
res neste trabalho, pela vossa orientação e colaboração. Sem a vossa paciência e tolerância constante
não teria sido possível terminar este trabalho. A vós um muito Obrigado.
Seguidamente gostaria de agradecer aos meus colegas que me acompanharam durante o curso, em
especial ao David Gaspar pela força e por todos os momentos passados, ao Jhonny Aldeia por toda a
ajuda durante o curso, ao Dionisio Sousa, ao Rui Mestre e ao Artur Ferreira que me ajudaram e deram
motivação para continuar em frente. Para além disso, agradecer especialmente ao meu amigo Pedro
Monteiro pela ajuda com o trabalho dele e pelo constante incentivo.
Gostaria também de agradecer aos meus amigos Daniela Coelho, Miguel Matos, David Dias, João Ve-
lez e Pedro Chagas por todo o apoio durante este trabalho. Ao meu afilhado Tiago Carreira pelas várias
insistências para acabar este trabalho e por toda a ajuda durante o mesmo.
Para além deles, também gostaria de agradecer aos meus colegas da Premium Minds que sempre
tiveram disponíveis para me ajudar e para assumir as coisas na minha ausência, em especial ao Márcio
Nóbrega, ao André Soares, ao Renil Lacmane e ao Afonso Vilela.
Por final, mas com a maior das importâncias, gostaria de agradecer aos meus pais e ao meu irmão por
toda a força e motivação para que conseguisse chegar ao final deste trabalho e em especial à minha
namorada Ana Daniela pela motivação na recta final deste trabalho.
i
Abstract
Biological sequence data is becoming more accessible to researchers around the world. In particular,
rich databases of protein and DNA sequence data are already made available to biologists and their size
is increasing every day. However, all this obtained information needs to be processed and classified.
Several bioinformatics algorithms, such as the Needleman-Wunsch and the Smith-Waterman algorithm,
have been proposed for this purpose. Both consist of the execution of dynamic programming schemes
which allow the usage of parallelism to achieve a better performance execution. Under this context,
this thesis proposes the integration of two previously presented parallel implementations: an adaptation
of the SWIPE implementation, for multi-core CPUs that exploits SIMD vectorial instructions, and an
implementation of the Smith-Waterman algorithm for GPU platforms (CUDASW++ 2.0). Accordingly,
the presented work offers an unified solution that tries to take advantage of all computational resources
that are made available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a
dynamic load balancing layer. The obtained results show that the attained speedup can reach values as
high as 6x, when executing in a quad-core CPU and two distinct GPUs.
In the remaining sections, we describe the main parallel computing architectures that are used nowa-
days: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), and architectures that
combine both, the Accelerated Processing Unit (APU). Finally, we present the parallel-programming
model used by NVIDIA GPUs, the Compute Unified Device Architecture (CUDA).
2.1 Flynn’s Taxonomy
In 1966, Michael J. Flynn proposed a simple model that it is still used to categorize computers,
taking into account the parallelism in instruction execution and memory data calls. Flynn looked at
the parallelism in the instruction and data streams1 called for the instructions at the most constrained
component of the multiprocessor, and placed all existing computers in four distinct categories [19], as
defined below and presented in Table 2.1.
Table 2.1: Flynn’s Taxonomy [5].
Single Instruction Multiple InstructionSingle Data SISD MISD
Multiple Data SIMD MIMD
1. Single instruction stream, single data stream (SISD): This category corresponds to the unipro-
cessor model. One example of this is the conventional sequential computer based on the Von
Neumann architecture, i.e., a uniprocessor computer which can only perform one single instruc-
tion at a time.
2. Single instruction stream, multiple data streams (SIMD): The same instruction is executed by
multiple processors using different data streams. SIMD computers exploit data-level parallelism,
by applying the same operations to multiple items of data in parallel. Many current CPUs use
this kind of architecture by supporting instruction set extensions. Examples for this are the MMX,
established by Intel [20], and the SSEx family of Streaming SIMD Extensions, representing an
evolution of the MMX architecture. The Advanced Vector Extensions (AVX) extension is also one
kind of SIMD extension proposed by Intel. This category is also followed by the programming
model used in Graphics Processing Units (CUDA and OpenCL) that we describe on Section 2.6.1The concept of stream refers to the sequence of data or instructions as seen by the machine during the execution of a
program.
6
2.2 CPU - Central Processing Unit
3. Multiple instruction streams, single data stream (MISD): This category indicates the use of
multiple independently executing functional units operating on a single stream of data, forwarding
the results from one functional unit to the next [5].
4. Multiple instruction streams, multiple data streams (MIMD): Each processor fetches its own
instructions and operates on its own dataset. This model exploits thread-level parallelism, since
multiple threads operate in parallel. Examples of this architecture are the current processors with
multi-threading support. Other examples are distributed systems and computer clusters.
2.2 CPU - Central Processing Unit
The central processing unit (CPU) is the computer hardware unit responsible for interpreting and ex-
ecuting the program instructions. One of the first commercial CPU microprocessors was the Intel 4004
presented by Intel in 1971.
A CPU is usually composed by these components [21]:
• Arithmetic Logic Unit (ALU) - Responsible for the execution of logical and arithmetic operations;
• Control Unit – Decodes instructions, gets operands and controls the execution point;
• Registers – Memory cells of the CPU that store data needed by the CPU to execute the instruc-
tions;
• CPU interconnection - communication channels among the control unit, ALU, and registers.
Nowadays, in order to reduce the power consumption and to process multiple tasks at the same time
and more efficiently, commercial CPUs are built with multi-core technology, having between 4 and 16
execution cores. This way, the multi-core CPU can process 4 or more instructions at a time, in a MIMD
parallelization way. Some solutions that take advantage of the parallelizing process on the Intel CPUs
are presented in Section 3.4.
2.3 GPU - Graphics Processing Unit
A Graphics Processing Unit(GPU) is the processing unit that is present in every graphics card on
each computer. This unit is designed specifically for performing the complex mathematical and geomet-
ric calculations that are necessary for graphics rendering. Although GPUs were originally developed
to process and display computer graphics, they have been also used for processing general purpose
operations, leading to the General-Purpose Computation on Graphics Hardware (GPGPU) paradigm.
There are several frameworks to adapt the GPU programming to this paradigm. The most known ones
are OpenCL and North American Company that invented the GPUs in 1999 (NVIDIA)’s CUDA, pre-
sented in the Section 2.6. Early approaches to computing on GPUs cast computations into a graphics
7
2. Parallel Architectures
framework, allocating buffers (arrays) and writing shaders (kernel functions).
GPUs provide massive parallel execution resources and high memory bandwidth. Within the most
popular GPU-accelerated applications we can mention the research field, specifically:
• Higher Education and Supercomputing (numerical analytics, physics and weather and climate fore-
casting, for example);
• Defense and Intelligence applications (such as geospatial visualization);
Sequence alignment is a fundamental procedure in Bioinformatics, specifically used for molecular
sequence analysis, which attempts to identify the maximally homologous subsequences among sets
of long sequences [14]. In the scope of this thesis, it was considered the processing of biological
sequences consisting on a single, continuous molecule of nucleic acid or protein [33]. While DNA se-
quences can be expressed by four symbols (corresponding to the four nucleotides A,C,T and G). The
amino acids in proteins can be expressed by 22 symbols A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S,
T, V, Y, Z.
When comparing sequences, one looks for patterns that diverged from a common ancestor by a process
of mutation and selection. According to Dewey et al. [34], the main objectives of sequence alignment
are to establish input data for phylogenetic analysis, to determine the evolutionary history of a set of
sequences, to discover a common motif3 in a set of sequences; to characterize the set of sequences
and also for building profiles for database sequence searching.
The considered mutational processes involved in the alignments are residue substitutions, residue
insertions, and residue deletions. Insertions and deletions are common referred to as gaps [35].
The basic idea in the aligning process of two sequences (of possibly different sizes) is to write one on
top of the other and break them into smaller pieces by inserting spaces in one or the other so that
identical sub-sequences are eventually aligned in a one to one correspondence. Naturally, spaces are
not inserted in both sequences in the same position. Figure 3.1 illustrates an alignment between the
sequences A="ACAAGACAGCGT" and B="AGAACAAGGCGT".
Figure 3.1: Pairwise Alignment Example
In order to understand all the steps involved in the algorithms that will be presented in Sections 3.2
and 3.3, we need to go through some of the concepts employed: the Scoring Model and the concept
of the Gap Penalties (Section 3.1). After explaining these concepts, this chapter will provide a brief
overview on optimal sequence alignment algorithms (Section 3.2) and heuristic sequence alignment al-
gorithms (Section 3.3).
Finally, by taking into account the parallel architectures presented in Chapter 2, we present some im-
plementations of sequence alignment using parallel architectures, based either on CPU (Section 3.4.1)
and on GPU (Section 3.4.2).
3.1 Alignment Scoring Model
The basis of many sequence alignment algorithms are based in a scoring model, which classifies
the several matching and mismatching patterns according predefined score values. The simplest ap-
proaches consider a positive constant value assigned to a match between both residues. Alternatively,
3Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function.
20
3.2 Optimal Alignment Algorithms
instead of using fixed score values when there’s one match on the alignment, biologists frequently use
scoring schemes that take into account physicochemical properties or evolutionary knowledge of the
sequences being aligned. This is common when protein sequences are compared. The most known
schemas are Point Accepted Mutation (PAM) and Blocks Substitution Matrix (BLOSUM) alphabet-weight
scoring schemes, which are usually implemented by a substitution matrix.
The BLOSUM matrices were developed by Henikoff & Henikoff, in 1992, to detect more distant rela-
tionships. In particular, BLOSUM50 and BLOSUM62 are being widely used for pairwise alignment and
database searching.
Substitution matrices allow for the possibility of giving a negative score for a mismatch, what is
sometimes called an approximate or partial match.
Just like the score values, the gap value can be represented by a constant value or by using some
of the presented models. In order to understand the models, the gap open/start score (d), represents
the cost of starting a gap, while the gap extension (e) represents the cost of extending a gap by one
more space. The standard cost associated with a gap of length (g) is given either by a linear score given
by [35]:
γ(g) = −gd (3.1)
or by using the affine score:
γ(g) = −d− (g − 1)e (3.2)
The gap-extension penalty e is usually set to a value less than the gap-open penalty (d), allowing long
insertions and deletions to be penalized less than they would be by the linear gap cost. This is desirable
when gaps of a few residues are expected almost as frequently as gaps of a single residue [35].
3.2 Optimal Alignment Algorithms
The optimal alignment of two DNA or protein sequences is the alignment that maximizes the sum of
pair-scores minus any penalty for the introduced gaps [35].
Optimal Alignment algorithms include:
• Global Alignment algorithms, which align every residue in both sequences. One example is the
Needleman-Wunsch algorithm, which we present in Section 3.2.1.
• Local Alignment algorithms, which only considers part of the sequences and obtains the best sub-
sequence alignments or the Identification of common molecular subsequences [14]. One example
is the Smith-Waterman algorithm, which we present in Section 3.2.2.
21
3. Sequence Alignment in Bioinformatics
3.2.1 Needleman-Wunsch Algorithm
In 1970, Needleman & Wunsch [36] proposed the following algorithm. Given two molecular se-
quences, A = a1a2...an and B = b1b2...bm, the goal is to return an alignment matrix H which indicates
the optimal global-alignment score between both sequences.
In order to understand this algorithm, consider the following definitions:
• H(i, j) represents the similarity score of two sequences A and B, ending at position i and j;
• s(ai, bj) is the score for each aligned pair of residues. This value can be defined by a constant
value, or can be obtained using scoring matrices like PAM or BLOSUM, for the protein sequences;
• Wk, Wl represent the gap penalties, according to the considered gap model.
Each matrix cell is filled with the maximum value that results from Equation 3.3.
H(i, j) = max
Hi−1j−1 + s(ai, bj), if ai and bj are similar symbols
Hi−kj −Wk, if ai is at the end of a deletion of length k
Hij−k −Wl, if bj is at the end of a deletion of length l
(3.3)
This equation is repeatedly applied in order to fill in the matrix with the H(i, j) values, by calculating
the value in the bottom right-hand corner of each square of four cells from one of the remaining three
values [36]. By definition, the value in the bottom-right cell of the entire matrix, H(n,m), corresponds to
the best score for an alignment between A and B. Figure 3.2 illustrates the algorithm with the alignment
between sequences A="AACGTT" and B="ATGTT". The obtained score was 13 and the best global
alignment is presented with the green arrows presented in the figure.
Figure 3.2: Needleman-Wunsch alignment matrix example
3.2.2 Smith-Waterman Algorithm
In 1981, Smith and Waterman [14] proposed a dynamic programming algorithm4 that computes the
similarity scores corresponding to the maximally homologous subsequences among sets of long se-
quences. Given two sequences A = a1a2...an and B = b1b2...bm, the goal of this algorithm is to return a
alignment matrix H which indicates the optimal local-alignments between both sequences. For each cell,
this algorithm computes the similarity value between the current symbol of sequence A and the current4Dynamic programming is a programming method that solves problems by combining the solutions to their subproblems[37].
22
3.2 Optimal Alignment Algorithms
symbol of sequence B. This algorithm has some data dependencies, since each cell of the alignment
matrix depends on its left, upper and upper-left neighbors.
In this algorithm, we consider the same definitions ofH(i, j), s(ai, bj),Wk andWl used in the Needleman-
Wunsch algoritm (Section 3.2.1).
Receiving the sequences A and B as input, this algorithm begins with the initialization of the first
column and the first row, which is given by:
Hk0 = H0l = 0, for 0 ≤ k ≤ n and 0 ≤ l ≤ m (3.4)
Then the algorithm computes the similarity score H(i, j) by using the following equation:
Hij = max
Hi−1j−1 + s(ai, bj), if ai and bj are similars
Hi−kj −Wk, if ai is at the end of a deletion of length k
Hij−k −Wl, if bj is at the end of a deletion of length l
0, otherwise
(3.5)
The output for the algorithm is the optimal local alignment of sequence A and sequence B with max-
imum score. Unlike the Needleman-Wunsch algorithm, the Smith-Waterman algorithm always gives
matrix scores greater than or equal to 0.
In order to get all the optimal local alignments between sequences A and B, a trace-back algorithm
starts from the highest score in the whole matrix and ends at a score of 0.
Figure 3.3 presents the optimal local alignments between sequence A: WPCIWWPC and sequence
B: IIWPC. In this example, the BLOSUM 50 matrix scoring model is used, in order to get s(ai, bj) value.
The gap penalty is -5. The optimal local alignments between sequences A and B are represented inside
green background color cells. These alignments occurred between the subsequences WPC of A and
WPC of B.
Figure 3.3: Smith-Waterman alignment matrix example
23
3. Sequence Alignment in Bioinformatics
3.3 Heuristic Sub-Optimal Algorithms
Although providing optimal solutions, the described algorithms we characterized by a quadratic com-
plexity O(mn) where m is the size of sequence A and n the size of sequence B. This is made evident on
large databases with high number of residues. The current protein Uniprot Swissprot [12] database con-
tains hundreds of millions residues; for a sequence of length one thousand, approximately 1011 matrix
cells must be evaluated to search the complete database. At ten million matrix cells per second, which
is reasonable for a single workstation at the time this is being written, this would take 10000 seconds,
i.e., around three hours [35].
Heuristic algorithms address this issue at the expense of not guaranteeing to find the optimal solution.
Examples of these algorithms are the FASTA and the BLAST, presented in Section 3.3.1 and in Sec-
tion 3.3.2.
3.3.1 FASTA
The FASTA algorithm (also known as "fast A" which stands for "FAST-All") was presented by Pear-
son & Lipman in 1985 [38] and further improved in 1988 [39]. This algorithm uses local high scoring
alignments with a multistep approach, starting from exact short word matches, through maximal scoring
ungapped extensions, to finally identify gapped alignments.
This algorithm can be described in four steps [35]:
• Step 1 (Figure 3.4): locate all identically matching words of length ktup (specifies the size of the
word) between the two sequences. For proteins, ktup is typically 1 or 2, for DNA it may be 4 or 6.
It then looks for diagonals with many mutually supporting word matches.
Figure 3.4: FASTA algorithm step 1.
• Step 2 (Figure 3.5): search for the best diagonals, extending the exact word matches to find
maximal scoring ungapped regions (and, in the process, possibly joining together several seed
matches).
• Step 3 (Figure 3.6): check if any of these ungapped regions can be joined by a gapped region,
allowing for gap costs.
• Step 4 (Figure 3.7): the highest scoring candidate matches in a database search are realigned
using the full dynamic programming algorithm, but restricted to a subregion of the dynamic pro-
24
3.3 Heuristic Sub-Optimal Algorithms
Figure 3.5: FASTA algorithm step 2.
Figure 3.6: FASTA algorithm step 3.
gramming matrix forming a band around the candidate heuristic match. This step uses a standard
dynamic programming algorithm, such as Needleman-Wunsch or Smith-Waterman, to get the final
scores.
Figure 3.7: FASTA algorithm step 4.
There is a tradeoff between speed and sensitivity in the choice of the ktup parameter, higher values
of ktup are faster, but more likely to miss true significant matches. To achieve sensitivities close to the
optimal algorithms for protein sequences, ktup needs to be set to 1.
3.3.2 BLAST - Basic Local Alignment Search Tool
The Basic Local Alignment Search Tool (BLAST) was presented by Altschul et al. in 1990 [40],
and finds regions of local similarity between sequences. The program compares nucleotide or protein
sequences to sequence databases and calculates the statistical significance of matches. BLAST can
25
3. Sequence Alignment in Bioinformatics
be used to infer functional and evolutionary relationships between sequences as well as to help identify
members of gene families [9]. This algorithm is most effective with polypeptide5 sequences and uses a
matrix score (BLOSUM, PAM, etc.) to find the maximal segment pair (MSP) for two sequences, defined
as locally optimal if the score cannot be improved either by lengthening or shortening the segment pair.
This algorithm is the most widely used for protein-coding sequences alignment6.
The BLAST algorithm steps are [40]:
1. Compile a list of high-scoring words
• Giving a length parameter w and a threshold parameter T , find all the w-length substring
(words) of the database sequences that align with words from the query with an alignment
score higher than T . This is called a hit in BLAST.
• Discard those words that score below T (these are assumed to carry too little information to
be useful starting seeds)
2. Scan the database for hits
• When T is high, the search will be rapid, but potentially informative matches will be missed.
3. Extend the hits
• Attempt to extend this match to see if it is part of a longer segment that scores above the
MSP score S
• Report only those hits that yield a score above S
From the score S it is also possible to calculate an expectation score E, which is an estimate of how
many local alignments of at least this score would be expected given the characteristics of the query
sequence and database.
The original BLAST did not permit gaps, so it would find relatively short regions of similarity, and it was
often necessary to extend the alignment manually or with a second alignment tool.
3.4 Parallel Implementations
Smith-Waterman is the most known algorithm in this context, and it was explored in many software
implementations, each improving the execution times and optimizing the parallelization method.
In what concerns the parallelization method, the implementations presented in this Section have some
different approaches of parallelism and they can be grouped taking into account their level of paral-
lelism [2]:
• Coarse Grained Parallelism: example of this parallelism can be the master/worker model pre-
sented in our work, where we have a single processor named master that send works to the
workers. On the parallel sequence alignment, the database sequence is split into n parts, and5short chains of amino acid monomers linked by peptide (amide) bonds.6http://cmns.umd.edu/
In Section 3.4, a set of parallel implementations of the Smith-Waterman Algorithm that were pro-
posed in the last years were presented. Considering two of the solutions presented: CUDASW++2.0
by Liu et al. [17] and the Pedro Monteiro’s SWIPE extension [2] presented in Section 3.4.1, our work
proposes an efficient solution for parallel implementation of the Smith-Waterman algorithm, named Mul-
tiSW (Section 3.4.2.B).
This implementation consists of the orchestration of both applications execution modules in a single
solution, exploiting the use of multiple CPU cores and the NVIDIA GPUs that may be available on the
running machine, in a heterogeneous approach methodology, as presented in Figure 4.1. Each one of
the modules is called a worker, so we have the CPU workers (Section 4.2.1) and the GPU workers (Sec-
tion 4.2.2). The MultiSW application considers a load balancing abstraction layer, in order to efficiently
split the database sequences during the execution. This layer is explained in Section 4.5. Another im-
plemented optimization is a wrapper8 function for the CPU worker execution (Section 4.2.1.A). These
additional implementations were proposed in order to improve the application CPU worker execution
time. Besides these improvements in the CPU, it was also proposed several implementations in the
GPU worker (Section 4.2.2).
During the execution, the proposed MultiSW application must receive multiple arguments from the
prompt, specifying the running parameters. Then, it prepares all the execution structures (presented
in Section 4.4.3), and coordinates the execution of all executable work between the available workers
(specified at invocation time). This coordination process can be referred to as Orchestration process.
Figure 4.1: Heterogeneous Architecture
This way, multiple parallelization techniques are considered in a single software solution, in a medium-
grained parallelization approach where multiple database sequences are processed simultaneously, as
will be explained in Section 4.2.
In this kind of applications, the main objective is to process all data in the minimum execution time,
leading the maximum execution speedup (concept explained below in Section 5.2). Considering both
8A wrapper function is a subroutine in a software library or a computer program whose main purpose is to call a secondsubroutine or a system call with little or no additional computation.
39
4. Heterogeneous Parallel Alignment MultiSW
execution workers, the execution time its directly related to the amount of data processed (database
sequences) in each iteration. Due to the base implementations considered in the implementation of
MultiSW, it was necessary to create several auxiliar processing structures (see Section 4.4.3). Sec-
tion 4.2 presents the architecture of this solution and the adaptation of the existing solutions to enable it.
To improve MultiSW, in Section 4.5, a model that changes this block size in run time iterations, in order
to minimize the application total execution time, is presented.
4.2 Architecture
The solution’s architecture is presented in Figure 4.2. The orchestration can be considered the
application’s core. This main orchestration invokes the CPU and GPU implementations to execute work
that consists in processing alignments between database sequences and the query sequence. Both
workers are adapted from the considered applications Pedro Monteiro 3.4.1.D and CUDASW++2.0 to
this thesis implementation solution. This adaptation is explained below.
Orchestration
GPU Module
CPU Module
MultiSWC
PU
Wra
pp
er
Load
Bal
anci
ng
Mo
dule
Database Sequences
Get work
Figure 4.2: MultiSW block diagram.
In order to adapt both solutions to this work, the considered model was the Master/Worker originally
proposed by Pedro Monteiro in SWIPE extension [2]. A possible representation for this model execution
is shown in Figure 4.3. The split of database in multiple chunks represents the inter-task parallelization
model introduced by Pedro Monteiro in his solution.
During the execution, all running workers, the CPU and the GPUs ones are getting some new work
to process, execution after execution, invoking the function get_fasta_sequences(). This function loads
the next processing database sequences, from the database sequences file specified at the application
run time. The access to this function is protected by a pthread_mutex_t, to ensure that only one of the
workers can obtain sequences, each time. The worker then gets the respective processing block from
the profile_seqs structure. GPU workers use the processing block structure presented in Section 4.4.3.
40
4.2 Architecture
Figure 4.3: Master Worker Model [2]
In Sections 4.2.1 and 4.2.2, the adaptations of each existing applications, to use the original code
adapted in the orchestration implementation, are presented. Also is presented a CPU wrapper function
used to minimize all the multiple thread accesses to the global shared variables, that need synchroniza-
tion amongst all execution threads.
4.2.1 CPU Worker
The CPU worker of our work consists of the adaption of the Pedro Monteiro’s solution [2], trans-
forming the master of the original master/worker model into one of our workers, since in the original
Pedro Monteiro’s implementation the master thread controls all the execution and creates new process-
ing work for the workers. In the original implementation, the master thread creates processing blocks of
16 sequences, blocking the access to the get_fasta_sequences() function (explained above) of the other
workers, in every execution iteration. This represents an efficiency problem in the final solution, because
of the low parallelization level on the database sequences, so a CPU wrapper function was developed
in our work (Section 4.2.1.A), to avoid this problem.
The architecture of the original solution was not changed, and the application still works as repre-
sented in Figure 4.4.
Figure 4.4: Master Worker Model [2]
So, in our implementation, the worker creates the processing blocks to be processed. It gets the
database sequence from the CPU wrapper function, and then it creates the 16 database sequence
41
4. Heterogeneous Parallel Alignment MultiSW
blocks to be inserted in the queue to be processed. Besides the CPU wrapper implementation, some of
the initial functions were adapted, since the original database file considered was the BLAST sequence
type [48] and our implementation works with the FASTA [49] database file format. This demanded that
initializing functions were changed in order to support this different file format.
4.2.1.A CPU Wrapper
When workers get some new execution block, it is necessary to guarantee that the accessed method
which obtains the database sequences does not block the access to other execution workers. In the
CPU implementation this is implemented using one mutex that blocks every concurrent accesses to
these variables. Pedro Monteiro’s implementation [2] considers that executable blocks have only 16
sequences, and the method getwork() (explained above) gets those sequences was blocking the access
of another workers when getting them information. So, in order to avoid the CPU worker to get only 16
sequences at a time and block the other workers, it was presented in this work one CPU Wrapper
function that gets a bigger block (default value its 30000) to avoid the other worker waits for several
accesses. After that, the CPU worker creates processing blocks accessing that block obtained by this
wrapper, getting 16 sequences from it (as shown in Figure 4.5).
Database Sequences
CPU Wrapper
CPU Worker(…)
GetSeqs(16)
GetSeqs(16)
GetSeqs(16)
GetSeqs(16)
GetSeqs(30000)
Figure 4.5: CPU Wrapper Function.
4.2.2 GPU Worker
The GPU module considers several GPU workers, each one assigned to a physical NVIDIA GPU
device. The number of running GPUs is specified in the prompt at run time. The application creates a
CPU pthread for each one of the considered GPUs. This thread runs a function named gpu_worker()
and this function gets the execution database sequences from the get_fasta_sequences() function, run-
ning all the preparation and execution flows from the original CUDASW++ implementation [17]. Liu’s et
al. solution works with FASTA sequences, so it was not necessary to change the sequence’s preparing
functions.
To minimize the application execution time, some optimizations that reduce the execution time for each
iteration of the worker are presented. A CUDA Stream is "a sequence of operations that execute on
the device in the order in which they are issued by the host code. While operations within a stream are
guaranteed to execute in the prescribed order, operations in different streams can be interleaved and,
42
4.2 Architecture
when possible, they can even run concurrently" [50]. Using CUDA Streams, the memory transfers can
be made asynchronous between the host and the device (Section 4.2.2.A). Also, with streams is also
possible parallelize the execution of kernels (Section 4.2.2.B). Besides these, the loading of the next pro-
cessing sequences is done in parallel with the execution of kernels on the device side (Section 4.2.2.C).
4.2.2.A Asynchronous Transfers
By creating CUDA streams and assigning them to data transfers and changing memory transfers to
asynchronous (placing -Async in the name of the transferring instruction), as shown in the code the data
The code was compiled for a 64-bit Linux operating system using the Intel C compiler version 13.1.3
and the NVIDIA Compiler release 6.5.
When comparing the used GPUs its easy to identify which one will obtain the best results. GeForce
GTX 780 Ti has more cuda processing cores (2880) than GeForce GTX 660 Ti (1344), so can run the
kernels with more parallelization power. Another big difference is in the memory bandwith that in the
GTX 780 Ti is 336 GB/s, whereas in the GTX 660 Ti is only 144.2 GB/s, less than a half. The memory
interface is also different for both, in the first one, it is 384 bits and, in the second one it is 192 bits. So it
is expected that the first GPU runs the kernel functions faster, and transfers the data more quickly than
the second GPU.
5.1.1 Experimental Dataset
The query sequence that was used in the experimental scenarios was the IFNA6 interferon, alpha 6
[Homo sapiens (human)] [51] with 189 residues.
The database sequences that were considered was the release 2014_02 of UniProtKB/Swiss-Prot [52]
database sequences in the FASTA format repeated 5 times in the file. This database contains 542,503
sequences with several sizes, comprising 192,888,369 amino acids abstracted from 226,190 references.
The total processed number of sequences is 2,712,515.
51
5. Experimental Results
5.2 Evaluating Metrics
In order to compare the considered scenarios, the speedup metric will be used. This metric measures
how much one optimized implementation is faster than the base implementation. It is given by equation
5.1:
speedup =tsequentialtparallel
(5.1)
5.3 Results
This section presents multiple scenarios and their results when running the application with the
various execution parameters configurations. It starts with the simplest scenario, corresponding to a
single CPU core execution, and finishes with the most complex configuration with an orchestration of
the workers based on a multicore CPU and multiple GPUs, that processes all the available work. The
execution block sizes for each kind of worker were pre-adjusted in order to obtain the best solutions in
the overall execution times running with several block size configurations before getting the experimental
results.
Each execution scenario was executed ten times, and the presented results correspond to the av-
erage of the times of these executions. An iteration execution represents the time that the application
spends to execute the block size defined for the execution worker.
In each presented scenario, for the global orchestration, the CPU execution worker represents the
CPU Wrapper module, presented in Section 4.2.1.A, regardless the execution being done with one or
four CPU cores. The iteration time may vary, because the sequences processed has different sizes.
The block sizes for the experimental results considered in the CPU and the GPU were adjusted by
varying the block sizes and checking the best execution times using only one CPU core and a single
GPU. The CPU obtained block size was 30,000 sequences and the GPU default block size was 65,000
sequences.
52
5.3 Results
5.3.1 Scenario A - Single CPU core
Considering a single CPU core execution, the total execution time was about 31.52 seconds, as
shown in Figure 5.1.
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Time (s)
31.52
Figure 5.1: Processing times considering a single CPU core execution and a processing block with 30000 sequences.
The multiple grey-colored blocks represents the execution for each CPU wrapper iteration, consid-
ering its size of 30,000 sequences. These iteration execution times vary between 0.0688 and 2.391
seconds and all together represent the total execution time of about 31.52 seconds. Between each
iteration execution (during the overall execution), in the beginning the preparation time is about 0.0009
seconds and is not visible in the figure presented above. The difference in the iteration execution times
is explained by different sequences size in each iteration. For bigger sequences, the iteration execution
time will be bigger.
5.3.2 Scenario B - Four CPU cores
Considering a 4-core CPU execution, the total execution time was about 15.55 seconds, as shown
in Figure 5.2.
53
5. Experimental Results
0 2 4 6 8 10 12 14 16
Time (s)
15.55
Figure 5.2: Processing times for 4 CPU cores, considering a block size of 30,000 sequences.
The distinct grey-based colored blocks represents the process time for a block of 30,000 sequences
(considering a CPU wrapper iteration) by four CPU cores. These iteration values vary between 0.046
and 0.957 seconds.
Total execution time was about 15.55 seconds. The reason why the solution with four CPU cores is not
four times faster than the single CPU core is because of the synchronization between multiple threads
and the data partition and organization times, this way the speedup obtained was not linear like it was
supposed to.
5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti
Considering a single GeForce GTX 780 Ti GPU execution, the total execution time was 6.35 seconds,
as show in Figure 5.3.
0 1 2 3 4 5 6 7
Time (s)
6.35
Figure 5.3: Processing Times for single GPU in Machine A, considering blocks size of 65,000 sequences. Total execution timeabout 6.35 seconds.
In the figure it is presented several grey colored execution blocks that represents the time of pro-
cessing 65,000 database sequences against the query sequence. These iteration values vary between
54
5.3 Results
0.118 and 0.266 seconds.
Considering the several optimizations mentioned in Section 4.2.2, specially the use of CUDA Streams
provided by NVIDIA in its framework, it is possible to reduce significantly the preparation time between
iterations and getting the best overall execution times.
5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti
Considering a single GeForce GTX 660 Ti GPU execution, the total execution time was about 7.38
seconds, as show in Figure 5.4.
0 1 2 3 4 5 6 7 8
Time (s)
7.38
Figure 5.4: Processing Times for single GPU in Machine B, considering blocks size of 65,000 sequences. Total execution timeabout 7.38 seconds.
In the figure it is presented several grey colored execution blocks. Each one of them represents the
time of processing 65,000 database sequences against the query sequence. The total execution time
was 7.38 seconds. These iteration values vary between 0.126 and 0.304 seconds.
5.3.5 Scenario E - Four CPU cores + Single GPU Execution
In this scenario, the considered workers for the execution are the four CPU cores and the GeForce
GTX780 Ti GPU. This execution time was about 6.112 seconds, as show in Figure 5.6:
55
5. Experimental Results
0 1 2 3 4 5 6
CPU Cores
GPU A
Time (s)
6.112
Figure 5.5: Processing Times for 4 CPU cores and a GeForce GTX780 Ti GPU, considering CPU blocks of 30,000 sequencesand GPU blocks of 65,000 sequences. Total execution time was 6.112 seconds.
This time was better than the one obtained in Scenario F, because the considered GPU was the
GeForce GTX 780 Ti, that executes faster than GeForce GTX 660 Ti like presented in Scenarios C and
D.
In Figure 5.6 it is presented the number of sequences processed by each kind of worker. The
execution CPU worker processed 817,087 sequences, while the GPU worker processed 1,895,428 se-
quences.
817087
1895428
0
300000
600000
900000
1200000
1500000
1800000
2100000
GPU ACPU Cores
Figure 5.6: Number of Sequences Processed by CPU cores and GPU.
The orchestration represented in this scenario is better than the single GPU execution, but a linear
speedup was not achieved since the synchronization points increased with the increase of the workers
in the orchestration.
In Figure 5.5, its also presented the dynamic block size along the time. These values are presented
in the arrow next to the block execution, in the GPU worker and in the CPU worker. So the CPU worker
56
5.3 Results
starts with the 30,000 block size and finish with a size of 15,000. The GPU worker starts with a 65,000
sequences block size and finish with a size of 40,000 sequences. To both of the workers, the number of
next processing sequences is decreasing along the execution time, since it was the way load balacing
module works. The iteration execution times for the GPU worker varies between 0.082 and 0.316 sec-
onds. For the CPU worker this value goes between the 0.072 and the 0.258 seconds.
5.3.6 Scenario F - Four CPU cores + Double GPUs Execution
The last scenario considered is composed by 4-core CPU execution and both available GPUs: the
GeForce GTX 780 Ti (GPU A) and the GeForce GTX 660 Ti (GPU B).
As expected, this execution was the fastest one but not the most efficient and takes about 4.957
seconds, as show in Figure 5.7.
0 1 2 3 4 5
CPU Cores
GPU A
GPU B
Time (s)
4.957400004460165000
65000 58633 58415 56279 40000
30000 33000 25387 17843 16372 15000 4.9
Figure 5.7: Processing Times for 4 cores CPU, GPU A and GPU B, considering the initial block size of 30,000 sequences blocksto the CPU solution and 65,000 to the GPU solution. Total execution time of 4.957 seconds. Near some of the iteration blocks it is
presented the new considered block size.
Figure 5.7 shows the execution blocks for the three workers. The CPU worker starts with a block size
of 30.000 and finishes with the block size of 15,000. The execution times for this worker takes from 0,06
to 0,379 seconds. The execution times for the GPU A worker goes between 0.067 and 0.394 seconds.
Finally, for the GPU B worker the execution times goes between 0.067 and 0.520 seconds.
The number of sequences computed by each worker is presented in Figure 5.8. The CPU worker
process 70,795 sequences, the GPU A worker computed 1,241,496 sequences and the GPU B worker
process 1,059,202 sequences. The low quantity processed by the CPU worker is because the block
size of this worker is smaller than the GPU worker block sizes improve performance.
57
5. Experimental Results
411817
1241496
1059202
0
300000
600000
900000
1200000
1500000
GPU ACPU Cores GPU B
Figure 5.8: Number of Sequences Processed by CPU, GPU A, and GPU B workers.
5.4 Summary
As shown in Table 5.1, considering the multiple scenarios, Scenario F presented in Section 5.3.6 was
achieved a speedup of 6.4x when comparing the execution in a CPU single core execution presented in
Scenario A, Section 5.3.1.
Execution Time SpeedupSingle core 31.52Four cores 15.55 2.03
GeForce GTX 780 Ti 6.350 4.96GeForce GTX 660 Ti 7.380 4.271
Four CPU cores + GeForce GTX 780 Ti 6.112 5.16Four CPU cores + 2 GPU 4.96 6.36
Table 5.1: Execution Speedups.
The increase of workers in the execution of our work orchestration, increases also the synchroniza-
tion between the involved threads are needed. This causes the occurrence of execution delays and
makes the workers wait longer. This situation is minimized with the load balancing layer presented in
our solution, since the block size is being adapted to be similar. However, there are some limitations
in the loading balance module, since the number of total processing sequences are not known at the
beginning of the application.
Despite these limitations, as it can be verified in the Table 5.1, to the different execution scenarios, the
orchestration was getting relatively good speedups, with the inclusion of new worker in their execution.