Efficient GPU Implementation of Bioinformatics Applications

Efficient GPU implementation of bioinformatics

applications

Nuno Miguel Trindade Marcos

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Nuno Filipe Valentim RomaProf. Pedro Filipe Zeferino Tomás

Examination Committee:

Chairperson: Prof. José Carlos Martins DelgadoSupervisor: Prof. Nuno Filipe Valentim Roma

Member of the Committee: Prof. David Manuel Martins de Matos

November 2014

Agradecimentos

Em primeiro lugar, gostaria de agradecer aos Professores Nuno Roma e Pedro Tomás, meus orientado-

res neste trabalho, pela vossa orientação e colaboração. Sem a vossa paciência e tolerância constante

não teria sido possível terminar este trabalho. A vós um muito Obrigado.

Seguidamente gostaria de agradecer aos meus colegas que me acompanharam durante o curso, em

especial ao David Gaspar pela força e por todos os momentos passados, ao Jhonny Aldeia por toda a

ajuda durante o curso, ao Dionisio Sousa, ao Rui Mestre e ao Artur Ferreira que me ajudaram e deram

motivação para continuar em frente. Para além disso, agradecer especialmente ao meu amigo Pedro

Monteiro pela ajuda com o trabalho dele e pelo constante incentivo.

Gostaria também de agradecer aos meus amigos Daniela Coelho, Miguel Matos, David Dias, João Ve-

lez e Pedro Chagas por todo o apoio durante este trabalho. Ao meu afilhado Tiago Carreira pelas várias

insistências para acabar este trabalho e por toda a ajuda durante o mesmo.

Para além deles, também gostaria de agradecer aos meus colegas da Premium Minds que sempre

tiveram disponíveis para me ajudar e para assumir as coisas na minha ausência, em especial ao Márcio

Nóbrega, ao André Soares, ao Renil Lacmane e ao Afonso Vilela.

Por final, mas com a maior das importâncias, gostaria de agradecer aos meus pais e ao meu irmão por

toda a força e motivação para que conseguisse chegar ao final deste trabalho e em especial à minha

namorada Ana Daniela pela motivação na recta final deste trabalho.

i

Abstract

Biological sequence data is becoming more accessible to researchers around the world. In particular,

rich databases of protein and DNA sequence data are already made available to biologists and their size

is increasing every day. However, all this obtained information needs to be processed and classified.

Several bioinformatics algorithms, such as the Needleman-Wunsch and the Smith-Waterman algorithm,

have been proposed for this purpose. Both consist of the execution of dynamic programming schemes

which allow the usage of parallelism to achieve a better performance execution. Under this context,

this thesis proposes the integration of two previously presented parallel implementations: an adaptation

of the SWIPE implementation, for multi-core CPUs that exploits SIMD vectorial instructions, and an

implementation of the Smith-Waterman algorithm for GPU platforms (CUDASW++ 2.0). Accordingly,

the presented work offers an unified solution that tries to take advantage of all computational resources

that are made available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a

dynamic load balancing layer. The obtained results show that the attained speedup can reach values as

high as 6x, when executing in a quad-core CPU and two distinct GPUs.

Keywords

Bioinformatics Algorithms; Sequence Alignment; Smith-Waterman Algorithm; Heterogeneous Paral-

lel Architectures; Load Balancing; CUDA.

iv

Resumo

Hoje em dia a quantidade de informação genética disponível aos investigadores é cada vez maior.

Bases de dados com informação genética estão disponíveis na internet e aumenta a cada dia que

passa. De modo a ser utilizada pela Biologia, toda esta informação necessita de ser processada e

classificada. Para a classificar, existem diversos algoritmos bioinformáticos, tais como o algoritmo de

Needleman-Wunsch e o algoritmo de Smith-Waterman algorithm. Ambos consistem na execução de

múltiplas iterações, que permitem a sua paralelização de forma a obter uma melhor performance na

execução. Duas das implementações paralelas existentes são uma adaptação da implementação do

Rognes SWIPE, apresentada por Pedro Monteiro, baseada numa paralelização ao nível das threads

CPU e a outra a CUDASW++ 2.0, apresentada pelo Liu et. al., baseada numa paralelização ao nível

das Threads e dos dados em GPUs. Considerando ambas as soluções, este trabalho propõe uma

orquestração heterógenea que utilizando ambas consegue processar sequências nos cores da CPU e

nas GPUs disponíveis na máquina. Para além desta implementação, é proposta uma camada adicional

responsável pelo balanceamento dos dados entre os diferentes workers. Os resultados mostram que a

execução pode atingir um speedup superir a 6x quando executada com quatro cores CPU e duas GPUs

distintas.

Palavras Chave

Algoritmos Bioinformáticos; Alinhamento de sequências; Algoritmo de Smith-Waterman; Arquitectu-

ras Paralelas Heterógeneas; Módulo de Load Balancing; CUDA.

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Parallel Architectures 5

2.1 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 CPU - Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 GPU - Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 CPU vs GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Hybrid Solution: Accelerated Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 CUDA - Compute Unified Device Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6.1 Definition and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Sequence Alignment in Bioinformatics 19

3.1 Alignment Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Optimal Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Heuristic Sub-Optimal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 BLAST - Basic Local Alignment Search Tool . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Parallel Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.1 CPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1.A Wozniak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1.B Farrar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1.C SWIPE (Rognes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii

3.4.1.D Pedro Monteiro’s Implementation . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2.A Manavski’s Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2.B CUDASW++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.3 Discussion on the Presented implementations . . . . . . . . . . . . . . . . . . . . 35

4 Heterogeneous Parallel Alignment MultiSW 38

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.1 CPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.1.A CPU Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2 GPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2.A Asynchronous Transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.2.B CUDA Streams in Kernel Execution . . . . . . . . . . . . . . . . . . . . . 43

4.2.2.C Loading Sequences with Execution . . . . . . . . . . . . . . . . . . . . . 43

4.3 Application Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Implementation Details and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.1 Database File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.2 Database Sequences Pre-Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Dynamic Load-balancing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Experimental Results 50

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Evaluating Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.1 Scenario A - Single CPU core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.2 Scenario B - Four CPU cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti . . . . . . . . . . . . . . . . . . . 54

5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti . . . . . . . . . . . . . . . . . . . 55

5.3.5 Scenario E - Four CPU cores + Single GPU Execution . . . . . . . . . . . . . . . . 55

5.3.6 Scenario F - Four CPU cores + Double GPUs Execution . . . . . . . . . . . . . . . 57

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Conclusions and Future Work 59

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 61

viii

List of Figures

2.1 NVIDIA GK110 Kepler Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Coalesced memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 GPU Memory organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 CPU and GPU architectures [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Graphics Processing Unit (GPU) vs Central Processing Unit (CPU) GFLOPS comparation

[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Fermi Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7 CUDA Kernel definition and invocation example [1] . . . . . . . . . . . . . . . . . . . . . . 16

2.8 Execution Flows Representation [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Pairwise Alignment Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Needleman-Wunsch alignment matrix example . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Smith-Waterman alignment matrix example . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 FASTA algorithm step 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24




3.8 Multi-sequence vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.9 Rogne’s Algorithm core instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.10 Sequences Database in several chunks [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.11 Processing Block - Message [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.12 Processing Block FIFOs [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.13 Coalesced Subject Sequence Arrangement [3]. . . . . . . . . . . . . . . . . . . . . . . . . 33

3.14 Coalesced Global Memory Access [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.15 Program workflow of CUDASW++ 3.0 [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 Heterogeneous Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 MultiSW block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Master Worker Model [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Master Worker Model [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 CPU Wrapper Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

4.6 Execution Sequence Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.7 Workers execution not balanced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.8 Workers execution balanced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1 Processing times considering a single CPU core execution and a processing block with

30000 sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Processing times for 4 CPU cores, considering a block size of 30,000 sequences. . . . . 54

5.3 Processing Times for single GPU in Machine A, considering blocks size of 65,000 se-

quences. Total execution time about 6.35 seconds. . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Processing Times for single GPU in Machine B, considering blocks size of 65,000 se-

quences. Total execution time about 7.38 seconds. . . . . . . . . . . . . . . . . . . . . . . 55

5.5 Processing Times for 4 CPU cores and a GeForce GTX780 Ti GPU, considering CPU

blocks of 30,000 sequences and GPU blocks of 65,000 sequences. Total execution time

was 6.112 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.6 Number of Sequences Processed by CPU cores and GPU. . . . . . . . . . . . . . . . . . 56

5.7 Processing Times for 4 cores CPU, GPU A and GPU B, considering the initial block size

of 30,000 sequences blocks to the CPU solution and 65,000 to the GPU solution. Total

execution time of 4.957 seconds. Near some of the iteration blocks it is presented the new

considered block size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.8 Number of Sequences Processed by CPU, GPU A, and GPU B workers. . . . . . . . . . . 58

x

List of Tables

2.1 Flynn’s Taxonomy [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.1 Execution Speedups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

xi

List of Acronyms

ALU Arithmetic Logic Unit

AMD Advanced Micro Devices - North American Technology Company

APU Accelerated Processing Unit

BLOSUM Blocks Substitution Matrix

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

GPC Graphics Processing Clusters

GPGPU General-Purpose Computation on Graphics Hardware

GPU Graphics Processing Unit

NVIDIA North American Company that invented the GPUs in 1999

PAM Point Accepted Mutation

SMX Streaming Multiprocessor

SP Streaming Processor

xiii

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1

1. Introduction

1.1 Motivation

Nowadays, numerous databases spread all over the world hosting large amounts of biological data,

and they are growing in size exponentially as genomes of other species are being sequenced. Specifi-

cally, rich databases of protein and DNA sequence data are available on the Internet. The outcome of

the DNA sequencing work is very ample and can lead to many potential benefits in distinct fields such as

molecular medicine (aiming improved diagnosis of disease, drugs design, etc.), bioarchaelogy and evo-

lution (study of evolution and similarity between organisms, and others), DNA forensics (identification

of crime or catastrophe victims, establishment of paternity and other family relationships, and others)

and others, as agriculture or bioprocessing (disease- and drought-resistant crops, biopesticides, edible

vacines to incorporate into food products, and others) [6].

There are several online knowledge bases that contain millions of genes information extracted [7]:

• GenBank DNA database [8];

• National Center for Biotechnology Information (NCBI) [9];

• Universal Protein Resource (UniProt) [10];

• Nucleotide sequence database (EMBL) [11];

• Swiss-Prot [12];

• TrEMBL [13].

With this proliferation of data comes a large computational cost to perform a genetic sequence align-

ment between new genetic information and the online databases. As a consequence, genetic sequence

alignment is considered to be one of the application domains which require further improvements in

the execution speed, mostly because it involves several computationally intensive tasks, as well as

databases whose size will continue to increase. This is leading researchers to look for even faster, high-

throughput alignment tools, which can give an efficient response to this intensive growth.

One of the most known Bioinformatics algorithms is the Smith-Waterman algorithm [14]. This algorithm

is presented in Section 3.2.2 and consists of the alignment of two sequences, using a score matrix ap-

proach. So, recently several parallel computer architectures exploiting some of the parallel approaches

were presented: multiple-core processors; multiple processors installed in a single motherboard; multi-

ple computers connected through a common network - cluster on computer grids [15]. Some of these

implementations are presented in Section 2 and explore Thread-level and Data-Level parallelism using

Central Processing Unit (CPU) and Graphics Processing Unit (GPU) architectures.

Under this context, this thesis proposes the integration of two previously presented parallel implementa-

tions: an adaptation of SWIPE implementation [16], for multi-core CPUs that exploits SIMD vectorial in-

structions [2], and an implementation of the Smith-Waterman algorithm for GPU platforms (CUDASW++

2.0) [17]. Accordingly, the presented work offers an unified solution that tries to take advantage of all

computational resources that are made available in heterogeneous platforms, composed by CPUs and

GPUs, by integrating a convenient dynamic load balancing layer.

2

1.2 Objectives

This implementation was extensively evaluated considering several execution scenarios, combining both

kinds of workers.

1.2 Objectives

The aim of the present work is considering two of the Smith-Waterman algorithm implementations,

implement a unified solution that tries to take advantage of all computational resources that are made

available in heterogeneous platforms, composed by CPUs and GPUs, by integrating a convenient dy-

namic load balancing layer. With this load balancing layer it is expected that the execution time for the

multiple workers considered be equal in the timeline execution, like explained in Section 4.5. This way it

is possible to minimize the waiting times between working and guarantee that all the workers finish their

work at the same time. Besides the load balancing layer, several implementations on the GPU module

are presented in Section 4.1.

1.3 Document Outline

This thesis is organized as follows. First, in Chapter 2, the main characteristics for the parallel

architectures, describing the considered ones, the CPU and the GPU. Also in Chapter 2 it is presented

and described the Compute Unified Device Architecture (CUDA) architecture. Next, in Chapter 3, it will

be briefly presented the sequence alignment algorithms and some of the considered applications. In

Chapter 4 is presented our developed work, the MultiSW. Finally the results of this implementation and

the corresponding discussion are presented in chapter 5.

3

1. Introduction

4

2Parallel Architectures

Contents2.1 Flynn’s Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 CPU - Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 GPU - Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 CPU vs GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Hybrid Solution: Accelerated Processing Unit . . . . . . . . . . . . . . . . . . . . . . 14

2.6 CUDA - Compute Unified Device Architecture . . . . . . . . . . . . . . . . . . . . . . . 14

2.6.1 Definition and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5

2. Parallel Architectures

According to Almasi et al. [18], a parallel computer architecture can be defined as "That collection

of processing elements that communicate and cooperate to solve large problems fast". Taking this

definition into consideration, we realize that there are several types of parallel computing architectures

that use different memory organizations and communication topologies, as well as different processor

execution models.

In section 2.1, we review Flynn’s taxonomy which classifies the parallel architectures in four different

classes. According to the several possible approaches, four different types of parallelism may be defined:

Bit-level parallelism; Instruction-level parallelism; Data-level parallelism, Task/Thread-level parallelism.

In the remaining sections, we describe the main parallel computing architectures that are used nowa-

days: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), and architectures that

combine both, the Accelerated Processing Unit (APU). Finally, we present the parallel-programming

model used by NVIDIA GPUs, the Compute Unified Device Architecture (CUDA).

2.1 Flynn’s Taxonomy

In 1966, Michael J. Flynn proposed a simple model that it is still used to categorize computers,

taking into account the parallelism in instruction execution and memory data calls. Flynn looked at

the parallelism in the instruction and data streams1 called for the instructions at the most constrained

component of the multiprocessor, and placed all existing computers in four distinct categories [19], as

defined below and presented in Table 2.1.

Table 2.1: Flynn’s Taxonomy [5].

Single Instruction Multiple InstructionSingle Data SISD MISD

Multiple Data SIMD MIMD

1. Single instruction stream, single data stream (SISD): This category corresponds to the unipro-

cessor model. One example of this is the conventional sequential computer based on the Von

Neumann architecture, i.e., a uniprocessor computer which can only perform one single instruc-

tion at a time.

2. Single instruction stream, multiple data streams (SIMD): The same instruction is executed by

multiple processors using different data streams. SIMD computers exploit data-level parallelism,

by applying the same operations to multiple items of data in parallel. Many current CPUs use

this kind of architecture by supporting instruction set extensions. Examples for this are the MMX,

established by Intel [20], and the SSEx family of Streaming SIMD Extensions, representing an

evolution of the MMX architecture. The Advanced Vector Extensions (AVX) extension is also one

kind of SIMD extension proposed by Intel. This category is also followed by the programming

model used in Graphics Processing Units (CUDA and OpenCL) that we describe on Section 2.6.1The concept of stream refers to the sequence of data or instructions as seen by the machine during the execution of a

program.

6

2.2 CPU - Central Processing Unit

3. Multiple instruction streams, single data stream (MISD): This category indicates the use of

multiple independently executing functional units operating on a single stream of data, forwarding

the results from one functional unit to the next [5].

4. Multiple instruction streams, multiple data streams (MIMD): Each processor fetches its own

instructions and operates on its own dataset. This model exploits thread-level parallelism, since

multiple threads operate in parallel. Examples of this architecture are the current processors with

multi-threading support. Other examples are distributed systems and computer clusters.

2.2 CPU - Central Processing Unit

The central processing unit (CPU) is the computer hardware unit responsible for interpreting and ex-

ecuting the program instructions. One of the first commercial CPU microprocessors was the Intel 4004

presented by Intel in 1971.

A CPU is usually composed by these components [21]:

• Arithmetic Logic Unit (ALU) - Responsible for the execution of logical and arithmetic operations;

• Control Unit – Decodes instructions, gets operands and controls the execution point;

• Registers – Memory cells of the CPU that store data needed by the CPU to execute the instruc-

tions;

• CPU interconnection - communication channels among the control unit, ALU, and registers.

Nowadays, in order to reduce the power consumption and to process multiple tasks at the same time

and more efficiently, commercial CPUs are built with multi-core technology, having between 4 and 16

execution cores. This way, the multi-core CPU can process 4 or more instructions at a time, in a MIMD

parallelization way. Some solutions that take advantage of the parallelizing process on the Intel CPUs

are presented in Section 3.4.

2.3 GPU - Graphics Processing Unit

A Graphics Processing Unit(GPU) is the processing unit that is present in every graphics card on

each computer. This unit is designed specifically for performing the complex mathematical and geomet-

ric calculations that are necessary for graphics rendering. Although GPUs were originally developed

to process and display computer graphics, they have been also used for processing general purpose

operations, leading to the General-Purpose Computation on Graphics Hardware (GPGPU) paradigm.

There are several frameworks to adapt the GPU programming to this paradigm. The most known ones

are OpenCL and North American Company that invented the GPUs in 1999 (NVIDIA)’s CUDA, pre-

sented in the Section 2.6. Early approaches to computing on GPUs cast computations into a graphics

7


framework, allocating buffers (arrays) and writing shaders (kernel functions).

GPUs provide massive parallel execution resources and high memory bandwidth. Within the most

popular GPU-accelerated applications we can mention the research field, specifically:

• Higher Education and Supercomputing (numerical analytics, physics and weather and climate fore-

casting, for example);

• Defense and Intelligence applications (such as geospatial visualization);

• Computational Finance (financial analysis, etc.);

• Media and Entertainment (animation, modeling and rendering, color correction and grain manage-

ment, editing, review and stereo tools, encoding and digital distribution, etc.).

In Figure 2.1, we present the GK110 NVIDIA GPUs architecture. GeForce GTX 780 Ti GPU used in

our work, are built on this architecture. This architecture is part of Kepler GPUs family.

As shown in this figure, the GK110 NVIDIA architecture GPU has several Graphics Processing

Clusters (GPC)2, organized in a scalable array. Each GPC contains several Streaming Multiproces-

sor (SMX)s which perform the executions and run the CUDA kernels, presented below in this document.

The design of the SMX has been evolving rapidly since the introduction of the first CUDA-capable hard-

ware in 2006, with four major revisions, codenamed Tesla, Fermi, Kepler and Maxwell [22]. Kepler’s new

Streaming Multiprocessor, called SMX, has significantly more CUDA Cores than the SM of Fermi GPUs.

Each SMX contains thousands of registers that can be partitioned among the threads under execution,

several caches and warp schedulers (presented below in this document) that can quickly switch con-

texts between threads and issue instructions to warps that are ready to execute and execution cores for

integer and floating-point operations [1]. A GPU is connected to a host through a high-speed IO bus slot

(a PCI-Express bus in current systems). The considered GPU model contains four GPCs, fifteen SMXs

and six 64-bit memory controllers [23].

In Figure 2.3, we present the architecture of a GPU based on CUDA, composed of a set of stream

multi-processors sharing a global memory.

In addition to the shared memory, each SMX is composed of [22]:

• thousands of registers that can be partitioned among threads of execution;

• several kind of memory caches (explained below);

• warp schedulers that can quickly switch contexts between threads and issue instructions to warps

that are ready to execute and

• Execution cores for integer and floating-point operations.

The memory system for current NVIDIA GPUs is more complex, as we now explain.

2also known as Streaming Processors (SPs)

8


Figure 2.1: NVIDIA GK110 Kepler Architecture

9


Memory

NVIDIA GPUs include a complex memory system. The array of threads in a block (see Section 2.6),

explained below was designed to be an array of 1, 2, 3 dimensions leading to a memory access pattern

known as coalesced.

A coalesced memory access can be explained this way, considering that all threads in a warp (explained

below in this chapter) execute the same instruction, when all threads in a warp execute a load instruction,

the hardware detects whether the threads access consecutive memory locations. The most favorable

global memory access is achieved when the same instruction for all threads in a warp accesses global

memory locations. In this case, the hardware coalesces all memory accesses into a consolidated ac-

cess to consecutive DRAM locations. If thread 0 accesses location n, thread 1 accesses location n + 1,

... and thread 31 accesses location n + 31, then all these accesses are coalesced, that is: combined

into one single access.

Figure 2.2: Coalesced memory access

A coalesced access of memory occurs when the address locality and alignment meets certain crite-

ria, taking advantage of the distribute bus of the main memory, as presented in Figure 2.2. Specifically,

the types of memory presented in a GK110 architecture NVIDIA GPU are [24]:

• Global memory: the largest one (typically greater than 1 GB), but with high latency, low bandwidth

(when compared with the other types), and is not cached. The effective bandwidth of global mem-

ory depends heavily on the memory access pattern (e.g. coalesced access generally improves

bandwidth).

• Local memory: readable and writable per-thread memory with very limited size (16 kB per thread)

and is not cached. Access to this memory is as expensive as access to global memory.

• Constant memory: read-only memory with limited size (typically 64 kB) and cached. The reading

cost scales with the number of different addresses read by all threads. Reading from constant

memory can be as fast as reading from a register.

• Texture memory: read-only memory that is mapped and allocated in global memory. This memory

can be used like a cache.

• Shared memory: fast on-chip memory of limited size (16 kB per block), readable and writable

on a per-block basis. This memory can only be accessed by all threads in a thread block and is

10


divided into equally-sized banks that can be accessed simultaneously by each thread. Accessing

this memory is as fast as accessing a register as long as there are no bank conflicts.

• Registers: readable and writable per-thread registers. These are the fastest memory to access,

but the amount of registers is limited.

Figure 2.3: GPU Memory organization

The transfer of data between the Host and the GPU is done using direct memory access (DMA), and

can operate concurrently with both the host and the GPU computation units.

As said before, there are some programming models used in the GPGPU context, such as the CUDA

and OpenCL models. Section 2.6 presents the CUDA programming model.

11


2.4 CPU vs GPU

With the observed increase of the computational demands imposed by the gaming market, the man-

ufacturers of GPUs had to propose powerful processing units, in order to allow gamers to run their

increasingly graphically demanding games. A direct consequence is the fact that it represents the most

powerful and cost effective computer hardware. Consequently, GPUs are no longer exclusively applied

with the purpose of showing computer graphics. An increasingly interest of researchers and developers

in the potential of GPUs for applications with large amounts of computations have arisen along the last

few years.

Today, CPUs in consumer devices have several cores in a chip, and each of them has some ALUs that

perform the arithmetic and logical operations (see Figure 2.4). In comparison, the GPUs have hundreds

or even thousands of cores, each one with four units: one floating point unit, a logic unit, a move or

compare unit and a branch unit. An advantage of GPUs is the ability to perform multiple simultaneous

operations, up to an order of magnitude of 103, since there are hundreds of execution cores in a single

GPU.

Figure 2.4: CPU and GPU architectures [1].

According to Owens et al. [25], one of the major architectural differences between CPUs and GPUs

is the fact that CPUs are optimized to achieve high performance in sequential code, with some of the

processing stages dedicated to extracting instruction-level parallelism with techniques such as branch

prediction and out-of-order execution. On the other hand, GPUs with entirely parallel computing nature

allow processing stages to be more focused on computing. This allows achieving a higher level of arith-

metic intensity, with around the same number of transistors than CPUs.

Regarding execution performance, one of the metrics that has been used is floating-point operations per

second (FLOPS). As shown in Figure 2.5, during the last years GPUs surpassed CPUs in this measure

of theoretical peek performance.

In order to compare a sequential and a parallel software implementation, the fundamental metric is

the speedup. The expression of the speed up is:

Speedup =tsequentialtparallel

(2.1)

12

2.4 CPU vs GPU

Figure 2.5: GPU vs CPU GFLOPS comparation [1]

It gives a ratio that indicates how a parallelized system is faster when compared to a sequential system.

When comparing with CPUs, some advantages and disadvantages can be identified with respect to

GPUs[25, 26]:

Advantages:

• Faster and Cheaper;

• Fully programmable processing units that support vectorized floating-point operations[27];

• Very flexible and powerful, with the introduction of new capabilities in modern GPUs, like high level

languages support the programmability of the vertex and pixel pipelines. Other features are the

implementation of vertex texture access, the full branching support in the vertex pipeline, and the

limited branching capability in the fragment pipeline.

Disadvantages:

• Memory transfers between host and device can slow the whole application;

• Complex memory management, since there are several limitations regarding memory size (which

is limited), and also in the memory organization, which has a hierarchical organization. (See

Section 2.3 for the CUDA memory model);

• Only applications with high level of parallelization sections can benefit from all the GPU execution

power.

13


2.5 Hybrid Solution: Accelerated Processing Unit

Nowadays, new hybrid solutions are appearing in the market, such as the Accelerated Processing

Units (APU) by Advanced Micro Devices - North American Technology Company (AMD). This new hard-

ware part is based on a single processor chip that combines CPU and GPU elements into an unified

architecture.

An example of these APUs is the AMD Fusion [28], the Kaveri, the Athlon or the Sempron series.

In this architecture, the x86 CPU cores and the programmable GPU cores share a common path to the

system memory. The key aspect to highlight are that the x86 CPU cores and the vector engines are

attached to the system memory through the same high speed bus and memory controller. This feature

allows the AMD Fusion architecture to alleviate the fundamental PCIe constraint that traditionally has

limited performance on a discrete GPU. Fusion architecture obviates the necessity of PCIe accesses

to and from the GPU, improving application performance [29]. However, the graphics cores that have

been placed on current APUs are not meant to be competitive with high-end or even mid-range discrete

graphic cards [30]. Recently, in November 2013, Sony introduced an AMD 1.6GHz APU on the Playsta-

tion 4 console. This was the fastest APU produced by AMD when the console was presented. Despite

the Playstation APU being Sony property, AMD took some of the implementations and included them in

their consumer APUs, improving the available APUs processing power.

2.6 CUDA - Compute Unified Device Architecture

2.6.1 Definition and Architecture

The Compute Unified Device Architecture (CUDA) is a parallel-programming model and software en-

vironment, designed by NVIDIA, in order to deliver all the performance of NVIDIA’s GPUs technology to

general purpose GPU Computing. It was first introduced in March 2007, and, since then, more than 100

million CUDA-enabled GPUs has been sold.

This programming model implements a MIMD parallel processing paradigm, since it divides the ex-

ecution flow between groups, with the result that every group is independent from another. Inside that

group, an adapted SIMD parallelism is adopted, named single instruction, multiple-thread (SIMT), where

many threads execute each function. In CUDA, the GPU is denoted as the "device" and the CPU is re-

ferred as the "host". "Kernel" refers to the function that runs on the device. Using this nomenclature, the

host invokes kernel executions on the device.

Current NVIDIA graphics cards are composed of streaming multi-processors. In these, the kernel

function runs in parallel. This execution is done according to a special execution flow (explained in

Section 2.6.3). Figure 2.6 presents the Fermi architecture of NVIDIA’s graphic cards.

14

2.6 CUDA - Compute Unified Device Architecture

Figure 2.6: Fermi Architecture

On Kepler, each multiprocessor has 192 processing cores, while on Fermi each multiprocessor has

a group of 32 SPs. The high-end Kepler has 15 multiprocessors, for a total of 2880 cores (15 ∗ 192), and

the Fermi accelerators have 16 multiprocessors, for a total of 512 cores (32 ∗ 16). Another difference is

the shared memory size. On Kepler, each SMX has 64 KB of on-chip memory that can be configured as

48 KB of Shared memory with 16 KB of L1 cache, or as 16 KB of shared memory with 48 KB of L1 cache

just like the Fermi GPUs. The memory types available in NVIDIA’s graphic cards are further explained

more fully in Section 2.3.

Another important difference is related with the maximum number of active warps (group of 32

threads that executes the kernel code at a time), that can exist in each multiprocessor. When one

warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that

one. This way, the cores can be productive as long as there is enough parallelism to keep them busy

[31]. Tesla supports up to 32 active warps on each multiprocessor, and Fermi supports up to 48.

In order to allow its use by a great number of developers, NVIDIA based its language on the C/C++

programming language and added some specific keywords in order to deploy some special features of

CUDA. This new language is called CUDA C and the compiler is NVCC. [24]

2.6.2 Programming Model

CUDA is a extension of C programming language with some reserved keywords. CUDA C extends

C by allowing the programmer to define C functions, called kernels, that, when called, are executed N

times in parallel by N different CUDA threads. The __global__ keyword declares a function as being

15


a kernel, and it is executed on the device and can only be invoked by the host using a specific syntax

configuration - < < < ... > > > as shown in Figure 2.7. Each thread that executes the kernel is given

a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. These

kernel functions must be highly parallelized, in order to obtain maximum efficiency for the application [1].

The basic entities involved in the execution of the heterogeneous programming model are the host,

which is traditionally the CPU, and the other one is the devices, which are GPUs in this case.

The execution flow for a simple CUDA application can be:

1. Allocate device memory

2. Copy memory from host to device

3. Invoke Kernel

4. Copy memory from device to host

5. Free device memory

Figure 2.7 illustrates a kernel definition and a kernel invocation. The two values within three angle

brackets, 1 and N, represent respectively the dimension for each execution grid (total number of blocks),

and the dimension for each block (the total number of threads per block that will run in the kernel). These

numbers are specified by the programmer, but have a limit according to the maximum number of blocks

and threads supported by the adopted GPU[1].

Figure 2.7: CUDA Kernel definition and invocation example [1]

The next section presents the execution model.

2.6.3 Execution Model

In CUDA, the execution flow is organized by a hierarchy that is represented in figure 2.8. Threads

represent the fundamental flow of parallel execution and are executed by core processors [32].

A set of threads is called a thread block. Thread blocks are executed on multi-processors and do

not migrate over multi-processors. Several concurrent thread blocks can reside on one multi-processor.

16

2.7 Open Computing Language

Figure 2.8: Execution Flows Representation [1]

This number is delimited by multi-processor resources (shared memory and register file). Finally, a set

of thread blocks is called a grid. One kernel is launched as a grid.

2.6.4 Limitations

Despite the versatility offered by this architecture, GPUs have some limitations, particularly in mem-

ory management and allocation. Memory transfer time between the host and the device represents an

overhead that delays execution time, since data has to be transferred to the device before being pro-

cessed. Afterwards, the results of data processing need to be transferred from the device to the host.

This overhead can become larger, since limited system bus bandwidth and system bus contention can

increase the latency between the host and device components.

2.7 Open Computing Language

Open Computing Language (OpenCL) is an open standard that can be used not only for program-

ming NVIDIA GPUs, but also to program CPUs and the GPU devices from different manufacturing

brands, providing a portable language when programming in context of GPGPU.

As with CUDA technology, the OpenCL language denotes as kernel the execution code block that will

run on the GPU. The diference between a CUDA kernel and an OpenCL kernel relates to the fact that

OpenCL kernel is compiled at run-time, which increases the run time of this solution. In addition, CUDA

has the advantage of being developed by the same company that develops the hardware where it runs,

so better performance at execution time is expected.

This technology is used by the GPGPU community alongside with the CUDA programming model.

17


18

3Sequence Alignment in Bioinformatics

Contents3.1 Alignment Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Optimal Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Heuristic Sub-Optimal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 BLAST - Basic Local Alignment Search Tool . . . . . . . . . . . . . . . . . . . . . 25

3.4 Parallel Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.1 CPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.3 Discussion on the Presented implementations . . . . . . . . . . . . . . . . . . . . 35

19

3. Sequence Alignment in Bioinformatics

Sequence alignment is a fundamental procedure in Bioinformatics, specifically used for molecular

sequence analysis, which attempts to identify the maximally homologous subsequences among sets

of long sequences [14]. In the scope of this thesis, it was considered the processing of biological

sequences consisting on a single, continuous molecule of nucleic acid or protein [33]. While DNA se-

quences can be expressed by four symbols (corresponding to the four nucleotides A,C,T and G). The

amino acids in proteins can be expressed by 22 symbols A, B, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S,

T, V, Y, Z.

When comparing sequences, one looks for patterns that diverged from a common ancestor by a process

of mutation and selection. According to Dewey et al. [34], the main objectives of sequence alignment

are to establish input data for phylogenetic analysis, to determine the evolutionary history of a set of

sequences, to discover a common motif3 in a set of sequences; to characterize the set of sequences

and also for building profiles for database sequence searching.

The considered mutational processes involved in the alignments are residue substitutions, residue

insertions, and residue deletions. Insertions and deletions are common referred to as gaps [35].

The basic idea in the aligning process of two sequences (of possibly different sizes) is to write one on

top of the other and break them into smaller pieces by inserting spaces in one or the other so that

identical sub-sequences are eventually aligned in a one to one correspondence. Naturally, spaces are

not inserted in both sequences in the same position. Figure 3.1 illustrates an alignment between the

sequences A="ACAAGACAGCGT" and B="AGAACAAGGCGT".

Figure 3.1: Pairwise Alignment Example

In order to understand all the steps involved in the algorithms that will be presented in Sections 3.2

and 3.3, we need to go through some of the concepts employed: the Scoring Model and the concept

of the Gap Penalties (Section 3.1). After explaining these concepts, this chapter will provide a brief

overview on optimal sequence alignment algorithms (Section 3.2) and heuristic sequence alignment al-

gorithms (Section 3.3).

Finally, by taking into account the parallel architectures presented in Chapter 2, we present some im-

plementations of sequence alignment using parallel architectures, based either on CPU (Section 3.4.1)

and on GPU (Section 3.4.2).

3.1 Alignment Scoring Model

The basis of many sequence alignment algorithms are based in a scoring model, which classifies

the several matching and mismatching patterns according predefined score values. The simplest ap-

proaches consider a positive constant value assigned to a match between both residues. Alternatively,

3Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function.

20

3.2 Optimal Alignment Algorithms

instead of using fixed score values when there’s one match on the alignment, biologists frequently use

scoring schemes that take into account physicochemical properties or evolutionary knowledge of the

sequences being aligned. This is common when protein sequences are compared. The most known

schemas are Point Accepted Mutation (PAM) and Blocks Substitution Matrix (BLOSUM) alphabet-weight

scoring schemes, which are usually implemented by a substitution matrix.

The BLOSUM matrices were developed by Henikoff & Henikoff, in 1992, to detect more distant rela-

tionships. In particular, BLOSUM50 and BLOSUM62 are being widely used for pairwise alignment and

database searching.

Substitution matrices allow for the possibility of giving a negative score for a mismatch, what is

sometimes called an approximate or partial match.

Just like the score values, the gap value can be represented by a constant value or by using some

of the presented models. In order to understand the models, the gap open/start score (d), represents

the cost of starting a gap, while the gap extension (e) represents the cost of extending a gap by one

more space. The standard cost associated with a gap of length (g) is given either by a linear score given

by [35]:

γ(g) = −gd (3.1)

or by using the affine score:

γ(g) = −d− (g − 1)e (3.2)

The gap-extension penalty e is usually set to a value less than the gap-open penalty (d), allowing long

insertions and deletions to be penalized less than they would be by the linear gap cost. This is desirable

when gaps of a few residues are expected almost as frequently as gaps of a single residue [35].


The optimal alignment of two DNA or protein sequences is the alignment that maximizes the sum of

pair-scores minus any penalty for the introduced gaps [35].

Optimal Alignment algorithms include:

• Global Alignment algorithms, which align every residue in both sequences. One example is the

Needleman-Wunsch algorithm, which we present in Section 3.2.1.

• Local Alignment algorithms, which only considers part of the sequences and obtains the best sub-

sequence alignments or the Identification of common molecular subsequences [14]. One example

is the Smith-Waterman algorithm, which we present in Section 3.2.2.

21


3.2.1 Needleman-Wunsch Algorithm

In 1970, Needleman & Wunsch [36] proposed the following algorithm. Given two molecular se-

quences, A = a1a2...an and B = b1b2...bm, the goal is to return an alignment matrix H which indicates

the optimal global-alignment score between both sequences.

In order to understand this algorithm, consider the following definitions:

• H(i, j) represents the similarity score of two sequences A and B, ending at position i and j;

• s(ai, bj) is the score for each aligned pair of residues. This value can be defined by a constant

value, or can be obtained using scoring matrices like PAM or BLOSUM, for the protein sequences;

• Wk, Wl represent the gap penalties, according to the considered gap model.

Each matrix cell is filled with the maximum value that results from Equation 3.3.

H(i, j) = max

Hi−1j−1 + s(ai, bj), if ai and bj are similar symbols

Hi−kj −Wk, if ai is at the end of a deletion of length k

Hij−k −Wl, if bj is at the end of a deletion of length l

(3.3)

This equation is repeatedly applied in order to fill in the matrix with the H(i, j) values, by calculating

the value in the bottom right-hand corner of each square of four cells from one of the remaining three

values [36]. By definition, the value in the bottom-right cell of the entire matrix, H(n,m), corresponds to

the best score for an alignment between A and B. Figure 3.2 illustrates the algorithm with the alignment

between sequences A="AACGTT" and B="ATGTT". The obtained score was 13 and the best global

alignment is presented with the green arrows presented in the figure.

Figure 3.2: Needleman-Wunsch alignment matrix example

3.2.2 Smith-Waterman Algorithm

In 1981, Smith and Waterman [14] proposed a dynamic programming algorithm4 that computes the

similarity scores corresponding to the maximally homologous subsequences among sets of long se-

quences. Given two sequences A = a1a2...an and B = b1b2...bm, the goal of this algorithm is to return a

alignment matrix H which indicates the optimal local-alignments between both sequences. For each cell,

this algorithm computes the similarity value between the current symbol of sequence A and the current4Dynamic programming is a programming method that solves problems by combining the solutions to their subproblems[37].

22


symbol of sequence B. This algorithm has some data dependencies, since each cell of the alignment

matrix depends on its left, upper and upper-left neighbors.

In this algorithm, we consider the same definitions ofH(i, j), s(ai, bj),Wk andWl used in the Needleman-

Wunsch algoritm (Section 3.2.1).

Receiving the sequences A and B as input, this algorithm begins with the initialization of the first

column and the first row, which is given by:

Hk0 = H0l = 0, for 0 ≤ k ≤ n and 0 ≤ l ≤ m (3.4)

Then the algorithm computes the similarity score H(i, j) by using the following equation:

Hij = max

Hi−1j−1 + s(ai, bj), if ai and bj are similars

Hi−kj −Wk, if ai is at the end of a deletion of length k

Hij−k −Wl, if bj is at the end of a deletion of length l

0, otherwise

(3.5)

The output for the algorithm is the optimal local alignment of sequence A and sequence B with max-

imum score. Unlike the Needleman-Wunsch algorithm, the Smith-Waterman algorithm always gives

matrix scores greater than or equal to 0.

In order to get all the optimal local alignments between sequences A and B, a trace-back algorithm

starts from the highest score in the whole matrix and ends at a score of 0.

Figure 3.3 presents the optimal local alignments between sequence A: WPCIWWPC and sequence

B: IIWPC. In this example, the BLOSUM 50 matrix scoring model is used, in order to get s(ai, bj) value.

The gap penalty is -5. The optimal local alignments between sequences A and B are represented inside

green background color cells. These alignments occurred between the subsequences WPC of A and

WPC of B.

Figure 3.3: Smith-Waterman alignment matrix example

23


3.3 Heuristic Sub-Optimal Algorithms

Although providing optimal solutions, the described algorithms we characterized by a quadratic com-

plexity O(mn) where m is the size of sequence A and n the size of sequence B. This is made evident on

large databases with high number of residues. The current protein Uniprot Swissprot [12] database con-

tains hundreds of millions residues; for a sequence of length one thousand, approximately 1011 matrix

cells must be evaluated to search the complete database. At ten million matrix cells per second, which

is reasonable for a single workstation at the time this is being written, this would take 10000 seconds,

i.e., around three hours [35].

Heuristic algorithms address this issue at the expense of not guaranteeing to find the optimal solution.

Examples of these algorithms are the FASTA and the BLAST, presented in Section 3.3.1 and in Sec-

tion 3.3.2.

3.3.1 FASTA

The FASTA algorithm (also known as "fast A" which stands for "FAST-All") was presented by Pear-

son & Lipman in 1985 [38] and further improved in 1988 [39]. This algorithm uses local high scoring

alignments with a multistep approach, starting from exact short word matches, through maximal scoring

ungapped extensions, to finally identify gapped alignments.

This algorithm can be described in four steps [35]:

• Step 1 (Figure 3.4): locate all identically matching words of length ktup (specifies the size of the

word) between the two sequences. For proteins, ktup is typically 1 or 2, for DNA it may be 4 or 6.

It then looks for diagonals with many mutually supporting word matches.

Figure 3.4: FASTA algorithm step 1.

• Step 2 (Figure 3.5): search for the best diagonals, extending the exact word matches to find

maximal scoring ungapped regions (and, in the process, possibly joining together several seed

matches).

• Step 3 (Figure 3.6): check if any of these ungapped regions can be joined by a gapped region,

allowing for gap costs.

• Step 4 (Figure 3.7): the highest scoring candidate matches in a database search are realigned

using the full dynamic programming algorithm, but restricted to a subregion of the dynamic pro-

24

3.3 Heuristic Sub-Optimal Algorithms



gramming matrix forming a band around the candidate heuristic match. This step uses a standard

dynamic programming algorithm, such as Needleman-Wunsch or Smith-Waterman, to get the final

scores.


There is a tradeoff between speed and sensitivity in the choice of the ktup parameter, higher values

of ktup are faster, but more likely to miss true significant matches. To achieve sensitivities close to the

optimal algorithms for protein sequences, ktup needs to be set to 1.

3.3.2 BLAST - Basic Local Alignment Search Tool

The Basic Local Alignment Search Tool (BLAST) was presented by Altschul et al. in 1990 [40],

and finds regions of local similarity between sequences. The program compares nucleotide or protein

sequences to sequence databases and calculates the statistical significance of matches. BLAST can

25


be used to infer functional and evolutionary relationships between sequences as well as to help identify

members of gene families [9]. This algorithm is most effective with polypeptide5 sequences and uses a

matrix score (BLOSUM, PAM, etc.) to find the maximal segment pair (MSP) for two sequences, defined

as locally optimal if the score cannot be improved either by lengthening or shortening the segment pair.

This algorithm is the most widely used for protein-coding sequences alignment6.

The BLAST algorithm steps are [40]:

1. Compile a list of high-scoring words

• Giving a length parameter w and a threshold parameter T , find all the w-length substring

(words) of the database sequences that align with words from the query with an alignment

score higher than T . This is called a hit in BLAST.

• Discard those words that score below T (these are assumed to carry too little information to

be useful starting seeds)

2. Scan the database for hits

• When T is high, the search will be rapid, but potentially informative matches will be missed.

3. Extend the hits

• Attempt to extend this match to see if it is part of a longer segment that scores above the

MSP score S

• Report only those hits that yield a score above S

From the score S it is also possible to calculate an expectation score E, which is an estimate of how

many local alignments of at least this score would be expected given the characteristics of the query

sequence and database.

The original BLAST did not permit gaps, so it would find relatively short regions of similarity, and it was

often necessary to extend the alignment manually or with a second alignment tool.

3.4 Parallel Implementations

Smith-Waterman is the most known algorithm in this context, and it was explored in many software

implementations, each improving the execution times and optimizing the parallelization method.

In what concerns the parallelization method, the implementations presented in this Section have some

different approaches of parallelism and they can be grouped taking into account their level of paral-

lelism [2]:

• Coarse Grained Parallelism: example of this parallelism can be the master/worker model pre-

sented in our work, where we have a single processor named master that send works to the

workers. On the parallel sequence alignment, the database sequence is split into n parts, and5short chains of amino acid monomers linked by peptide (amide) bonds.6http://cmns.umd.edu/

26

http://cmns.umd.edu/


each worker node is processing one of those parts. When the execution ends, the worker sends

back the results to the master and gets more parts to process, until the processing of all parts is

finished. This parallel method is present in the proposed implementations of this work, presented

on Chapter 4.

• Fine-Grained Parallelism: examples on this parallelization methodology are those presented by

Wozniak [41], taking advantage of the Visual Instruction Set (VIS) of the SUN ULTRA SPARC

processors and Farrar and Rognes implementations [16, 42], that take advantage of the Stream-

ing SIMD Extensions (SSE) technologies, available in most modern Intel processors. All these

implementations are presented on Sections 3.4.1.A, 3.4.1.B and 3.4.1.C.

• Intermediate-grained Parallelism: this kind of implementations are the most explored nowadays,

with the growth of the General Purpose Graphics Processing Unit (GPGPU), taking advantage

of the GPUs to parallelize the execution of the algorithm. Section 3.4.2.A presents Manavski’s

solution [43], one of the first ones considering CUDA framework for modern NVIDIA GPUs. The

implementation CUDASW++, introduced by Liu et al. [3, 17] is presented in Section 3.4.2.B.

3.4.1 CPU Implementations

In this section, we survey the state of the art on CPU-based implementations of the Smith-Waterman

algorithm.

3.4.1.A Wozniak

One of the first parallel implementations of the Smith-Waterman algorithm was presented in 1997

by A. Wozniak [41], who proposed using SIMD instructions for the parallelization of the algorithm. By

exploiting the use of specialized video instructions. These instructions, SIMD-like in their design, make

possible parallelization of the algorithm at the instruction level. Another optimization of this implemen-

tation is using Visual Instruction Set (VIS) instructions found in the SUN ULTRA SPARC processors.

These VIS instructions can be used to execute in parallel four rows of the Smith-Waterman algorithm

score matrix, enabling data-level parallelization. VIS instructions use special 64-bit registers, making it

possible to add two sets of 16-bit integers and get four 16-bit results, with a single instruction.

This implementation reaches over 18 million matrix cell updates per second on a single ULTRA SPARC

running at 167 MHz. The global performance scales with the number of processors used, reaching at

12 processors, 200 million matrix cells per second.

3.4.1.B Farrar

In order to optimize the performance of the original Smith-Waterman algorithm, Michael Farrar also

proposed in 2006 [42] a SIMD solution to parallelize the algorithm at the data level. This solution takes

advantage of three different optimizations. The first one is called query profile and was presented by

27


Rognes and Seeberg [44]. It avoids calculating the score between both sequence residues in the Smith-

Waterman matrix, calculating a query profile parallel of the query for each possible residue. Then, the

calculation of the Hij requires just an addition of the pre-calculated score to the previous Hij . The

query profile is stored in memory at 16-byte boundaries. By aligning the profile at a 16-byte boundary,

the values are read with a single aligned load instruction, which is faster than reading unaligned data.

Another optimization proposed by Farrar is the use of the SSE2 instructions, available on Intel proces-

sors. To maximize the number of cells calculated per instruction, the SIMD SSE2 registers are divided

into their smallest unit possible. The 128-bit wide registers are divided into 16 8-bit elements for pro-

cessing. One instruction can therefore operate on 16 cells in parallel. Dividing the register into 8-bit

elements limits the cell’s range to between 0 and 255. In most cases, the scores fit in the 8-bit range

unless the sequences are long and similar. If a query’s score exceeds the cells maximum, that query is

recalculated using a higher precision.

Finally, Farrar proposed the Lazy F evaluation. In order to avoid calculating every cells on the matrix,

this optimization makes the algorithm not calculate the H value when the F remains at zero (thus not

contributing to the value of H). In order to avoid bad results, this optimization has a second pass loop

to correct all the matrix cells that were not calculated in the first pass. This second pass loop is exe-

cuted until all elements in F are less than H −Ginit, Ginit being the gap open penalty. According to the

presented results, this algorithm achieves over 3 billion cell updates per second using a 2.0 GHz Xeon

Core 2 Duo processor [42].

3.4.1.C SWIPE (Rognes)

Taking into account the Farrar’s implementation, in 2011 Torbjørn Rognes proposed SWIPE, an effi-

cient parallel solution based on SIMD instructions [16], which allows running the Smith-Waterman search

more than six times faster. SWIPE performs rapid local alignment searches in amino acid or nucleotide

sequence databases.

SWIPE compares sixteen residues from sixteen different database sequences in parallel for the same

query residue. This operation is carried out using Intel SSE2 vectors consisting of sixteen independent

bytes (Figure 3.8).

Another important characteristic of this algorithm is the use of a compact code of ten instructions writ-

ten in assembly, which constitute the core of the inner loop of the computations. These ten instructions

are presented in Figure 3.9 and compute in parallel the values for each vector of 16 cells in independent

alignment matrices. The exact selection of instructions and their order is important; this part of the code

was therefore hand-coded in assembly to maximize performance. In this figure, H represents the main

score vector. The H vector is saved in the N vector for the next cell on the diagonal. E and F represent

the score vectors for alignments ending in a gap in the query and database sequence, respectively. P is

the vector of substitution scores for the database sequences versus the query residue q (see temporary

28


Figure 3.8: Multi-sequence vectors.

score profiles below). Q represents the vector of gap open plus gap extension penalty. R represents

the gap extension penalty vector. S represents the current best score vector. All vectors, except N are

initialized prior to this code.

Figure 3.9: Rogne’s Algorithm core instructions.

Using a 375-residue query sequence, SWIPE achieved 106 billion cell updates per second (GCUPS)

on a dual Intel Xeon X5650 six-core processor system, which is more than six times faster than software

based on Farrar’s approach (the previous fastest implementation).

3.4.1.D Pedro Monteiro’s Implementation

Extending the Rognes implementation, Pedro Monteiro in its Master Thesis [2] proposes an ex-

tension to the presented thread-level parallelization model, exploring a fine-grained parallelization in a

inter-task SIMD solution. Basically in this implementation the database sequences are split in several

database chunks, and each chunk are processed using the Rognes execution module, like presented in

Figure 3.10. This implementation explores both intra-task and inter-task level parallelization.

29


Figure 3.10: Sequences Database in several chunks [2].

To support this implementation Pedro Monteiro’s solution proposes a different basic processing ele-

ment, represented by a structure called message or processing block, that are presented in Figure 3.11.

Figure 3.11: Processing Block - Message [2].

The message presented in Figure 3.11 contains all the needed elements for one processing iteration

by one of the system workers.

To avoid an overhead in the amount of waiting processing blocks to be processed by system workers,

Pedro Monteiro’s solution implements also two First In,First Out (FIFO) lists through which the master is

30


able to communicate asynchronously with all the workers and vice versa (Figure 3.12). Supporting the

inclusion of two queues in the solution, Pedro Monteiro introduced also multiple synchronization barriers

in the distinct access moments.

Figure 3.12: Processing Block FIFOs [2].

This way, the solution obtained great speedup values using the Dell PowerEdge R810 processing

platform. This implementation also attained a performance of more than 71 GCUPS by using 32 parallel

worker threads on a distributed-memory architecture, which is nearly 2.5 times faster than SWIPE,

running on a different memory architecture [2].

3.4.2 GPU Implementations

We now present some of the GPU-based implementations of the Smith-Waterman algorithm found

in the literature.

3.4.2.A Manavski’s Implementation

In order to get a fast implementation of the Smith-Waterman algorithm on commodity GPU hardware

using OpenGL instructions, Manavski et al. [43] proposed what they refer to "the first solution based

on commodity hardware that efficiently computes the exact Smith-Waterman algorithm". In this imple-

mentation, they used an optimization of the Smith-Waterman algorithm previously proposed by Rognes

and Seeberg [16]. This optimization consists in pre-computing the query profile parallel to the query

sequence for each possible residue, in order to avoid the lookup of s(ai, bj) in the internal cycle of the

algorithm. Thus, the random accesses to the substitution matrix are replaced by sequential ones. In

their implementation, the query profile is stored in GPU texture memory space, since it is a low latency

memory.

The strategy that was adopted in this implementation consists of making each GPU thread compute

the whole alignment of the query sequence with one database sequence. Before that, the database is

ordered and stored in the global memory of the GPU, while the query-profile is saved into texture mem-

ory. Another optimization of this implementation is the inclusion of an initialization process, where the

number of available computational resources is automatically detected. This number will help achieve

dynamic load balancing. After this step, the database is divided into as many segments as the number

of stream-processors present in the GPU. Each stream-processor then computes the alignment of the

31


query with one database sequence.

To analyze the obtained performance, Manavski’s implementation was compared with three previ-

ous implementations. This performance was measured by running the application both on single and

on double GPU configurations. The first comparison that was carried out is with Liu’s implementation

of the Smith-Waterman algorithm based on OpenGL instructions. The obtained results show that this

implementation is 18 times faster than Liu’s [45]. The second comparison was made with BLAST and

SSEARCH algorithms [46, 47]. The obtained results show that this implementation is up to 30 times

faster than SSEARCH, and up to 2.4 faster than BLAST. Finally, the last test compares this implemen-

tation with Farrar’s implementation [42], showing a three-fold performance increase.

3.4.2.B CUDASW++

Just like the algorithm presented above, CUDASW++ is an optimized implementation of the Smith-

Waterman algorithm using CUDA. It was proposed by Liu et al. [3] and uses the computational power of

CUDA-enabled GPUs to accelerate Smith-Waterman algorithm sequence database searches.

Liu et al. presented two different approaches for the parallelization of the algorithm: inter-task paral-

lelization and intra-task parallelization. In inter-task parallelization, each task is assigned to exactly one

thread and dimBlock tasks are performed in parallel by different threads in a thread block. In Intra-task

parallelization, each task is assigned to one thread block and all dimBlock threads in the thread block

cooperate to perform the task in parallel, exploiting the parallel characteristics of cells in the minor diag-

onals.

In order to achieve the best performance, their implementation uses two stages. The first stage

exploits inter-task parallelization and the second stage exploits intra-task parallelization. The transition

between these stages is separated by a defined threshold; only when the query sequence length is

above that threshold are the alignments carried out in the second stage.

Besides this two-stage process, their implementation uses three techniques to improve the perfor-

mance: coalesced subject sequence arrangement, coalesced global memory access, and cell block

division method.

Coalesced subject sequence arrangement (Figure 3.13) - For inter-task parallelization, arrange the

sorted subject sequence in an array, where the symbols of the query sequences are restricted to

be stored in the same column from top to bottom and all sequences are arranged in increasing

length order from left to right and top to bottom in the array. For the intra-task parallelization, the

sorted subject sequence are sequentially stored in an array, row by row, from the top-left corner to

the bottom-right corner. All symbols of a sequence are restricted to be stored in the same row from

left to right. Texture cache can be utilized in order to achieve maximum performance on coalesced

access patterns.

32


Figure 3.13: Coalesced Subject Sequence Arrangement [3].

Coalesced global memory access (Figure 3.14) - This technique explores memory organization pat-

terns in order to achieve the best performance. All threads in a half-warp should access the

intermediate results in coalesced pattern. Thus, the words accessed by all threads in a half-warp

must lie in the same segment. To achieve this, all threads in a half-warp are allocated in the form

of an array to keep them in contiguous memory address.

Figure 3.14: Coalesced Global Memory Access [3].

Cell block division method - This method consists of dividing the alignment matrix into cell blocks of

equal size for inter-task parallelization.

When executing their implementation using a single-GPU version, CUDASW++ [3], achieves a per-

formance value of about 10 GCUPS on an NVIDIA GeForce GTX 280 graphics card. In a multi-GPU

version, it achieves a performance of up to 16 GCUPS on an NVIDIA GeForce GTX 295 graphics card,

which has two G200 GPU-chips on a single card.

Meanwhile, the same authors have proposed a new version of this implementation, the CUDASW++2.0 [17].

In this new version, they proposed three different implementations: a optimized SIMT SW algorithm ver-

sion, a basic vectorized SW algorithm and a partitioned vectorized SW algorithm.

Optimized SIMT SW algorithm - This implementation is a optimized version of CUDASW++ focused

on its first stage, with the introduction of two optimizations: introduction of a sequential query profile

33


and the utilization of a packed data format. The packed data format is used in the re-organization

of each subject sequence; four successive residues of each subject sequence are packed together

and represented using the uchar4 vector data type. When using the cell block division method, the

four residues loaded by one texture fetch are further stored in shared memory for the use of the

inner loop.

Basic Vectorized SW algorithm - This implementation is based on Michael Farrar striped SW imple-

mentation [42]. It directly maps Farrar’s implementation onto CUDA, based on the virtualized SIMD

vector programming model. As seen before, Farrar denotes as F values. That part of the similarity

values for H(i, j), which derive from the same line: Hi−kj −Wk. The lazy-F loop technique avoids

the calculations of similarity scores when running this algorithm. This technique states that, for

most cells in the matrix H(i, j), the value of Hi−kj −Wk remains at zero and does not contribute

to the value of H. Only when H is greater than Wk will F start to influence the value of H.

For the computation of each column of the alignment matrix, the striped SW algorithm consists

of two loops: an inner loop calculating local alignment scores postulating that F values do not

contribute to the corresponding H values, and a lazy-F loop correcting any errors introduced from

the calculations of the inner loop. This algorithm uses a striped query profile.

Partitioned Vectorized SW algorithm - In this implementation, the algorithm first divides a query se-

quence into a series of non-overlapping, consecutive small partitions, according to a pre-specified

partition length. Then, it aligns the query sequence to a subject sequence, partition by partition,

considering each one a new query sequence. Finally, it constructs a striped query profile for each

partition.

Concerning performance evaluation, just like in the first version of CUDASW++ implementation, Liu

et al. use two different approaches: a single GPU implementation (NVIDIA Geforce GTX 280) and a

multi-GPU implementation (Geforce GTX 295).

The optimized SIMT SW algorithm achieves an average performance of 16.5 GCUPS on Geforce

GTX 280. The same algorithm, when running on GTX 295, achieves an average performance of 27.2

GCUPS. The partitioned vectorized algorithm achieved an average performance of 15.3 GCUPS using

a gap penalty of 10-2 k (gap penalty initialization of 10 and gap extension penalty of 2), gap; an aver-

age performance of 16.3 GCUPS using a gap penalty of 20-2 k; and an average performance of 16.8

GCUPS using a gap penalty of 40-3 k on GTX 280. The same partitioned vectorized algorithm, when

running on GTX 295, achieved an average performance of 22.9 GCUPS using a gap penalty of 10-2 k;

an average performance of 24.8 GCUPS using a gap penalty of 20-2 k; and an average performance of

26.2 GCUPS using a gap penalty of 40-3 k.

When comparing this algorithm with the first CUDASW++ implementation, the optimized SIMT algorithm

method runs 1.74 faster on GTX 280 and 1.72 faster on GTX 295. The partitioned vectorized algorithm

method runs about 1.58 and 1.77 times faster on GTX 280 and about 1.45 and 1.66 times faster on GTX

295.

In 2013, Liu et al. [4] presented the third version of this algorithm, CUDASW++ 3.0. This implementation

34


couples CPU and GPU SIMD instructions and carries out concurrent CPU and GPU computations. For

the CPU computation, this algorithm employs SSE-based vector execution units as accelerators. Be-

sides the inclusion of CPU implementation, this version has investigated for the first time a GPU SIMD

parallelization, based on the CUDA PTX SIMD video instructions to gain more data parallelism beyond

the SIMT execution model. Moreover, sequence alignment workloads are automatically distributed over

CPUs and GPUs based on their respective computing capabilities. The GPU implementations were

specified for GPUs based on the Kepler architecture7. In order to balance the runtimes of CPU and

GPU computations, they have dynamically distributed all sequence alignment workloads over CPUs and

GPUs, as per their compute power. For the computation on CPUs, Liu et al. [4] have employed the

streaming SIMD extensions (SSE) based vector execution units and multithreading to speed up the SW

algorithm. The program workflow is presented in Figure 3.15.

Figure 3.15: Program workflow of CUDASW++ 3.0 [4].

Evaluation on the Swiss-Prot database shows that CUDASW++ 3.0 gains a performance improve-

ment over CUDASW++ 2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS,

on a single-GPU GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively.

In addition, Liu et al. CUDASW++ 3.0 algorithm [4] has demonstrated good speedups over other top-

performing tools: SWIPE and BLAST+.

3.4.3 Discussion on the Presented implementations

The Smith-Waterman algorithm represents one of the most used bioinformatics algorithms. As a

result, last years there have been proposed a lot of solutions based on it. In the previous sections, we

presented the main parallel implementations of this algorithm.

Considering task parallelization, we observe two types of implementation:

• intra-task parallelization considers the parallelization within a single alignment, breaking the se-

quences into multiple parts;

• inter-task parallelization considers the parallelization where multiple database or query sequences

are processed simultaneously, considering a single query sequence and breaking the database

into several sequences, parallelizing at the sequence level.

7http://www.nvidia.com/object/nvidia-kepler.html

35

http://www.nvidia.com/object/nvidia-kepler.html


Concerning the presented CPU implementations, Wozniak’s [41] implementation, exploring the instruction-

level parallelism, has achieved a great performance evolution from the original Smith-Waterman algo-

rithm implementations. In addition, Wozniak proposed a different processing approach named anti-

diagonal (process the matrix in diagonal , in the intra-task parallelization method. This algorithm achieves

18 million cell updates per second on a single processor.

This optimization was then explored and surpassed by Michael Farrar [42] exploiting Intel SSE in-

structions present on modern Intel processors. This implementation considers a striped pattern in the

query sequence access and achieved a performance of over 3 billion cell updates per second (GCUPS),

reaching a speedup of approximately 8 times over the previously SIMD implementations.

Finally, Rognes [16] solution explores not only the instruction level parallelism with the usage of the

Intel’s Streaming SIMD Extensions (SSE) on ordinary CPUs, but also the data parallelism, implement-

ing the master/worker model. This implementation can use inter-task approach when considering the

execution of one alignment query against one database sequence alignment, but also the intra-task

approach when splitting the several database sequences between several workers configured in the

environment. This model was implemented on Intel processors with the SSE3 instruction set extension,

such as the Intel Core i7. SWIPE achieved performances of over 9 GCUPS for a single thread and up

to 106 GCUPS for 24 parallel threads.

In the GPU context, the two solutions presented in Section 3.4.2 use NVIDIA’s CUDA. Manavski’s [43]

implementation implements the query profile, presented by Farrar [42], in order to avoid the lookup step

of the internal cycle that calculates the s(ai, bj), pre-computing the query profile parallel to the query

sequence. This optimization removes the random accesses to the score matrix replacing them by se-

quential accesses to the query profile. The strategy adopted by this implementation makes each GPU

thread compute the whole alignment of the query sequence with one database sequence, in a inter-task

parallelization approach. Another optimization was pre-ordering the database sequences. This algo-

rithm achieved speeds of more than 3.5 GCUPS, less than Rognes [16], but faster than than any other

previous attempt available on commodity hardware [43]. By the other way, the Liu’s CudaSW++ [17]

application also considers the query profile presented by Farrar [42], and proposes three different op-

timizations. The first optimization (Optimized SIMT SW algorithm) takes into account one intra-task

parallelization model, using a packed data format in the re-organization of each subject sequence. In

it, four successive residues of each subject sequence are packed together and represented using the

uchar4 vector data type. When using the cell block division method, the four residues loaded by one

texture fetch are further stored in shared memory for the use of the inner loop. The second one (Basic

Vectorized SW algorithm) is based on Michael Farrar striped SW implementation [42]. It directly maps

the Farrar’s implementation onto CUDA, based on the virtualized SIMD vector programming model. Far-

rar denotes as F values, that part of the similarity values for H(i, j), which derives from the same line:

Hi−kj −Wk. The lazy-F loop is a technique used by Farrar in order to avoid the calculations of similarity

36


scores when running this algorithm. This technique states that for most cells in the matrix H(i, j), the

Hi−kj −Wk remains at zero and does not contribute to the value of H. Only when H is greater than Wk

will F start to influence the value of H. And finally the third optimization, the partitioned Vectorized SW

algorithm where the algorithm first divides a query sequence into a series of non-overlapping, consecu-

tive small partitions, according to a pre specified partition length. Then, it aligns the query sequence to a

subject sequence partition by partition, considering each one a new query sequence. Then it constructs

a striped query profile for each partition.

Considering performance, CUDASW++2.0, the optimized SIMT SW algorithm achieves an average per-

formance of 16.5 GCUPS on Geforce GTX 280. The same algorithm, when running on GTX 295,

achieves an average performance of 27.2 GCUPS. The partitioned vectorized algorithm achieved an

average performance of 15.3 GCUPS using a gap penalty of 10-2 k; an average performance of 16.3

GCUPS using a gap penalty of 20-2 k; and an average performance of 16.8 GCUPS using a gap penalty

of 40-3 k on GTX 280. The same partitioned vectorized algorithm, when running on GTX 295, achieved

an average performance of 22.9 GCUPS using a gap penalty of 10-2 k; an average performance of 24.8

GCUPS using a gap penalty of 20-2 k; and an average performance of 26.2 GCUPS using a gap penalty

of 40-3 k. The third implementation of CUDASW++ gains a performance improvement over CUDASW++

2.0 up to 2.9 and 3.2, with a maximum performance of 119.0 and 185.6 GCUPS, on a single-GPU

GeForce GTX 680 and a dual-GPU GeForce GTX 690 graphics card, respectively and in addition shows

significant speedups over other top-performing tools: SWIPE and BLAST+.

So, considering the values presented and all the implementations, this work implements one ap-

plication that mixes the Rogne’s implementation with the Liu’s CudaSW++2.0 implementation, in or-

der to develop a master/worker model implementation that can speedups the execution times on the

Smith-Waterman algorithm execution. This algorithm will determine one alignment between a database

sequence file with thousands sequences against one query sequence faster and with dynamic load

balancing when getting the chunks of work, maximizing the execution time of each one of the solution

workers.

37

4Heterogeneous Parallel Alignment

MultiSW

Contents4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2.1 CPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 GPU Worker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Application Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Implementation Details and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.1 Database File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4.2 Database Sequences Pre-Loading . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Dynamic Load-balancing Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

38

4.1 Introduction

4.1 Introduction

In Section 3.4, a set of parallel implementations of the Smith-Waterman Algorithm that were pro-

posed in the last years were presented. Considering two of the solutions presented: CUDASW++2.0

by Liu et al. [17] and the Pedro Monteiro’s SWIPE extension [2] presented in Section 3.4.1, our work

proposes an efficient solution for parallel implementation of the Smith-Waterman algorithm, named Mul-

tiSW (Section 3.4.2.B).

This implementation consists of the orchestration of both applications execution modules in a single

solution, exploiting the use of multiple CPU cores and the NVIDIA GPUs that may be available on the

running machine, in a heterogeneous approach methodology, as presented in Figure 4.1. Each one of

the modules is called a worker, so we have the CPU workers (Section 4.2.1) and the GPU workers (Sec-

tion 4.2.2). The MultiSW application considers a load balancing abstraction layer, in order to efficiently

split the database sequences during the execution. This layer is explained in Section 4.5. Another im-

plemented optimization is a wrapper8 function for the CPU worker execution (Section 4.2.1.A). These

additional implementations were proposed in order to improve the application CPU worker execution

time. Besides these improvements in the CPU, it was also proposed several implementations in the

GPU worker (Section 4.2.2).

During the execution, the proposed MultiSW application must receive multiple arguments from the

prompt, specifying the running parameters. Then, it prepares all the execution structures (presented

in Section 4.4.3), and coordinates the execution of all executable work between the available workers

(specified at invocation time). This coordination process can be referred to as Orchestration process.

Figure 4.1: Heterogeneous Architecture

This way, multiple parallelization techniques are considered in a single software solution, in a medium-

grained parallelization approach where multiple database sequences are processed simultaneously, as

will be explained in Section 4.2.

In this kind of applications, the main objective is to process all data in the minimum execution time,

leading the maximum execution speedup (concept explained below in Section 5.2). Considering both

8A wrapper function is a subroutine in a software library or a computer program whose main purpose is to call a secondsubroutine or a system call with little or no additional computation.

39

4. Heterogeneous Parallel Alignment MultiSW

execution workers, the execution time its directly related to the amount of data processed (database

sequences) in each iteration. Due to the base implementations considered in the implementation of

MultiSW, it was necessary to create several auxiliar processing structures (see Section 4.4.3). Sec-

tion 4.2 presents the architecture of this solution and the adaptation of the existing solutions to enable it.

To improve MultiSW, in Section 4.5, a model that changes this block size in run time iterations, in order

to minimize the application total execution time, is presented.

4.2 Architecture

The solution’s architecture is presented in Figure 4.2. The orchestration can be considered the

application’s core. This main orchestration invokes the CPU and GPU implementations to execute work

that consists in processing alignments between database sequences and the query sequence. Both

workers are adapted from the considered applications Pedro Monteiro 3.4.1.D and CUDASW++2.0 to

this thesis implementation solution. This adaptation is explained below.

Orchestration

GPU Module

CPU Module

MultiSWC

PU

Wra

pp

er

Load

Bal

anci

ng

Mo

dule

Database Sequences

Get work

Figure 4.2: MultiSW block diagram.

In order to adapt both solutions to this work, the considered model was the Master/Worker originally

proposed by Pedro Monteiro in SWIPE extension [2]. A possible representation for this model execution

is shown in Figure 4.3. The split of database in multiple chunks represents the inter-task parallelization

model introduced by Pedro Monteiro in his solution.

During the execution, all running workers, the CPU and the GPUs ones are getting some new work

to process, execution after execution, invoking the function get_fasta_sequences(). This function loads

the next processing database sequences, from the database sequences file specified at the application

run time. The access to this function is protected by a pthread_mutex_t, to ensure that only one of the

workers can obtain sequences, each time. The worker then gets the respective processing block from

the profile_seqs structure. GPU workers use the processing block structure presented in Section 4.4.3.

40

4.2 Architecture

Figure 4.3: Master Worker Model [2]

In Sections 4.2.1 and 4.2.2, the adaptations of each existing applications, to use the original code

adapted in the orchestration implementation, are presented. Also is presented a CPU wrapper function

used to minimize all the multiple thread accesses to the global shared variables, that need synchroniza-

tion amongst all execution threads.

4.2.1 CPU Worker

The CPU worker of our work consists of the adaption of the Pedro Monteiro’s solution [2], trans-

forming the master of the original master/worker model into one of our workers, since in the original

Pedro Monteiro’s implementation the master thread controls all the execution and creates new process-

ing work for the workers. In the original implementation, the master thread creates processing blocks of

16 sequences, blocking the access to the get_fasta_sequences() function (explained above) of the other

workers, in every execution iteration. This represents an efficiency problem in the final solution, because

of the low parallelization level on the database sequences, so a CPU wrapper function was developed

in our work (Section 4.2.1.A), to avoid this problem.

The architecture of the original solution was not changed, and the application still works as repre-

sented in Figure 4.4.

Figure 4.4: Master Worker Model [2]

So, in our implementation, the worker creates the processing blocks to be processed. It gets the

database sequence from the CPU wrapper function, and then it creates the 16 database sequence

41


blocks to be inserted in the queue to be processed. Besides the CPU wrapper implementation, some of

the initial functions were adapted, since the original database file considered was the BLAST sequence

type [48] and our implementation works with the FASTA [49] database file format. This demanded that

initializing functions were changed in order to support this different file format.

4.2.1.A CPU Wrapper

When workers get some new execution block, it is necessary to guarantee that the accessed method

which obtains the database sequences does not block the access to other execution workers. In the

CPU implementation this is implemented using one mutex that blocks every concurrent accesses to

these variables. Pedro Monteiro’s implementation [2] considers that executable blocks have only 16

sequences, and the method getwork() (explained above) gets those sequences was blocking the access

of another workers when getting them information. So, in order to avoid the CPU worker to get only 16

sequences at a time and block the other workers, it was presented in this work one CPU Wrapper

function that gets a bigger block (default value its 30000) to avoid the other worker waits for several

accesses. After that, the CPU worker creates processing blocks accessing that block obtained by this

wrapper, getting 16 sequences from it (as shown in Figure 4.5).

Database Sequences

CPU Wrapper

CPU Worker(…)

GetSeqs(16)

GetSeqs(16)

GetSeqs(16)

GetSeqs(16)

GetSeqs(30000)

Figure 4.5: CPU Wrapper Function.

4.2.2 GPU Worker

The GPU module considers several GPU workers, each one assigned to a physical NVIDIA GPU

device. The number of running GPUs is specified in the prompt at run time. The application creates a

CPU pthread for each one of the considered GPUs. This thread runs a function named gpu_worker()

and this function gets the execution database sequences from the get_fasta_sequences() function, run-

ning all the preparation and execution flows from the original CUDASW++ implementation [17]. Liu’s et

al. solution works with FASTA sequences, so it was not necessary to change the sequence’s preparing

functions.

To minimize the application execution time, some optimizations that reduce the execution time for each

iteration of the worker are presented. A CUDA Stream is "a sequence of operations that execute on

the device in the order in which they are issued by the host code. While operations within a stream are

guaranteed to execute in the prescribed order, operations in different streams can be interleaved and,

42

4.2 Architecture

when possible, they can even run concurrently" [50]. Using CUDA Streams, the memory transfers can

be made asynchronous between the host and the device (Section 4.2.2.A). Also, with streams is also

possible parallelize the execution of kernels (Section 4.2.2.B). Besides these, the loading of the next pro-

cessing sequences is done in parallel with the execution of kernels on the device side (Section 4.2.2.C).

4.2.2.A Asynchronous Transfers

By creating CUDA streams and assigning them to data transfers and changing memory transfers to

asynchronous (placing -Async in the name of the transferring instruction), as shown in the code the data

transfer function does not block the execution:

//...

cudaStreamCreate(&mystream1);

cudaMemcpyAsync(deviceArray,hostArray,size,cudaMemcpyHostToDevice,mystream1);

kernel<<>>(otherDataArray);

//...

4.2.2.B CUDA Streams in Kernel Execution

It is possible to execute two different kernel functions at same time, if the data processed by each

one is different and independent:

//...



kernel<<1,N,mystream1>>(DataArray);

kernel<<1,N,mystream2>>(differentDataArray);

//...

4.2.2.C Loading Sequences with Execution

It is also possible to execute some host code during the kernel execution in the device. This way, the

three instructions are executed in parallel:

//...



kernel<<1,N,mystream1>>(DataArray);

get_fasta_sequences(); // host function

kernel<<1,N,mystream2>>(differentDataArray);

//...

43


4.3 Application Execution Flow

The application execution flow its presented in Figure 4.6. In the beginning of the application the

workers are started (the number of CPU cores and GPUs chosen for the execution are specified in the

prompt using the t parameter for the CPU threads and the g parameter for the GPUs). Considering the

CPU Wrapper as a big CPU worker, this implementation only considers one CPU worker, since it is the

CPU module master that accesses the CPU wrapper function and gets the work for all the CPU module

sub-workers. At the same time, GPU workers also start to get blocks of database sequences to process.

At the end of the execution, after all database sequences are processed, CPU module workers and

GPU workers are killed with the pthread_exit function and control returns to the main function, that

shows the results of the best alignments score, and finishes the application execution.

Figure 4.6: Execution Sequence Diagram.

4.4 Implementation Details and Optimizations

4.4.1 Database File Format

As for the database file format, Pedro Monteiro’s implementation [2] only considers the BLAST se-

quences type, and the CUDASW++ only considers the FASTA format. All the functions used from Pedro

Monteiro’s implementation were adapted to use the FASTA database sequence file format.

44

4.4 Implementation Details and Optimizations

4.4.2 Database Sequences Pre-Loading

The BLAST file format indexes all the sequences presented in the file, and can be used efficiently

to obtain the desired profile sequences at run-time. On the other hand, the FASTA file format is not

indexed, and it will be very difficult to index the correct sequences in the execution. The solution was to

create the structure profile_seqs and pre-load all the sequences of the file in this structure. This way, it

becomes possible to index the wished sequence in the run-time execution.

4.4.3 Data Structures

In order to organize all the code and to be possible to separate the blocks in several files with distinct

uses, several structures we created in this implementation to keep the running arguments of the appli-

cation.

The first structure is named execution_params and contains all the execution parameters for the appli-

cation:

typedef struct execution_parameters {

char *progname; // application name

char *matrixname; // score matrix name used for the scoring model

char *databasename; // name of the database used

char *queryname; // query sequence name

long maxmatches; // maximum show results

long minscore; // minimum show score

long threads; // number of cpu cores

long blocksize; // blocksize for the cpu worker

long p_blocksize; // size of the profile blocksize

long workerIter; // contains number of worker iterations

long nodes; // execution nodes in cpu

long gpu_enable; // number of gpus enabled

long gpu_max_seqs; // blocksize for gpu execution

long gapopen; // gap open value

long gapextend; // gap extended value

BYTE gap_open_penalty; // gap open penalty

BYTE gap_extend_penalty; // gap extended penalty

} execution_params;

Another used structure is query_seq_parameters. This one contains the query sequence parame-

ters:

typedef struct query_seq_parameters {

int qlen; // query length

int qlen_aligned; // query length aligned by 8

45


char* filename; // query filename

char *description; // query description read of the file

BYTE *query_sequence; // query sequence residues

BYTE *query_sequence_padded; // query sequence residues padded by 8

} query_seq_params;

Another one is the score_matrixes structure. This structure keeps the matrixes and the score limit

for the SWIPE execution modes. These values are read by the score matrix file indicated in the prompt

when executing the application:

typedef struct score_matrixes {

long SCORELIMIT_7; // score limit for worker7 execution



char *score_matrix_7; // score matrix for worker7

short *score_matrix_16; // score matrix for worker7

long *score_matrix_63; // score matrix for worker63

} score_matrixes;

Another structure created is profile_params that keeps the information about the database profile

parameters:

typedef struct profile_params {

// Genbank NCBI Format

int fd_psq; // file descriptor for BLAST psq file

int fd_phr; // file descriptior for BLAST phr file

// Fasta Format

int using_fasta; // flag to indicate if file is in FASTA file format

int fd_fasta; // file descriptor for FASTA file format

FILE* fasta_file; // file pointer

int fasta_pos; // next considered sequence

UINT32 *adr_pin; // address pin variable for BLAST mode execution

off_t len_psq; // offset of psq BLAST file

off_t len_phr; // offset of phr BLAST file

off_t len_pin; // offset of pin BLAST file

char *dbname; // database name obtained from the file

char *dbdate; // database date obtained from the file

unsigned seqcount; // total number of sequences presented in database file

unsigned longest; // longest sequence of the database file

46

4.5 Dynamic Load-balancing Layer

unsigned long totalaa; // total number of aminoacids for all the sequences in database

file

unsigned long phroffset; // offset of phr BLAST file

unsigned long psqoffset; // offset of phr BLAST file

} profile_params;

To enable the use of these structures, all implementation functions were changed to return the struc-

tures back to the application’s main function. The main application then sends the initialized execution

structures to both kinds of workers: CPU and GPU.

4.5 Dynamic Load-balancing Layer

"Load balancing is dividing the amount of work that a computer has to do between two or more

computers so that more work gets done in the same amount of time and, in general, all users get

served faster" 9. In our implementation the processing data unit represents the database sequences,

that needs to be aligned against the query sequence. The execution time of each worker iteration is

directly affected by the size of the processed block. To make the implementation more efficient and to

adjust the execution time for all workers, in this implementation, also is considered a load balancing

module, adjusting dynamically the block size of the obtained work.

The load balancing layer dynamically adjusts the block size for each worker (concept of block size

is presented above). In this implementation, all workers were considered equals. The only difference

is that the default CPU worker block size was 30000 and the GPUs default block size is 65000. These

sizes were defined taking into account the average execution time for each processing modules.

Imagine the scenario presented in Figure 4.7. In this case, worker A spends almost twice as much

time than the time that worker B takes to process its block. If the application finishes its execution after

the worker B finishes its iteration, worker A has not processed all the information. This way, the solution

does not take advantage of each worker most efficient.

Worker A

Worker B

Figure 4.7: Workers execution not balanced.

In order to minimize this inefficiency in each iteration, the proposed load balancing layer adjusts the

block size of each worker, to make the execution time as close as possible. Considering this, it is meant

to adjust the block size, in order to reduce the execution time of the worker A.

In the developed model, the following variables were considered:

9http://searchnetworking.techtarget.com/definition/load-balancing

47

http://searchnetworking.techtarget.com/definition/load-balancing


Worker A

Worker B

Figure 4.8: Workers execution balanced.

• blocksize(w, i) - Represents the block size computed by worker w in iteration i;

• Texecution(w, i) - Represents the execution time of worker w in iteration i;

• Tminexecution(i) - Represents the minimum execution time for all workers in iteration i.

When a worker finishes its execution, it calls the registerExecutionTime function, that registers for the

deviceNum worker the execution time and processed block size. This function updates the attributes for

the current worker execution and calls a method named adjustBlockSizes that reprocesses all workers

block sizes.

The first worker (the fastest one) to finish its work increases their block size 10% considering the

previous block size. Thus, the next iteration’s block size is going to be:

blocksize = blocksize× 1.1 (4.1)

After all the workers finish their execution, the new block size for each worker is calculated taking into

account the fastest worker’s execution, and the time spent in that execution. This time is presented in

Equation 4.2 and its given by the function getMinExecT ime().

Tminexecution = getMinExecT ime(); (4.2)

Then, for each worker, the block size of next iteration can be calculated by (This block size must be

an integer value, in this case it is used the ceil function to round the value obtained by the formula):

blockSize(i) = ceil(Tminexecution × blockSize(i− 1)

Texecution(i− 1)) (4.3)

Considering an execution example like the ones presented on Figure 4.7, being worker A the CPU

worker and the worker B the GPU worker. Initialization values are given by:

blocksize(0, 0) = 30000 (4.4)

blocksize(1, 0) = 65000 (4.5)

Texecution(0, 0) = 0 seconds (4.6)

48

4.6 Conclusion

Texecution(1, 0) = 0 seconds (4.7)

These values, presented in equations 4.11, 4.5, 4.6, and 4.7, represent the initial execution structures

for the workers of our implementation.

After the first worker finish their execution, their block size value is updated and their execution time

is registered:

blocksize(1, 0) = 65000× 1.1 = 71500 (4.8)

Texecution(1, 0) = 2.01 seconds (4.9)

Texecution(0, 0) = 3.5 seconds (4.10)

Then the load balancing layer updates the block size of other workers.

blocksize(0, 0) = ceil(2.01× 30000

3.5) = 17228 (4.11)

4.6 Conclusion

The main objective of this work is to study efficient implementations based on heterogeneous archi-

tectures on the Smith-Waterman algorithm. For this algorithm, as was seen on Chapter 3, there were

proposed several efficient implementations as Pedro Monteiro’s implementation, with an extension to

the initial SWIPE proposal, presented by Rognes and also Liu et. al. CUDASW++ 2.0 implementation.

Considering the execution modules of both implementations, this work considers each one of them as

base of its execution workers, in order to process all of the work chunks that exists in the introduced

database file, in the fastest way possible.

Thus, the architecture of the MultiSW implementation presented in Section 4.2 represents an orchestra-

tion between the execution modules, considering all the synchronizing mechanisms and access to the

sequence to process database file. In general, for us to accomplish an orchestration of the two different

solutions, was needed to understand and adapt both modules so they could access the same database

file and process the sequences in FASTA format.

Complementing this orchestration are proposed optimizations at the GPUs module, implementing the

introduction of CUDA Streams to the transference code of data execution and invoking kernels to the

execution GPUs. These GPU worker optimizations are presented in Section 4.2.

Finally, in order to assure that the several system workers process the correct amount of data, is intro-

duced an extra module of Load Balancing, with the objective of adjusting the size of blocks processed

along the execution of the application, in such a way that the end time of the execution of the application

is the minimum possible, as represented in the example presented on Section 4.5.

49

5Experimental Results

Contents5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Evaluating Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.1 Scenario A - Single CPU core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.2 Scenario B - Four CPU cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti . . . . . . . . . . . . . . . . . . . 54

5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti . . . . . . . . . . . . . . . . . . . 55

5.3.5 Scenario E - Four CPU cores + Single GPU Execution . . . . . . . . . . . . . . . 55

5.3.6 Scenario F - Four CPU cores + Double GPUs Execution . . . . . . . . . . . . . . 57

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

50

5.1 Experimental Setup

The performance of our implementation - MultiSW - was evaluated by considering multiple execution

scenarios. The results are presented below in section 5.3.

5.1 Experimental Setup

To correctly benchmark the implemented solution, the considered experimental setup was a Linux

based workstation with the following characteristics:

Machine:

• Intel(R) Core(TM) i7 4770K @ 3.5GHz (CPU);

• Four Kingston HyperX DDR3 CL9 8GB @ 1.6GHz Memory RAM modules;

• ASUS Z87-Pro Motherboard;

• GPU A - MSI GeForce GTX 780 Ti Gaming 3GB DDR5;

• GPU B - GeForce GTX 660 Ti 2GB DDR5;

The code was compiled for a 64-bit Linux operating system using the Intel C compiler version 13.1.3

and the NVIDIA Compiler release 6.5.

When comparing the used GPUs its easy to identify which one will obtain the best results. GeForce

GTX 780 Ti has more cuda processing cores (2880) than GeForce GTX 660 Ti (1344), so can run the

kernels with more parallelization power. Another big difference is in the memory bandwith that in the

GTX 780 Ti is 336 GB/s, whereas in the GTX 660 Ti is only 144.2 GB/s, less than a half. The memory

interface is also different for both, in the first one, it is 384 bits and, in the second one it is 192 bits. So it

is expected that the first GPU runs the kernel functions faster, and transfers the data more quickly than

the second GPU.

5.1.1 Experimental Dataset

The query sequence that was used in the experimental scenarios was the IFNA6 interferon, alpha 6

[Homo sapiens (human)] [51] with 189 residues.

The database sequences that were considered was the release 2014_02 of UniProtKB/Swiss-Prot [52]

database sequences in the FASTA format repeated 5 times in the file. This database contains 542,503

sequences with several sizes, comprising 192,888,369 amino acids abstracted from 226,190 references.

The total processed number of sequences is 2,712,515.

51

5. Experimental Results

5.2 Evaluating Metrics

In order to compare the considered scenarios, the speedup metric will be used. This metric measures

how much one optimized implementation is faster than the base implementation. It is given by equation

5.1:

speedup =tsequentialtparallel

(5.1)

5.3 Results

This section presents multiple scenarios and their results when running the application with the

various execution parameters configurations. It starts with the simplest scenario, corresponding to a

single CPU core execution, and finishes with the most complex configuration with an orchestration of

the workers based on a multicore CPU and multiple GPUs, that processes all the available work. The

execution block sizes for each kind of worker were pre-adjusted in order to obtain the best solutions in

the overall execution times running with several block size configurations before getting the experimental

results.

Each execution scenario was executed ten times, and the presented results correspond to the av-

erage of the times of these executions. An iteration execution represents the time that the application

spends to execute the block size defined for the execution worker.

In each presented scenario, for the global orchestration, the CPU execution worker represents the

CPU Wrapper module, presented in Section 4.2.1.A, regardless the execution being done with one or

four CPU cores. The iteration time may vary, because the sequences processed has different sizes.

The block sizes for the experimental results considered in the CPU and the GPU were adjusted by

varying the block sizes and checking the best execution times using only one CPU core and a single

GPU. The CPU obtained block size was 30,000 sequences and the GPU default block size was 65,000

sequences.

52

5.3 Results

5.3.1 Scenario A - Single CPU core

Considering a single CPU core execution, the total execution time was about 31.52 seconds, as

shown in Figure 5.1.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Time (s)

31.52

Figure 5.1: Processing times considering a single CPU core execution and a processing block with 30000 sequences.

The multiple grey-colored blocks represents the execution for each CPU wrapper iteration, consid-

ering its size of 30,000 sequences. These iteration execution times vary between 0.0688 and 2.391

seconds and all together represent the total execution time of about 31.52 seconds. Between each

iteration execution (during the overall execution), in the beginning the preparation time is about 0.0009

seconds and is not visible in the figure presented above. The difference in the iteration execution times

is explained by different sequences size in each iteration. For bigger sequences, the iteration execution

time will be bigger.

5.3.2 Scenario B - Four CPU cores

Considering a 4-core CPU execution, the total execution time was about 15.55 seconds, as shown

in Figure 5.2.

53


0 2 4 6 8 10 12 14 16

Time (s)

15.55

Figure 5.2: Processing times for 4 CPU cores, considering a block size of 30,000 sequences.

The distinct grey-based colored blocks represents the process time for a block of 30,000 sequences

(considering a CPU wrapper iteration) by four CPU cores. These iteration values vary between 0.046

and 0.957 seconds.

Total execution time was about 15.55 seconds. The reason why the solution with four CPU cores is not

four times faster than the single CPU core is because of the synchronization between multiple threads

and the data partition and organization times, this way the speedup obtained was not linear like it was

supposed to.

5.3.3 Scenario C - Single GPU - GeForce GTX 780 Ti

Considering a single GeForce GTX 780 Ti GPU execution, the total execution time was 6.35 seconds,

as show in Figure 5.3.

0 1 2 3 4 5 6 7

Time (s)

6.35

Figure 5.3: Processing Times for single GPU in Machine A, considering blocks size of 65,000 sequences. Total execution timeabout 6.35 seconds.

In the figure it is presented several grey colored execution blocks that represents the time of pro-

cessing 65,000 database sequences against the query sequence. These iteration values vary between

54

5.3 Results

0.118 and 0.266 seconds.

Considering the several optimizations mentioned in Section 4.2.2, specially the use of CUDA Streams

provided by NVIDIA in its framework, it is possible to reduce significantly the preparation time between

iterations and getting the best overall execution times.

5.3.4 Scenario D - Single GPU - GeForce GTX 660 Ti

Considering a single GeForce GTX 660 Ti GPU execution, the total execution time was about 7.38

seconds, as show in Figure 5.4.

0 1 2 3 4 5 6 7 8

Time (s)

7.38

Figure 5.4: Processing Times for single GPU in Machine B, considering blocks size of 65,000 sequences. Total execution timeabout 7.38 seconds.

In the figure it is presented several grey colored execution blocks. Each one of them represents the

time of processing 65,000 database sequences against the query sequence. The total execution time

was 7.38 seconds. These iteration values vary between 0.126 and 0.304 seconds.

5.3.5 Scenario E - Four CPU cores + Single GPU Execution

In this scenario, the considered workers for the execution are the four CPU cores and the GeForce

GTX780 Ti GPU. This execution time was about 6.112 seconds, as show in Figure 5.6:

55


0 1 2 3 4 5 6

CPU Cores

GPU A

Time (s)

6.112

Figure 5.5: Processing Times for 4 CPU cores and a GeForce GTX780 Ti GPU, considering CPU blocks of 30,000 sequencesand GPU blocks of 65,000 sequences. Total execution time was 6.112 seconds.

This time was better than the one obtained in Scenario F, because the considered GPU was the

GeForce GTX 780 Ti, that executes faster than GeForce GTX 660 Ti like presented in Scenarios C and

D.

In Figure 5.6 it is presented the number of sequences processed by each kind of worker. The

execution CPU worker processed 817,087 sequences, while the GPU worker processed 1,895,428 se-

quences.

817087

1895428

0

300000

600000

900000

1200000

1500000

1800000

2100000

GPU ACPU Cores

Figure 5.6: Number of Sequences Processed by CPU cores and GPU.

The orchestration represented in this scenario is better than the single GPU execution, but a linear

speedup was not achieved since the synchronization points increased with the increase of the workers

in the orchestration.

In Figure 5.5, its also presented the dynamic block size along the time. These values are presented

in the arrow next to the block execution, in the GPU worker and in the CPU worker. So the CPU worker

56

5.3 Results

starts with the 30,000 block size and finish with a size of 15,000. The GPU worker starts with a 65,000

sequences block size and finish with a size of 40,000 sequences. To both of the workers, the number of

next processing sequences is decreasing along the execution time, since it was the way load balacing

module works. The iteration execution times for the GPU worker varies between 0.082 and 0.316 sec-

onds. For the CPU worker this value goes between the 0.072 and the 0.258 seconds.

5.3.6 Scenario F - Four CPU cores + Double GPUs Execution

The last scenario considered is composed by 4-core CPU execution and both available GPUs: the

GeForce GTX 780 Ti (GPU A) and the GeForce GTX 660 Ti (GPU B).

As expected, this execution was the fastest one but not the most efficient and takes about 4.957

seconds, as show in Figure 5.7.

0 1 2 3 4 5

CPU Cores

GPU A

GPU B

Time (s)

4.957400004460165000

65000 58633 58415 56279 40000

30000 33000 25387 17843 16372 15000 4.9

Figure 5.7: Processing Times for 4 cores CPU, GPU A and GPU B, considering the initial block size of 30,000 sequences blocksto the CPU solution and 65,000 to the GPU solution. Total execution time of 4.957 seconds. Near some of the iteration blocks it is

presented the new considered block size.

Figure 5.7 shows the execution blocks for the three workers. The CPU worker starts with a block size

of 30.000 and finishes with the block size of 15,000. The execution times for this worker takes from 0,06

to 0,379 seconds. The execution times for the GPU A worker goes between 0.067 and 0.394 seconds.

Finally, for the GPU B worker the execution times goes between 0.067 and 0.520 seconds.

The number of sequences computed by each worker is presented in Figure 5.8. The CPU worker

process 70,795 sequences, the GPU A worker computed 1,241,496 sequences and the GPU B worker

process 1,059,202 sequences. The low quantity processed by the CPU worker is because the block

size of this worker is smaller than the GPU worker block sizes improve performance.

57


411817

1241496

1059202

0

300000

600000

900000

1200000

1500000

GPU ACPU Cores GPU B

Figure 5.8: Number of Sequences Processed by CPU, GPU A, and GPU B workers.

5.4 Summary

As shown in Table 5.1, considering the multiple scenarios, Scenario F presented in Section 5.3.6 was

achieved a speedup of 6.4x when comparing the execution in a CPU single core execution presented in

Scenario A, Section 5.3.1.

Execution Time SpeedupSingle core 31.52Four cores 15.55 2.03

GeForce GTX 780 Ti 6.350 4.96GeForce GTX 660 Ti 7.380 4.271

Four CPU cores + GeForce GTX 780 Ti 6.112 5.16Four CPU cores + 2 GPU 4.96 6.36

Table 5.1: Execution Speedups.

The increase of workers in the execution of our work orchestration, increases also the synchroniza-

tion between the involved threads are needed. This causes the occurrence of execution delays and

makes the workers wait longer. This situation is minimized with the load balancing layer presented in

our solution, since the block size is being adapted to be similar. However, there are some limitations

in the loading balance module, since the number of total processing sequences are not known at the

beginning of the application.

Despite these limitations, as it can be verified in the Table 5.1, to the different execution scenarios, the

orchestration was getting relatively good speedups, with the inclusion of new worker in their execution.

58

6Conclusions and Future Work

Contents6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

59

6. Conclusions and Future Work

6.1 Conclusions

Multiple solutions have been proposed in the last years to be possible to respond to the large amount

of biological information produced everyday. Exploiting parallel architectures based on CPU and GPU

architectures, enables the quickly processing of these data. So, in our work, it was proposed a solution

that mixes both to get better results. Under this context, this thesis proposed the integration of two pre-

viously presented parallel implementations: an adaptation of SWIPE implementation [16], for multi-core

CPUs that exploits SIMD vectorial instructions [2], and an implementation of the Smith-Waterman algo-

rithm for GPU platforms (CUDASW++ 2.0) [17]. Accordingly, the presented work offers a unified solution

that tries to take advantage of all computational resources that are made available in heterogeneous

platforms, composed by CPUs and GPUs, by integrating a convenient dynamic load balancing layer.

The obtained results presented in Chapter 5 show that the attained speeup can reach values as high as

6x, when executing in a quad-core CPU and two distinct GPUs.

6.2 Future Work

The presented solution already considers both intra-task and inter-task processing approaches.

However, in the CPU module, it would be good to explore some extra inter-task approach. Another

possible future work is to add an extra thread to the solution to prepare all the GPU work, like it occurs

in the CPU module.

The new Kepler NVIDIA GPUs presents a tecnology designated as Dynamic Parallelism, which al-

lows to create new chunks of work without need of new data transfer between the device and the host.

Thus, is possible to spend less time transferring data between device and host, optimizing the total exe-

cution time.

Finally, another possible optimization is taking advantage of the Load Balancing module, optimizing

the algorithm used in the module, in order to achieve more optimal values to the processing blocks size,

processed by the available workers in the solution.

60

Bibliography

[1] NVIDIA CUDA - NVIDIA CUDA C Programming Guide, February 2014. URL http://docs.nvidia.

com/cuda/pdf/CUDA_C_Programming_Guide.pdf.

[2] Pedro Matos Monteiro. Profiling biological applications for parallel implementation in multicore

computers. Master’s thesis, Av. Rovisco Pais, 1, november 2012.

[3] Yongchao Liu, Douglas L. Maskell, and Bertil Schmidt. CUDASW++: optimizing Smith-Waterman

sequence database searches for CUDA-enabled graphics processing units. BMC research notes,

2(1):73+, 2009. ISSN 1756-0500. doi: 10.1186/1756-0500-2-73. URL http://dx.doi.org/10.

1186/1756-0500-2-73.

[4] Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt. Cudasw++ 3.0: accelerating smith-waterman

protein database search by coupling cpu and gpu simd instructions. BMC Bioinformatics, 14(1):

117, 2013. ISSN 1471-2105. doi: 10.1186/1471-2105-14-117. URL http://www.biomedcentral.

com/1471-2105/14/117.

[5] M. J. Flynn. Very high-speed computing systems. Proc. IEEE, 54(12):1901–1909, December 1966.

[6] EBERLY COLLEGE OF ARTS and SCIENCES. Eberly college of arts and sciences.

http://eberly.wvu.edu/, 2014. Accessed on October 9, 2014.

[7] Pat Hanrahan Daniel Reiter Horn, Mike Houston. ClawHMMer: A streaming HMMer-search imple-

mentation. In Supercomputing, 2005.

[8] Genbank DNA Database. Genbank dna database. http://www.ncbi.nlm.nih.gov/genbank/, 2014.

Accessed on October 9, 2014.

[9] National Center for Biotechnology Information (NCBI). National center for biotechnology information

(ncbi). http://www.ncbi.nlm.nih.gov/, 2014. Accessed on October 9, 2014.

[10] Universal Protein Resource (UniProt). Universal protein resource (uniprot). http://www.uniprot.org/,

2014. Accessed on October 9, 2014.

[11] Nucleotide sequence database (EMBL). Nucleotide sequence database (embl).

http://www.ebi.ac.uk/ena/, 2014. Accessed on October 9, 2014.

[12] Swiss-Prot. Swiss-prot. http://www.ebi.ac.uk/uniprot/, 2014. Accessed on October 9, 2014.

61

http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

http://dx.doi.org/10.1186/1756-0500-2-73

http://dx.doi.org/10.1186/1756-0500-2-73

http://www.biomedcentral.com/1471-2105/14/117

http://www.biomedcentral.com/1471-2105/14/117

Bibliography

[13] TrEMBL. Trembl. http://www.ebi.ac.uk/uniprot/, 2014. Accessed on October 9, 2014.

[14] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of

molecular biology, 147(1):195–197, March 1981. ISSN 0022-2836. URL http://view.ncbi.nlm.

nih.gov/pubmed/7265238.

[15] D.E. Culler, J.P. Singh, and A. Gupta. Parallel computer architecture: a hardware/software ap-

proach. The Morgan Kaufmann Series in Computer Architecture and Design. Morgan Kauf-

mann Publishers, 1999. ISBN 9781558603431. URL http://books.google.pt/books?id=

gftcVOn7iGsC.

[16] Torbjorn Rognes. Faster Smith-Waterman database searches with inter-sequence SIMD par-

allelisation. BMC Bioinformatics, 12(1):221+, June 2011. ISSN 1471-2105. doi: 10.1186/

1471-2105-12-221. URL http://dx.doi.org/10.1186/1471-2105-12-221.

[17] Yongchao Liu, Bertil Schmidt, and Douglas Maskell. CUDASW++2.0: enhanced Smith-Waterman

protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstrac-

tions. BMC Research Notes, 3(1):93+, 2010. ISSN 1756-0500. doi: 10.1186/1756-0500-3-93.

URL http://dx.doi.org/10.1186/1756-0500-3-93.

[18] G.S. Almasi and A. Gottlieb. Highly parallel computing. The Benjamin/Cummings series in com-

puter science and engineering. Benjamin/Cummings Pub. Co., 1994. ISBN 9780805304435. URL

http://books.google.pt/books?id=rohQAAAAMAAJ.

[19] J.L. Hennessy, D.A. Patterson, and A.C. Arpaci-Dusseau. Computer architecture: a quantitative

approach. Number vol. 1 in The Morgan Kaufmann Series in Computer Architecture and De-

sign. Morgan Kaufmann, 2007. ISBN 9780123704900. URL http://books.google.pt/books?id=

57UIPoLt3tkC.

[20] Alex Peleg and Uri Weiser. MMX Technology Extension to the Intel Architecture. IEEE Micro, 16

(4):42–50, 1996. ISSN 0272-1732. doi: 10.1109/40.526924. URL http://dx.doi.org/10.1109/

40.526924.

[21] W. Stallings. Computer Organization and Architecture: Designing for Performance. Prentice Hall,

2010. ISBN 9780136073734. URL http://books.google.es/books?id=-7nM1DkWb1YC.

[22] N. Wilt. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education,

2013. ISBN 9780133261509. URL http://books.google.pt/books?id=ynydqKP225EC.

[23] NVIDIA. Kepler GK110 whitepaper, 2012. URL http://www.nvidia.com/content/PDF/kepler/

NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[24] Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction to General-Purpose

GPU Programming. Addison-Wesley Professional, 1st edition, 2010. ISBN 0131387685,

9780131387683.

62

http://view.ncbi.nlm.nih.gov/pubmed/7265238

http://view.ncbi.nlm.nih.gov/pubmed/7265238

http://books.google.pt/books?id=gftcVOn7iGsC

http://books.google.pt/books?id=gftcVOn7iGsC

http://dx.doi.org/10.1186/1471-2105-12-221

http://dx.doi.org/10.1186/1756-0500-3-93

http://books.google.pt/books?id=rohQAAAAMAAJ

http://books.google.pt/books?id=57UIPoLt3tkC

http://books.google.pt/books?id=57UIPoLt3tkC

http://dx.doi.org/10.1109/40.526924

http://dx.doi.org/10.1109/40.526924

http://books.google.es/books?id=-7nM1DkWb1YC

http://books.google.pt/books?id=ynydqKP225EC

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

Bibliography

[25] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and

Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer

Graphics Forum, 26(1):80–113, 2007. ISSN 1467-8659. doi: 10.1111/j.1467-8659.2007.01012.x.

[26] Xiaoqing Tang. Introduction to general purpose GPU computing. University of Rochester - Class

Lecture, March, 16 2011. URL http://www.cs.rochester.edu/~kshen/csc258-spring2011/

lectures/student_Tang.pdf.

[27] John Nickolls and William J. Dally. The gpu computing era. IEEE Micro, 30(2):56–69, March 2010.

ISSN 0272-1732. doi: 10.1109/MM.2010.41. URL http://dx.doi.org/10.1109/MM.2010.41.

[28] AMD. AMD Fusion Website. http://www.amd.com/us/products/technologies/fusion/Pages/fusion.aspx.

accessed on 1/1/2012.

[29] Mayank Daga, Ashwin M. Aji, and Wu-chun Feng. On the efficacy of a fused cpu+gpu processor (or

apu) for parallel computing. In Proceedings of the 2011 Symposium on Application Accelerators in

High-Performance Computing, SAAHPC ’11, pages 141–149, Washington, DC, USA, 2011. IEEE

Computer Society. ISBN 978-0-7695-4448-9. doi: 10.1109/SAAHPC.2011.29. URL http://dx.

doi.org/10.1109/SAAHPC.2011.29.

[30] Math Smith. What is an APU? [technology explained]. http://www.makeuseof.com/tag/apu-

technology-explained/, February 18 2011. Accessed on January 1, 2012.

[31] Michael Wolfe. Understanding the CUDA Data Parallel Threading Model A Primer.

http://www.pgroup.com/lit/articles/insider/v2n1a5.htm, February 2010. accessed in 7-1-2012.

[32] Brent Oster and Greg Ruetsch. Getting started with CUDA. In NVISION 2008, The World of

Visual Computing. NVIDIA, 2008. URL http://www.nvidia.com/content/nvision2008/tech_

presentations/CUDA_Developer_Track/NVISION08-Getting_Started_with_CUDA.pdf.

[33] Biological Sequences. Biological sequences. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/BIOSEQ.HTML,


[34] Colin Dewey. Multiple sequence alignment. http://www.biostat.wisc.edu/bmi576/lectures/multiple-

alignment.pdf, Fall 2011. URL http://www.biostat.wisc.edu/bmi576/lectures/

multiple-alignment.pdf. Acessed December 21, 2011.

[35] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence Anal-

ysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, July 1998.

ISBN 0521629713.

[36] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in

the amino acid sequence of two proteins. J. Mol. Biol., 48:443–453, 1970.

[37] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to

Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. ISBN 0070131511.

63

http://www.cs.rochester.edu/~kshen/csc258-spring2011/lectures/student_Tang.pdf

http://www.cs.rochester.edu/~kshen/csc258-spring2011/lectures/student_Tang.pdf

http://dx.doi.org/10.1109/MM.2010.41

http://dx.doi.org/10.1109/SAAHPC.2011.29

http://dx.doi.org/10.1109/SAAHPC.2011.29

http://www.nvidia.com/content/nvision2008/tech_presentations/CUDA_Developer_Track/NVISION08-Getting_Started_with_CUDA.pdf

http://www.nvidia.com/content/nvision2008/tech_presentations/CUDA_Developer_Track/NVISION08-Getting_Started_with_CUDA.pdf

http://www.biostat.wisc.edu/bmi576/lectures/multiple-alignment.pdf

http://www.biostat.wisc.edu/bmi576/lectures/multiple-alignment.pdf

Bibliography

[38] D. J. Lipman and W. R. Pearson. Rapid and Sensitive protein Similarity Searches. Science, 227:

1435–1441, March 1985.

[39] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings

of the National Academy of Sciences of the United States of America, 85(8):2444–2448, April 1988.

ISSN 0027-8424. doi: 10.1073/pnas.85.8.2444. URL http://dx.doi.org/10.1073/pnas.85.8.

2444.

[40] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local

alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct 1990. URL citeseer.

nj.nec.com/akutsu99identification.html.

[41] A. Wozniak. Using video-oriented instructions to speed up sequence comparison. Computer Appli-

cations in the Biosciences, 13(2):145–150, 1997. URL http://dblp.uni-trier.de/db/journals/

bioinformatics/bioinformatics13.html#Wozniak97.

[42] Michael Farrar. Striped Smith–Waterman speeds database searches six times over other SIMD

implementations. Bioinformatics, 23:156–161, January 2007. ISSN 1367-4803. doi: http://dx.doi.

org/10.1093/bioinformatics/btl582. URL http://dx.doi.org/10.1093/bioinformatics/btl582.

[43] Svetlin A Manavski and Giorgio Valle. CUDA compatible GPU cards as efficient hardware

accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics, 9(Suppl 2):S10,

2008. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2323659&tool=

pmcentrez&rendertype=abstract.

[44] T. Rognes and E. Seeberg. Six-fold speed-up of Smith-Waterman sequence database searches

using parallel processing on common microprocessors. Bioinformatics (Oxford, England), 16(8):

699–706, August 2000. ISSN 1367-4803. doi: 10.1093/bioinformatics/16.8.699. URL http://dx.

doi.org/10.1093/bioinformatics/16.8.699.

[45] Bio-sequence database scanning on a GPU, April 2006. doi: 10.1109/ipdps.2006.1639531. URL

http://dx.doi.org/10.1109/ipdps.2006.1639531.

[46] NCBI BLAST. Blast - ncbi. http://blast.ncbi.nlm.nih.gov/Blast.cgi, 2014. Accessed on October 9,

2014.

[47] EBI. Ssearch algorithm. http://www.ebi.ac.uk/Tools/sss/, 2014. Accessed on October 9, 2014.

[48] Blast DB Format. Blast format. http://selab.janelia.org/people/farrarm/blastdbfmtv4/blastdbfmt.html,


[49] FASTA DB Format. Fasta format. http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml, 2014.


[50] NVIDIA. Nvidia overlap data transfers and cuda executions.

http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/, 2014. Accessed

on October 9, 2014.

64

http://dx.doi.org/10.1073/pnas.85.8.2444

http://dx.doi.org/10.1073/pnas.85.8.2444

citeseer.nj.nec.com/akutsu99identification.html

citeseer.nj.nec.com/akutsu99identification.html

http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics13.html#Wozniak97

http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics13.html#Wozniak97

http://dx.doi.org/10.1093/bioinformatics/btl582

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2323659&tool=pmcentrez&rendertype=abstract

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2323659&tool=pmcentrez&rendertype=abstract

http://dx.doi.org/10.1093/bioinformatics/16.8.699

http://dx.doi.org/10.1093/bioinformatics/16.8.699

http://dx.doi.org/10.1109/ipdps.2006.1639531

Bibliography

[51] NCBI. Ifna6 interferon, alpha 6 [homo sapiens (human)] - ncbi.

http://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=ShowDetailView&TermToSearch=3443, 2014.


[52] Uniprot. Uniprotkb/swiss-prot release 2014 february - uniprot. http://www.uniprot.org/downloads,


[53] Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of CUDA and

OpenCL. Read, cs.PF(1):12, 2010. URL http://arxiv.org/abs/1005.2581.

[54] E.A. Lee. The problem with threads. Computer, 39(5):33 – 42, may 2006. ISSN 0018-9162. doi:

10.1109/MC.2006.180.

[55] Microsoft. Microsoft - mmx, sse, and sse2 intrinsics. http://msdn.microsoft.com/en-

us/library/y0dh78ez(v=vs.90).aspx, May 2011. Acessed May 14, 2014.

[56] W. Stallings. Computer Organization and Architecture: Designing for Performance. Prentice Hall,

2010. ISBN 9780136073734. URL http://books.google.es/books?id=-7nM1DkWb1YC.

[57] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensitivity of pro-

gressive multiple sequence alignment through sequence weighting, position-specific gap penalties

and weight matrix choice. Nucleic Acids Research, 22(22):4673–4680, November 1994. ISSN

1362-4962. doi: 10.1093/nar/22.22.4673. URL http://dx.doi.org/10.1093/nar/22.22.4673.

[58] N. Whitehead and A. Fit-Florea. Precision & performance: Floating point and ieee 754 compliance

for nvidia gpus. nVidia technical white paper, 2011.

65

http://arxiv.org/abs/1005.2581

http://books.google.es/books?id=-7nM1DkWb1YC

http://dx.doi.org/10.1093/nar/22.22.4673

Bibliography

66

Efficient GPU Implementation of Bioinformatics Applications

Documents