Top Banner
Chun-Yuan Lin, Assistant Processor Department of Computer Science and Information Engineering Chang Gung University GPU-REMuSiC on Graphics Processing Units 2011/5/20 1 PPCB
47

GPU-REMuSiC on Graphics Processing Units

Feb 03, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GPU-REMuSiC on Graphics Processing Units

Chun-Yuan Lin, Assistant Processor

Department of Computer Science and Information Engineering

Chang Gung University

GPU-REMuSiC on Graphics

Processing Units

2011/5/20 1 PPCB

Page 2: GPU-REMuSiC on Graphics Processing Units

Outline

Sequence alignment problem

Pair-wise sequence alignment

Multiple sequence alignment

Previous studies

Parallel computing for dynamic programming

Related works on CUDA

Constrain multiple sequence alignment

Developments of a series of tools

RE-MuSiC

GPU-REMuSiC implementation

2011/5/20 2 PPCB

Page 3: GPU-REMuSiC on Graphics Processing Units

Sequence alignment problem

2011/5/20 3 PPCB

Page 4: GPU-REMuSiC on Graphics Processing Units

2011/5/20 4 PPCB

Page 5: GPU-REMuSiC on Graphics Processing Units

2011/5/20 5 PPCB

Page 6: GPU-REMuSiC on Graphics Processing Units

2011/5/20 6 PPCB

Page 7: GPU-REMuSiC on Graphics Processing Units

2011/5/20 7 PPCB

Page 8: GPU-REMuSiC on Graphics Processing Units

2011/5/20 8 PPCB

Page 9: GPU-REMuSiC on Graphics Processing Units

2011/5/20 9 PPCB

Page 10: GPU-REMuSiC on Graphics Processing Units

2011/5/20 10 PPCB

Page 11: GPU-REMuSiC on Graphics Processing Units

PAM100

2011/5/20 11 PPCB

Page 12: GPU-REMuSiC on Graphics Processing Units

2011/5/20 12 PPCB

Page 13: GPU-REMuSiC on Graphics Processing Units

2011/5/20 13 PPCB

Page 14: GPU-REMuSiC on Graphics Processing Units

2011/5/20 14 PPCB

Page 15: GPU-REMuSiC on Graphics Processing Units

2011/5/20 15 PPCB

Page 16: GPU-REMuSiC on Graphics Processing Units

2011/5/20 16 PPCB

Page 17: GPU-REMuSiC on Graphics Processing Units

2011/5/20 17 PPCB

Page 18: GPU-REMuSiC on Graphics Processing Units

We can not use dynamic programming for k sequences.

2011/5/20 18 PPCB

Page 19: GPU-REMuSiC on Graphics Processing Units

2011/5/20 19 PPCB

Page 20: GPU-REMuSiC on Graphics Processing Units

2011/5/20 20 PPCB

Page 21: GPU-REMuSiC on Graphics Processing Units

Previous studies

2011/5/20 21 PPCB

Page 22: GPU-REMuSiC on Graphics Processing Units

2011/5/20 22 PPCB

Page 23: GPU-REMuSiC on Graphics Processing Units

2011/5/20 23 PPCB

Page 24: GPU-REMuSiC on Graphics Processing Units

Some bioinformatics applications have been successfully

ported to GPGPU in the past.

Liu et al. (IPDPS 2006) implemented the Smith-Waterman algorithm to

run on the nVidia GeForce 6800 GTO and GeForce 7800 GTX, and

reported an approximate 16× speedup by computing the alignment

score of multiple cells simultaneously.

Charalambous et al. (LNCS 2005) ported an expensive loop from

RAxML, an application for phylogenetic tree construction, and achieved a

1.2× speedup on the nVidia GeForce 5700 LE.

2011/5/20 24 PPCB

Page 25: GPU-REMuSiC on Graphics Processing Units

Liu et al. (IEEE TPDS 2007) presented a GPGPU approach to

high-performance biological sequence alignment based on

commodity PC graphics hardware. (C++ and OpenGL Shading

Language (GLSL))

Pairwise Sequence Alignment (Smith-Waterman algorithm, scan database)

Multiple sequence alignment (MSA)

2011/5/20 25 PPCB

Page 26: GPU-REMuSiC on Graphics Processing Units

Some bioinformatics applications have been successfully

ported to CUDA now.

Smith-Waterman algorithm (goal: scan database)

Manavski and Valle (BMC Bioinformatics 2008),

Striemer and Akoglu (IPDPS 2009),

Liu et al. (BMC Research Notes 2009)

Khajeh-Saeed et al. (Journal of Computational Physics, 2010)

Liu et al. (BMC Research Notes 2010)

Multiple sequence alignment (ClustalW)

Liu et al. (IPDPS 2009) for Neighbor-Joining Trees construction

Liu et al. (ASAP 2009)

2011/5/20 26 PPCB

Page 27: GPU-REMuSiC on Graphics Processing Units

Manavski and Valle present the first solution based on commodity hardware that efficiently computes the exact Smith-Waterman alignment. It runs from 2 to 30 times faster than any previous implementation on general-purpose hardware.

Pre-compute a query profile parallel to the query sequence for each possible residue.

The implementation in CUDA was to make each GPU thread compute the whole alignment of the query sequence with one database sequence. (pre-order the sequences of the database in function of their length)

The ordered database is stored in the global memory, while the query-profile is saved into the texture memory.

For each alignment the matrix is computed column by column in order parallel to the query sequence. (store them in the local memory of the thread)

2011/5/20 27 PPCB

Page 28: GPU-REMuSiC on Graphics Processing Units

Striemer and Akoglu further study the effect of memory organization and

the instruction set architecture on GPU performance.

They pointed out that query profile in Manavski’s method has a major drawback

in utilizing the texture memory of the GPU that leads to unnecessary caches

misses. (larger than 8KB)

Long sequence problem.

They placed the substitution matrix in the constant memory to exploit the

constant cache, and created an efficient cost function to access it. (modulo

operator is extremely inefficient on CUDA, not use hash function)

They mapped query sequence as well as the substitution matrix to the constant

memory.

2011/5/20 28 PPCB

Page 29: GPU-REMuSiC on Graphics Processing Units

Liu et al. proposed Two versions of CUDASW++ are implemented: a

single-GPU version and a multi-GPU version.

The alignment can be computed in minor-diagonal order from the top-left

corner to the bottom-right corner in the alignment matrix.

Considering the optimal local alignment of a query sequence and a subject

sequence as a task.

Inter-task parallelization: Each task is assigned to exactly one thread and dimBlock

tasks are performed in parallel by different threads in a thread block.

Intra-task parallelization: Each task is assigned to one thread block and all dimBlock

threads in the thread block cooperate to perform the task in parallel.

2011/5/20 29 PPCB

Page 30: GPU-REMuSiC on Graphics Processing Units

Khajeh-Saeed et al. showed that effective use of the GPU requires a novel

reformulation of the Smith–Waterman algorithm.

2011/5/20 30 PPCB

Page 31: GPU-REMuSiC on Graphics Processing Units

Liu et al. described the latest release of the CUDASW++ software,

CUDASW++ 2.0, which makes new contributions to Smith-Waterman

protein database searches using compute unified device architecture

(CUDA).

2011/5/20 31 PPCB

Page 32: GPU-REMuSiC on Graphics Processing Units

Liu et al. presents MSA-CUDA, a parallel MSA program, which

parallelizes all three stages of the ClustalW processing pipeline using

CUDA.

Pairwise distance computation:

As the work in Liu et al. (BMC Research Notes 2009)

Neighbor-Joining Trees: as the work in Liu et al. (IPDPS 2009)

Progressive alignment:

conducted iteratively in a

multi-pass way.

2011/5/20 32 PPCB

Page 33: GPU-REMuSiC on Graphics Processing Units

Constrain multiple sequence alignment

2011/5/20 33 PPCB

Page 34: GPU-REMuSiC on Graphics Processing Units

(Tang, CSB 2002) P = HKH

Constrain Multiple Sequence Alignment (CMSA)

2011/5/20 34 PPCB

Page 35: GPU-REMuSiC on Graphics Processing Units

G. Myers, S. Selznick, Z. Zhang and W. Miller. “Progressive multiple alignment with constraints,” Proceedings of the 1st ACM Conference on Computational Molecular Biology, 1997.

(Position constraint, is jt)

C.Y. Tang, C.L. Lu, M.D.T. Chang, Y.T. Tsai, etc, “Constrained Multiple Sequence Alignment Tool Development and Its Application to RNase Family Alignment,” Journal of Bioinformatics and Computational Biology, vol. 1, 2003. (CSB 2002) (CMSA)

(Character constraint, P: a set of single residue/nucleotide)

Y.T. Tsai, Y.P. Huang, C.T. Yu and C.L. Lu, “MuSiC: a tool for multiple sequence alignment with constraints,” Bioinformatics, vol. 20, 2004. (MuSiC)

(Segment constraint, P: a set of short segment) C.L. Lu and Y.P. Huang, “A memory-efficient algorithm for multiple

sequence alignment with constraints,” Bioinformatics, vol. 21, 2005. (MuSiC-ME)

2011/5/20 35 PPCB

Page 36: GPU-REMuSiC on Graphics Processing Units

Y.-T. Tsai, “The constrained common sequence problem,” Information Processing Letters, vol. 88, 2003.

F. Y.L. Chin, etc,

“A simple algorithm for the constrained sequence problems,” IPL, vol. 90, 2004.

“ Efficient Constrained Multiple Sequence Alignment with Performance Guarantee,” JBCB, vol. 3, 2005. (CSB, 2003)

A. N. Arslan, etc.

“Algorithms for the constrained common sequence problem,” Proc. Prague Stringology Conference, 2004.

“A fast Algorithm for the Constrained Multiple Sequence Alignment problem,” Acta Cybernetica (accepted)

“A parallel algorithm for the constrained multiple sequence alignment problem,” BIBE 2005.

“A space-efficient algorithm for the constrained pairwise sequence alignment problem,” GIW 2005.

“FastPCMSA: An improved parallel algorithm for the constrained multiple sequence alignment problem,” FCS 2006.

“A* Algorithms for the Constrained Multiple Sequence Alignment Problem,” ICAI 2006.

“Space-efficient Parallel Algorithms for the Constrained Multiple Sequence Alignment Problem,” BIOCOMP 2006.

2011/5/20 36 PPCB

Page 37: GPU-REMuSiC on Graphics Processing Units

2011/5/20 37 PPCB

Page 38: GPU-REMuSiC on Graphics Processing Units

RE-MuSiC steps Step1. Compute the score of the global sequence alignment without

any constraint using the Needleman-Wunsch algorithm between all pairs of the K sequences and build the distant matrix.

Step2. Follow the results of Step1 to create a complete graph G of K sequences.

Step3. Use the complete graph G to construct a Kruskal merging order tree .

Step4. Use the created in Step3 as a guide tree. Progressively align the sequences according to the branching order of the Two closest pre-aligned groups of sequences are joined by CPSA algorithm to represented sequences of two groups. (ClustalW)

2011/5/20 38 PPCB

Page 39: GPU-REMuSiC on Graphics Processing Units

GPU-REMuSiC implementation

2011/5/20 39 PPCB

Page 40: GPU-REMuSiC on Graphics Processing Units

In the past, we have designed an efficient parallel algorithm for the

constrained multiple sequence alignment based on the memory-efficient

algorithm (MuSiC-ME). (PCMSA, MPI)

3k

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35

total processers

spee

d up

32 sequences 64 sequences 128 sequences

256 sequences 512 sequences

4k

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35

total processers

spee

d up

32 sequences 64 sequences 128 sequences

256 sequences 512 sequences

5k

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35

total processers

spee

d u

p

32 sequences 64 sequences 128 sequences

256 sequences 512 sequences

2011/5/20 40 PPCB

Page 41: GPU-REMuSiC on Graphics Processing Units

Define eight possible

methods for computing

dynamic programming

by CUDA

(a)SRST (b) SRMT

Figure 3: SRST and SRMT.

(a) ARST (b) ARMT

Figure 4: ARST and ARMT.

(a) SDST (b) SDMT

Figure 5: SDST and SDMT.

(a) ADST (b) ADMT

Figure 6: ADST and ADMT.

2011/5/20 41 PPCB

Page 42: GPU-REMuSiC on Graphics Processing Units

Figure 7. Distance matrix and the thread blocks allocation.

Figure 8. The load balancing cutting technology by two GPUs

2011/5/20 42 PPCB

Page 43: GPU-REMuSiC on Graphics Processing Units

(Device) Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

Intra-task parallelization

•Sequences (global memory)

•Scoring matrix (constant memory)

•alphabetical order

•ASCII indexing

•Si and Sj stored in shared memory

•Result stored in shared memory

•Memory efficient

•Only score

2011/5/20 43 PPCB

Page 44: GPU-REMuSiC on Graphics Processing Units

Sp

eedu

ps

Sequences

90.4 85.75 92.39 87.63

179.24 177.27 187.38 181.17

229.58

268.1 255.41273.1272.04

374.86342.78

383.14

0

50

100

150

200

250

300

350

400

450

400(856) 2000(266) 1000(858) 2000(858)

1 GPU 2 GPU 3 GPU 4 GPU

(DP computation time)

2011/5/20 44 PPCB

Page 45: GPU-REMuSiC on Graphics Processing Units

Sp

eedu

ps

Sequences

Figure 10. The speedup ratios by comparing GPU-REMuSiC and RE-MuSiC.

12.51

22.78

12.59 13.5113.3

26.36

13.5

21.1

13.5

28.05

13.68

21.9

13.52

29.18

13.93

22.36

0

5

10

15

20

25

30

35

400(856) 2000(266) 1000(858) 2000(858)

1 GPU 2 GPU 3 GPU 4 GPU

2011/5/20 45 PPCB

Page 46: GPU-REMuSiC on Graphics Processing Units

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

CPU 1 GPU 2 GPU 3 GPU 4 GPU

Other Distant Matrix

2011/5/20 46 PPCB

Page 47: GPU-REMuSiC on Graphics Processing Units

Sp

eedu

ps

Length

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700

1 GPU 2 GPU 3 GPU 4 GPU

2011/5/20 47 PPCB