InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

InCoB2007 - August 30, 2007 - HKUST

“Speedup Bioinformatics Applications on Multicore-based Processor using

Vectorizing & Multithreading Strategies”

King Mongkut’s Institute of Technology, Ladkrabang,

Thailand

National Center for Genetic Engineering and Biotechnology, Thailand

Dr. Surin KittitornkunDr. Sissades Tongsima

Kridsadakorn [email protected]

1

Outline

Introduction Case Study Existing works Speedup of our approach Comparison Discussion Our strategies Limitation Conclusion

2

Motivation

New modern processors are launched How to make a use of new technologies?

Dual-core CPU Quad-core CPU

3

Motivation [2]

What is the difference between old and new CPUs?

4

Dual-core, Max. speedup ~2x Quad-core, Max. speedup ~4x

Problems

Old sequential software is still used?Yes, especially the science and bioinformatics tools

Why do the scientists still use?Mostly they care about novel algorithms and

knowledge. They don't care about speed Why don't we use the PC cluster?

Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data

5

Our Contribution

The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered

Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW

6

Case Study: ClustalW

ClustaW is a general purpose multiple alignment program for DNA or proteins.

7

All pairwisealignments

ClustalW example

S1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S3

2. Align S2 with S4

3. Align (S1, S3) with (S2, S4)

Distance Matrix

Multiple Alignment Steps

NeighborJoining

-ALSKNA-SK

-TNSDNT-SD

-ALSK-TNSDNA-SKNT-SD

MultipleAlignment

S1 S3

S2

S4

Distance

8

Existing works

ClustalW-MPI: ClustalW analysis using distributed and parallel computingK.B. Li, Bioinformatics 19, 2003

Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic SchedulingJ. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05

SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTALD. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio

9

Speedup of our approach

*Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT-ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT-ClustalW with optimization and our assist

2.12244,672474,1095,472,407VI

1.98253,188473,3595,900,891V

1.70252,984511,0477,009,875IV

1.21327,985880,9699,656,750III

1.14338,016881,12510,387,046II

-333,110932,71811,918,672I

Test data - 800 sequences, 1000 amino acids

ProgressiveAlignment

NeighborJoining

DistanceMatrix

Overallspeedup

Elapsed times (ms)Runningmode*

10

Data set Protein sequences from NCBIRun time: from 3 h. 40 m. down to 1 h. 43 m.

ClustalW

Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

10.00%

14.00%

18.00%

22.00%

26.00%

200 400 600 800

Number of sequences

Sp

eed

up

(%

)

len800, Only compiler-optimization len800, Optimization w ith our assist


11

Multithreaded ClustalW

Speedup of the optimized versions of MT-ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

95.00%

100.00%

105.00%

110.00%

115.00%

200 400 600 800

Number of sequences

Sp

eed

up

(%

)



12

Comparison

13

ClustalW-MPI Parallel MSA SGI ClustalW-MTV

Number of sequences 500 80 600 600

Sequence length 1100 289-399 390 400

Machine PC Cluster PC Cluster Single PCShared memory

Single PCShared memory

Processors 2 2 2 2

Speedup 1.75x 1.8x 1.8x 2.25x

Why does the speedup is over 2x?Because of the special unit in the new CPU

Does the special unit normally work with common software?No, we have to activate it.

Speedup > 2x for dual-CPU? [1]

Amdahl’s Law

14

kf

fS

1

1S Speedup

Original Program

Modified Program

k

1-f f

Speedup > 2x for dual-CPU? [2]

15

mtopttotal SpeedupSpeedupSpeedup

06.270.121.1 totalSpeedup

Speedup 1.21

Speedup 1.70

Data set 800 sequences, 1000 amino acids

Our strategies

Step 1: Analyzing and Profiling To find the software structure and where the

bottle neck is Step 2: Applying the methodologies

Multithreading & Vectorizing (one of the optimization method)

Step 3: Validating To compare the result with the original one. For

sure, the result is not changed

16

Strategy: Multithreading

The Proposed Multithreading StrategyTo improve the bottle neck of the software which

is non-threaded part To rise the throughput of the program by

applying multithreading strategy Reduce the overhead of thread creation

17

Profile the software

Profiled by Intel Thread Profiler

Distance matrix

Neighbor joining

Progressive alignment

18

Implementation

Apply the Thread library for this loop19

Trick

Reduce Thread Creation Overhead

T1 T2 T2 T4

P1 P2 P3 P4

P5 P6 P7 P8

P9 P10 P11 P12

4 Threads

Parameters

20

Strategy: Vectorizing

Proposed Optimizing and Vectorizing Methodology Find the frequent used functions in the programApplying the Loop Optimizing MethodologiesUse the advantage of Intel C++ Compiler to

optimize the code, also enable vectorizing option

21

Frequent used functions

22

Function Clockticks (%) Methodology*

diff 33.36 A,B

prfscore 15.93 C

forward_pass 14.91 -

calc_score 12.93 D

reverse_pass 11.45 A

pdiff 5.85 -

*Note: A is Loop reversal, B is Loop fission, C is Type Casting, and D is Procedure call reduction

Profiled by Intel VTune

Loop Reversal

That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set.

for (i=se2;i>0;i--){ HH[i] = -1; DD[i] = -1;}

for (i=1;i<=se2;i++){ HH[i] = -1; DD[i] = -1;}

23

Loop Fission

A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements.

for (j=0;j<=N;j++){ hh = HH[j] + RR[j]; if (hh>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=hh; midj=j; }}

for (j=0;j<=N;j++){ temp[j] = HH[j] + RR[j];}

for (j=0;j<=N;j++){ if (temp[j]>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=temp[j]; midj=j; }} 24

Limitation

Available compliers and programming languagesC/C++ Intel C++ complier (Windows,

Linux, Mac)Fortran Intel Fortran complier (Windows,

Linux, Mac) Available processors

CPU with Hyper-thread technology or above (Intel, AMD)

25

Conclusion

Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++

Proposed framework: multithreading and vectorizing strategies

Higher speedup by taking the advantage of multicore architecture technology

Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer

26

Questions?

Thank you

27

InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

Documents

InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.