InCoB2007 - August 30, 2007 - HKUST
“Speedup Bioinformatics Applications on Multicore-based Processor using
Vectorizing & Multithreading Strategies”
King Mongkut’s Institute of Technology, Ladkrabang,
Thailand
National Center for Genetic Engineering and Biotechnology, Thailand
Dr. Surin KittitornkunDr. Sissades Tongsima
Kridsadakorn [email protected]
1
Outline
Introduction Case Study Existing works Speedup of our approach Comparison Discussion Our strategies Limitation Conclusion
2
Motivation
New modern processors are launched How to make a use of new technologies?
Dual-core CPU Quad-core CPU
3
Motivation [2]
What is the difference between old and new CPUs?
4
Dual-core, Max. speedup ~2x Quad-core, Max. speedup ~4x
Problems
Old sequential software is still used?Yes, especially the science and bioinformatics tools
Why do the scientists still use?Mostly they care about novel algorithms and
knowledge. They don't care about speed Why don't we use the PC cluster?
Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data
5
Our Contribution
The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered
Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW
6
Case Study: ClustalW
ClustaW is a general purpose multiple alignment program for DNA or proteins.
7
All pairwisealignments
ClustalW example
S1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
1. Align S1 with S3
2. Align S2 with S4
3. Align (S1, S3) with (S2, S4)
Distance Matrix
Multiple Alignment Steps
NeighborJoining
-ALSKNA-SK
-TNSDNT-SD
-ALSK-TNSDNA-SKNT-SD
MultipleAlignment
S1 S3
S2
S4
Distance
8
Existing works
ClustalW-MPI: ClustalW analysis using distributed and parallel computingK.B. Li, Bioinformatics 19, 2003
Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic SchedulingJ. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05
SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTALD. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio
9
Speedup of our approach
*Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT-ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT-ClustalW with optimization and our assist
2.12244,672474,1095,472,407VI
1.98253,188473,3595,900,891V
1.70252,984511,0477,009,875IV
1.21327,985880,9699,656,750III
1.14338,016881,12510,387,046II
-333,110932,71811,918,672I
Test data - 800 sequences, 1000 amino acids
ProgressiveAlignment
NeighborJoining
DistanceMatrix
Overallspeedup
Elapsed times (ms)Runningmode*
10
Data set Protein sequences from NCBIRun time: from 3 h. 40 m. down to 1 h. 43 m.
ClustalW
Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.
10.00%
14.00%
18.00%
22.00%
26.00%
200 400 600 800
Number of sequences
Sp
eed
up
(%
)
len800, Only compiler-optimization len800, Optimization w ith our assist
len1000, Only compiler-optimization len1000, Optimization w ith our assist
11
Multithreaded ClustalW
Speedup of the optimized versions of MT-ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.
95.00%
100.00%
105.00%
110.00%
115.00%
200 400 600 800
Number of sequences
Sp
eed
up
(%
)
len800, Only compiler-optimization len800, Optimization w ith our assist
len1000, Only compiler-optimization len1000, Optimization w ith our assist
12
Comparison
13
ClustalW-MPI Parallel MSA SGI ClustalW-MTV
Number of sequences 500 80 600 600
Sequence length 1100 289-399 390 400
Machine PC Cluster PC Cluster Single PCShared memory
Single PCShared memory
Processors 2 2 2 2
Speedup 1.75x 1.8x 1.8x 2.25x
Why does the speedup is over 2x?Because of the special unit in the new CPU
Does the special unit normally work with common software?No, we have to activate it.
Speedup > 2x for dual-CPU? [1]
Amdahl’s Law
14
kf
fS
1
1S Speedup
Original Program
Modified Program
k
1-f f
Speedup > 2x for dual-CPU? [2]
15
mtopttotal SpeedupSpeedupSpeedup
06.270.121.1 totalSpeedup
Speedup 1.21
Speedup 1.70
Data set 800 sequences, 1000 amino acids
Our strategies
Step 1: Analyzing and Profiling To find the software structure and where the
bottle neck is Step 2: Applying the methodologies
Multithreading & Vectorizing (one of the optimization method)
Step 3: Validating To compare the result with the original one. For
sure, the result is not changed
16
Strategy: Multithreading
The Proposed Multithreading StrategyTo improve the bottle neck of the software which
is non-threaded part To rise the throughput of the program by
applying multithreading strategy Reduce the overhead of thread creation
17
Profile the software
Profiled by Intel Thread Profiler
Distance matrix
Neighbor joining
Progressive alignment
18
Implementation
Apply the Thread library for this loop19
Trick
Reduce Thread Creation Overhead
T1 T2 T2 T4
P1 P2 P3 P4
P5 P6 P7 P8
P9 P10 P11 P12
4 Threads
Parameters
20
Strategy: Vectorizing
Proposed Optimizing and Vectorizing Methodology Find the frequent used functions in the programApplying the Loop Optimizing MethodologiesUse the advantage of Intel C++ Compiler to
optimize the code, also enable vectorizing option
21
Frequent used functions
22
Function Clockticks (%) Methodology*
diff 33.36 A,B
prfscore 15.93 C
forward_pass 14.91 -
calc_score 12.93 D
reverse_pass 11.45 A
pdiff 5.85 -
*Note: A is Loop reversal, B is Loop fission, C is Type Casting, and D is Procedure call reduction
Profiled by Intel VTune
Loop Reversal
That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set.
for (i=se2;i>0;i--){ HH[i] = -1; DD[i] = -1;}
for (i=1;i<=se2;i++){ HH[i] = -1; DD[i] = -1;}
23
Loop Fission
A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements.
for (j=0;j<=N;j++){ hh = HH[j] + RR[j]; if (hh>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=hh; midj=j; }}
for (j=0;j<=N;j++){ temp[j] = HH[j] + RR[j];}
for (j=0;j<=N;j++){ if (temp[j]>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=temp[j]; midj=j; }} 24
Limitation
Available compliers and programming languagesC/C++ Intel C++ complier (Windows,
Linux, Mac)Fortran Intel Fortran complier (Windows,
Linux, Mac) Available processors
CPU with Hyper-thread technology or above (Intel, AMD)
25
Conclusion
Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++
Proposed framework: multithreading and vectorizing strategies
Higher speedup by taking the advantage of multicore architecture technology
Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer
26
Questions?
Thank you
27