Parallelized Multiple Sequence Alignment on the Public Cloud Presented by: Dr. G.Sudha Sadasivam Professor, Dept of CSE, PSG College of Technology, Coimbatore Co-authors Mr B. Vijayan, Mr S. Arul Prakash, Mr K.V. Hari Babu Students, BE(CSE), Dept of CSE, PSG College of Technology, Coimbatore
31
Embed
Parallelized Multiple Sequence Alignment on the Public Cloud Presented by: Dr. G.Sudha Sadasivam Professor, Dept of CSE, PSG College of Technology, Coimbatore.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parallelized Multiple Sequence Alignment on the Public Cloud
Presented by:Dr. G.Sudha SadasivamProfessor, Dept of CSE,
PSG College of Technology, Coimbatore
Co-authorsMr B. Vijayan, Mr S. Arul Prakash, Mr K.V. Hari Babu
Students, BE(CSE), Dept of CSE, PSG College of Technology,
Coimbatore
Agenda Sequence alignment Introduction to Clouds Approaches for MSA Problem statement System Architecture Illustration of working of the system Analysis Experimental results Conclusion
What is Sequence Alignment?
The procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Uses
For sequence similarity Phylogenetic tree analysis
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA.
The input is a set of query sequences that are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor.
From the resulting multiple sequence alignment , phylogenetic analysis can be conducted to assess the sequences shared evolutionary origins.
Multiple Sequence Alignment
Dynamic programming
Progressive alignment
Iterative approach
MSA Approaches
Direct method for MSA to identify the globally optimal alignment solution .
Computational complexity n-dimensional equivalent of the pairwise alignment
matrix is formed. The search space increases exponentially with
increasing n and is strongly dependent on sequence length(N).
O(Nn)
Dynamic Programming
Heuristic search . builds up a final MSA by combining pair wise alignments
beginning with the most similar pair and progressing to the most distantly related.
Stages: The relationships between the sequences are represented
as a tree, called a guide tree (pairwise alignment scores). The MSA is built by adding the sequences sequentially to
the growing MSA according to the guide tree.
seq 1seq 2seq3seq4
According to guide tree, 1) Align seq 1 and 2, 2) Align seq 3 wrt seq 1 and 2, 3) Align seq 4 to that of seq 1, 2,
and 3.
Progressive Alignment
The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Random/ iterative approaches are used
Performance is also particularly bad when all of the sequences in the set are rather distantly related.
K1,I K2,I K3, I K4, I K5, I K6,I K1, I K2, I K3, I K5, I K6, I
Map Task 1 Map Task 2 Map Task 3
Reduce Task 1 Reduce Task 2
Map reduce Architecture
A single Combination – An illustration
0 1 2 3 4
A G T A
0 0 -1 -2 -3 -4
1 A -1 1 0 -1 -2
2 T -2 0 0 1 0
3 A -3 -1 -1 0 2
SCORE: 4
A1S1:“AGTA”; A1S2:“A_TA”
0 1 2 3 4
A G T A
0 0 -1 -2 -3 -4
1 G -1 -1 0 -1 -2
2 A -2 0 -1 1 0
3 T -3 -1 -1 0 -1
SCORE: -5
A2S1:“AG_TA”; A1S3:“_GAT_”
1. ALIGNMENT OF SI & S2
2. ALIGNMENT OF A1SI & S3
S1= “AGTA”; A2=“ATA”; A3=“GAT”
0 1 2 3 4 5
A _ T A _
0 0 -1 -2 -3 -4 -5
1 _ -1 0 0 -1 -2 -3
2 G -2 -1 -1 -1 -2 -2
3 A -3 -1 -1 -2 0 -1
4 T -4 -2 -1 0 -1 0
5 _ -5 -3 -1 -1 0 0
SCORE: -3
A2S2:“A _ _TA_”;
A2S3:“ _GAT_ _”
3. ALIGNMENT OF A1S2 & A1S3
Complexity Measure
Proposed Method
Conventional Method
Score Calculation
O(N) O(n*N)
Pairwise alignment
O(K2) O(N2)
MSA O[K2 * ( n(n-1)/2] O(Nn)
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘k’ – Average number of blocks in a sequence
‘K’ – Size of 1 block
Analysis
‘T’ – Time for sequence transfer serially & ‘k’ – block size
T/k – Time for sequence transfer in parallel
Advantage: Computation power of remote cluster is optimal and not wasted
Disadvantage: Time to set up the cluster
2. Parallelised data trasfer
3. Dynamic cluster creation
Experimental Setup
Core – 2 Duo processors – 2.8 GHz - 160GB HD,
2 GB RAM LAN- 100 Mbps. OS - RHEL v5 Client virtual environment - 4 VMs Server cluster - 5 machines Hadoop DFS in fully distributed mode OpenVZ was used for virtualization
Effect of parallel file transfer
FileSize(MB)
FileTransfer(sec)
Split Time(sec)
Merge Time(sec)
C1(sec)
T1 (sec)
C2(sec)
T2 (sec)
100 6.23 0.02 0.03 2.13 2.18 0.73 0.78
200 9.32 0.23 0.43 2.96 3.62 1.23 1.89
300 11.43 0.85 1.64 3.84 6.33 1.16 3.65
C1: Communication time from 3 client VMs to server without multithreading.C2: Communication time from 3 client VMs to the server with multithreading.T1: Total time for file transfer from client to server without multi threading T2: Total time for file transfer from client to server with multi threading
Time to start virtual machines
0
20
40
60
80
100
120
1 2 3 4
Number of VMs
Tim
e in
Sec
Parallelised starting of VMs can be done to reduce time
cluster performance wrt number of VMs 30 KB sequences with 2 KB splits – upto 5 sequences
Number of sequences is less than 6, a five node hadoop cluster is sufficient.
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10Number of sequences
Tim
e in
Sec
4 slave VMs (sec) 6 slave VMs (sec)
3 4 5 6 7 8 9 10 11 12
Dynamic scaling up/down of clusters
File Size (GB)
Block size (10 MB)
Static VM creation based on Predicted application load (maps + reduces)
Dynamic VM creation based on actual application load (maps + reduces)
Time (min -sec)
VMs Time (min-sec)
New VMs added
1 5-36 2 3-16 1
2 5-52 3 5-40 1
3 8-27 4 5-48 2
5 12-13 5 6-39 9
VMs instantiated based on number of Map-Reduce Tasks
Dynamically number of tasks were checked up New VMs started and tasks were reallocated
Old VMs were destroyed if not used
Conclusion1) Proposed MSA improves on the computation time and also
maintains the accuracy. Parallelism of sequence alignment in three levels.
Hadoop data grids - Data and compute parallelism & scalability
Dynamic Programming - accuracy.
2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)] Combining progressive and dynamic approaches. Blocking in hadoop
3) Enhancements (using clouds for MSA) Automatic configuration of the cloud environment based on
the computational needs Efficient upload of data into the HDFS by parallel transfer of
sequence fragments over the Internet.
Acknowledgements
The Research has been carried out as a result of PSG-Yahoo Research programme on Grid and Cloud computing.
Sincere Thanks to
1) Dr R Rudramoorthy, Principal,
PSG College of Techniology, Coimbatore.
2) Mr K V Chidambaran,
Director, Grid and Cloud Systems Group,
Yahoo, Bangalore
THANK YOU
QUESTIONS?
REFERENCES Apache, (2002), Hadoop Documentation, retrieved on September 20, 2009,
fromhttp://hadoop.apache.org/core/docs/r0.17.2/. Tahir, N., Imitaz, S. and Shaftab, A., “Parallel Needleman-Wunsch Algorithm for
Grid”. retrieved on January 19, 2009 from http://www.gridbus.org/~alchemi/files/Parallel%20Needleman% 20Algo.pdf
Lee, T., “A genomic CluE for Cloud Computing”, retrieved on January 13, 2009 from http://www.eurekalert.org/pub_releases /2009-04/uom-agc042309.php
Yongli, H. and Shen, J., “Sequence analysis scale up and acceleration using Grid and Cloud Computing yield efficient analyses of HIV-1 variants and other viruses”, retrieved on February 15, 2009 from www.iscb.org /uploaded/css/43/12056.pdf.
Philip, P., Andres, L., Eyal, L. and Michael, B. “Adding the easy button to the cloud with SnowFlock and MPI”, in Proceedings of 3rd ACM workshop in system level virtualization for HPC (2009), 122-127.