Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters Sangmi Lee Pallickara, Marlon Pierce, Qunfeng Dong, Chin Hua Kong Indiana University, Bloomington IN, USA *Presented by Marlon Pierce
22
Embed
Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters Sangmi Lee Pallickara, Marlon.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Enabling Large Scale Scientific Computations for Expressed
Sequence Tag Sequencing over Grid and Cloud Computing Clusters
Sangmi Lee Pallickara, Marlon Pierce, Qunfeng Dong, Chin Hua Kong
Indiana University, Bloomington IN, USA
*Presented by Marlon Pierce
IU to lead New US NSF Track 2d $10M Award
The EST Pipeline• The goal is to cluster mRNA sequences
– Overlapping sequences are grouped together into different clusters and then
– A consensus sequence is derived from each cluster.– CAP3 is one program to assemble contiguous sequences.
• Data sources: NCBI GenBank, short read gene sequencers in the lab, etc.– Too large to do with serial codes like CAP3
• We use PaCE (S. Aluru) to do a pre-clustering step for large sequences (parallel problem).– 1 large data set --> many smaller clusters – Each individual cluster can be fed into CAP3. – We replaced the memory problem with the many-task problem.– This is data-file parallel.
• Next step: do the CAP3 consensus sequences match any known sequences?– BLAST also data-file parallel, good for Clouds
http://swarm.cgb.indiana.edu
• Our goal is to provide a Web service-based science portal that can handle the largest mRNA clustering problems.
• Computation is outsourced to Grids (TeraGrid) and Clouds (Amazon) – Not provided by in-
house clusters. • This is an open service,
open architecture approach.
Some TeraGrid Resources
Data obtained from NIH NCBI. 4.7 GB raw data processed using PACE on Big Red. Clusters shown to be processed with CAP3.
Swarm: Large scale job submission infrastructure over the distributed clusters• Web Service to submit and monitor 10,000’s (or more)
serial or parallel jobs.• Capabilities: – Scheduling large number of jobs over distributed HPC clusters
(Grid clusters, Cloud cluster and MS Windows HPC cluster)– Monitoring framework for the large scale jobs– Standard Web service interface for web application– Extensible design for the domain specific software logics– Brokers both Grid and Cloud submissions
• Other applications: – Calculate properties of all drug-like molecules in PubChem
(Gaussian)– Docking problems in drug discovery (Amber, Autodock)
(Revised) Architecture of Swarm Service
Windows Server Cluster
Windows Server Cluster
Swarm-Grid
Swarm-Dryad
Local RDMBSLocal RDMBS
Swarm-AnalysisStandard Web Service Interface
Large Task Load Optimizer
Swarm-Grid Connector
Swarm-Dryad Connector
Swarm-Hadoop Connector
Cloud Comp. Cluster
Cloud Comp. Cluster
Grid HPC/Condor Cluster
Grid HPC/Condor Cluster
Swarm-Hadoop
Swarm-Grid• Swarm considers
traditional Grid HPC cluster are suitable for the high-throughput jobs.– Parallel jobs (e.g. MPI
jobs)– Long running jobs
• Resource Ranking Manager– Prioritizes the resources
with QBETS, INCA• Fault Manager– Fatal faults– Recoverable faults
Job Execution Time in Swarm-DryAd with Windows HPC 16 nodes
Job Execution Time in Swarm-DryAd various number of nodes
EST Sequencing Pipeline• EST (Expressed Sequence Tag): A fragment of Messenger
RNAs (mRNAs) which is transcribed from the genes residing on chromosomes.
• EST Sequencing: Re-constructing full length of mRNA sequences for each expressed gene by means of assembling EST fragments.
• EST sequencing is a standard practice for gene discovery, especially for the genomes of many organisms which may be too complex for whole-genome sequencing. (e.g. wheat)
• EST contigs are important data for accurate gene annotation.
• A pipeline of computational steps is required:– E.g. repeat masking, PaCE, CAP3 or other assembler on
clustered data set
Computing resources for computing intensive Biological Research
• Biologically based researches require substantial amount of computing resources.
• Many of current computing is based on the limited local computing infrastructure.
• Available computing resources include:– US national cyberinfrastructure (e.g. TeraGrid) good fit for