Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters Sangmi Lee Pallickara, Marlon.

Enabling Large Scale Scientific Computations for Expressed

Sequence Tag Sequencing over Grid and Cloud Computing Clusters

Sangmi Lee Pallickara, Marlon Pierce, Qunfeng Dong, Chin Hua Kong

Indiana University, Bloomington IN, USA

*Presented by Marlon Pierce

IU to lead New US NSF Track 2d $10M Award

The EST Pipeline• The goal is to cluster mRNA sequences

– Overlapping sequences are grouped together into different clusters and then

– A consensus sequence is derived from each cluster.– CAP3 is one program to assemble contiguous sequences.

• Data sources: NCBI GenBank, short read gene sequencers in the lab, etc.– Too large to do with serial codes like CAP3

• We use PaCE (S. Aluru) to do a pre-clustering step for large sequences (parallel problem).– 1 large data set --> many smaller clusters – Each individual cluster can be fed into CAP3. – We replaced the memory problem with the many-task problem.– This is data-file parallel.

• Next step: do the CAP3 consensus sequences match any known sequences?– BLAST also data-file parallel, good for Clouds

http://swarm.cgb.indiana.edu

• Our goal is to provide a Web service-based science portal that can handle the largest mRNA clustering problems.

• Computation is outsourced to Grids (TeraGrid) and Clouds (Amazon) – Not provided by in-

house clusters. • This is an open service,

open architecture approach.

Some TeraGrid Resources

Data obtained from NIH NCBI. 4.7 GB raw data processed using PACE on Big Red. Clusters shown to be processed with CAP3.

Swarm: Large scale job submission infrastructure over the distributed clusters• Web Service to submit and monitor 10,000’s (or more)

serial or parallel jobs.• Capabilities: – Scheduling large number of jobs over distributed HPC clusters

(Grid clusters, Cloud cluster and MS Windows HPC cluster)– Monitoring framework for the large scale jobs– Standard Web service interface for web application– Extensible design for the domain specific software logics– Brokers both Grid and Cloud submissions

• Other applications: – Calculate properties of all drug-like molecules in PubChem

(Gaussian)– Docking problems in drug discovery (Amber, Autodock)

(Revised) Architecture of Swarm Service

Windows Server Cluster

Windows Server Cluster

Swarm-Grid

Swarm-Dryad

Local RDMBSLocal RDMBS

Swarm-AnalysisStandard Web Service Interface

Large Task Load Optimizer

Swarm-Grid Connector

Swarm-Dryad Connector

Swarm-Hadoop Connector

Cloud Comp. Cluster

Cloud Comp. Cluster

Grid HPC/Condor Cluster

Grid HPC/Condor Cluster

Swarm-Hadoop

Swarm-Grid• Swarm considers

traditional Grid HPC cluster are suitable for the high-throughput jobs.– Parallel jobs (e.g. MPI

jobs)– Long running jobs

• Resource Ranking Manager– Prioritizes the resources

with QBETS, INCA• Fault Manager– Fatal faults– Recoverable faults

Resource Ranking Manager

Grid HPC/Condor pool Resource Connector

Condor(Grid/Vanilla) with Birdbath

Grid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC Clusters

Condor ClusterCondor Cluster

Standard Web Service Interface

Swarm-Grid

QBETS Web

Service

QBETS Web

Service

Local RDMBS

Local RDMBS

MyProxyServer

MyProxyServer

Hosted by TeraGrid Project

Hosted by UCSB

Request Manager

Job Distributor

Job Queue

Data Model Manager Fault Manager

User A’s Job Board

Swarm-Hadoop• Suitable for short running

serial job collections• Submit jobs to the cloud

computing clusters: Amazon’s EC2 or Eucalyptus

• Uses Hadoop map-reduce engine.

• Each job processed as a single Map function:

• Input/output location is determined by the Data Model Manager– Easy to modify for the

domain specific requirements.

Swarm-Hadoop

Hadoop Map Reduce Programming interface

Cloud Computing Cluster

Cloud Computing Cluster

Hadoop Resource Connector

Job Producer

DataModel Manager Fault Manager

Local RDMBSLocal RDMBS

Standard WebService Interface

Job Buffer

Request Manager


Performance Evaluation• Java JDK 1.6 or higher, Apache Axis2• Server: 3.40 GHz Inter Pentium 4 CPU, 1GB RAM

• Swarm Grid:– Backend TeraGrid machines: Big Red (Indiana University), Ranger (Texas Advanced

Computing Center), and NSTG (Oak Ridge National Lab)

• Swarm-Hadoop:– Computing Nodes: Amazon Web Service EC2 cluster with m1.small instance (2.5 GHz

Dual-core AMD Opteron with 1.7GB RAM)

• Swarm-Windows HPC:– Microsoft Windows HPC cluster, 2.39GHz CPUs, 49.15GB RAM, 24 cores, 4 sockets

• Dataset: partial set of the human EST fragments (published by NCBI GenBank)

– 4.6 GB total– Groupings: Very small job(less than 1 minute), small job(1~3 minutes), Medium

job(3~10 minutes), large job(longer than 10 minutes)

Total Execution time of CAP3 execution for the various numbers of jobs (~1 minute) with Swarm-

Grid, Swarm-Hadoop, and Swarm-Dryad

Job Execution time in Swarm-Hadoop

Conclusions• Bioinformatics needs both computing Grids and

scientific Clouds– Problem sizes range over many orders of magnitude

• Swarm is designed to bridge the gap between the two, while supporting 10,000’s or more jobs per user per problem.

• Smart scheduling is an issue in data-parallel computing– Small Jobs(~1min) were processed more efficiently by

Swarm-Hadoop and Swarm-Dryad. – Grid style HPC clusters adds minutes (or even longer) of

overhead to each of jobs.– Grid style HPC clusters still provide stable environment

for large scale parallel jobs.

More Information

• Email: leesangm AT cs.indiana.edu mpierce AT cs.indiana.edu

• Swarm Web Site: http://www.collab-ogce.org/ogce/index.php/Swarm

• Swarm on SourceForge: http://ogce.svn.sourceforge.net/viewvc/ogce/ogce-services-incubator/swarm/

mailto:[email protected]

mailto:[email protected]

http://www.collab-ogce.org/ogce/index.php/Swarm

http://www.collab-ogce.org/ogce/index.php/Swarm

http://ogce.svn.sourceforge.net/viewvc/ogce/ogce-services-incubator/swarm/

http://ogce.svn.sourceforge.net/viewvc/ogce/ogce-services-incubator/swarm/

Computational Challenges in the EST Sequencing

• Challenge 1: Executing tens of thousands of jobs.– More than 100 plant species have at least 10,000 EST

sequences; tens of thousand assembly jobs are processed.– Standard queuing systems used by Grid based clusters do

NOT allow users to submit 1000s of jobs concurrently to batch queue systems.

• Challenge 2: Requirement of job processing is various– To complete EST assembly process, various types of

computation jobs must be processed. E.g. large scale parallel processing, serial processing, and embarrassingly parallel jobs.

– Suitable computing resource will optimize the performance of the computation.

Tools for EST Sequence Assembly

Cleaning Cleaning sequence readssequence reads

RepeatMaskerSEG, XNU,

RepeatRunner, PILER

Clustering Clustering sequence readssequence reads

PaCE

Cap3 Clustering, BAG, Clusterer,

CluSTr, UI Cluster and many more

Assemble readsAssemble reads Cap3FAKII, GAP4, PHRAP, TIGR Assembler

Similarity searchSimilarity search Blast

FASTA, Smith-Waterman, Needleman-

Wunsch

http://en.wikipedia.org/wiki/Sequence_clustering

Swarm-Grid: Submitting High-throughput jobs-2

• User(personal account, community account) based job management: policies in the Gird clusters are based on the user.

• Job Distributor: matchmaking available resources and submitted jobs.

• Job Execution Manager: submit jobs through CondorG using birdbath WS interface

• Condor resource connector manages to job to be submitted to the Grid HPC clusters or traditional Condor cluster.

Resource Ranking Manager

Grid HPC/Condor pool Resource Connector

Condor(Grid/Vanilla) with Birdbath

Grid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC ClustersGrid HPC Clusters

Condor ClusterCondor Cluster

Standard WebService InterfaceSwarm-Grid

QBET Web

Service

QBET Web

Service

Local RDMBS

Local RDMBS

MyProxyServer

MyProxyServer

Hosted by TeraGrid Project

Hosted by UCSB

Request Manager

Job Distributor

Job Queue

DataModel Manager Fault Manager


Job Execution Time in Swarm-DryAd with Windows HPC 16 nodes

Job Execution Time in Swarm-DryAd various number of nodes

EST Sequencing Pipeline• EST (Expressed Sequence Tag): A fragment of Messenger

RNAs (mRNAs) which is transcribed from the genes residing on chromosomes.

• EST Sequencing: Re-constructing full length of mRNA sequences for each expressed gene by means of assembling EST fragments.

• EST sequencing is a standard practice for gene discovery, especially for the genomes of many organisms which may be too complex for whole-genome sequencing. (e.g. wheat)

• EST contigs are important data for accurate gene annotation.

• A pipeline of computational steps is required:– E.g. repeat masking, PaCE, CAP3 or other assembler on

clustered data set

Computing resources for computing intensive Biological Research

• Biologically based researches require substantial amount of computing resources.

• Many of current computing is based on the limited local computing infrastructure.

• Available computing resources include:– US national cyberinfrastructure (e.g. TeraGrid) good fit for

closely coupled parallel application– Cloud computing clusters (e.g. Amazon EC2, Eucalyptus) :

good for on-demand jobs that individually last a few seconds or minutes

– Microsoft Windows based HPC cluster(DryAd) : Job submission environment without conventional overhead of batch queue systems

Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid and Cloud Computing Clusters Sangmi Lee Pallickara, Marlon.

Documents

swarmgrid swarm

cloud cluster

cluster cloud

traditional grid hpc

individual cluster

swarmhadoop suitable

parallel jobs

autodock slide