MetaPathways v2.0: A master-worker model for environmental Pathway/Genome Database construction on grids and clouds

MetaPathways v2.0: A master-worker model forenvironmental Pathway/Genome Database

construction on grids and cloudsNiels W. Hanson

Graduate Program in BioinformaticsUniversity of British Columbia, Canada

Email: [email protected]

Kishori M. KonwarShang-Ju Wu

Steven J. HallamDepartment of Microbiology & Immunology

University of British Columbia, CanadaEmail: [email protected]

Abstract—The development of high-throughput sequencingtechnologies over the past decade has generated a tidal waveof environmental sequence information from a variety of naturaland human engineered ecosystems. The resulting flood of infor-mation into public databases and archived sequencing projectshas exponentially expanded computational resource requirementsrendering most local homology-based search methods inefficient.We recently introduced MetaPathways v1.0, a modular annota-tion and analysis pipeline for constructing environmental Path-way/Genome Databases (ePGDBs) from environmental sequenceinformation capable of using the Sun Grid engine for externalresource partitioning. However, a command-line interface andfacile task management introduced user activation barriers withconcomitant decrease in fault tolerance.

Here we present MetaPathways v2.0 incorporating a graphicaluser interface (GUI) and refined task management methods. TheMetaPathways GUI provides an intuitive display for setup andprocess monitoring and supports interactive data visualizationand sub-setting via a custom Knowledge Engine data structure.A master-worker model is adopted for task management allowingusers to scavenge computational results from a number of workergrids in an ad hoc, asynchronous, distributed network thatdramatically increases fault tolerance. This model facilitates theuse of EC2 instances extending ePGDB construction to theAmazon Elastic Cloud.

I. INTRODUCTION

The development of high-throughput sequencing technolo-gies over the past decade has generated a tidal wave of envi-ronmental sequence information from a variety of natural andhuman engineered ecosystems, creating new computationalchallenges and opportunities [1], [2]. For example, one ofthe primary modes of inferring microbial community struc-ture and function from environmental sequence informationinvolves database searches using local alignment algorithmssuch as BLAST [3]. Unfortunately, BLAST-based queries havebecome increasingly inefficient on stand-alone machines asreference databases and sequence inputs increase in size andcomplexity. The advent of adaptive seed approaches such asLAST [4] for homology searches shows promise in overcom-ing runtime bottlenecks when implemented on grid or cloudcomputing resources [5]. However, many academic researcherssimply do not have access to the technical or infrastructure

requirements needed to achieve these calculations, and mustturn to online information processing services.

The use of online services, such as Metagenome RapidAnnotation using Subsystem Technology (MG-RAST), in-creases user access to information storage, gene prediction,and annotation services [6], [7]. Although democratizing in-frastructure access, the use of online services can insulate usersfrom parameter optimization, create formatting, data transferrestrictions, and impose barriers to downstream analytic or vi-sualization methods. Large grid computing resources offer analternative by providing access to high-performance computeinfrastructure on local or regional scales. Such computing en-vironments are often composed of multiple grids implementingdifferent batch scheduling and queuing systems e.g., PortableBatch System (PBS), TORQUE, Sun Grid Engine (SGE), orSLURM. Because most federated systems limit the number ofjobs a user can submit into a batch-processing queue at onetime, scheduling irregularities and job failure are not uncom-mon when attempting to process large datasets across multiplegrid environments implementing different batch schedulingand queuing systems.

Task management can be improved with algorithms thatsplit and load-balance across multiple grid environments,monitor job status to resubmit failed jobs or submit new jobsin succession and consolidate batch results upon completion.These improvements can increase fault tolerance and reducemanual administration requirements. The development of sucha method requires a number of considerations: (1) job splittingand merging with appropriate checks on completion andcorrectness, (2) automated grid installation of search tools suchas BLAST/LAST and required databases in a non-redundantmanner, (3) bandwidth optimization, (4) minimization of re-dundant job processing on more than one grid, (5) tolerance tojob failure or intermittent availability of computing resources,(6) load-balance to deal with slow, high-volume or smallclusters, and (7) an efficient client tool to integrate processesrunning on the local users machine.

Practically speaking, searching a large set of query se-quences against reference databases by splitting them into

https://www.researchgate.net/publication/51864678_Next_generation_sequencing_and_bioinformatic_bottlenecks_The_current_state_of_metagenomic_data_analysis?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/221731235_From_genomics_to_metagenomics?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/13964871_Non-redundant_Protein_Sequence_Database_Gapped_BLAST_and_PSI-BLAST_a_new_generation_of_protein_database_search_programs?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/23268958_Meyer_F_Paarmann_D_D'Souza_M_Olson_R_Glass_EM_Kubal_M_et_al_The_Metagenomics_RAST_server_-_A_public_resource_for_the_automatic_phylogenetic_and_functional_analysis_of_metagenomes_BMC_Bioinformatics_9_?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

smaller query files can be approximately viewed as an instanceof the well-studied Do-All problem — p processors mustcooperatively perform n tasks in the presence of an Adversary[8]. One of the most popular approaches is the master-worker model [9], [10], where the master process sends tasksacross the Internet to worker processes, and workers executeand report back the result. Although the literature is repletewith such models [8], multiple grid computing imposes newmanagement challenges due to indirect communication issuesbetween master and worker via head nodes. The head nodesthemselves have limited communication and compute capac-ity, and are restricted to performing simple calculations andjob submissions. An optimal client tool should automaticallyminimize redundant or unproductive communications, e.g.,frequently trying to submit jobs to or transfer results from aslow server, and adaptively increase their involvement shouldtheir performance improve.

Astronomers and mathematicians have encountered similardata processing challenges, and software has been developedfor task management across distributed compute networks. Forexample, SETI@HOME takes advantage of idle CPU hoursfrom volunteers willing to install a custom software to searchthe cosmos for intelligent life [11], and PrimeNet has a similarimplementation for calculating the largest Mersenne primenumbers (http://www.mersenne.org/primenet/). Both projectsare tightly controlled to accomplish monolithic tasks underthe auspices of major organizations, and the software is notreadily transferable to academic users interested in alternativeapplications such as sequence homology searches. The recentdevelopment of MetaPathways [12], an analytical pipelinefor sequence annotation and pathway inference using thePathoLogic algorithm and Pathway Tools to construct envi-ronmental Pathway/Genome Databases (ePGDBs) [13], [14],[15], provides a facile method distributing homology searchsthrough the integration of an individual grid. However, thissoftware does not take advantage of all available computeresources should more grids become available, nor does itaddress the previously discussed challenges associated withad hoc distributed compute networks.

Here we describe MetaPathways v2.0, representing a se-ries of improvements to the existing pipeline that attemptto address the aforementioned computational and data in-tegration issues. First, automated multiple grid managementallows computationally intensive sections of the pipelineto be performed by multiple compute grids simultaneouslyin an ad hoc distributed system, accommodating dynamicavailability, addition, and removal of compute clusters. Be-cause many potential users do not have dedicated accessto compute clusters, this implementation includes a mod-ule for use of the Amazon Elastic Compute Cloud (EC2)(http://aws.amazon.com/ec2/) through integration with theStarCluster library (http://star.mit.edu/cluster/). Finally, we im-prove the usability of MetaPathways through the developmentof a graphical user interface (GUI) for parameter setup, runmonitoring, and result management. Further, the integrationand efficient query of results is empowered via a custom

Knowledge Engine data structure. Use of this structure isintegrated into customized data summary tables, visualizationmodules, and data export features for down-stream analysis(e.g., the Metagenome Analyzer (MEGAN) [16] and the Rprogramming environment). The GUI interface is written anddesigned with the Qt 5.0 library in C++ under the LGPLlicense, and is fully-compatible with Mac OS X and Linux-based operating systems.

II. IMPLEMENTATION

A. Multi-grid brokering

MetaPathways v2.0 coordinates and manages the com-putation of sequence homology searches on compute gridsimplementing the Sun Grid Engine/Open Grid Engine [17]or TORQUE (http://www.adaptivecomputing.com/) batch-jobqueuing systems. Expanding from an individual compute gridto many grids in an ad hoc asynchronous distributed networkincurs a number of additional algorithmic challenges in termsof job coordination, worker monitoring, fault tolerance, andefficient job migration. The previous implementation of Meta-Pathways controlled an individual grid. In the current version,a variant of the master-worker model (perhaps a ‘master-cluster’ model) has been implemented, with the local machinetaking on the role of a master Broker, coordinating a set ofworker grids that compute tasks on the Brokers behalf. Thissetup is analogous to the way operating systems and Internetsupercomputers queue requests and return results.

The Broker, operating on the local machine, commissionsworker grids to setup as “BlastServices” that compute indi-vidual sequence homology search jobs much in the same wayas the individual grid did in MetaPathways v1.0. However,the availability of many worker grids with varying levels ofthroughput in an ad hoc distributed network greatly increasesthe complexity and asynchronicity of job distribution andresult harvesting. The Broker not only has to monitor jobsin the context of worker grids and specific samples, but alsohas to initiate job migration from overly slow or ineffectiveworkers. (Figure 1). Our model assumes an Adversary cansabotage the distributed setup in three ways: (1) a grid corecan fail or become ineffective, causing a loss of jobs currentlybeing computed on that core, (2) an entire grid can fail orbecome ineffective, requiring all jobs to be migrated to othergirds, and (3) a sporadic or failed Internet connection increasesthe asynchronicity of the system, affecting the reliable sub-mission and harvesting of results. The Broker handles thesethree situations through a combination of job resubmission, jobmigration to other worker grids, and decreasing job submissionand job harvesting from problematic grids.

Here we discuss the implementation of our distributionalgorithm as executed by the Broker. First, smaller independentjobs are created from a much larger input file. The Brokerthen submits jobs in a modified round-robin fashion up tothe queue limit allowed by different clusters. The round-robinsubmission is modified by a delay counter (one for each clus-ter), submission is modified by a delay counter (one for eachcluster), which exponentially increases for each unsuccessful

https://www.researchgate.net/publication/261540346_Applying_the_dynamics_of_evolution_to_achieve_reliability_in_master-worker_computing?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/220422470_SETIhome_An_Experiment_in_Public-Resource_Computing?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/242016138_Konwar_KM_Hanson_NW_Page_AP_Hallam_SJ_MetaPathways_a_modular_pipeline_for_constructing_pathwaygenome_databases_from_environmental_sequence_information_BMC_Bioinformatics_14_202?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/225276722_The_Pathway_Tools_Pathway_Prediction_Algorithm?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/6547797_MEGAN_analysis_of_metagenomic_data_Genome_Res?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/3897263_Sun_Grid_Engine_Towards_Creating_a_Compute_Power_Grid?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

Amazon EC2

a

b

c

d

e

Fig. 1. A master-worker model for sequence homology searches. The master Broker breaks a BLAST job into equal sized sub-tasks (a). The Broker then setsup blast services (with all the required executables and databases) on each of the available worker grids, ready to handle incoming tasks. Jobs (squares) aresubmitted in a round robin manner to each of the grids (b). The Broker then intermittently harvests results (circles) from each of the services as they becomeavailable, de-multiplexing if there are multiple samples being run (c). An Adversary can cause nodes and whole grids to fail at random, the Broker handlesthis by migrating lost jobs (dashed lines) to alternative grids (d). An Adversary can also cause an intermittent or failing Internet connection (red line), andthe Broker handles this through an exponential back-off, or eventually migrating jobs to other worker girds if latency becomes excessive (e).

connection, submission, or harvesting failure. Once all the jobsare submitted but not all results are available, then incompletejobs are resubmitted to other clusters in counter rank order.Finally, once all the jobs are finished and results are retrieved,the Broker then consolidates results on the client’s machine.

Connectivity or response issues are handled by the expo-nential back-off of job distribution, while throughput issuesare handled by the migration of outstanding or pending jobsto more available workers. On each unsuccessful attempt tosubmit or harvest results, the Broker waits some specified pe-riod of time before retrying. This time exponentially increases

on each successive failure until an upper maximum is reached(by default set to 60 minutes). Exponential back-off is a well-accepted method in Queueing theory, limiting the numberof connections attempted to slower grids. Job migration ishandled by a heuristic that migrates a job when its pendingtime at a particular grid is six standard deviations greater thanthe average job completion time over the previous hour. Suchtasks are then migrated to a grid with the lowest expectedcompletion time, proportional to the number of outstandingjobs. This ensures that jobs are not readily passed to gridswith a higher load.

B. Amazon Elastic Cloud Integration

MetaPathways v2.0 enables researchers to take advantageof existing compute grids using the TORQUE or SunGridqueue submission engine, requiring only login credentialsand basic information about the queue submission system.However, there are many instances where convenient ac-cess to a compatible compute cluster is simply unavailableor difficult. Here, we have adopted the use of StarCluster(http://star.mit.edu/cluster/), an open source cluster-computingtoolkit for Amazons Elastic Compute Cloud (EC2), in order toautomate and simplify the setup of a compatible compute grid.EC2 grids are setup with a custom MetaPathways AmazonMachine Image (AMI) containing MetaPathways code, systemappropriate executables, and compiled databases, reducingsetup bandwidth and latency. EC2 grids are specified similar toother compute grids; however, the user must provide AmazonWeb Services (AWS) credentials: an access key, a “secret”access key, and an user ID, obtained by registering for anAmazon EC2 account.

C. Graphical User Interface & Data Integration

The newest iteration of MetaPathways expands upon itsinitial release with the addition of a minimalistic, simple-to-use GUI aimed to allow users to process, monitor, integrate,and analyze large environmental datasets more effectively. AMetaPathways run is set up via a configuration window thatallows specification of input and output files, quality controlparameters, target databases, executable stages, and grid com-putation (Figure 2). Available grid systems and credentials areadded and stored in an ‘Available Grids” window, allowing theuser to add additional grids when credentials become available.Currently, grid credentials using the Sun Grid or TORQUE jobdistribution system are supported, though expansion to includeother systems including SLURM and PBS is anticipated. Oncea run has started, an execution summary is displayed showingrun progress through completed and projected pending stagesalong with a number of logs showing the exact commandsassociated with each processing step (Figure 3).

Environmental datasets generated on next-generation se-quencing platforms require the annotation of millions ofopen reading frames (ORFs). Not only is this annotationcomputationally intensive, the interpretation, integration, andanalysis of gigabytes of output creates analytic challenges aswell. Here, MetaPathways v2.0 has implemented a customKnowledge Engine data structure and file indexing scheme thatconnects data primitives, such as reads, ORFs, and metadata,projecting them onto a specified classification or hierarchy(e.g., KEGG [18], COG [19], MetaCyc [13], and the NCBITaxonomy database [20]) (Figure 4). The benefit of this cus-tom abstraction is flexibility and performance; data primitivesare extensible across multiple levels of biological information(DNA, RNA and protein), and related knowledge transfor-mations like coverage statistics or numerical normalizationscan be customized for them if necessary. Such a customdata structure enables rapid queries for specific results fasterthan typical database setups like MySQL or Oracle. Through

Fig. 2. Configuring a MetaPathways run. Specific pipeline execution stagescan run, skipped, or redone by setting the appropriate radio button. A checkbox at the bottom indicates if the currently configured set of external workergrids should be used for this run. Additional grids can be configured withtheir credentials by clicking the “Setup Grids” button.

Fig. 3. Once a MetaPathways run is configured and started, progress ismonitored using two coordinated windows. Execution progress and processedresults for a specific sample are displayed as a series of tabs containing tables,graphs or other visualizations, and can easily be expanded should additionalpipeline stages or modifications occur. Different environmental datasets canbe selected using the drop-down combo-box above.

integration with the Knowledge Engine data structure, datasubsets can be easily selected to drive inquiry through customlook-up tables and visualization modules, or exported in avariety of formats for downstream analysis (e.g., custom .csv,.tsv tables, nucleotide and protein fastas, MEGAN compatiblecsv, etc.).

https://www.researchgate.net/publication/12709522_KEGG_kyoto_Encyclopedia_of_Genes_and_Genomes?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/24459736_The_COG_database_New_developments_in_phylogenetic_classification_of_proteins_from_complete_genomes?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

https://www.researchgate.net/publication/51852054_The_NCBI_taxonomy_database?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

P1

P2

P3

Pn

... ...

a

b

KEGG

COG

MetaCyc

NCBITaxonomy

PrimitivesClassification

"Knowledge"

Fig. 4. The MetaPathways ‘Knowledge Engine’ data structure. One of therecent developments to enable the efficient exploration of large environmentalsequence datasets is the Knowledge Engine data structure. This data structureconnects data primitives,such as reads, ORFs, and projects them onto a speciedclassication or hierarchy. Pointer following is a computationally efficientoperation so the identification and enumeration of data primitives is robust tolarge samples consisting of millions of annotations. We have described theconnection between primitives and classification schemes as Knowlege objects(a). Through the exploration and selection of data projected on classificationschemes new Knowledge objects can be created (dashed lines) (b). Oncedefined, Knowledge objects can be projected onto custom look-up tables andvisualization modules, and these, in turn, can be used to create new Knowledgeobjects, enabling efficient and interactive querying of millions of annotations.

Here MetaPathways v2.0 has implemented three particularmodules: a Large Table implementation exploiting the orderstatistic algorithm to allow the manual browsing and searchof tables with million of rows (Figure 5a), a Tree-taxonomyBrowser for the display of taxonomic annotation (Figure 5b),and a Contig Viewer for the inquiry of annotated ORFson individual reads or assembled contigs (Figure 5c). Eachmodule enables the selection of subsets and projections ofreads, ORFs, or Contigs onto the above classification schemesfor intuitive data exploration and analysis. Additionally, exportfunctions for sequences, reads, or ORF annotations facilitatedownstream analyses like tree-building, R, or MEGAN.

III. RESULTS

A. Heterogeneous Compute Grid Performance

To illustrate the ability of the task distribution algorithmto migrate tasks between a number of different homogeneouscompute clusters, we ran a demonstration run on three dif-ferent compute grids simultaneously (Table I). Two systems(Bugaboo and Jasper) from WestGrid, one Compute/CalculCanada’s High-performance computing consortia, and Hallam-Grid, an internal grid of 10 heterogeneous desktop computers.Connection to WestGrid systems was through the Internet,while HallamGrid was connected through local area network.The Bugaboo and Jasper clusters have 4,584 and 4,160cores, respectively, running the Torque or SunGrid Enginequeuing systems with at least 2GB of RAM available per

Fig. 5. MetaPathways v2.0 has integrated its Knowledge Engine datastructure into three data summary and visualization modules: a Large Tablemodule that allows for the efficient query, look-up, and sub-setting of reads,ORFs, statistics, annotations and their hierarchical classification (e.g., KEGG,COG, MetaCyc, SEED, NCBI Taxonomy) (a); A Tree-taxonomy Browser ofannotated ORF on the NCBI Taxonomy classification using the LCA andMegan declinations (b); a Contig Viewer enabling the browsing of contigs orlong reads for ORF positions, and functional and taxonomic annotations intool-tips (c).

core. HallamGrid ran the SunGrid Engine and consisted of10 nodes with 4-8 cores each with a variety of processortypes. Locally, MetaPathways v2.0 was running on a Mac Prodesktop computer with Mac OSX 10.6.8 with 2 2.4 Ghz Quad-Core Intel Xeon processors and 16 GB of 1,066 Mhz DDR3RAM.

We ran MetaPathways v2.0, configured to the above grids,

on an Illumina metagenome, predicting 106,500 ORFs from127,821 assembled contigs using standard MetaPathways pa-rameters [12]. These ORFs were annotated against COGdatabase using the BLAST algorithm. Split into batches of500 sequences each, this resulted in 213 sequence homologyjobs. MetaPathways distributed these jobs to the network,completing in 0.5 real-world hours. A transfer matrix describesthe job distribution behavior of the Broker (Table II); diagonalvalues represent jobs completed at a particular grid, whileoff-diagonal values represent the transfer of a job from onegrid to another by the Broker. This small experiment indicatesthat hardware specifications aside, grid performance is sub-stantially affected by the behavior of the administrative andqueuing policies of the compute girds, and that there are largeperformance advantages to having a dedicated server available,i.e., HallamGrid or a private EC2 instance. In addition, jobtransfer matricies like Table II could be used as an online toolto assess grid behavior; a high number of off-diagonal valueswould indicate a high number of transfers and therefore lowperformance.

TABLE IOVERVIEW OF GRID HARDWARE SPECIFICATIONS.

System Cores Nodes Memory Connectivity

Bugaboo 4584 446 16-24GB/Node InternetJasper 4160 400 16-24GB/Node Internet

HallamGrid 80 10 32-64GB/Node LAN

TABLE IIMATRIX OF COMPLETED JOBS (DIAGONAL VALUES) AND JOB TRANSFERS

(OFF-DIAGONAL).

Jasper Bugaboo HallamGrid

Jasper 49 0 0Bugaboo 2 10 8

HallamGrid 0 0 164

IV. DISCUSSION

MetaPathways enables streamlined functional, taxonomic,and pathway-centric analysis of environmental sequence infor-mation when compared to existing methods based on KEGGpathways and SEED subsystems. This version of the pipelinefocused on improving scalability through the use of a robustmaster-worker model to control compute grid submission.Our solution has the potential to allow collaborating researchlabs to take advantage of under-used idle cycles on theirheterogeneous in-house networks, and to facilitate the useof external grids when additional dedicated CPU cycles arenecessary. By adopting the StarCluster software, we enablethe setup and use of Amazon EC2 instances for those whodo not have convenient access to large federated computegrids. Further, we improved the usability and monitoringcapacity of MetaPathways through the integration of a GUIto monitor pipeline progress, while also providing an efficientinteractive framework to bring users closer to their data. The

custom Knowledge Engine data structure drives query and sub-setting of the processed data, enabiling efficient analysis ofsamples containing millions of annotations. We can think of noother stand-alone software tool that will allow for comparableanalysis on such large datasets.

We have demonstrated that the Broker’s distribution algo-rithm has good empirical performance on a small numberof grids. However, the current implementation uses a verygeneral Queueing theory heuristic. The interactions betweenthe client and clusters with varying loads, user behaviors, andadministrative activities could be viewed as a non-cooperativegame where the various players are competing for resources.From a Game theory perspective there could be interestingequilibrium conditions that result within this dynamic network.Not only would these results be theoretically interesting, theycould have critical implications to the performance of thealgorithm when large grid networks are being managed.

Aside from run-time improvements, additional data transfor-mation and visual analysis modules, modules are needed thatincorporate coverage statistics indicating numerical abundanceand taxonomic distribution of enzymatic steps. Additionally,the application of self-organizing maps or other machine learn-ing methods are needed to more accurately place single-cellor population genome assemblies onto the Tree of Life. Fur-ther, as transcriptomic and proteomic datasets are becomingincreasingly available, they represent new data primitives thatneed to be extended into the Knowledge Engine data structure.More generally, reference databases for 5S, 7S and 23S RNAgenes and updates to the current MetaCyc database that in-clude more biogeochemically relevant pathways are needed toimprove BLAST and cluster-based annotation efforts. Finally,more operational insight is needed to identity hazards inpathway prediction and to improve ePGDB integration withinthe Pathway Tools browser.

V. CONCLUSION

MetaPathways v2.0 provides users with improved perfor-mance and usability in high-performance computing througha master-worker grid submission algorithm. The addition ofa GUI significantly lowers user activation boundaries, andprovides better control over pipeline operation. MetaPathwaysv2.0 is extensible to the ever increasing data volumes producedon next generation sequencing platforms, and generates usefuldata products for microbial community structure and func-tional analysis including phylogenetic trees, taxonomic binsand tabular annotation files. The MetaPathways v2.0 software,installation instructions, tutorials and example data can be ob-tained from http://hallam.microbiology.ubc.ca/MetaPathways.

COMPETING INTERESTS

The authors are not aware of any competing interests.

AUTHORS CONTRIBUTIONS

NWH, KMK, and SW conceived, developed, and imple-mented the master-worker model and the GUI and visu-alization modules. NWH and SW drafted and edited the

https://www.researchgate.net/publication/242016138_Konwar_KM_Hanson_NW_Page_AP_Hallam_SJ_MetaPathways_a_modular_pipeline_for_constructing_pathwaygenome_databases_from_environmental_sequence_information_BMC_Bioinformatics_14_202?el=1_x_8&enrichId=rgreq-ccb24527-f140-4e46-833b-989fedd5850e&enrichSource=Y292ZXJQYWdlOzI2NDIzMjk4MztBUzoxMjI5NTE0OTYxMTQxNzZAMTQwNjMyNTMyOTc0MQ==

manuscript. SJH assisted in writing the manuscript and over-saw the project.

ACKNOWLEDGMENTS

This work was carried out under the auspices of theGenome Canada, Genome British Columbia, the Natural Sci-ence and Engineering Research Council (NSERC) of Canada,the Canadian Foundation for Innovation (CFI) and the Cana-dian Institute for Advanced Research (CIFAR). The WesternCanadian Research Grid (WestGrid) provided Access to high-performance computing resources. KMK was supported by theTula Foundation funded Centre for Microbial Diversity andEvolution (CMDE) at UBC. We would like to thank, PeterKarp, Tomer Altman, and the SRI International staff for in-valuable comments related to design ethos and implementationof MetaPathways.

REFERENCES

[1] M. B. Scholz, C.-C. Lo, and P. S. G. Chain, “Next generation sequencingand bioinformatic bottlenecks: the current state of metagenomic dataanalysis.” Current Opinion in Biotechnology, vol. 23, no. 1, pp. 9–15,Feb. 2012.

[2] N. Desai, D. Antonopoulos, J. A. Gilbert, E. M. Glass, and F. Meyer,“From genomics to metagenomics.” Current Opinion in Biotechnology,vol. 23, no. 1, pp. 72–76, Feb. 2012.

[3] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang,W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: anew generation of protein database search programs.” Nucleic AcidsResearch, vol. 25, no. 17, pp. 3389–3402, Sep. 1997.

[4] S. M. Kiełbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith, “Adaptiveseeds tame genomic sequence comparison.” Genome Res, vol. 21, no. 3,pp. 487–493, Mar. 2011.

[5] G. Bell and T. Hey, “Beyond the data deluge,” Science, vol. 323, no.5919, pp. 1296–1297, Mar. 2009.

[6] F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal,T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A.Edwards, “The metagenomics RAST server - a public resource for theautomatic phylogenetic and functional analysis of metagenomes.” BMCBioinformatics, vol. 9, p. 386, 2008.

[7] R. K. Aziz, D. Bartels, A. A. Best, M. DeJongh, T. Disz, R. A. Edwards,K. Formsma, S. Gerdes, E. M. Glass, M. Kubal, F. Meyer, G. J. Olsen,R. Olson, A. L. Osterman, R. A. Overbeek, L. K. McNeil, D. Paarmann,T. Paczian, B. Parrello, G. D. Pusch, C. Reich, R. Stevens, O. Vassieva,V. Vonstein, A. Wilke, and O. Zagnitko, “The RAST Server: rapidannotations using subsystems technology.” BMC Genomics, vol. 9, p. 75,2008.

[8] C. Georgiou and A. A. Shvartsman, “Cooperative Task-Oriented Com-puting: Algorithms and Complexity,” Synthesis Lectures on DistributedComputing Theory, vol. 2, no. 2, pp. 1–167, Jul. 2011.

[9] E. Christoforou, A. F. Anta, C. Georgiou, M. A. Mosteiro, andA. Sanchez, “Applying the dynamics of evolution to achieve reliabilityin master–worker computing,” Concurrency and Computation: Practiceand Experience, vol. 25, pp. 2363–2380, 2013.

[10] K. M. Konwar, S. Rajasekaran, and A. A. Shvartsman, “Robust networksupercomputing with malicious processes,” in Proc. of 17th Int-l Symp.on Distributed Computing (DISC), 2006, pp. 474–488.

[11] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer,“SETI@ home: an experiment in public-resource computing,” Commu-nications of the ACM, vol. 45, no. 11, pp. 56–61, 2002.

[12] K. M. Konwar, N. W. Hanson, A. P. Page, and S. J. Hallam,“MetaPathways: a modular pipeline for constructing pathway/genomedatabases from environmental sequence information.” BMC Bioinfor-matics, vol. 14, no. 1, p. 202, 2013.

[13] R. Caspi, H. Foerster, C. A. Fulcher, R. Hopkinson, J. Ingraham,P. Kaipa, M. Krummenacker, S. Paley, J. Pick, S. Y. Rhee, C. Tissier,P. Zhang, and P. D. Karp, “MetaCyc: a multiorganism database ofmetabolic pathways and enzymes.” Nucleic Acids Research, vol. 34,no. Database issue, pp. D511–6, Jan. 2006.

[14] P. D. Karp, S. M. Paley, M. Krummenacker, M. Latendresse, J. M.Dale, T. J. Lee, P. Kaipa, F. Gilham, A. Spaulding, L. Popescu,T. Altman, I. Paulsen, I. M. Keseler, and R. Caspi, “Pathway Toolsversion 13.0: integrated software for pathway/genome informatics andsystems biology.” Brief. Bioinformatics, vol. 11, no. 1, pp. 40–79, Jan.2010.

[15] P. D. Karp, M. Latendresse, and R. Caspi, “The pathway tools pathwayprediction algorithm.” Stand Genomic Sci, vol. 5, no. 3, pp. 424–429,Dec. 2011.

[16] D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster, “MEGAN analysisof metagenomic data.” Genome Res, vol. 17, no. 3, pp. 377–386, Mar.2007.

[17] W. Gentzsch, “Sun Grid Engine: towards creating a compute powergrid,” in CCGRID-01. IEEE Comput. Soc, 2001, pp. 35–36.

[18] M. Kanehisa and S. Goto, “KEGG: kyoto encyclopedia of genes andgenomes.” Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, Jan. 2000.

[19] R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T.Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova,and E. V. Koonin, “The COG database: new developments in phyloge-netic classification of proteins from complete genomes.” Nucleic AcidsResearch, vol. 29, no. 1, pp. 22–28, Jan. 2001.

[20] S. Federhen, “The NCBI Taxonomy database.” Nucleic Acids Research,vol. 40, no. Database issue, pp. D136–43, Jan. 2012.

MetaPathways v2.0: A master-worker model for environmental Pathway/Genome Database construction on grids and clouds

Documents