Top Banner
MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar, Antoine P. Pag´ e, and Steven J. Hallam This document explains the basic installation and setup of the MetaPathways v1.0 pipeline on a typical unix-based machine including downloading and installing required software, Pathway Tools installation, reference sequence database installation, basic configuration, and Sun Grid engine setup. Further details of the MetaPathways v1.0 pipeline are described in the companion June 2013 BMC Bioinformatics article. 1 Downloading MetaPathways Download the zip file MetaPathways v1..zip from http://hallam.microbiology.ubc. ca/MetaPathways/ or the GitHub releases page. After you have downloaded the file, unzip and inspect the contents of the MetaPathways/ folder (Figure 1). A Tour of the MetaPathways/ folder: blastDB/ place where BLAST databases are stored along with name-mapping and taxonomic support files for specific databases like KEGG and COG daemon.py a script that carries out external operations on super-computing grids using the Sun Grid engine executables/ contains various analytical and data handling programs that process the inputs and outputs of different steps of the pipeline e.g. BLAST, Prodigal, trna-scan, etc. libs/ the code library folder contains different Perl and Python functions and code that coordinate different steps of the pipeline MetaPathways.py the starter script/program that runs the pipeline with specific config- uration and parameter settings for each of the steps 1
15

MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

MetaPathways v1.0 Installation

Niels W. Hanson, Kishori M. Konwar, Antoine P. Page, and Steven J. Hallam

This document explains the basic installation and setup of the MetaPathways v1.0 pipelineon a typical unix-based machine including downloading and installing required software,Pathway Tools installation, reference sequence database installation, basic configuration,and Sun Grid engine setup. Further details of the MetaPathways v1.0 pipeline aredescribed in the companion June 2013 BMC Bioinformatics article.

1 Downloading MetaPathways

Download the zip file MetaPathways v1.0.zip from http://hallam.microbiology.ubc.ca/MetaPathways/ or the GitHub releases page. After you have downloaded the file,unzip and inspect the contents of the MetaPathways/ folder (Figure 1).

A Tour of the MetaPathways/ folder:

blastDB/ place where BLAST databases are stored along with name-mapping andtaxonomic support files for specific databases like KEGG and COG

daemon.py a script that carries out external operations on super-computing grids usingthe Sun Grid engine

executables/ contains various analytical and data handling programs that process theinputs and outputs of different steps of the pipeline e.g. BLAST, Prodigal, trna-scan,etc.

libs/ the code library folder contains different Perl and Python functions and code thatcoordinate different steps of the pipeline

MetaPathways.py the starter script/program that runs the pipeline with specific config-uration and parameter settings for each of the steps

1

Page 2: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

Figure 1: An example of the MetaPathways/ folder from the MetaPathways v1.0.zip file.Notice that the folder has a number of different files and folders inside it. The templateconfiguration (template config.txt) and parameter configuration (template param.txt)files are used to configure and set parameter settings of each of the analytical steps of thepipeline. Additionally, the Python script, MetaPathways.py, is used to start the pipeline.

MetaPathwaysrc a source file that must be run to ensure that the computer systemknows where the

MetaPathways/ folder, sets the local python and perl paths, and compiles someexecutable code

template config.txt a parameter file that specifies the analytical settings for all pipelinesteps. e.g. BLAST cut-offs, steps to include in a run of the analysis, what order toannotate databases in, etc.

template header.txt a template header for output GenBank (.gbk) files

template param.txt a parameter file that specifies the analytical settings for all pipelinesteps. e.g. BLAST cut-offs, steps to include in a run of the analysis, what order toannotate databases in, etc.

testdata/ contains some simple .fasta files to do a dry-run to ensure that everything inthe pipeline is working properly

For simplicity we are going to perform this installation out of the user home folder/User/[username]/ by default. In unix commands the tilde ˜ character is equivalent to

2

Page 3: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

your home directory. In OSX systems the home folder can be found through any of thefollowing:

• Double-click the “Macintosh HD” on the Desktop

• Right-click (control-click) the “Finder” icon in the Dock and select “New FinderWindow”

• Left-click the “Finder” icon and press (command + n)

• Go to home from any finder folder by pressing (shift + command + h)

Drag-and-drop the newly extracted MetaPathways v1.0/ folder into the home directory. Itshould sit as ˜/MetaPathways/ when accessing it through the terminal.

MetaPathways requires the use of the unix command-line terminal to run. On OSXsystems this is done through the “Terminal” program located in:

• Applications > Utilities > Terminal

You may want to place this program on your OSX Dock for future convenience.

2 Installing programming languages Python, Perl, and GCC.

Install the required Python 2.x, Perl 5.x, and GCC compiler. For OSX users, these are allcontained within the current release of Xcode4 which can be obtained for free fromhttps://developer.apple.com/xcode/ or on the Apple App Sore within modern releases ofOSX. Alternatively, Perl, and Python installation files and documentation can be obtainedfrom their respective websites:

Python 2.x http://docs.python.org/using/unix.html

Perl 5.x http://www.perl.org/get.html

GCC http://gcc.gnu.org

These also can be obtained through a package management system like Synaptic. Thoughin the case of many Unix distributions, like the popular Ubuntu, versions of Python, Perl,and GCC are included by default, but you will want to ensure that they are the properversions.

3

Page 4: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

In many instances, installing new programming languages is quite low-level from an OSperspective, and may require some discussion with your local system administrator. Arestart of the computer might also be required. It is also a good idea to open the terminalafter installation to check if these installations made it to your system’s $PATH variableusing the which command:

# tests to see if a perl, python, or gcc are included in your $PATH variable$ which perl/usr/bin/perl$ which python/usr/bin/python$ which gcc/Developer/usr/bin/gcc

3 Install Pathway Tools

One of the final steps of the MetaPathways pipeline uses the software Pathway Tools tobuild a Pathway/Genome Database (PGDB) from environmental nucleotide sequences.The Pathway Tools software can be obtained directly from SRI International and will re-quire obtaining an academic licence for the software (http://biocyc.org/download.shtml).This is free for academic users and usually takes approximately 1-2 business days toapprove. Problems with licensing can be emailed to [email protected]. SRI Interna-tional provides installation instructions for OSX and Unix, and is extensively documentedat its homepage: http://bioinformatics.ai.sri.com/ptools/. Eventually you receive anemail from the Pathway Tools group that will allow you to download the Pathway Toolssoftware (Figure 2)

In short, you will obtain an install file like pathway-tools-17.0-macosx-tier1-install.dmgand upon mounting this folder to the desktop a folder with a file that starts an installationwizard (Figure 3).

For ease of instruction we encourage the use of the default installation locations ofPathway Tools directories in the standard home folder locations: ˜/pathway-tools and˜/ptools-local.

pathway-tools/ contains the actual Pathway Tools software

ptools-local/ contains the PGDBs once they have been built via the MetaPathwayspipeline

4

Page 5: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

Figure 2: Pathways tools software download table outlining the different versions. Formost metagenomic purposes the basic configuration with just the EcoCyc and MetaCycdatabases will be fine (outlined in red).

Figure 3: The Pathway Tools install wizard for OSX. We recommend that installationdefaults are followed, placing the pathway-tools/ and ptool-local directories in theirdefault location of the user root folder. On typical installations these are placed inthe user’s home directory by default. OSX installations may prompt the installation ofXQuartz please allow the installation of XQuartz to finish before continuing on withthe Pathway Tools installation.

5

Page 6: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

On OSX systems the a window during the Pathway Tools installation will prompt installa-tion of xQuartz. This will download an additional .dmg file to install xQuartz. Allow theinstallation of xQuartz to finish before continuing with the Pathway Tools installation.On some systems, installation of xQuartz may require a manual restart. Please restartyour system prior to running Pathway Tools for the first time.

After installing Pathway Tools you can launch it from the terminal by executing thefollowing from the command line:

$ cd ˜$ ./pathway-tools/pathway-tools

Or from the shortcut icons that it placed on your desktop during installation.

4 BLAST Databases

The Basic Local Alignment Search Tool (BLAST) is used for a number of pipeline steps;specifically the Open Reading Frame (ORF) functional annotation and the taxonomicidentification of sequences through RNA homology. In order to perform this step locallyyou need a copy of some sequence reference databases to search. We have provided a fewdatabases to get started:

MetaCyc (metacyc-v5-2011-10-21) a sub-set of Uniprot corresponding with the se-quences in the MetaCyc database. This is included with the Pathway Tools software(uniprot-seq-ids.seq) just reformatted into the common .fasta format

Cluster of Orthologous Groups of proteins (COG 2013-02-05) A protein database con-taining taxonomically specific clusters of functional proteins

Silva LSU (LSURef 111 tax silva) LSU rRNA nucleotide sequences for taxonomic iden-tification

However, the choice of database often depends on the specific scientific question you areasking. As such, many databases are freely maintained for download from public ftpservers. However, these databases are large and they grow in size every day. Downloadsadd up to many gigabytes (GBs), so a high-speed internet connection will be required.Also many of these are hosted on file transfer protocol (ftp) servers, we recommendCyberduck http://cyberduck.ch as a free, simple, and user-friendly ftp client.

By default, MetaPathways is configured to detect databases in the blastDB/ folder. Belowwe outline some basic instructions for obtaining other popular databases for metagenomicanalysis.

6

Page 7: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

Protein Databases

RefSeq

RefSeq is a major protein reference database maintained by the National Center of Biotech-nology Information (NCBI) http://www.ncbi.nlm.nih.gov/RefSeq/. Refseq providesformatted BLAST databases on its ftp server:

• connect to the BLAST database ftp server ftp://ftp.ncbi.nlm.nih.gov/blast/db

• download the set of files named refseq protein.XX.tar.gz, where XX are numbers

• extract the .tar.gz archives (usually by simply double-clicking on them)

• MetaPathways actually requires the original fasta sequences of the RefSeq databaseto start. Extract the sequences from the refseq protein BLAST database using theblastdbcmd or the older fastacmd:$ blastdbcmd -db refseq protein -dbtype prot -outfmt %f -out Refseq 2013$ fastacmd -D 1 -d refseq protein -o Refseq 2013

• Both the blastdccmd and the legacy fastacmd can be found from the BLAST Softwareand Databases webiste provided by the NCBI.

KEGG

The Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/ andhttp://www.bioinformatics.jp/en/keggftp.html. MetaPathways is configured to handleKEGG annotations and provide summary tables. Unfortunately, KEGG now requires asubscription fee to access its databases. However, once sequences are obtained they canbe simply placed in the blastDB/ folder.

Nucleotide Taxonomic Databases

Silva

Silva is a comprehensive ribosomal database project.

• Visit the Silva website http://www.arb-silva.de/download/

• navigate links: Download → Archive → Current → Exports

• download the current SSU database (SSURef 111 NR tax silva.fasta.tgz) and thecurrent LSU database (LSURef 111 tax silva.fasta.tgz)

7

Page 8: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

GreeneGenes

16S rRNA gene database and workbench compatible with ARB.

• Visit the GreeneGenes website http://greengenes.lbl.gov/cgi-bin/nph-index.cgi

• navigate links: Download → Sequence Data → Fasta data files

• download current GREENGENES gg16S unaligned.fasta.gz

Once again, one need only download the databases in .fasta format in place them inthe blastDB/ folder. MetaPathways is programmed to do automatic formatting of themon-the-fly.

5 Configuring the template config.txt

The template config.txt file configures the pipeline to find the resources it needs to run.Paths will have to be set for the PERL EXECUTABLE, PYTHON EXECUTABLE, PATHOLOGIC EXECUTABLE,REFDBS, and METAPATHS PATH.

Direct the Terminal to the MetaPathways/ folder and source the MetaPathwaysrc filecompiling the Perl and Python code and locating Perl, Python and the MetaPathwaysdirectory for the config file:

$ cd MetaPathways/$ source MetaPathwaysrcChecking for Python and Perl:Python found in /usr/bin/pythonPlease set variable PYTHON_EXECUTABLE in file template_config.txt as:PYTHON_EXECUTABLE /usr/bin/python

Perl found in /usr/bin/perlPlease set variable PERL_EXECUTABLE in file template_config.txt as:PERL_EXECUTABLE /usr/bin/perl

Adding installation folder of MetaPathways to PYTHONPATHYour MetaPathways is installed in :Please set variable METAPATHWAYS_PATH in file template_config.txt as:METAPATHWAYS_PATH /Users/username/MetaPathways

8

Page 9: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

Follow the printed instructions and update the PYTHON EXECUTABLE, PERL EXECUTABLE,METAPATHWAYS PATH, PATHOLOGIC EXECUTABLE, and SYSTEM keyword in template config.txt(Figure 4). The METAPATHWAYS PATH and PATHOLOGIC EXECUTABLE represent the absolutepaths to MetaPathways and Pathways Tools, respectively.

Figure 4: An example of how to edit the template config.txt file for MetaPathwayssetup. In most cases, one only needs to edit the PYTHON EXECUTABLE, PERL EXECUTABLE,METAPATHWAYS PATH, the PATHOLOGIC EXECUTABLE, and then replace the SYSTEM keywordwith ether mac, linux, or win depending on the operating system. These are highlightedin the red boxes on the left, and in blue boxes on the right during an example setup onthe for a Mac OSX operating system.

6 Configuring the template param.txt

The template param.txt file defines the parameter settings of all the analytical steps ina MetaPathways run. It needs to be updated with the exact names of your protein andnucleotide databases in the blastDB/ folder (Figure 5).

7 Connecting with the Grid (optional)

MetaPathways has capability to externalize computationally heavy tasks like proteinBLAST searches to super computing facilities, provided they use the Sun Grid Engine.This is an optional, but highly recommended step. However, this requires having ssh

9

Page 10: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

Figure 5: The template param.txt file. The exact names of the BLAST databases need tobe listed in the above highlighted lines. These must be the exact names of the databasesequence files in the blastDB/ folder.

access and sufficient user permissions to set up password-less access on a compute server.This might be a good time to check with your local system administrator and ask if thiskind of setup is permissible. We’ve outlined some basic steps of this process:

1. test to see if you can connect to your account via ssh:$ ssh [email protected]

2. You should be asked for your password

3. check to see there is a .ssh/ folder in your remote home directory$ ls ˜/.ssh/$ authorized keys known hosts

4. if not you should create it$ mkdir ˜/.ssh/

5. return to your local computer (control + d)

6. navigate to the local ˜/.ssh/ directory$ cd ˜/.ssh/

10

Page 11: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

7. run ssh-key to create a RSA public and private key

$ ssh-keygen -t rsaGenerating public/private rsa key pair.Enter passphrase (empty for no passphrase):Enter same passphrase again:Your identification has been saved in id_rsa.Your public key has been saved in id_rsa.pub.Enter file in which to save the key (/Users/username/.ssh/id_rsa):Enter passphrase (empty for no passphrase):Enter same passphrase again:Your identification has been saved in id_rsa.Your public key has been saved in id_rsa.pub.

8. Copy your public key to your grid .ssh/ folder with scp$ scp id rsa.pub [email protected]:˜/.ssh/

9. Log back in to your external server account using ssh$ ssh [email protected]

10. Navigate to the ˜/.ssh/ directory again$ cd ˜/.ssh/

11. append the public key to a file called authorized keys$ cat id rsa.pub >> authorized keys

12. change the permissions of the authorized keys file and .ssh/ directory such thatonly your username can read/write it$ chmod 600 ˜/.ssh/authorized keys$ chmod 700 ˜/.ssh/

13. logout to your local computer pressing (control + d)

14. again try to login using ssh, you should not need to type in your password this time$ ssh [email protected]

If this above procedure did not help then you likely have a more complicated setup onyour hands. At this point it would be good to speak with a local system administrator tohelp you setup keyless login. If this is not possible, a Google term would be “ssh keylesslogin”

Congratulations! You have completed what is in some cases an convoluted and unintuitivesetup, but with some luck the MetaPathways pipeline ready for action. Now that you

11

Page 12: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

have come so far you will likely want to use it. You can now proceed to obtain some .fastafiles full of sample sequences and let the analysis commence. Its use is simple if you arefamiliar with the Unix command line, however, we have provided some basic examplesand use cases.

8 MetaPathways Use and Setup

Running MetaPathways

1. Setting Parameters — Preparing for your MetaPathways run

Before we start our first run of the pipeline we will again take a look at theparameters contained in template param.txt. This file gives all the instructions andsettings to be run for each step of the pipeline. Many of the default settings foundin template param.txt are general and should be adequate for many metagenomicanalyses. However, often one will have to remember to change these to reflect thequestions and goals one has about their specific dataset.

Settings in this file are in the form of parameter/value separated by spaces; multiplevalues are separated by commas:

parameter valueparameter value1,value2,...

INPUT: format — specifies the type of input file. Possible values include: fasta,gbk-annotated, and gbk-unannotated. Annotated and unannotated correspondto the existing gene annotations contained within the or GenBank (gbk) inputfiles.

QC parameters

quality control:min length — specifies the minimum number of nucleotides asequence must have during the QC phase

quality control:delete replicates — removes duplicate sequences from input

ORF prediction parameters

orf prediction:algorithm — specifies the ORF prediction algorithm that is used.Currently only Prodigal is available

orf prediction:min length — specifies the minimum number of amino acids in apredicted ORF

Annotation parameters

12

Page 13: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

annotation:algorithm — specifies which homology search algorithm to use forORF annotation. Current options are blast and last are more-efficient imple-mentation of the seed-and-extend approximation algorithm

annotation:dbs — specifies which protein databases and in what order they will beused for annotation. Database names are separated by commas, and the namesmust exactly match the naming convention in the database folder blastDB/

annotation:min bsr — specifies the minimum blast-score ratio threshold. Only hitsgreater than the threshold will be kept.

annotation:max evalue — specifies the maximum e-value threshold. Only e-valuessmaller (more statistically significant) than this threshold will be kept.

annotation:min score — specifies the minimum bit-score threshold. Only hitsgreater than this score will be kept.

annotation:min length — specifies the minimum length threshold. Only annota-tions with a greater length will be kept.

annotation:max length — specifies the maximum number of annotations to be keptfor each search. Usually the top-5 or top-10 homology hits are sufficient formost pourposes.

RNA parameters Analogous to the protein homology search settings above:

rRNA:refdbs — specifies the databases to be searched against. These databasenames must match the names of the nucleotide BLAST databases found in theblastDB/ folder specified in pipeline configuration file

rRNA:max evalue — sets the 16s rRNA maximum expect value threshold. Only hitsless than (more statistically significant) than this threshold will be kept

rRNA:min identity — sets the minimum percent identity threshold. Only annota-tions with a greater percent identity with the query sequence will be kept

rRNA:min bitscore — only annotations with bit-scores greater than this minimumthreshold will be kept.

Grid Settings Settings associated with running protein homology searches on thegrid

grid engine:batch size — specifies the number of sequences to be included in eachgrid job. This should be set to respect the memory and cpu time requirementsof the grid you are using

grid engine:max concurrent batches — sets the maximum number of jobs to besubmitted to a grid at one time. MetaPathways will maintain a job queue ofthis size waiting to be scheduled

grid engine:walltime — sets the maximum amount of time an individual jobcan take. Setting this value too high affects your scheduling by the SunGrid

13

Page 14: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

scheduler. Setting it too low allows you to be schedule but your job will bestopped before completion.

grid engine:RAM — the maximum ram usage for the job. Also can affect the schedul-ing of your jobs. Becomes an issue for larger databases such as RefSeq

grid engine:user — username used to access the grid via ssh

grid engine:server — the address of the compute grid via ssh

Pathway Tools parameters

ptools settings:taxonomic pruning — specifies if the ePGDB in Pathway Toolsshould be built with taxonomic pruning enabled (yes) or disabled (no). Dis-abled is recommended for metagenomic samples. Single-cell analyses maywant to consider enabling it.

2. Pipeline Execution Flags — yes, skip, stop, redo, grid

For each step of the pipeline one must specify one of the following actions:

yes perform the operation with the above settings

skip do not perform this operation (note that this could cause later dependent stepsin the pipeline to fail)

stop stop the pipeline run after completing the previous step

redo recompute a specific step of the pipeline (after incomplete execution or errormay have corrupted the output)

grid compute this step on the grid. Currently only available for the BLAST/LASThomology search step

3. Starting a Run — The MetaPathways pipeline is run using the MetaPathways.pyscript from the command line:

$ ./MetaPathways.py -i [input file/folder] -o [output directory]-c [config file] -p [parameter file] -r [overwrite/overlay]

For example,

$ ./MetaPathways.py -i testdata/-o ˜/MetaPathways/output-c ˜/MetaPathways/template_config.txt-p ˜/MetaPathways/template_param.txt-r overlay-v

14

Page 15: MetaPathways v1.0 Installation - Hallam Labhallam.microbiology.ubc.ca/MetaPathways/resources/MetaPathways... · MetaPathways v1.0 Installation Niels W. Hanson, Kishori M. Konwar,

where,

-i specifies the input file directory or specific .fasta file

-o specifies the output directory

-c the configuration file to be used for this run

-p the parameter file to be used for this run

-r the run-style to be use for this run:

overlay check for existing run in place and uses existing files as it finds them exceptif the pipeline step is set to redo

overwrite overwrites existing output

-v verbose output displays the exact commands being run for each step

The script testMetaPathways.sh will do a simple run on sequences in the testdata/folder:

$ testMetaPathways.sh

15