CAMERA metagenomic annotation pipeline
Post on 04-Aug-2015
1903 Views
Preview:
Transcript
CAMERA Annotation Pipelines(and related infrastructure)
Brett Whitty
12/20/2007
Overview
Compute Infrastructure GOS/CAMERA ncRNA/ORF calling pipeline
rRNA finding pipeline ORF calling
GOS (incremental) protein clustering CAMERA Annotation Pipeline
Specifications Implementation
Compute Infrastructure
CALIT2 Compute Grid
48 dual-core dual-CPU 64 bit machines 192 SGE slots
Redhat-based ‘Rocks Clusters’ Linux distribution (see http://rocksclusters.org)
‘Rocks Rolls’ Bio-roll (/opt/Bio) Used to image/install each node separately,
including local Perl module installs (patches)
sos.camera.calit2.net
Head node of sos cluster SSH into here
Is not an SGE submit host
SOS Cluster Global Mounts
/share/apps applications (and related files) are installed here,
analysis data should not be stored here /home/thumper6
a global mount point --- 18T(!!!) storage volume on which all analysis data/results should be stored
/opt/Bio tools such as clustalw, EMBOSS, hmmer, ncbi
blast are installed under here
SOS Local Mounts(on each grid node)
/state/partition1 local storage device on each grid node available
for local scratch space (438G)
/tmp system tmp partition (7G)
pg0-0.camera.calit2.net
SSH accessible only through head Is an SGE submit host Running apache and postgres servers
pg0-0.camera.calit2.net
http://web1.camera.calit2.net/ergatis/
/var/www/cgi-bin/ergatis /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/htdocs ergatis
/var/www/html/ergatis /home/bwhitty/temp/subversion-1.4.5/subversion/svn/svn export --force
https://ergatis.svn.sf.net/svnroot/ergatis/tags/ergatis-v2r6b1/cgi-bin ergatis
pg0-0.camera.calit2.net
CGI scripts run as the user 'apache' on pg0-0, but ‘apache’ has sudo permissions for user 'ergatis' The two CGI scripts in the install which run RunWorkflow and
KillWorkflow (ergatis/kill_wf.cgi, ergatis/Ergatis/Pipeline.pm) have been modified, and 'sudo -u ergatis ' has been appended to their normal execution strings
IdGenerator.pm has been modified to use JCVIIdGenerator.pm
Many of the settings in ergatis.ini have been changed from defaults, including disabling a number of the components When updating the Ergatis CGI directory from the SVN
repository, a backup copy should be set-aside in advance
SGE/Workflow Notes
Two SGE queues have been configured for ergatis: ergatis.q (192 slots) ergatis-fast.q (144 slots)
ergatis.q is subordinate queue of ergatis-fast.q
ergatis.q is set as default queue for user ‘ergatis’ by specifying ‘-q ergatis.q’ in/home/ergatis/.sge_request
Workflow version 3.0 is installed /share/apps/workflow
Workflow requires that the SGE queue's prolog and epilog scripts be set to the following: prolog=/share/apps/workflow/bin/prolog $host $job_owner $job_id $job_name $queue epilog=/share/apps/workflow/bin/epilog $host $job_owner $job_id $job_name $queue
The queue configuration can be checked using the command'qconf -sq ergatis.q'
Ergatis Application Install
The main ergatis application install directory is under /share/apps/ergatis
The chado-v1r12b1 release is the current version installed direct copy of the install located at /usr/local/devel/ANNOTATION/ard/ at JCVI Perl wrappers were modified via sed to the correct local directory structures Proper install wasn't done because no working installer script was available at the time
/share/apps/ergatis/chado-v1r12b1symlinked to /share/apps/ergatis/current
Executables which some ergatis component use, but are not installed with Ergatis (e.g.: JCVI internal scripts) are located under /share/apps/ergatis/bin
External tools which are not globally installed on sos are installed under /share/apps/ergatis/external_apps
Ergatis global directories (global_id_repository, global_saved_templates) are located under /share/apps/ergatis/ergatis_global
Ergatis Data Locations
All ergatis data should be put under /home/thumper6/ergatis
Project repositories are located under /home/thumper6/ergatis/projectsor symlink /share/apps/ergatis/projects
CAMERA project repository is /home/thumper6/ergatis/projects/camera
Databases are located under /home/thumper6/ergatis/dbor symlink /share/apps/ergatis/db
Global scratch space is under /home/thumper6/ergatis/scratchor symlink /share/apps/ergatis/scratch
ikelite.rocksclusters.org
Less machines than sos cluster (~20 slots?) Initial test ergatis install was done here
(similar directory structure to sos) Completely distinct from sos cluster Sandbox Shibu, Weizhong Li and others run computes
here (e.g.: clustering pipeline)
Pipelines
ncRNA/ORF Finding Pipeline
Annotation Pipeline
Incremental Clustering Pipeline
Metagenomic Reads
ORFs/peptides
GOS/CAMERA Pipelines Overview
Cluster Memberships
Challenges
All computes in pipeline must be performed on multi-sequence input/output files, as the filesystem can not physically support 12M+ individual FASTA input files/output files other partitioning solutions could work(?) but most tools
support multiple sequence inputs anyway
Overall total space consumption was an issue when computes were running on TIGR grid, but this is not as much an issue (currently) on CALIT2 grid Solution here was to keep all inputs/outputs gzipped
during pipeline execution, at the cost of some performance loss (using things like zcat –f | with NCBI BLAST, etc.)
GOS/CAMERA ncRNA and ORF Finding Pipeline
Reads
Find tRNAs Extract tRNAs
Soft-Mask tRNAs
GOS/CAMERA ncRNA and ORF Finding Pipeline Overview
Find rRNAs
Soft-Mask rRNAs
Extract rRNAs
GOS ORF calling
ORF stats ORF overlaps
tRNAs FASTA
rRNAs FASTA
ORFs FASTA
Peptides FASTA
MetageneORFs FASTA
Peptides FASTA
GOS/CAMERAncRNA and ORF Finding Pipeline
CAMERA-specificErgatis components
camera_extract_trna
CAMERA rRNA Finder Overview
BLAST vs. a database of coded pooled rRNA subunit sequences
BLAST prefilter step with loose parameters blastall -p blastn -i reads.fsa -d rrna_db.fsa -e 0.1 -F 'T' -b 1 -v 1
-z 3000000000 -W 9
Reads with prefilter hits are searched using strict parameters blastall -p blastn -i aligned.fsa -d rrna_db.fsa -e 1e-4 -F 'm L' -b
1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T
Collapse aligned intervals of the same rRNA type and extract the highest scoring alignments from each region
camera_filter_blast
camera_rrna_finder
Custom DB
rRNA Finder DB/usr/local/annotation/CAMERA/CustomDB/camera_rRNA_finder.all_rRNA.coded.cdhit_80.fsa
5S Sequences from Archaea, Bacteria and Eukaryota were
obtained from the 5S Ribosomal RNA Database http://biobases.ibch.poznan.pl/5SData/
16S Sequences for Archaea and Bactera were obtained from the
Green Genes 16S db http://greengenes.lbl.gov/
18S Source was Doug Rusch's 18S database prepared for the GOS
paper 23S
Source was Doug Rusch's 23S database prepared for the GOS paper.
rRNA Finder DB
Fasta headers were coded as follows:
>#S [D] ...original.header...
where # is one of (5, 16, 18, 23) and D is one of (A, B, E). The camera_rrna_finder component expects this format.
rRNA Finder DB
CD-HIT was run on the entire database to cluster sequences with high similarity to reduce the database size but maintain a range of diverse sequences
Command line:/usr/local/devel/ANNOTATION/bwhitty/cdhit/cd-hit/cd-hit-est -i
input_database.fsa -o output_database.fsa -c 0.8 -n 4
Consistency of clustering was checked with a Perl script to ensure no heterogeneous clustering(e.g.: 18S and 16S clustering together)
Clusters were consistent Database size was reduced from 65,591 sequences to 1,329
rRNA Finder
open_reading_frames
ORF Overlaps/ORF Stats
FASTA Headers
>HOT_READ_85779353 /accession=DU765170.1 /sample_id=JGI_SMPL_HF4000_12-21-03 /template_id=JGI_TMPL_ANIW12796 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=1088 /length=1088
>JCVI_ORF_1108836626524 /pep_id=JCVI_PEP_1108836626525 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"
>JCVI_PEP_1108836626525 /orf_id=JCVI_ORF_1108836626524 /read_id=HOT_READ_85760722 /begin=0 /end=234 /orientation=1 /5_prime_stop=0 /3_prime_stop=TAG /ttable=11 /length=234 /read_defline="/accession=DU750886.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1709 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=841 /length=841"
>JCVI_NT_1108826205795 /read_id=HOT_READ_85801707 /begin=785 /end=858 /orientation=1 /type=Asn_tRNA /ergatis_id=1108826197895 /defline="HOT_READ_85801707 /accession=DU787412.1 /sample_id=JGI_SMPL_HF770_12-21-03 /template_id=JGI_TMPL_APKH2110 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=902 /length=902"
>JCVI_NT_1108806998652 /read_id=HOT_READ_85760731 /begin=55 /end=847 /orientation=0 /type=23S_rRNA /ergatis_id=1108826197895 /read_defline="/accession=DU750895.1 /sample_id=JGI_SMPL_HF130_10-06-02 /template_id=JGI_TMPL_ASNF1714 /sequencing_direction=forward /site_id=HAWAII_SITE_HOT /clr_range_begin=0 /clr_range_end=847 /length=847"
The absence of called ORFs in this region of the read is due to the
soft-masked rRNA sequence
RNAmmer didn’t identify the 23S
sequence, though it is capable of finding 23S
Again, RNAmmer failed to identify rRNA sequence
BLAST-based approach does a pretty good job of
finding correct boundaries
These ORFs have >150 unmasked
bases
BLAST-based rRNA finding appears to
outperform RNAmmer for 23S sequences, and
some 16S
GOS (Incremental) Clustering Pipeline
http://camera.venterinstitute.org/wiki/display/VISWCAMERA/Incremental+clustering%2C+work+flow+details
Clustering Overview
All Public Proteins +
GOS ORFs
Core Cluster
Core Cluster Core
Cluster
Core Cluster
Core Cluster
Core Cluster
Longest Sequence Representatives
Non-Redundant 90% Identity CD-HIT Sequence
Representatives
GOS v1.2
Historical Artifacts(with respect to annotation)
CAMERA Polypeptide Annotation Pipeline
Thoughts on Specifications
Annotation rules should not be literally codified as Perl code (and only Perl code)!!!(especially when the “decision makers” never look at the code)
What tools do we trust? What cutoffs do we use? What evidence/data types do we consider?
These will (in some cases should) change over time
More Thoughts
Specifications are easier to change than code, so code should be written to support change
But unless they’re defined first, the specifications will be a moving target
(My) Design Objectives
Must be able to add/remove annotation data sources as the annotation SOP changes
Must be able to easily change the ways in which these annotation data types are applied/combined to produce final annotation
Must be able to change/expand the types of final annotation data we are producing
Object-Oriented Design Approach
OOP in Perl == *, but lesser of two evils (don’t ask me what the other evil is, but it must be pretty evil)
Encapsulates possible sources of change and prevents them from affecting downstream components(like HACCP)
Polymorphism of $parser->parse($infile) producing annotation objects is nice
Re-use was not really a motive here
*Damian Conway in his OOP Perl book says using OOP in Perl yields 5X performance hit
Annotation Pipeline Overview
Annotation Source Data
Parser(s)
Annotation Tool(s)
Annotation Data Object(s)
Annotation Rules
Final Annotation Data
We can make changes to the annotation rules,
without having to necessarily re-run or re-
parse the data
Design Objectives for Parsers
A parser must: Produce polypeptides with associated AnnotationData objects of a defined type Produce AnnotationData object with attributes specified in a consistent way
E.g.: All parsers should produce EC number attributes that look like ‘1.1.1.1’ -> ‘1.-.-.-’, not sometimes ‘1.-’. Multiple values should be split. Any clean-up or verification should be done before the AnnotationData object is created; if the data is invalid, the attribute should not be populated, or the object should not be created.
Produce annotation data objects that are independent of the source annotation data they were parsed from e.g.: They have already been canonized as a type of ‘trusted annotation evidence
type’ when they are created as AnnotationData objects. These trusted types are defined in the annotation SOP.
These features create a separation between how trusted evidence is defined (input data), and how the evidence is used to produce annotation (annotation rules)
AnnotationData Objects
AnnotationData
AnnotationData::Polypeptide
type:
[some string]
attributes:
common_namegene_symbolECGOTIGR_role
…
Polypeptide
AnnotationData Object(s)
AnnotationRules
AnnotationRules object implements the rules from the annotation SOP document
AnnotationRules::PredictedProtein takes a Polypeptide object with associated AnnotationData objects of varying type and applies the annotation rules to create a final AnnotationData object
AnnotationRules
Rules are encoded as an array in the following format:
ANNOTATION_TYPE|OPERATOR|ATTRIBUTE1 ATTRIBUTE2
Where OPERATOR is one of: = for assign attribute (if unassigned) + for append attribute - for overwrite attribute
Any operators can be defined as they are applied with a hash of handler subroutines
AnnotationRules::PredictedProtein
my @annotation_order = ( ## equivalog level tigrfam hits 'TIGRFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Exception|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role',
'TIGRFAM::FRAG::Equivalog|=|GO', 'TIGRFAM::FRAG::Exception|=|GO', 'TIGRFAM::FRAG::HypotheticalEquivalog|=|GO', 'TIGRFAM::FullLength::Domain|=|GO', 'PandaBLASTP::Characterized|=|GO',
'PRIAM|=|GO EC', ## equivalog level hits vs tigrfam frag 'TIGRFAM::FRAG::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FRAG::Exception|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FRAG::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role', ## characterized high confidence blast hit 'PandaBLASTP::Characterized|=|common_name gene_symbol', ## pfam and non-equivalog tigrfams 'PFAM::FullLength::Equivalog|=|common_name gene_symbol GO EC TIGR_role', 'PFAM::FullLength::HypotheticalEquivalog|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Subfamily|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Superfamily|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::EquivalogDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::HypotheticalEquivalogDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::SubfamilyDomain|=|common_name gene_symbol GO EC TIGR_role', 'TIGRFAM::FullLength::Domain|=|common_name gene_symbol GO EC TIGR_role', …
CAMERA Annotation Pipeline
CAMERA-specificErgatis components
camera_annotation_parser
camera_annotation_rules
camera_annotation_rules
CAMERA-specific Code in SVN
http://iwebsvn.tigr.org/listing.php?repname=ANNOTATION&path=%2FCAMERA%2F&rev=0&sc=1
Future Development(My 2 cents)
Pipeline development must be driven by annotation SOP development work Feedback on pipeline bugs must be vigilantly kept separate from feedback
on annotation SOP bugs First discuss and update the SOP, then modify the code
Cluster summary annotation Shortest path here seems to be a combination of GO Slim and EC
assignments? GO consortium makes some scripts available for summarizing sets of GO assignments
If using the current code, PolypeptideSet container class exists already. Cluster members can be added to a PolypeptideSet and that can be used as input to an AnnotationRules::FinalCluster object that is similar to the one for PredictedProtein, but with a different set of handler routines.
Incremental clustering pipeline Good luck
top related