A generic and modular platform for automated
sequence processing and annotation
Arthur Gruber
Instituto de Ciências Biomédicas Universidade de São Paulo
AG-ICB-USP
2
• Analyzing and processing sequencing reads is a tedious and error-prone job
• Multistep process• All sequences are submitted to the same
processing steps• Sequences processed by a given step are
the input for the next one • Require different programs• Integrated system – PIPELINE
Sequence processing and annotation
2
AG-ICB-USP
Problem: how to build pipelines
• Creating scripts for new pipelines involves good programming knowledge
• Once created, most pipelines are difficult to change and customize
• Many programs must be used• Phred, Cross_match, Phrap, CAP3, Blast,
HMMer, InterproScan, TMHMM, etc.
2
AG-ICB-USP
• Each program needs a specific environment to work (e.g. directories with specific names)
• Each program produces output in different ways and formats
• Integrating programs is a hard task
2 Problem: how to build pipelines
AG-ICB-USP
Solution: creating an environment to build pipelines
• Abstract the environment of each program
• Abstract output format
• Easily specify “coupling” of different programs
• Document how the pipe was built • Easy to inspect and monitor• Easy to store (e.g. in a database)
Requirements:
2
AG-ICB-USP
EGene
• To develop a simple to use and configure platform for pipeline construction
• Big sequencing centers already have sophisticated pipelines, but many are not published and/or publicly available
• They are too complex for the small-/mid-sized labs
• Platform should be generic
• Useful for any sequencing project
• Platform should provide components for the most common tasks
• New components should be easy to develop
Aims and characteristics:
AG-ICB-USP
2
EGene: a generic platform for pipeline construction
• Written in Perl language• Modular• Easy to build specific components to
interact with third-party programs• EGene components can be integrated
to fulfill user-specific needs• CoEd – a graphical configuration editor
written in Java – user-friendly interface
Characteristics:
2
AG-ICB-USP
AG-ICB-USPAG-ICB-USPAG-ICB-USP
AG-ICB-USPAG-ICB-USPAG-ICB-USP
AG-ICB-USPAG-ICB-USPAG-ICB-USP
AG-ICB-USPAG-ICB-USPAG-ICB-USP
AG-ICB-USPAG-ICB-USPAG-ICB-USP
AG-ICB-USPAG-ICB-USPAG-ICB-USP
AG-ICB-USPAG-ICB-USPAG-ICB-USP
Sequence processing pipelineThe Eimeria ORESTES project
Size filteringFilter-size
End trimmingTrim-ends.pl
Quality filteringFilter-quality.pl
Vector masking and screeningCross_Match
Primer screening and maskingCross_Match
Base calling and quality assignmentPhred
Inputchromatogram files
AssemblyCAP3
Human sequence filteringBlast
Chicken sequence filteringBlast
Bacterial sequence filteringBlast
Repetitive sequence filteringCross_Match
Ribosomal sequence filteringCross_Match
Plastid sequence filteringCross_Match
Mitochondrial sequence filteringCross_Match
2
AG-ICB-USP
Sequence processing and grahical report
2
AG-ICB-USP
How to get EGene
Internet site:http://www.coccidia.icb.usp.br/egene
- EGene is distributed under the GNU General Public License- EGene is Open Source
2
AG-ICB-USP
How to get EGene
Internet site:http://www.coccidia.icb.usp.br/egene
- EGene is distributed under the GNU General Public License- EGene is Open Source
2
AG-ICB-USP
Recent developments
• Incorporation of forks• Enhancement of the data model –
incorporation of annotation evidences
• Development of annotation components
• Evidence-based annotation
2
AG-ICB-USP
Genome annotation
• Annotation is the process of adding information to DNA sequence.
• The information usually has a DNA coordinate.
• Features could be repeats, genes, promoters, protein domains, etc.
• Features can be cross-referenced to other databases (e.g. Pfam/Pubmed)
2
AG-ICB-USP
• Annotation is the process of adding information to DNA sequence.
• The information usually has a DNA coordinate.
• Features could be repeats, genes, promoters, protein domains, etc.
• Features can be cross-referenced to other databases (e.g. Pfam/Pubmed)
Genome annotation2
AG-ICB-USP
Annotation file
A typical annotation file contains:A header with:
• Information about the sequence• Organism• Authors• References• Comments
A feature table containing• Sequence features and co-ordinates
2
AG-ICB-USP
Feature table format
• Flatfile format• Format definition available at
http://www.ncbi.nlm.nih.gov/projects/collab/FT/
• Covers DDBJ/EMBL/GenBank
• Defines all accepted annotation terms and hierarchy
2
AG-ICB-USP
Incorporating annotation
• EGene’s data model was enriched to incorporate annotation information into the representation of the sequences
• All collected data is converted into a proprietary XML format• The XML can be easily converted into
different annotation formats: Feature Table, GFF3, etc.
• We provide some converters and new ones can be easily implemented
2
AG-ICB-USP
Annotation components
• A comprehensive set of annotation components has been implemented:
• ORF finding and translation• Tandem repeats finding: TRF, String, mREPS• tRNA finding: tRNAscan-SE• Gene Prediction: Genscan, GlimmerM,
GlimmerHMM, Twinscan, Phat, ESTscan, SNAP • Motif finding: HMMer x Pfam, RPS-BLAST,
InterproScan• Similarity search: BLAST• EST mapping: Sim4, Exonerate
2
AG-ICB-USP
Annotation components
• A comprehensive set of annotation components has been implemented:• Transmembrane domain finding: TMHMM,
Phobius• Signal peptide: SignalP, Phobius• GPI anchor: DGPI• GO mapping and quantification• Orthology assignment and quantification:
COG/KOG• Pathway mapping: KEGG• Annotation visualization with GBrowse: web
inspection• Annotation report generation: feature table,
GFF3• Web site generation: HTML/PHP
2
AG-ICB-USP
EGene generates annotation files that can be inspected using regular editors
(Artemis, Apollo, etc.)
2
AG-ICB-USP
EGene’s annotation
• EGene can generate annotation in different formats:
• XML – local use, easy to feed a database management system
• Feature table Convenient for manual curation on Artemis Ready for submission to public databases
• GFF3 Current annotation interchange format Manual curation/visualization on Artemis,
Apollo and GMOD Genome Browser Compliant with Sequence Ontology terms
2
AG-ICB-USP
EGene performs GO term mapping and constructs web pages for inspection
2
AG-ICB-USP
EGene performs an integrated and quantitative orthology analysis
(COG/KOG) and constructs web pages
2
AG-ICB-USP
EGene automatically constructs a full web site for evidence inspection
2
AG-ICB-USP
Current developments
• Full integration with a database management system
• Automated task distribution management across multiple processing nodes
• Development of a graphical interface for evidence inspection and manual curation
• “Intelligent” annotation – use of probalistic methods to evaluate evidence and designate protein functions
2
AG-ICB-USP
Why use EGene2 ?• Ideal for small- and mid-sized laboratories
• Genome and EST sequencing projects• Conceived for Biologists
• Does not require programming skills• Generic tool for any sequencing/annotation
project – customized for specific user’s requirements
• Very easy to implement new components• Multiplatform - MacOS, UNIX, Linux, etc.• Well documented – HOWTOs, tutorials, example
datasets available• Easy configuration
• CoEd - Application with a GUI for pipeline construction• Generic pipeline templates provided
2
AG-ICB-USP
Research team
Prof. Alan M. Durham – IME-USP
AnnotationMilene Ferro – ICB-USPRicardo Yamamoto Abe – IME-USPLuiz Thiberio Rangel – ICB-USP
Sequence pre-processingAndré Yoshiaki Kashiwabara - IME-USP Fernando Tadashi G. Matsunaga - ICB-USPPaulo Henrique Ahagon - ICB-USP Leonardo Varuzza - ICB-USP
2
AG-ICB-USP
Financial Support
• FAPESP - São Paulo State Science Foundation
• CNPq - National Research Council
2
AG-ICB-USP
Thanks for your
attention
AG-ICB-USP