Top Banner
A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP 2
100

A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Jun 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

A generic and modular platform for automated

sequence processing and annotation

Arthur Gruber

Instituto de Ciências Biomédicas Universidade de São Paulo

AG-ICB-USP

2

Page 2: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

• Analyzing and processing sequencing reads is a tedious and error-prone job

• Multistep process• All sequences are submitted to the same

processing steps• Sequences processed by a given step are

the input for the next one • Require different programs• Integrated system – PIPELINE

Sequence processing and annotation

2

AG-ICB-USP

Page 3: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Problem: how to build pipelines

• Creating scripts for new pipelines involves good programming knowledge

• Once created, most pipelines are difficult to change and customize

• Many programs must be used• Phred, Cross_match, Phrap, CAP3, Blast,

HMMer, InterproScan, TMHMM, etc.

2

AG-ICB-USP

Page 4: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

• Each program needs a specific environment to work (e.g. directories with specific names)

• Each program produces output in different ways and formats

• Integrating programs is a hard task

2 Problem: how to build pipelines

AG-ICB-USP

Page 5: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Solution: creating an environment to build pipelines

• Abstract the environment of each program

• Abstract output format

• Easily specify “coupling” of different programs

• Document how the pipe was built • Easy to inspect and monitor• Easy to store (e.g. in a database)

Requirements:

2

AG-ICB-USP

Page 6: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

EGene

• To develop a simple to use and configure platform for pipeline construction

• Big sequencing centers already have sophisticated pipelines, but many are not published and/or publicly available

• They are too complex for the small-/mid-sized labs

• Platform should be generic

• Useful for any sequencing project

• Platform should provide components for the most common tasks

• New components should be easy to develop

Aims and characteristics:

AG-ICB-USP

2

Page 7: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

EGene: a generic platform for pipeline construction

• Written in Perl language• Modular• Easy to build specific components to

interact with third-party programs• EGene components can be integrated

to fulfill user-specific needs• CoEd – a graphical configuration editor

written in Java – user-friendly interface

Characteristics:

2

AG-ICB-USP

Page 8: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

AG-ICB-USPAG-ICB-USPAG-ICB-USP

Page 9: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

AG-ICB-USPAG-ICB-USPAG-ICB-USP

Page 10: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

AG-ICB-USPAG-ICB-USPAG-ICB-USP

Page 11: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

AG-ICB-USPAG-ICB-USPAG-ICB-USP

Page 12: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

AG-ICB-USPAG-ICB-USPAG-ICB-USP

Page 13: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

AG-ICB-USPAG-ICB-USPAG-ICB-USP

Page 14: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

AG-ICB-USPAG-ICB-USPAG-ICB-USP

Page 15: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Sequence processing pipelineThe Eimeria ORESTES project

Size filteringFilter-size

End trimmingTrim-ends.pl

Quality filteringFilter-quality.pl

Vector masking and screeningCross_Match

Primer screening and maskingCross_Match

Base calling and quality assignmentPhred

Inputchromatogram files

AssemblyCAP3

Human sequence filteringBlast

Chicken sequence filteringBlast

Bacterial sequence filteringBlast

Repetitive sequence filteringCross_Match

Ribosomal sequence filteringCross_Match

Plastid sequence filteringCross_Match

Mitochondrial sequence filteringCross_Match

2

AG-ICB-USP

Page 16: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Sequence processing and grahical report

2

AG-ICB-USP

Page 17: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

How to get EGene

Internet site:http://www.coccidia.icb.usp.br/egene

- EGene is distributed under the GNU General Public License- EGene is Open Source

2

AG-ICB-USP

Page 18: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

How to get EGene

Internet site:http://www.coccidia.icb.usp.br/egene

- EGene is distributed under the GNU General Public License- EGene is Open Source

2

AG-ICB-USP

Page 19: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 20: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Recent developments

• Incorporation of forks• Enhancement of the data model –

incorporation of annotation evidences

• Development of annotation components

• Evidence-based annotation

2

AG-ICB-USP

Page 21: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 22: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 23: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 24: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Genome annotation

• Annotation is the process of adding information to DNA sequence.

• The information usually has a DNA coordinate.

• Features could be repeats, genes, promoters, protein domains, etc.

• Features can be cross-referenced to other databases (e.g. Pfam/Pubmed)

2

AG-ICB-USP

Page 25: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

• Annotation is the process of adding information to DNA sequence.

• The information usually has a DNA coordinate.

• Features could be repeats, genes, promoters, protein domains, etc.

• Features can be cross-referenced to other databases (e.g. Pfam/Pubmed)

Genome annotation2

AG-ICB-USP

Page 26: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Annotation file

A typical annotation file contains:A header with:

• Information about the sequence• Organism• Authors• References• Comments

A feature table containing• Sequence features and co-ordinates

2

AG-ICB-USP

Page 27: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Feature table format

• Flatfile format• Format definition available at

http://www.ncbi.nlm.nih.gov/projects/collab/FT/

• Covers DDBJ/EMBL/GenBank

• Defines all accepted annotation terms and hierarchy

2

AG-ICB-USP

Page 28: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Incorporating annotation

• EGene’s data model was enriched to incorporate annotation information into the representation of the sequences

• All collected data is converted into a proprietary XML format• The XML can be easily converted into

different annotation formats: Feature Table, GFF3, etc.

• We provide some converters and new ones can be easily implemented

2

AG-ICB-USP

Page 29: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Annotation components

• A comprehensive set of annotation components has been implemented:

• ORF finding and translation• Tandem repeats finding: TRF, String, mREPS• tRNA finding: tRNAscan-SE• Gene Prediction: Genscan, GlimmerM,

GlimmerHMM, Twinscan, Phat, ESTscan, SNAP • Motif finding: HMMer x Pfam, RPS-BLAST,

InterproScan• Similarity search: BLAST• EST mapping: Sim4, Exonerate

2

AG-ICB-USP

Page 30: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Annotation components

• A comprehensive set of annotation components has been implemented:• Transmembrane domain finding: TMHMM,

Phobius• Signal peptide: SignalP, Phobius• GPI anchor: DGPI• GO mapping and quantification• Orthology assignment and quantification:

COG/KOG• Pathway mapping: KEGG• Annotation visualization with GBrowse: web

inspection• Annotation report generation: feature table,

GFF3• Web site generation: HTML/PHP

2

AG-ICB-USP

Page 31: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 32: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 33: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 34: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

EGene generates annotation files that can be inspected using regular editors

(Artemis, Apollo, etc.)

2

AG-ICB-USP

Page 35: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

EGene’s annotation

• EGene can generate annotation in different formats:

• XML – local use, easy to feed a database management system

• Feature table Convenient for manual curation on Artemis Ready for submission to public databases

• GFF3 Current annotation interchange format Manual curation/visualization on Artemis,

Apollo and GMOD Genome Browser Compliant with Sequence Ontology terms

2

AG-ICB-USP

Page 36: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 37: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 38: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 39: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 40: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

EGene performs GO term mapping and constructs web pages for inspection

2

AG-ICB-USP

Page 41: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 42: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 43: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 44: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 45: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 46: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 47: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 48: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 49: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 50: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 51: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 52: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 53: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 54: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 55: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 56: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 57: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 58: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 59: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 60: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 61: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

EGene performs an integrated and quantitative orthology analysis

(COG/KOG) and constructs web pages

2

AG-ICB-USP

Page 62: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 63: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 64: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 65: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 66: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 67: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 68: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 69: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 70: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 71: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 72: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 73: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 74: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 75: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 76: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 77: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 78: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 79: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 80: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 81: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 82: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

EGene automatically constructs a full web site for evidence inspection

2

AG-ICB-USP

Page 83: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 84: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 85: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 86: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 87: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 88: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 89: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 90: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 91: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 92: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 93: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 94: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 95: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation
Page 96: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Current developments

• Full integration with a database management system

• Automated task distribution management across multiple processing nodes

• Development of a graphical interface for evidence inspection and manual curation

• “Intelligent” annotation – use of probalistic methods to evaluate evidence and designate protein functions

2

AG-ICB-USP

Page 97: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Why use EGene2 ?• Ideal for small- and mid-sized laboratories

• Genome and EST sequencing projects• Conceived for Biologists

• Does not require programming skills• Generic tool for any sequencing/annotation

project – customized for specific user’s requirements

• Very easy to implement new components• Multiplatform - MacOS, UNIX, Linux, etc.• Well documented – HOWTOs, tutorials, example

datasets available• Easy configuration

• CoEd - Application with a GUI for pipeline construction• Generic pipeline templates provided

2

AG-ICB-USP

Page 98: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Research team

Prof. Alan M. Durham – IME-USP

AnnotationMilene Ferro – ICB-USPRicardo Yamamoto Abe – IME-USPLuiz Thiberio Rangel – ICB-USP

Sequence pre-processingAndré Yoshiaki Kashiwabara - IME-USP Fernando Tadashi G. Matsunaga - ICB-USPPaulo Henrique Ahagon - ICB-USP Leonardo Varuzza - ICB-USP

2

AG-ICB-USP

Page 99: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Financial Support

• FAPESP - São Paulo State Science Foundation

• CNPq - National Research Council

2

AG-ICB-USP

Page 100: A generic and modular 2 platform for automated sequence ... · Incorporating annotation •EGene’s data model was enriched to incorporate annotation information into the representation

Thanks for your

attention

AG-ICB-USP