Top Banner
MAKER Annotation Process Example of Glossina VectorBase http:// www.vectorbase.org Karyn Mégy Dan Hughes
19

MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes.

Jan 01, 2016

Download

Documents

Gwen Morrison
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

MAKER Annotation ProcessExample of Glossina

VectorBasehttp://www.vectorbase.org

Karyn Mégy Dan Hughes

Page 2: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Annotation: aims and means

• Aims– Preliminary

– Locus rather than exact position

• Means– Automatic annotation

• By similarity

• Ab initio

– Manual annotation

• By regions

• By gene families

Page 3: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Annotation: similarity vs. ab initio

• Similarity– Similarity to known sequences

-> only know genes

-> based on available data (qty, qlty)

• Ab initio– Follow a gene “recipe”

-> potentially identify new genes

-> over predictions

Page 4: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Ensembl annotation

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Community Annotation

1

Proteinspecies specific

2

Transcriptomespecies specific

3

Protein‘close’ specific

4

Ab initio

5

Page 5: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Ensembl annotation

• Similarity-focused • Data rich organisms• Fiddly, time consuming• Rhodnius prolixus experience

• In the meantime:

Heliconius annotation using MAKER

Page 6: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER

http://www.yandell-lab.org/software/maker.html Cantarel et al. Gen. Res. 2008. PMID 18025269

Rawgenome

DATADAT

ADATA

Annotatedgenome

Page 7: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER

Rawgenome

DATADAT

ADATA

Annotatedgenome

Page 8: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

- ESTs - from GenBank - cleaned and clustered/assembled with CAP3- 71,700 contigs

- Insecta/metazoa proteins- from UniProt- align to the genome with BLAST- 690,000 seqces (insecta)- 2,200,00 seqces (metazoa)

Page 9: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

- RNAseq Illumina Yale - cleaned - aligned to the genome using Tophat/Bowtie - build ‘tranfrag’ with Cufflinks

- 78,000 ‘transfrag’ (on 4 sets -> overlaps)

- Augustus - generated by Martin Swain - trained with SOLiD data

- 16, 963 models – high quality

Gene models

Page 10: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

Ab initio

- ESTs – aligned to the genome- from GenBank – clustered with CAP3- 71,700 clusters

- Insecta/metazoa proteins (UniProt)- 690,000 seqces (insecta)- 2,200,00 seqces (metazoa)

- RNAseq Illumina Yale – using Tophat/Cufflinks- 78,000 ‘transfrag’ (on 4 sets -> overlaps)

- Augustus – SOLiD data trained- 16, 963 models – high QC

- SNAP – trained for Glossina (MAKER)- Augustus – trained for Glossina (Martin Swain)- GenScan

Gene models

Page 11: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Intermediate gene sets

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

Ab initio

Gene models

Page 12: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER

Masking: RepeatModeler repeats

+ known repeats/transposons

Rawgenome

Maskedgenome

Raw data

Ab initio

Gene models

ESTs

Proteins

Provided as input

Run software within MAKER

Page 13: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER – iterative process

• Round-1:– Align ESTs and Insecta proteins to the genome

– Train SNAP (1): Drosophila HMM

ESTs and protein alignments,

RNA-seq Illumina Yale, Augustus (SOLiD)

• Round-2:– Re-train SNAP (2) – same as above but HMM = output of SNAP-1

• Round-3:– Re-train SNAP (3) – same as above but HMM = output of SNAP-2

– Align Metazoa proteins to the genome

– Combine final gene set

Page 14: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Using MAKER for…

Heliconius

Tsetse fly

Salmon louse

Centipede

Page 15: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Annex…

Page 16: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

Augustus (SOLiD)

Martin Swain’s stats, July 22nd, 2011

• Glossina trained:> ESTs only: 14,739 predictions,

9.8% with similarity to Gl. proteins (1,455 seq., 95% seq. identity)

-> ESTs + SOLiD: 14,739 predictions, 9.9% with similarity to Gl. proteins (1,465 seq., 95%

ID)

-> Glossina GenBank proteins: 2,754 proteins sequences 53% matching Augustus models

• Glossina un-trained:-> 8,581 predictions, 15% with similarity to Gl. proteins (1,299 seq., exact matches)

Page 17: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

ESTs• Total: 79,292 ESTs

Page 18: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

• [1] Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes. Genome Biol. 2003. Lehane et al.

• [2] Differential expression of fat body genes in Glossina morsitans morsitans following infection with Trypanosoma brucei brucei. Int. J. Parasitol. 2008. Lehane et al.

• [3] Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. Insect Mol. Biol. 2006 Attardo et al.

 

• [4] Functional Characterisations of odorant binding proteins and chemosensory proteins in tsetse fly Glossina morsitans morsitans. Unpublished 2009. …., Lehane,M., Hertz-Fowler,C., Berriman,M., …

 

• [5] Comprehensive analysis of the transcriptome of the Tsetse fly Glossina morsitans morsitans. Unpublished. 2009. Hertz-Fowler,C., Aslett,M.A. and Berriman,M.EST submitted under: GenomeProject:9563

Page 19: MAKER Annotation Process Example of Glossina VectorBase  Karyn Mégy Dan Hughes.

VectorBasehttp://www.vectorbase.org

Hinxton Developer Meeting February 2012

MAKER – final gene set

• Genes: – Final genes: 12,220

– Raw data: • EST-based genes: 23,469• Protein-based genes : 416,9591 (redundancy)

– Gene sets: • Illumina-Yale: 70,915 (redundancy)• Augustus (SOLiD): 16,155

– Ab initio• SNAP: 48,464• Augustus (MAKER): 14,413

(417,000)