Top Banner
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer
50

Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Jan 08, 2018

Download

Documents

AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATC GTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAAT AGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAG GACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACA TCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGA TGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCC AGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTG GAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAA ACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATA CGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCC GATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTG GTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCC CTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAA CTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGG CTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCG GAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATC GCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCG GGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCC AAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAA TGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTAT GTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAA ATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTT TGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donorSplice acceptor UTR
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Primer on Annotation of Drosophila Genes

GEP Workshop – January 2016

Wilson Leung and Chris Shaffer

Page 2: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

OutlineOverview of the GEP annotation projectsGEP annotation workflowPractice applying the GEP annotation strategy

Page 3: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGAACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC

Start codon Coding region Stop codon

Splice donor Splice acceptor UTR

Page 4: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

GEP Drosophila annotation projectsD. melanogasterD. simulansD. sechelliaD. yakubaD. erectaD. ficusphilaD. eugracilisD. biarmipesD. takahashiiD. elegansD. rhopaloaD. kikkawai

D. ananassaeD. bipectinata

D. pseudoobscuraD. persimilisD. willistoniD. mojavensisD. virilisD. grimshawi

ReferencePublished

Annotation projects for Fall 2015 / Spring 2016

Species in the Four Genomes Paper

New species sequenced by modENCODE

Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project

Manuscript in progress

Page 5: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Muller element nomenclature

Schaeffer SW et al, 2008. Polytene Chromosomal Maps of 11 Drosophila Species: The Order of Genomic Scaffolds Inferred From Genetic and Physical Maps. Genetics. 2008 Jul;179(3):1601-55

X 2L 2R 3L 3R 4

X 4 5 3 2 6

Page 6: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Gene structure nomenclature

Gene span

Primary

mRNA

Protein

ExonsExon

UTR’sUTR

CDS’sCDS

Page 7: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

GEP annotation goals Identify and annotate all genes in your project

For each gene, identify and precisely map (accurate to the base pair) all Coding DNA Sequence (CDS)Do this for ALL isoformsAnnotate the initial transcribed exon and transcription start site (TSS)

Optional curriculum not submitted to GEPClustal analysis (protein, promoter regions)Repeats analysisSynteny analysisNon-coding genes analysis

Page 8: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Evidence for gene models(in general order of importance)

1. ConservationSequence similarity to genes in D. melanogasterSequence similarity to other Drosophila species (Multiz)

2. Expression dataRNA-Seq, EST, cDNA

3. Computational predictionsGene and splice site predictions

4. Tie-breakers of last resortSee the “Annotation Instruction Sheet”

Page 9: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Basic annotation workflow1. Identify the likely D. melanogaster ortholog2. Observe the gene structure of the ortholog3. Map each CDS to the project sequence4. Determine the exact coordinates of each CDS5. Verify the model using the Gene Model

Checker6. Repeat steps 2-5 for each additional isoform

Page 10: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Four main web sites used by the GEP annotation strategy

1. GEP UCSC Genome Browser (http://gander.wustl.edu)

2. FlyBase (http://flybase.org) Tools Genomic/Map Tools BLASTJump to Gene Genomic Location GBrowse

3. Gene Record Finder (http://gep.wustl.edu)Projects Annotation Resources

4. NCBI BLAST (http://blast.ncbi.nlm.nih.gov)BLASTX select the checkbox:

Page 11: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

1. Identify the likely D. melanogaster ortholog2. Observe the gene structure of the ortholog3. Map each CDS to the project sequence4. Determine the exact coordinates of each CDS5. Verify the model using the Gene Model

Checker6. Repeat steps 2-5 for each additional isoform

Annotation workflow: Step 1

Page 12: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Two different versions of the UCSC Genome Browser

Official UCSC Versionhttp://genome.ucsc.edu

Published data, lots of species, whole genomes; used for “Chimp Chunks”GEP Versionhttp://gander.wustl.edu

GEP data, parts of genomes, used for annotation of Drosophila species

Page 13: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

GEP UCSC Genome Browser overview

Genomic sequence

Evidence tracks

Page 14: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Control how evidence tracks are displayed on the Genome Browser

Five different display modes:Hide: track is hiddenDense: all features appear on a single lineSquish: overlapping features appear on separate lines

Features are half the height compared to full modePack: overlapping features appear on separate lines

Features are the same height as full modeFull: each feature is displayed on its own line

Set “Base Position” track to “Full” to see the amino acid translations

Some evidence tracks (e.g., RepeatMasker) only have a subset of these display modes

Page 15: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

DEMO: GEP UCSC Genome BrowserExamine contig10 in the D. biarmipes Aug. 2013 (GEP/Dot) assembly

Page 16: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

GEP annotation strategyUse D. melanogaster as reference

D. melanogaster is very well annotatedUse sequence similarity to infer homology

Minimize changes compared to the D. melanogaster gene model (parsimony)

Coding sequences evolve slowlyExon structure changes very slowly

Page 17: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

FlyBase – Database for the Drosophila research community

Lots of ancillary data for each gene in D. melanogaster

Curation of literature for each geneReference for D. melanogaster annotations for all other databases

Including NCBI, EBI, and DDBJ

Fast release cycle (6-8 releases per year)

Page 18: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Overview of NCBI BLASTDetect local regions of significant sequence similarity between two sequences

Decide which BLAST program to use based on the type of query and subject sequences:

Program Query Database (Subject)BLASTN Nucleotide NucleotideBLASTP Protein ProteinBLASTX Nucleotide ->

ProteinProtein

TBLASTN Protein Nucleotide -> ProteinTBLASTX Nucleotide ->

ProteinNucleotide -> Protein

Page 19: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Where can I run BLAST?NCBI BLAST web service

http://blast.ncbi.nlm.nih.gov/Blast.cgi

EBI BLAST web servicehttp://www.ebi.ac.uk/Tools/sss/

FlyBase BLAST (Drosophila and other insects)

http://flybase.org/blast/

Page 20: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

DEMO: Ortholog assignment for the N-SCAN prediction contig10.001.1 Feature in contig10 of the D. biarmipes Aug. 2013 (GEP/Dot) assembly

Page 21: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

1. Identify the likely D. melanogaster ortholog2. Observe the gene structure of the ortholog3. Map each CDS to the project sequence4. Determine the exact coordinates of each CDS5. Verify the model using the Gene Model

Checker6. Repeat steps 2-5 for each additional isoform

Annotation workflow: Step 2

Page 22: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Gene Record Finder – Observe the structure of D. melanogaster genes

Retrieves CDS and exon sequences for each gene in D. melanogaster

CDS and exon usage maps for each isoformList of unique CDS

Designed for the exon-by-exon annotation strategy

Page 23: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Nomenclature for Drosophila genes Drosophila gene names are case-sensitive

Lowercase initial letter = recessive mutant phenotypeUppercase initial letter = dominant mutant phenotype

Every D. melanogaster gene has an annotation symbol

Begins with the prefix CG (Computed Gene)

Some genes have a different gene symbol (e.g., ey)Suffix after the gene symbol denotes different isoforms

mRNA = -R; protein = -Pey-RA = Transcript for the A isoform of eyey-PA = Protein product for the A isoform of ey

Page 24: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Be aware of different annotation releases

D. melanogaster Release 6 genome assemblyFirst change of the assembly since late 2006Most modENCODE analysis used the Release 5 assembly

Gene annotations change much more frequentlyUse FlyBase as the canonical reference

GEP data freeze:GEP materials are updated before the start of semester;potential discrepancies in exercise screenshots corrected;minor differences in search results corrected.Let us know about major errors or discrepancies.

Page 25: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

DEMO: Determine the gene structure of the D. melanogaster gene CG31997

Page 26: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Annotation workflow: Step 3

1. Identify the likely D. melanogaster ortholog2. Observe the gene structure of the ortholog3. Map each CDS to the project sequence4. Determine the exact coordinates of each CDS5. Verify the model using the Gene Model

Checker6. Repeat steps 2-5 for each additional isoform

Page 27: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

BLAST parameters for CDS mapping

Select the “Align two or more sequences” checkbox

Settings in the “Algorithm parameters” section

Verify the Word size is set to 3

Turn off compositional adjustments

Turn off the low complexity filter

Page 28: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Strategies for finding small CDSExamine RNA-Seq coverage and TopHat junctions

Small CDS is typically part of a larger transcribed exon

Use Query subrange to restrict the search regionIncrease the Expect threshold and try again

Keep increasing the Expect threshold until you get matchesAlso try decreasing the word size

Use the Small Exon Finder Minimize changes in CDS sizeAvailable under Projects Annotation Resources

See the “Annotation Strategy Guide” for details

Page 29: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

DEMO: Map CDS 3_10861_1 of CG31997 against contig10 with BLASTX

Page 30: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

EXERCISE: Map CDS 1_10861_0 and 2_10861_2 of CG31997 against contig10

Page 31: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

1. Identify the likely D. melanogaster ortholog2. Observe the gene structure of the ortholog3. Map each CDS to the project sequence4. Determine the exact coordinates of each CDS5. Verify the model using the Gene Model

Checker6. Repeat steps 2-5 for each additional isoform

Annotation workflow: Step 4

Page 32: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Basic biological constraints (inviolate rules*)

Coding regions start with a methionineCoding regions end with a stop codonGene should be on only one strand of DNAExons appear in order along the DNA (collinear)Intron sequences should be at least 40 bpIntron starts with a GT (or rarely GC)Intron ends with an AG

* There are known exceptions to each rule

Page 33: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

modENCODE RNA-Seq data

RNA-Seq evidence tracks:RNA-Seq coverage (read depth)TopHat splice junction predictionsAssembled transcripts (Cufflinks, Oases)

Positive results very helpfulNegative results less informative

Lack of transcription ≠ no gene

GEP curriculum:RNA-Seq PrimerBrowser-Based Annotation and RNA-Seq Data

Page 34: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Overview of RNA-Seq (Illumina)Processed mRNA

5’ cap Poly-A tailAAAAAA

RNA fragments(~250bp)

Library with adapters 5’ 3’ 5’ 3’ 5’ 3’ 5’ 3’

Paired end sequencing 5’ 3’

~125bp

~125bp

RNA-Seq readsReverseForward

Wang Z et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.

Page 35: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

DEMO: Use RNA-Seq coverage to support the placement of the start codon

Page 36: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

EXERCISE: Confirm the placement of the stop codon for CDS 3_10861_1

Page 37: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Can use the TopHat splice junction predictions to identify splice sites

Processed mRNA AAAAAAM

RNA-Seq reads

5’ cap Poly-A tail*

TopHat junctions

ContigIntron Intron

Page 38: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

A genomic sequence has 6 different reading frames

Frame: Base to begin translation relative to the start of the sequence

Frames

1232

Page 39: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Splice donor and acceptor phases

Phase: Number of bases between the complete codon and the splice site

Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC)

Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon

Phase is dependent on the reading frame of the CDS

Page 40: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Phase depends on the reading frame

Phase of donor site:Phase 2 relative to frame +1Phase 1 relative to frame +2Phase 0 relative to frame +3

Splice donor

Page 41: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Phase of the donor and acceptor sites must be compatible

Extra nucleotides from donor and acceptor phases form an additional codon

Donor phase + acceptor phase = 0 or 3

GT AG… … …CTG AGA G TTT CCGAT

L R D F PTranslation:

CTG AGA G

TTT CCGGATCTG AGA

TTT CCGAT

Page 42: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Incompatible donor and acceptor phases result in a frame shift

Phase 0 donor is incompatible with phase 2 acceptor

CTG GT AG… …AGA G TTT CCGAT

L R G I FTranslation:

GT

CTG AGA TTC CGGGT ATT

Page 43: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

DEMO: Use RNA-Seq to annotate the intron between CDS 1_10861_0 and 2_10861_2 of the CG31997 ortholog

Page 44: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

EXERCISE: Determine the coordinates for CDS 2_10861_2 and 3_10861_1 of the CG31997 ortholog

Page 45: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

1. Identify the likely D. melanogaster ortholog2. Observe the gene structure of the ortholog3. Map each CDS to the project sequence4. Determine the exact coordinates of each CDS5. Verify the model using the Gene Model

Checker6. Repeat steps 2-5 for each additional isoform

Annotation workflow: Step 5

Page 46: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Verify the final gene model using the Gene Model Checker

Gene model should satisfy biological constraintsExplain errors or warnings in the GEP Annotation Report

Compare model against the D. melanogaster ortholog

Dot plot and protein alignmentSee “How to do a quick check of student annotations”

View your gene model as a custom track in the genome browserGenerate files require for project submission

Page 47: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

DEMO: Verify the proposed gene model for the ortholog of CG31997

Page 48: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

1. Identify the likely D. melanogaster ortholog2. Observe the gene structure of the ortholog3. Map each CDS to the project sequence4. Determine the exact coordinates of each CDS5. Verify the model using the Gene Model

Checker6. Repeat steps 2-5 for each additional isoform

Annotation workflow: Step 6

Page 49: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Next step: practice annotationAnnotation of a Drosophila Geneonecut on contig35ey on contig40 CG1909 on contig35Arl4 and CG33978 on contig10Difficulty

Page 50: Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.

Questions?

https://flic.kr/p/67maGa