Top Banner
Cédric Notredame (15/03/22) Finding Genes In a Genome Cédric Notredame
140

Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Dec 14, 2015

Download

Documents

Yuliana Padley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Finding Genes

In a Genome

Cédric Notredame

Page 2: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Naked Genome

Page 3: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

All Dressed Up!

Page 4: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 5: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 6: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Naked Genomes are Useless

Useful Genome

Accurate Annotation

-Experimental Methods

-Computational Methods

-ESTs, THS, DNA Chips…

-Homology, Ab-Initio

Page 7: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

ANNOTATION

-Where are the genes ?

-What do they do: Biochemistry ?

-When do they do it: Regulation ?

-Who do they do it for: Metabolic ?

Page 8: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Outline

Naked Genome => Fully Dressed Sequence

1. Cleaning the genome

2. Similarity methods

3. Experimental Methods

4. Ab-initio MethodsEukaryotes

Prokaryotes

5-How Good Are The Methods ??

Page 9: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Outline

Eukaryotes

Prokaryotes

Page 10: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 11: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene Fishingin

Prokaryotic Genomes

Page 12: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

What is a Prokaryotic Gene ?

GenePromoter RBS

Protein

ORF

mRNASTOPATG

Terminator

Page 13: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

What is a Prokaryotic Gene:Operon

Page 14: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

2-Homology Based Methods

1-Ab-initio:-ORFing-Codon Bias

Promoter RBS

mRNASTOP

Terminator

3-Regulatory Sequence Detection-Non Coding-Short Genes

Page 15: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes

-High Gene Density:Haemophilus Influenza: 85%

-No Introns

-Operons

In a prokaryotic Genome, any ORF longer than 300 nt

Can SAFELY be considered to be a gene

Page 16: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes

Clean-up

ORFing

Homology Search

Gene Prediction

Promoter Detection

Page 17: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

CleaningYourDNA

Sequence

Page 18: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Cleaning a DNA Sequence

Is My Sequence Contaminated ?

-Cloning may lead tothe inclusion of VectorSequences.

-These sequences must beremoved

Page 19: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Paste in yournew sequence

Page 20: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Crop

Our sequence displays two vectorcontaminations

Page 21: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Contamination Matters

Contaminations Look Like Horizontal Transfers

BUT Genuine Genome may Contain Similarity tothe Cloning vector (Antibiotics Resistance)

-Wrong Phylogeny

-Error Propagation in Secondary Databases

-Eukaryote Genomes can also be cleaned this way

Page 22: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

ORFingProkaryotic Genomes

Page 23: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: ORFing

Where are the ORFs In my Sequence ?

Page 24: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: ORFing

ATG (Start) Codons

STOP Codons

Page 25: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: ORFing

Page 26: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

www.ncbi.nih.gov/gorf/gorf.html

Page 27: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

Page 28: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

TO COG

TO BLAST

Page 29: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

Page 30: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

GORF: Can You Trust it ???

Random ORF Random 3rd Position

Real ORF Biased 3rd Position

Page 31: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

GORF: Can You Trust it ???

Page 32: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORFing cDNAs

BUT…

-Will NOT detect SHORT genes

-Will NOT detect Non Coding Genes

Works with Bacterial Genomes

Good enough for ~85% proteome

Works with Eukaryotic cDNA

Page 33: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Ab-InitioGene Predictions

InProkaryotic Genomes

Page 34: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Predicting Genes

What are the sequences in my genome that LOOK LIKE Genes

Page 35: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Using The Codon Biases

Page 36: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Using The Codon Biases

Coding RegionsDo NOT look LikeRandom DNA:

-Codon Bias

Page 37: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Real Genes Use Mostly the Optimal Codons

Page 38: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Predicting Genes

ALL the characteristics of a Gene can be Built into a model

Hidden Markov Model

Page 39: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Hidden Markov Model

-Each Nucleotide has a STATE: Coding/Non Coding …

-This STATE is HIDDEN

-The HMM tries to UNCOVER the STATE of each Nucleotide.

Page 40: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Hidden Markov Model

Occasionally Dishonest CAsino …

-This STATE is HIDDEN in the data

Observation: 122234455666125654151661661515566616166661

State : FFFFFFFFLLLLFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLL

Page 41: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

GeneMark

Page 42: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Simplified HMM for Coding Regions

S

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

64 Codons

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

64 Codons

E

Page 43: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Emission Proba

Transition Proba

Simplified HMM for Coding Regions

Page 44: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Proba of seq (GGG-TGG Given Model)

=Proba(GGG)*Proba(GGG-

>TGG)*Proba(TGG)

HMM order 5: 6th Nucleotide depends on the 5 previous

Takes into account Codon Bias AND dipeptide Comp

Simplified HMM for Coding Regions

Page 45: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)Translate Predicted Genes into Proteins Text Output

http://opal.biology.gatech.edu/GeneMark/

Page 46: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 47: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Non Standard FASTA

Page 48: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

GLIMMER: An alternative to GeneMark

Page 49: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Main Problems

Page 50: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

GeneMark and HMM predictions

Works Very Well

Good enough for ~99% proteome

BUT…

-Will NOT detect Some SHORT genes

-Will NOT detect Non Coding Genes

Page 51: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Which Program ???

The established programs ALL work well

No point in fighting if your users have their mind set on a brand…

Page 52: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

If Your Gene is NON-Coding…

The only existing model for NON-Coding genes are those for tRNA

Page 53: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Homology BasedGene Prediction

InProkaryotic Genomes

Page 54: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

BLASTx

What are the portion of my GenomeThat Look like a Known Gene/Protein?

Page 55: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

blastx

protein

nucleotide

proteinVS

Non Coding, but works only for higly similar sequences ( >70%)

Page 56: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

BlastX and HMM predictions

BUT…

Needs Homology

Depends on the databases

Very Reliable on Prokaryotes

Can Help in Eukaryotes

Page 57: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Finding PromotersIn

Prokaryotic Genomes

Page 58: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Promoter Hunting

Are There known promoters in my Sequence ?

Ideal for

-Finding Small Proteins

-Finding Non Coding Genes

Page 59: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 60: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 61: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

                                                                     

                                                                                                        

Page 62: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 63: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

prodoric.tu-bs.de/

Page 64: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

prodoric.tu-bs.de/

Page 65: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 66: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

rsat.ulb.ac.be/rsat/RSA_home.cgi

Page 67: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 68: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Fishing GenesIn

Eukaryotic Genomes

Page 69: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 70: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

2-Homology Based Methods

1-Transcript Based Methods 3-Ab-initio:-HMMs

4-Regulatory Sequence Detection

Promoter

mRNA (form2)

exonexon exon exon exonexon

mRNA (form2)

Page 71: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Eukaryote Genomes

Clean-up

Transcripts

Prediction

Homology

Promoter Detection

Page 72: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Know your

Opponent …

Page 73: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Exons are longer in Vertebrates

Page 74: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Introns are longer in Vertebrates-100 bp in Fungi-1000 bp in Vertebrates

Page 75: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Genes contain more Introns in Mammals

Page 76: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Cleaning

Eukaryotic Genomes

Page 77: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

RepeatsRepeatsTransposable elements, simple repeatsTransposable elements, simple repeats

RepeatMaskerRepeatMaskerSmith and Waterman Clean-up.Smith and Waterman Clean-up.

Avoiding Repeats

PlusPlus -Remove lots of noise.-Remove lots of noise.

MinusMinus -Changes Sequence -Changes Sequence Statistics.Statistics.

Page 78: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Homology BasedGene Prediction

InEukaryotic Genomes

Page 79: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Homology Based Predictions

What are the portion of my GenomeThat Look like a Known Protein?

Page 80: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Three Tools

GeneWise:GeneWise: Most CommonMost Common

Procrustes:Procrustes: Most SophisticatedMost Sophisticated

BlastX/TBlastXBlastX/TBlastX SimplestSimplest

Page 81: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

blastx

protein

Genome

proteinVS

BLASTX

Page 82: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

tblastx

protein

Genome

protein

ESTs

VS

TBLASTX: Exon Fishing

Page 83: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

genomic sequence

Protein

Procrustes

Page 84: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

40% id

Page 85: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

www.ebi.ac.uk/Wise2/advanced.html

GeneWise

genomic sequence

Protein

Page 86: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Transcript BasedGene Prediction

InEukaryotic Genomes

Page 87: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene indices

Using Established ESTs Collections

Page 88: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

AAAAAA...putativemRNA exon 15‘UTR exon 2 3‘UTR

expressed sequence tags(ESTs)

ESTs give us an Insight into this Complexity

1-Cluster the ESTs to reconstitute a gene

Page 89: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

EMBL database

Quality clipping

BLASTsearch, clustering

EST Collection

Quality clipping

Assembly,Consensus sequences

Visualization

Gene indices Typical WorkFlow

Page 90: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene indices Alignment

consensus

Page 91: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene indices Alignment Software

Phrap (Phil Green)

CAP3 (X. Huang)

TIGR assembler

GAP4 (R. Staden)

Page 92: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene indicesConsensus sequences

Reduced error rate

Long Consensus

Efficient database search

exon/intron boundaries

Alternative Splicing

Page 93: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

UniGene (NCBI)

TIGR Gene Indices

STACK (SANBI)

GeneNest (DKFZ,MPI)

Goal:

One cluster

One Gene

Gene indices

Clustering of EST and mRNA sequences of an organism to reduce redundance in sequence data.

Page 94: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

GeneNestgenenest.molgen.mpg.de

Page 95: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

TIGR Gene Indices

Alignment scheme

www.tigr.org

Page 96: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

UniGene

www.ncbi.nih.nlm.gov/UniGene

Page 97: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

UniGene

www.ncbi.nih.nlm.gov/UniGene

Page 98: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene indices

Page 99: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene indicesApplications

Detection of exon/intron boundaries

Detection of alternative splicing

Detection of Single Nucleotide Polymorphisms

Genome annotation

Analysis of gene expression

Design of DNA-chips/arrays

Page 100: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Mapping of EST consensus sequences on genomic DNA

genomic sequence

exons

consensus sequence( mRNA)

Page 101: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Comparing YourComparing YourGenomeGenome

with Transcriptswith Transcripts

HowHow to to dodo It ? It ?How Long ?How Long ?

BLAST : 36 hoursBLAST : 36 hours

Popular and well Popular and well describeddescribed

HSPs tend to mangle HSPs tend to mangle IntronsIntronsEST_GENOME 80 hoursEST_GENOME 80 hours

Dynamic Program. post processDynamic Program. post process

Slow and sometimes hard to useSlow and sometimes hard to use

BLAT: 0.5 hoursBLAT: 0.5 hours

Next GenerationNext Generation

Look for nearly identical seq.Look for nearly identical seq.

SIM4 pbil.univ-lyon1.fr/sim4.phpSIM4 pbil.univ-lyon1.fr/sim4.php

Similar to BLAT (slower)Similar to BLAT (slower)

Allows Large GapsAllows Large Gaps

Page 102: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 103: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Mapping cDNA on genomic DNA

splicenest.molgen.mpg.de

Page 104: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Gene indicesApplications

Detection of exon/intron boundaries

Detection of alternative splicing

Detection of Single Nucleotide Polymorphisms

Genome annotation

Analysis of gene expression

Design of DNA-chips/arrays

Page 105: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Alternative Splicing

genomic sequence

exons

consensus sequence( mRNA)

splice variant

Page 106: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 107: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Splice variants of APECED gene

number of sequences genomic sequencealternative variants

splicenest.molgen.mpg.de

Alternative Splicing

Page 108: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Alternative Splicing(additional exon)

1-skipped exon

Splice variants of adenylsuccinate lyase

2-unspliced ?

3-gene prediction errors ?

splicenest.molgen.mpg.de

Page 109: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Alternative Splicing (alternative donor site)

Page 110: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Alternative Splicing(unknown gene Hs16936)

Page 111: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Ab-InitioGene Prediction

InEukaryotic Genomes

Page 112: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 113: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Three Categories of Methods

•Rule Based

–Uses explicit set of rules to make decisions

–GeneFinder•Neural Network

–Uses a data set to build rules.

–Grail•HMM

–Finding the state of each Nucleotide (Coding…)

–Genscan

Page 114: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Rule Based - GeneFinder

• CodonBias -> score1

• Splice Site description -> score2

• ORFs -> score3

Proba (Gene)= F(score1, score2, score3..)

Page 115: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

• Train Neural Network on Known Genes: Discriminate

• GrailExp : • Measure several coding potentials

• Blast• Coding Potential• …

• Feed all the scores into the Neural Network

Neural Networks - Grail

Page 116: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 117: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

• Genscan models the genes with a Hidden Markov model, that models coding and non-coding regions.

HMM-Genscan

•Fifth order inhomogeneous HMM

–Fifth order : use 6-tuples (two codons)

–Inhomogeneous: each position is special (0,1,2)

Page 118: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Burge, C. and S. Karlin, Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997. 268(1): p. 78-94

Page 119: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Your Genomic Sequence

A Collection of Proteins

Page 120: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 121: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Evaluating Eukaryote

Gene Prediction

Page 122: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

PMID:11042160

Page 123: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Nucleotide Accuracy

TPTP FN

Sn Sp TPTP FP++

TP FN+ TP FP+ TN FP+ TN FN+

= =

TN+TN+TP+TPAC=0.5*( ) -1

((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))1/2

(TN*TP)+(FN*FP)CC=

sensitivity specificity

approximate correlation

correlation coefficient

Page 124: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Exon Accuracy

Page 125: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

•Blastx:

•Good Gene Hunter/Poor Modeler

•GeneWise

•Best Homology Gene Modeling

Page 126: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

•GenScan:

•Distracted by Complete Genome

•Use GenomeScan instead

Page 127: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

•GenScan:

•Ab-Initio Methods are more Robust

Page 128: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

http://www.cs.ubc.ca/~rogic/evaluation.html

PMID: 8786136

Page 129: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 130: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

•High and Low GC contents can confuse Predictions

Page 131: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 132: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Annnnnnd The

Winner is …

Page 133: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

http://www.cbs.dtu.dk/services/HMMgene/

Page 134: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Working on a Genome

Page 135: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

www.sanger.ac.uk/Software/Artemis/

Page 136: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Page 137: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Wrapping It Up

Page 138: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Predicting Genes

Using Homology: BlastX,

Procrustes

ORFing: GORF

Cleaning Up Data

Page 139: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)

Predicting Genes

Promoter PredictionPRODORIC

Transcript Based Predictions

GeneNest, UniGene, BLAT

Ab-Initio Predictions with HMMs

GenomeScan and HMMgene

Page 140: Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame.

Cédric Notredame (18/04/23)