Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark [email protected] ”Gene Finding in Eukaryotic Genomes” PhD course #27803 Spring 2003
Jan 26, 2016
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Nikolaj BlomCenter for Biological Sequence Analysis
BioCentrum-DTUTechnical University of Denmark
”Gene Finding in Eukaryotic Genomes”
PhD course #27803
Spring 2003
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Human Genome
Published
HUGO: Nature, 15.feb.2001
Celera: Science,
16.feb.2001
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seWe Have the Human Genome Sequence...now what?
So, what is the problem?• Well...• We don’t know how
many genes there are!• We don’t know where
they are!• We don’t know what
they do!
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
The cellular machinery recognize genes without access to GenBank, SwissProt or computers – can we?
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seNeedles in Haystacks...
Only 2% of human genome is coding regionsIntron-exon structure of genes• Large introns (average 3365 bp )• Small exons (average 145 bp)• Long genes (average 27 kb)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
AAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTCCAGCTATTGTACTGTTTCTCTGCTTTTAATTTATTTTTATTTATTTATTTATTTATTTATTTATTTATTTTTGAGATGGAGCTTCACTCTTGTTGCCCAGGCTGGAGCGCAATGGCGCGATCTCAGCTCACCGCAACCTCTACTTCCCGAATTCAAGTGATTGTCCTGCCTCAGCCTCCCGAGTAGCCGGGATTACAGGCATGCGCCACCACGCCTGGCTAATTTTGTACTTTTAGTAGAGACGGGGTTTCTCCATGTTGCTCAGCCTGGTCACAAACTCCCGATCTCAGGTGATCTGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCACGCCCCACCGTCTCTGTTCTCTTTTAAAGCACAATCCCTCAACACAAGTGTCTATACTCAGCGTCTCCACTTTCCCTCCATCTGGTCTTCCCAGTGCCCCCTTGTCAGGTTTTCACCCCATGCTCCTCCAGGGCTAGTCTGCTCTTGCTTCCCGTCTTACTGGAAGACCAGCAGCATTTGACAGAGTTGGTCACTCTCTCCTCCTTGGACACCTTTTCTTCACTTGGTTTCCAGAACAGCATTATCTCCTGCTTATTGTCTTCCTCAGTCTACCTCAGTGAAAAGCTTTACTGGTTCCTCCACATCTCCCAGACCTCCAGTAATAACAGGAATGTACCATGCCATTGCTCTCTCTCTCTCCTTTTTTTTTTTTTTTTTTTTTTTTTGTTGAGACAGAGTCTCAATTTTATCACCCAGACTGAAGCACAATGGCATGATCATAGCTCATTGCAGTCTCGAACTCGTGGGCTCAAGCAATCCTCCCACCTCAGCCTCCTGAATAGCTGGGACTACAAGCAACACCACCATGCCCAGCTAACTTTCTATTTTTTATTTTTATTTTTTGTAGAGATGAGGTTTTACTATGTTGCCTAGGCTAGTCTTGAACTCCTGGGCCCAAATGATCCTCCCACCTTGGTCTCCCAAAGTGCTGGGATTATAGGCGTGAGCCACCGTGTCCAACTTCTCTTTCTTAATGGAATTTAGGCAAAAGTTATTACTCATGGCCTTGGAATGCTCTTTCCTCAGATAGCCACATGGCTCACCATTACTTCCTTCCAGCTTTCTTCAAAGATCCACTTCTCAGTGAAGCTTTGTCCTGACCACCCAGCTGAAAATTGCAATCCTCTTCTGTCTACCATGTACATACTCTCTATTTGCTTTCCTTCCTTTATTTCTCTCTGTAGGTGTGACCTAACATAACATATAATTTACTTCTGTACCTTGTTTGCTTTCTGTCTTCCCCTTTAGAACATAAGCTCCATGAGGGAAGGCGTTTTTGCCTGCTTTAGTCACTTTATCTCCAGCAACTACAACTATATGTATATATACACACACATATATATACACACACATATATATACACACACATATATATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenes and Signals
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Features
Codon frequency/bias• Organism dependent• Hexamer statistics
Transcriptional• Promoters/enhancers
Exon/introns• Length distributions• ORFs
Splicing• Donor/acceptor sites• Branchpoints
Translational• Ribosome binding sites
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seCodon Bias
Gene Finders are often organism specificCoding regions often modelled by 5th order Markov chain (hexamers/di-codons)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seExon Size
0
5
10
15
20
25
30
35
1-100
100-200
200-300
300-500
>500
Fungi
Verterbrate
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seIntron Size
0
10
20
30
40
50
60
70
<100 <200 <1kbp
1 to5
>5
Fungi
Verterbrate
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seIntron Prevalence
0
10
20
3040
50
60
7080
90
100
0 1 >1
Yeast
Fungi
Mammal
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Finding Challenges
Need the correct reading frame• Introns can interrupt an exon in mid-
codon
There is no hard and fast rule for identifying donor and acceptor splice sites• Signals are very weak
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seOverpredicting Genes
Easy to predict all exonsReport all sequences flanked by ..AG and GT.. as exonsSensitivity = 100%Specificity ~ 0%
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seSensor-based methods
Similarity searches misses some/many genescDNA/EST libraries are not perfect Ab initio Gene Finders• HMM-based
• GenScan• HMMgene
• Neural network-based• GRAIL• NetGene2 (splice sites)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction
”Isolated” methods• Predict individual features
• E.g. splice sites, coding regions• NetGene (Neural network)
– http://www.cbs.dtu.dk/services/NetGene2/
”Integrated” methods• Predict genes in context
• ”Grammar” of genes• Certain elements in specific order are required
– HMMgene http://www.cbs.dtu.dk/services/HMMgene/
– GenScan (HMM-based) http://genes.mit.edu/GENSCAN.html
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Grammar
HAPPYEUGENEAWASGUYFINDER
Isolated features
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Grammar
HAPPYEUGENEAWASGUYFINDER
Isolated features
Intron 3’UTR Exon Promoter Exon RBS
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Grammar
EUGENEFINDERWASAHAPPYGUY
Integrated features
HAPPYEUGENEAWASGUYFINDER
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Grammar
EUGENEFINDERWASAHAPPYGUY
Integrated features
PromRBSExonIntronExon3’UTR
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Grammar
”Isolated” methods (e.g.NN):
HAPPYEUGENEAWASGUYFINDER
”Integrated” methods (e.g.HMM):
EUGENEFINDERWASAHAPPYGUY
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seHMMs for genefinding
GenScan principle• E=exon• I=intron• F=5’ UTR• T=3’ UTR• P=promoter• N=intergenic
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenscan http://genes.mit.edu/GENSCAN.html
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenscan
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenscan http://genes.mit.edu/GENSCAN.html
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenscan
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenscan
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seHMMgene http://www.cbs.dtu.dk/services/HMMgene/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seHMMgene http://www.cbs.dtu.dk/services/HMMgene/
Columns1.Sequence identifier 2.Program name 3.Prediction (see table below for the meaning). 4.Beginning 5.End 6.Score between 0 and 1 7.Strand: $+$ for direct and $-$ for complementary 8.Frame (for exons it is the position of the donor in the frame) 9.Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below).
Name Meaning firstex The coding part of the first coding exon starting with the first base of the start codon. exon_N The N'th predicted internal coding exon. lastex The coding part of the last coding exon ending with the last base of the stop codon. singleex The coding part of an exon in a gene with only one coding exon. CDS Coding region composed of the exon predictions prior to this line.
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seDefining the term ’exon’
Gene Prediction programs often useExon = CDS (coding sequence)
Real exons may contain 5’ or 3’ UTRs (untranslated regions)
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction – NetGene 2
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction – NetGene 2
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction – NetGene 2
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction – NetGene 2
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seNIX – Visualizing Gene Predictions
http://www.hgmp.mrc.ac.uk/NIX/
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction – Performance of Genscan
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
sePerformance of Genscan – Exon Length
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seRepeatmasker
Repetitive sequences in human/eukaryotic genomes are a problemRun gene predictions on large genomic regions before and after masking of repetitive sequence: • http://ftp.genome.washington.edu/cgi-bin/
RepeatMasker
Up to 45% of human genomic sequence derived from transposable/repetitive elements
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seRepeatmasker
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seFuture Challenges
Bootstrapping: prediction improves as more genes become known• ’Extreme’ genes (long/short) still difficult• Initial and terminal exons are predicted with lower
confidence
Combine with Sequence Similarity MatchesNon-coding RNAs• Most gene prediction programs only predict protein-
coding genes• tRNA and rRNA genes are not predicted
Prokaryotic gene finding• Much easier (no introns), but still not perfect• Especially short genes (<300 bp) difficult
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction
Take home messages• Human genome sequence is known• Number of human genes is unknown!
• Before 2001: est.30,000-140,000• Anno 2003: 30,000-40,000
• Location, structure and function of many human genes is unknown!
• Genes may be discovered by different means and methods
• ...
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction
Take home messages• Genes may be predicted by computer
programs• Masking of repetitive sequences may be
required for large genomic sequences• ’Unusual’ genes are difficult (high GC%,
short or terminal exons)• HMM-based gene prediction programs are
suitable for “Gene Grammar”
Prediction methods are not perfect!
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
The End
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Gene Prediction Exercises
I. Gene Finding in Prokaryotic SequenceII. Gene Finding in Eukaryotic Sequence
Exercises at:
http://www.cbs.dtu.dk/phdcourse/programme.htmlhttp://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/pro.htmlhttp://www.cbs.dtu.dk/phdcourse/cookbooks/genefinding/euk.html
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction Exercise
Sequence GenBank Genscan HMMgene NetGene2
Seq#1 (HoxA10)
320..12262401..2675
320 1226 0.871 2401 2675 0.988
320 1226 0.744 2401 2675 0.971
Donor 1227 0.95HAcc. 2400 1.00H
Seq#2 (Dub-2)
398..4251208..2817
-1208 2817 0.800
398 425 0.418 1208 2817 0.735
Donor 426 0.87 Acc. 1207 0.42 Acc. 1210 0.71
http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGene Prediction – Performance of Genscan
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenome Browsing - Exercise #1
How many exons are encoded by the hoxA10 gene?• 2 exons
How many basepairs is the transcript length ?• 2542 bp
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
seGenome Browsing - Exercise #1
On what chromosome is the hoxA10 gene?• Human chr.7
On which arm (short/p or long/q) ?• p
What gene is located ca. 500 kb downstream of HoxA10 ?• Scap2
On what mouse chromosome is the ortholog/homolog of human HoxA10 located?• Mouse chr.6
In the overview panel, there is a gene located ca. 300 kb downstream of HoxA10, what is the name?• Scap2
Cente
r fo
r B
iolo
gis
k Sekv
ensa
naly
se
http://www.cbs.dtu.dk/dtucourse/cookbooks/nikob/exercises/gf_exercise_solution.html