Whole-genome comparative genomics Analyzing the human genome 5/6.895 - Computational Biology: Genomes, Networks, Evo ecture 21 Dec 6, 2005
Jan 20, 2016
Whole-genome comparative genomics
Analyzing the human genome
6.095/6.895 - Computational Biology: Genomes, Networks, Evolution
Lecture 21 Dec 6, 2005
Challenges in Computational Biology
DNA
4 Genome Assembly
Gene FindingRegulatory motif discovery
Database lookup
Gene expression analysis9
RNA transcript
Sequence alignment
Evolutionary Theory7
TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT
Cluster discovery10 Gibbs samplingProtein network analysis12
Emerging network properties14
13 Regulatory network inference
Comparative Genomics
RNA folding
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT
Promoter motifs
3’ UTR motifs
Exons
Introns
Comparing genomes reveals functional elements
• Ultra-conserved elements
• Protein-coding genes
• Short regulatory motifs
MonotremataMarsupialia
Afrotheria
Xenarthra
Euarchontoglires
Laurasiatheria
platypusopossum
slothanteater
armadillohedgehog
shrewmole
phyllostomid microbatmicrobat (brown bat)
false vampire batflying foxmegabat (horseshoe bat)
whaledolphin
hippocow
pigllama
horserhinotapir
catdog
pangolinsquirrel
mouserat
hystricidguinea pig
rabbitpika
tree shrewtree shrew urogale
flying lemur variegatusflying lemur volans
lemurmouse lemurgalago
bushbaby
tarsier bancanustarsier syrichta
spider monkeygoeldi monkey
marmosetmacaquebaboonvervet
humanchimpanzee
gorillaorangutangibbon
tenrecgolden mole
short eared elephant shrewlong eared elephant shrew
aardvark
sirenianhyrax
elephant
Black - complete 8X
Red - 2x sequencing
elephant
armadillo
rabbit
bat
tenrec
shrew
cat
hedgehog
Average extra branch length 0.2 subs/site
Extensive sequencing of mammalian tree
Hidden Markov Modelsfor gene finding
Modeling biological sequences
• Ability to emit DNA sequences of a certain type– Not exact alignment to previously known gene– Preserving ‘properties’ of type, not identical sequence
• Ability to recognize DNA sequences of a certain type (state)– What (hidden) state is most likely to have generated observations– Find set of states and transitions that generated a long sequence
• Ability to learn distinguishing characteristics of each state– Training our generative models on large datasets– Learn to classify unlabelled data
Intergenic CpGisland
Promoter Firstexon
Intron Otherexon
Intron
GGTTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTAACGGTACTCACCGGGTTACAGGATTGTTACAGG
HMM-based Gene Finding
• GENSCAN (Burge 1997)
• FGENESH (Solovyev 1997)
• HMMgene (Krogh 1997)
• GENIE (Kulp 1996)
• GENMARK (Borodovsky & McIninch 1993)
• VEIL (Henderson, Salzberg, & Fasman 1997)
• TWINSCAN (Brent 2001)
• NSCAN (Brent 2005)
VEIL: Viterbi Exon-Intron Locator
• Contains 9 hidden states or features
• Each state is a complex internal Markovian model of the feature
• Features:– Exons, introns, intergenic regions, splice sites, etc.
Exon HMM Model
Upstream
Start Codon
Exon
Stop Codon
Downstream
3’ Splice Site
Intron
5’ Poly-A Site
5’ Splice Site
• Enter: start codon or intron (3’ Splice Site)
• Exit: 5’ Splice site or three stop codons (taa, tag, tga)
VEIL Architecture
Genie
• Uses a generalized HMM (GHMM)
• Edges in model are complete HMMs
• States can be any arbitrary program
• States are actually neural networks specially designed for signal finding
• J5’ – 5’ UTR
• EI – Initial Exon
• E – Exon, Internal Exon
• I – Intron
• EF – Final Exon
• ES – Single Exon
• J3’ – 3’UTR
Begin Sequenc
e
Start Translati
on
Donor splice site
Acceptor
splice site
Stop Translati
on
End Sequenc
e
Genscan Overview
• Developed by Chris Burge (Burge 1997)
• Characteristics:– Designed to predict complete gene structures
• Introns and exons, Promoter sites, Polyadenylation signals
– Incorporates:• Descriptions of transcriptional, translational and splicing signal
• Length distributions (Explicit State Duration HMMs)
• Compositional features of exons, introns, intergenic, C+G regions
– Larger predictive scope • Deal w/ partial and complete genes
• Multiple genes separated by intergenic DNA in a seq
• Consistent sets of genes on either/both DNA strands
• Based on a general probabilistic model of genomic sequences composition and gene structure
Genscan Architecture• It is based on Generalized HMM (GHMM)• Model both strands at once
– Other models: Predict on one strand first, then on the other strand
– Avoids prediction of overlapping genes on the two strands (rare)
• Each state may output a string of symbols (according to some probability distribution).
• Explicit intron/exon length modeling• Special sensors for Cap-site and TATA-box• Advanced splice site sensors
Fig. 3, Burge and Karlin 1997
GenScan States• N - intergenic region
• P - promoter
• F - 5’ untranslated region
• Esngl – single exon (intronless) (translation
start -> stop codon)
• Einit – initial exon (translation start -> donor
splice site)
• Ek – phase k internal exon (acceptor
splice site -> donor splice site)
• Eterm – terminal exon (acceptor splice site -
> stop codon)
• Ik – phase k intron: 0 – between codons; 1
– after the first base of a codon; 2 – after the second base of a codon
Classification-basedGene finding
Mike Lin
Gene identification
TTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
M T K S H S E E V I V P E F K
• Intuition– Genes are translated in units of 3 nucleotides (codons)
• Every DNA strand can be translated in 3 reading frames• Insertions and deletions may cause frame-shifts
– Selective pressure on the amino-acid translation• Silent substitutions tolerated• Codons for similar amino-acids frequently exchanged
• Method– Observe patterns of nucleotide change in genes / intergenic regions– Develop signatures / tests to discriminate between the two– Validate tests with known genes / intergenic regions– Use them to revisit the yeast and human genomes
Gene identification
Study known genes
Derive conservation rules
Discover new genes
Overall conservation vs. signatures of divergence
• Not a gene– Region of perfect/near-perfect non-coding conservation
– Scores very well with HMM approaches, ExoniPhy, N-Scan, which measure general levels of local nucleotide conservation
• Real gene– Mutations do occur, consistent with constraints under which genes evolve
– Insertions preserve reading frame. Mutations preserve amino-acid function
Quantify and capture these constraints computationally
human TGC---CCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTACmouse TGCCAGCCACGTGACGTGGCTG---TGGCAGCGGCAGCTAAAAAAGAGCTTAAGTAT rat TGCCAGCCACGCGACGTGGCCG---TGGCAGCAGCCGCTAAAAAGGAACTTAAGTAC dog TGCCAGCCACGCGAGGTGGCGG---------CTGCGGCCAAGAAAGAGCTCAAGTAC *** ** ** ** ***** * * ** ** ** ** ** ** *****
human TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTACmouse TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC rat TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC dog TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC *********************************************************
Signature 1: Reading frame conservation
30% 1.3%
0.14%
58%14%
10.2%
Genes Intergenic
Mutations Gaps Frameshifts
Separation
2-fold10-fold75-fold
Signature 2: Distinct patterns of codon substitution
Codon observed in species 2
Cod
on o
bser
ved
in s
peci
es 1
Genes
Codon observed in species 2C
odon
obs
erve
d in
spe
cies
1
Intergenic
• Codon substitution patterns specific to genes– Genetic code dictates substitution patterns– Amino acid properties dictate substitution patterns
100%
100%
100%
100%
100%
100%
100%
100%
100%
60%
60%
90%
40%
60%
100%
30%
30%
30%
56%100%
Evaluating reading frame conservation (RFC) Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG
Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG
Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTGSpar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123
Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTGSpar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123
Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG RFCSpar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231 43% Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312 34%Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123 23%
F1 F2 F1 F2 F3
Evaluating the codon substitution score (CSM)
pX/Y
= P(human codon X aligns to mouse codon Y in genes)q
X/Y = P(human codon X aligns to mouse codon Y outside genes)
AAA/K AAG/K AAC/N AAT/N AGA/R AGG/R...TAA/XAAA/K 1552 608 12 8 74 26 0AAG/K 423 2531 11 9 23 73 0AAC/N 8 13 1368 331 1 1 0AAT/N 8 12 444 1007 2 1 0AGA/R 44 22 1 1 664 178 0AGG/R 15 72 1 1 148 594 0
(×10-5)
MouseHuman
human CTGTTTTTCCCCTTTTGTAGGAAGTCAC
mouse CTGTTTTTCCTCTTTTGTAGTAAGTCAC
pCCC/CTC
qCCC/CTC
pAGG/AGT
qAGG/AGT
• Filling in the CSM
• Scoring an aligned region
Coding
Score=
Multiple levels of selection
Codon observed in species 2
Cod
on o
bser
ved
in s
peci
es 1
Genes
Codon observed in species 2C
odon
obs
erve
d in
spe
cies
1
Intergenic
• Multi-level information– All positions overall conservation– Exclude conserved triplets amino-acid sequence– Exclude conserved amino-acids amino-acid properties
Effect of using only off-diagonal CSM positions
Using full CSM matrix Using only off-diagonal positions
“Is it conserved like a coding gene?” “Has it diverged like a coding gene?”
False
positives No false
positives
CSM coding score for human/mouse (x-axis) and human/dog (y-axis) in CFTR region
Putting it all together: ExoClass gene finder
• Train Support Vector Machine (SVM) classifier– Reading Frame Conservation (RFC) score– Codon Substitution Matrix (CSM) coding score– Splice signal conservation, ESEs, ESIs– Exon length, conservation boundaries
• Apply it systematically to all candidate intervals• Use full gene model constraints for post-processing
Results in yeast
Accept Reject
~4000 named genes 99.9% 0.1%
~300 intergenic regions 1% 99%
Accept Reject
~4000 named genes
~300 intergenic regions
Accept Reject
~4000 named genes
~300 intergenic regions
Accept Reject
~4000 named genes 99.9% 0.1%
~300 intergenic regions 1% 99%
2000 Hypothetical ORFs 1500 500
High sensitivity and specificity
Spar
Smik
Sbay
Scer
528 deleted 43 novel280 boundary
changes34 merged
6235 annotated genes 5695 ‘real’ genes
Results in human ENCODE regions (Human/Mouse)
• High nucleotide sensitivity and specificity– Increases with additional species (with some caveats)
• ‘Missed’ exons due to: – Sequencing / assembly / alignment problems– Rapidly evolving genes: Immunity and olfactory families
• ‘Wrong’ exons due to: – Novel exons, Novel exons, Novel exons– Existing evidence: human / non-human spliced mRNAs– New evidence: validated using specific RT-PCR (with MGC)
Nucl Sn Nucl Sp Exon Sn Exon Sp Missed ‘Wrong’‘Wrong’w/evidnc
GENSCAN 85 62 67 49 17 39 17
TWINSCAN 77 88 66 79 26 11 25
SGP2 84 84 72 69 18 20 24
Exoniphy 73 88 57 67 26 10 53
ExoClass 86 87 73 75 17 14 37
Examples in the human
• Example 1: New gene
• Example 2: Deleted gene
• Example 3: Changed exons
• Fully rejected genes typically have only weak evidence• New exons often supported by existing experimental evidence• RT-PCR validation of 90 fully novel genes: 50 confirmed
Dog
Mouse
Rat
Human
1065 fullyrejected
454 novel(2591 exons)
1,919 notaligned
7,717refined
Initial results for the whole human genome
9862 fullyconfirmed
Experimental validation
• Select novel predictions with highest specificity– Unique in the genome– No pseudogenes– Absolutely no previous experimental evidence
• Results– June 2005: 454 genes 90 entirely novel– RT-PCR validation for specific exon splicing– 50 fully validated using pooled tissues
• New validation set– Top of the list: 354 genes, 1162 exons– … and many more (gene families, lower scores)
Gene Identification: Summary
• Exon-centric approach– Identify discriminating variables– Observed distinct patterns of nucleotide change– Systematically identify all exons in the genome– Use gene structure constraints to link them
• Application– High sensitivity and specificity (~90%)– More powerful than experimental methods– Largest reannotation of the yeast genome– Reannotation of the human gene set
Regulatory Motif Discovery
Xiaohui Xie
ATGACTAAATCTCATTCAGAAGAAGTGA
Regulatory Motif Discovery
GAL1
CCCCWCGG CCG
Gal4 Mig1
CGG CCG
Gal4
• Gene regulation– Genes are turned on / off in response to changing environments
– No direct addressing: subroutines (genes) contain sequence tags (motifs)
– Specialized proteins (transcription factors) recognize these tags
• What makes motif discovery hard?– Motifs are short (6-8 bp), sometimes degenerate
– Can contain any set of nucleotides (no ATG or other rules)
– Act at variable distances upstream (or downstream) of target gene
Regulatory Motif Discovery
Study known motifs
Derive conservation rules
Discover novel motifs
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *
human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *
human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *
Known motifs are preferentially conserved
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *
human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *
human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *
Gabpa
Err
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *
human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *
human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *
Is this enough to discover motifs?Is this enough to discover motifs?No.
Known motifs are frequently conserved
• Across the human promoter regions, the Err motif: – appears 434 times– is conserved 162 times
Human
Dog
Mouse
Rat
Err Err Err
Conservation rate: 37%
• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Err enrichment: 5.4-fold– Err p-value < 10-50 (25 standard deviations under binomial)
Motif Conservation Score (MCS)
MCS distribution of all 6-mers shows excess conservation
– High scoring patterns include known motifs– Excess specific to promoters and 3’-UTRs (not introns)– For MCS > 6, estimate 97% specificity
Mot
if de
nsity
Mot
if de
nsity
Motif Conservation Score (MCS)
Use MCS to discover new motifs
Select motifs with MCS > 6.0, cluster
Hill-climbing in sequence space
• Seed selection– Three mini-motif conservation criteria (CC1, CC2, CC3)
• Motif extension– Non-random conservation of neighbors
• Motif collapsing– Merge neighbors using hierarchical clustering, avg-max-linkage
• Re-scoring complex motifs– Motif conservation score for full motifs (MCS)
Test 1: Intergenic conservation
Total count
Con
serv
ed c
ount
CGG-11-CCG
Test 1: Selecting mini-motifs
• Estimate basal rate of conservation– Expected conservation rate at the
evolutionary distances observed– Average conservation rate of non-
outlier mini-motifs
• Score conservation of mini-motif– k: conserved motif occurrences– n: total motif occurrences– r: basal conservation rate– Evaluate binomial probability of
observing k successes out of n trials
• Assign z-score to each mini-motif– Bulk of distribution is symmetric– Estimate specificity as (R-L)/R– Select cutoff: 5.0 sigma– 1190 mini-motifs, 97.5% non-random
Conservation rater
N
knk ppk
nkp
)1()(
Binomial score
Right tail
Left tail
Specificity
Cu
toff
Test 2: Intergenic vs. Coding
Coding Conservation
Inte
rgen
ic C
onse
rvat
ion
CGG-11-CCG
Higher Conservation in Genes
Test 3: Upstream vs. Downstream
CGG-11-CCG
Downstream motifs?
MostPatterns
Downstream Conservation
Ups
trea
m C
onse
rvat
ion
Extend
Collapse
Full Motifs
Constructing full motifs
2,000 Mini-motifs
72 Full motifs
6CT A C GAR R
CT GR C C GA AA CCTG C GA A
CT GR C C GA ACT RA Y C GA A
Y 5Extend Extend Extend
Collapse Collapse Collapse
Merge
Test 1 Test 2 Test 3
Extending mini-motifs
• Separate conserved and non-conserved instances
CT A C GA6
CT x x GA6
Causalset
Randomset
CT A C GAR G W
CT x x GAY H S
• Find maximally discriminating neighborhood
N1
N2
M1
M2
• Evaluate non-randomness of neighborhood– chi-square contingency test on [N1,M1], [N2,M2]
Collapsing similar motifs
• Motif similarity: sequence and genomic positions– Motifs share similar sequences, count bits in common– Motifs appear conserved in similar sets of regions
Regionswith motif 2
Regionswith motif 1
Regionscontaining both motifs
• Collapsing: Hierarchical clustering– Sort the order of joins by decreasing similarity– Average max-linkage cluster similarity score
Systematically test candidate patterns
All potential motifs
Evaluate MCS
Cluster similar motifs
GT C A GTR RY gapS W
174 motifs in promoters
106 motifs in 3’ UTRs
• Enumerate
– Length between 6 and 15 nt, allow central gap
– 11 letter alphabet (A C G T, 2-fold codes, N)• Score
– Compute binomial score (conserved vs. total)– Select MCS > 6.0 specificity 97%
• Cluster– Sequence similarity– Overlapping occurrences
Are these real ?
Functions of discovered motifs
Evidence of motif function
• Promoter motifs: (1) Comparison to known motifs
(2) Distance from TSS
(3) Expression enrichment
Promoter 3’-UTRATG Stop
174 motifs 106 motifs
MCS Discovered motif 46.8 GGGCGGR 34.7 GCCATnTTg 32.7 CACGTG 31.2 GATTGGY 30.8 TGAnTCA 29.7 GGGAGGRR 29.5 TGACGTMR 26.0 CGGCCATYK 25.0 TGACCTTG 22.6 CCGGAARY 19.8 SCGGAAGY 17.9 CATTTCCK 14.9 TTGTTT 14.6 TATAAA 14.2 RTAAACA 13.9 SMGGAAGT 12.6 YYATTGTT 12.5 TCACGTG 12.4 YATGYAAAT 12.2 GGGnnTTTCC 11.9 TGACGTGK 11.7 TTAYRTAA 11.0 CCAWWnAAGG 10.7 TAAWWATAG
(1) Promoter motifs match known TF binding sites
• Compare discovered motifs to TRANSFAC database of 125 known motifs
MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY
55% of TRANSFAC motifs
match discovered motifs
45% of discovered motifs
match TRANSFAC motifs
(only 2% of control sequences
match TRANSFAC motifs)
MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY
(2) Promoter motifs show preferred distance to TSS
32% of discovered motifs show strong positional bias
Conserved motif sites in all four species Motif instances in human
Eac
h of
174
dis
cove
red
mot
ifs
Motif 8
Motif 4-81
-63
Distance from TSS
Discovered motifs occur preferentially
Within 200 bp of Transcription Start Site
Individual motifs show strong peaks
Regardless of conservation
(3) Promoter motifs enriched in specific tissues
70% of motifs show significant enrichment in at least one tissue
New
mo
tifs
Kn
ow
n T
Fs
Summary for promoter motifs
Rank Discovered MotifKnown
TF motifTissue
EnrichmentDistance
bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1(*) Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
20 GTGACGY E4F1 Yes Yes
21 GGAAnCGGAAnY Yes Yes
22 TGCGCAnK Yes Yes
23 TAATTA CHX10 Yes
24 GGGAGGRR MAZ Yes
25 TGACCTY ERRA Yes
• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias
75% have evidence
• Control sequences< 2% match known TF motifs
< 5% expression enrichment
< 3% show positional bias
< 7% false positives
Most discovered motifs are likely to be functional
NewNew
New
New
New
What about 3’-UTR motifs ?
• Sequence properties of 3’-UTR motifs
• Regulatory roles of 3’-UTR motifs
TSS 3’-UTRATG Stop
174 motifs 106 motifs
Directionality of 3’-UTR motifs
3’-UTR motifsATG Stopmotif motif
motif motif
also conserved on reverse strand NOT conserved on reverse strand
Promoter motifs
3’-UTR motifs likely to act post-transcriptionally
ATG Stopmotif
ATG Stopmotif
motif
DNA level – both strands are available
RNA level – only coding strand is available
Promoter motifs
3’-UTR motifs
Forward strand conservation
Re
ve
rse
str
an
d c
on
se
rva
tio
n
What are microRNAs (miRNAs)?
• Endogenous small non-coding RNA
• ~22nt in length
• Located in genomic loci that can produce fold-back structures
• Often conserved (but conservation may not be required)
miRNA and siRNA
miRNA gene/miRNA host geneDouble stranded RNA formation
P OH
5’ 3’RISC Complex
miRNA & siRNA as Negative Regulators of Gene Expression
miRNAsiRNA
lin-14 mRNA
lin-4 RNA, 22 nt
mRNA
Near Perfect MatchDegradation of Target
Partial MatchInhibition of TranslationDegradation of Target
ChromosomalSilencingOff-Target
Effect
Properties of microRNA genes (miRNAs)
Properties similar to the motifs we have discovered
DNA
~100 nt precursor
~50 nt stem loop structure
~22 nt miRNA gene
cleaved
Protein-coding gene 3’-UTR
miRNA
Repress target genes via loosesequence complementarity
• Small non-coding RNA genes involved in post-transcriptional regulation
• Properties of 3’-UTR motifsEnriched in motifs of length 8
75% end with nucleotide ‘A’
• Sequence properties of miRNAsNear-perfect complement to 7-mer seed
Many miRNAs start with ‘U’
3’-UTR motif properties
(2) Length distribution
Have we in fact discovered targets of microRNA genes?
• Enriched in motifs of length 8
(3) Sequence composition
• 75% end with nucleotide A
Compare 8-mer sequence to known miRNAs
• Compare 8-mer motifs against all 207 known miRNAs 72 discovered 8-mers match 44% of known miRNA genes
(72 control sequences only match 5%)
miRNA mature miRNA sequence hsa-miR-98 UGAGGUAGuaaguuguauuguu hsa-let-7i UGAGGUAGuaguuugugcu hsa-let-7g UGAGGUAGuaguuuguacagu hsa-let-7f UGAGGUAGuagauuguauaguu hsa-let-7e UGAGGUAGgagguuguauagu hsa-let-7c UGAGGUAGuagguuguaugguu hsa-let-7b UGAGGUAGuagguugugugguu hsa-let-7a UGAGGUAGuagguuguauaguu hsa-miR-124a uUAAGGCACgcggugaaugcca hsa-miR-92 UAUUGCACuugucccggccugu hsa-miR-32 UAUUGCACauuacuaaguugc hsa-miR-30e UGUAAACAuccuugacugga hsa-miR-30d UGUAAACAuccccgacuggaag hsa-miR-30c UGUAAACAuccuacacucucagc hsa-miR-30b UGUAAACAuccuacacucagc hsa-miR-30a-5p UGUAAACAuccucgacuggaagc hsa-miR-20 UAAAGUGCuuauagugcaggua hsa-miR-106b UAAAGUGCugacagugcagau hsa-miR-9 UCUUUGGUuaucuagcuguauga hsa-miR-29c UAGCACCAuuugaaaucgguua hsa-miR-29b UAGCACCAuuugaaaucagu hsa-miR-29a cUAGCACCAucugaaaucgguu hsa-let-7d aGAGGUAGUagguugcauagu hsa-miR-99a AACCCGUAgauccgaucuugug hsa-miR-100 AACCCGUAgauccgaacuugug hsa-miR-26b UUCAAGUAauucaggauaggu hsa-miR-26a UUCAAGUAauccaggauaggcu hsa-miR-153 UUGCAUAGucacaaaaguga hsa-miR-96 UUUGGCACuagcacauuuuugc
• Specifically, 8-mers match 5’-end of miRNA in 95% of cases
Position in miRNA where 8-mers match
8-mer motifs are likely miRNA targets
Novel miRNA genes show deep evolutionary conservation
• Using 8-mers to discovery novel miRNA genes
• Conserved much further than mammalian lineage
inferred miRNA
… ………
Can we use 8-mers to discover miRNA genes ?
TTGCATATATATGCAA8-mer motif miRNA complement
Conservedstem loop
3’ end
5’ end
5’ end 3’ endACGGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTACTTCATACTAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCGGCCACGAGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTACTTCATACTAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCTGGCCGCAGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTATTTTATACTA--AGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCCGCCGCCGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTATTTCATA-TAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCAGCC* ***************************************** ** *** ** **************************************** ***
258 stem loops discovered
• 258 candidate miRNA genes discovered– 114 correspond to known miRNA genes (of 222)– 144 novel candidate miRNA genes
• Experimentally tested 12 representative novel miRNAs– Specifically tested for expression of inferred 22mer using RT-PCR– Pooled small RNAs from 10 adult human tissues– 6 of 12 found to be expressed with predicted structure in adults
(developmental tissues may contain additional miRNA genes)
Many of the discovered miRNA genes are likely to be real
Properties of discovered miRNA genes
ATATGCAA
8-mer motifDiscovered miRNA gene
Two classes of miRNA genes
Many targets Evolutionary constraintCo-evolution of miRNA genes and their targets ?
• Few targets
114 re-discovered 108 missed
• No 8-mers
• Many targets
• Conserved 8-mers
• Slowly evolving • Rapidly evolving (5-fold higher mutation rate)
222 known miRNA genesN
umbe
r of
mut
atio
ns
~150 targets
Number of targets
~5 targets
• What fraction of conserved 8-mers are true miRNA targets ?– 40% of genes contain at least one discovered 8-mer
– (vs. 25% for appropriate control 8-mers)
Extraordinary importance of miRNA regulation
How many targets do miRNA genes regulate ?
ATATGCAA
8-mer motifmiRNA gene
Inferred 3’-UTR targets
P(conserved) = P(conserved|real)* P(real) + P(conserved|not real)*P(not real) 40% = 1 * p + ¼ * (1-p) p = 20%
~20% of genes are targeted by miRNAs
3’ UTR motifs and post-transcriptional regulation
• Several noteworthy examples– AATAAA: Poly-A signal– 6 AT-rich elements: mRNA stability and degradation– 24 TGTA-rich elements: mRNA localization (PUF-family)– 29 other, potential target of RNA-binding proteins
8-mer associatedOther 3’-UTR motifs
Motif length
May help systematic study of post-transcriptional regulation
46 motifs are 8-mer associated
Targets of microRNAs
60 motifs left
Targets of RNA-binding proteins
Summary: Regulatory motif discovery
ATATGCAA
discovered8-mers 114 known + 144 new miRNA genes
Target ~20% of human 3’-UTRs
miRNA regulation
• 174 promoter motifs
70 match known TF motifs
115 expression enrichment
60 show positional bias
• 106 motifs in 3’-UTR
Strand specific
8-mers are miRNA-associated
mRNA localization and stability
TSS 3’-UTRATG Stop
Systematic discovery of regulatory motifs in the human• Frequently occurring, strongly conserved short regulatory signals
Towards human regulatory networks
Global motif co-occurrence map
Reveal co-operating regulators
Initial network of master regulators
Reveal hubs, cascades, network motifs
From sequence-based discovery to dynamic models
Ste12 Tec1CBF1
Met31
Gcn4
Leu3
rESR1 Abf1
rESR2
Gcr1 Msn2
Motifs outside promoters and 3’-UTRs
Extract conserved regions in the human genome
1. Extract top 5% most conserved regions in the human genome based on PhyloHMM score (142M bp).
2. Remove protein-coding regions.3. Extract regions with conservation rate above 80% in sliding
windows of 20 bp in human/mouse/rat/dog alignment.4. Remove alignments not in syntenic blocks.5. Remove alignments not in one-to-one mapping.6. Mask repeat sequences.
=> 70M bp sequences (2.5% of the human genome)
Procedure for generating conserved regions:
Random chance of occurrence of K-mers with different size in conserved regions
Size 0 1 212 4 15 1 18 0.1520 0.01
Number of mismatches
Mean number of occurrence in 70M bp region by chance:
An example K-mer
TTCAGCACCATGGACAGC 18-merAppear 199 times in the conserved regions --> 1300-fold enrichment.
Moreover, in the whole human genome:The 18-mer occurred 446 times (45% of the sites in conserved regions) --> an enrichment of 18-fold, compared with 2.5%.
Number of occurrence:
Enrichment in the conserved regions:
Model motifs by consensus with mismatch
GGCGCTGTCCGTGGTGCTGATGCGCTGTCCGTGGTGCTGAGGAGCTGTCCGTGGTACTGAGGCACTGGCCGTGGTGCTGA...
Given an k-mer word w, we consider the ball B(w, r) of radius r around w. r is distance measure between two different words.
Example: k=20 w= ‘GGCGCTGTCCGTGGTGCTGA’ r=2
Algorithms for searching overrepresented sequences
Ver1: Build suffix tree first, and then numerate motifs with mismatches. (don’t allow indels, but motif search is exhaustive, slow)Ver2: Hash k-mer first, and extend shared k-mer sites to screen out sites that are similar to each other. (allow indels, but with
lower sensitivity, fast)
Word-search based method:
Alignment based method (for long sequences > 30 bp):
1. Blastz human vs human sequences.2. Extract sequences with multiple hits.3. Generate consensus sequence for each multiple alignment.4. Smith-Waterman alignment on the whole genome to identify all hits
for each consensus.
Discovered sequences
67 consensus sequences with average size 80 bp, enrichment rate > 0.6, and number > 20.
30 20-mers enrichment rate > 20%, and number > 20.
46 18-mers, enrichment rate > 30%, number > 30.
An example K-mer
TTCAGCACCATGGACAGC 18-merAppear 199 times in the conserved regions --> 1300-fold enrichment.
Moreover, in the whole human genome:The 18-mer occurred 446 times (45% of the sites in conserved regions) --> an enrichment of 18-fold, compared with 2.5%.
Number of occurrence:
Enrichment in the conserved regions:
Discovered sequences
67 consensus sequences with average size 80 bp, enrichment rate > 0.6, and number > 20.
30 20-mers enrichment rate > 20%, and number > 20.
46 18-mers, enrichment rate > 30%, number > 30.
A few examples
Sequence Enrichment Total in_gene in_promotUTR
TGGAAATGCTGACACAACCT 0.789 21 7 2 0TTCATTTACACTTAACTCAT 0.739 90 28 5 0AAAGGCCCTTTTCAGAGCCA 0.729 46 46 0 43AAATGCTGACAGACCCTTAA 0.700 25 13 4 0GTCTGTCAGCATTTCCATTA 0.698 35 14 1 0GGTTCCCATGGCAACAGCCT 0.686 22 10 3 0AACTCCCATTAATGCTAATG 0.680 21 7 0 0CAGCATCTGGCTCCTTGGCA 0.667 21 7 0 0GTTGCCATGGCAACAGCAGC 0.640 32 14 5 2TTTTATGGCTGAGTTATAAA 0.640 23 11 1 1CTGTTGCCATGGCAACCAGG 0.630 39 22 11 1GGTCTCCATGGCAACCAGCC 0.621 15 7 3 0AGTGGCCTGAAAGAGTTAAT 0.615 22 12 1 0TTATAATGGAAATGCTGACA 0.604 52 23 2 0GTCTGTTAGCATTTCCATTA 0.595 23 10 2 0AATAGGGGTTTATAATGGAA 0.594 27 11 2 1TCCCATTAATGTTAATGGGA 0.591 23 10 2 0GCTTTGGTTTCCATGGAAAC 0.583 25 7 2 0CTGTCAGCATTTCCATTATA 0.556 49 22 4 1CAGCATTTCCATTACAAACC 0.550 22 10 1 0CCACAAGAGGGCAGCAGAGG 0.521 32 15 11 0GTGCTATATAAATGCTAAAT 0.500 21 4 1 0GACTACAACTCCCAGCAGGC 0.474 42 40 37 1TCAGCACCACGGACAGCGCC 0.381 44 35 24 0
Context of K-mers: conservation island
Conservation island
Context of K-mers: extended conservation
TGCTGTTCCATGGCAACPalindromic sequence
Context of K-mers: connected conservation
Histone 3’UTR motifTGGCTCTGAAAAGAGCCTTT
Context of K-mers: connected conservation
Context of K-mers: connected conservation
Identify long sequences based on alignment
Interesting RNA structure of the sequence
GGAAGAAGGGAAGAAATGGCTCACTTTTCAGAGGTGCATTTACTCTTTGACCCACTAGGGTACTATTTAGTGTTCTAGAAGAGGTAATTTAGTAAATTGTACCCCAGTGGCCTGAAAAAGTTAATGCAACTCTGAAAAGTGAGCCATTCAATCGATTTTCCCTATTGCTTTTAAAAAAT.(((((.(((((((((((((((((((((((((((.((((((.(((.(((.(.(((((.((((((.(((((.((.(((.....))).....)).))))).)))))).)))))..).))).))).)))))).))))))))))))))))))).......))))))))....)))))....... (-74.51)
Conserved instance in the intron of ADCY5
TGCTGTTCCATGGCAAC
Conclusion
• Goldmines of conservation in the human genome– Short motifs, very frequently occurring– Longer motifs, many occurrences– Extremely long elements, near-perfect conservation
• Regulatory role?– microRNA genes / other non-coding RNAs– Early development, body-plan formation– Repeat elements high-jacked for regulatory roles?
• Contain strong enhancer regions, scattered across genome
– A lot of un-translated transcription
Regulatory motif
evolution
Erez Lieberman
• Genes
• Regulation
• Evolution
Motif disappears, and reappearsabout 100 bp downstream in S. mikatae
CGTNNNNNRYGAYScer GGCTCCATCAATTCGTATCAAGTGATAATT-AT------CACATAAATTATATAATTGTASpar AACCCTATTAATTCGTAAGCAGTGATATAA-AT-AGAATAACCTAACTTATACAACTGTASmik AACCCTATGAATTCCTAGTAAGCCACCTATTATAGAGATAACCTAAGTAGTATAGTAGTASbay AGCCCTATACATTCGTACCAAGTGATAAAT-ATTATTAAGACCTAACATTTAAAACAGTT * * ** **** ** ** * ** ** *** ** * **
CGTNNNNNRYGAYScer AACCT------ATTAATAACCCTAAT-ATCATCCTCATGCCCTA-AGAAATATTCAATATSpar TCCCTTTTAAACCCCCTAATATTACC-ATCTAAGACCTAACTAATATCAA----GGGAAASmik A-CCTATTAAAATTAAAAACGTTAACCATGATGCCCTAACAATATAATGA-----AGGAASbay ACCCT-----ACCCTAAAATGGGAAC-ATAAAACACAAACCCTATATAAACGTAGAGAAA *** ** * ** * * * * *
ABF1
ABF1
ABF1
ABF1
YHR078W
YHR078W
YHR078W
YHR078W
S. cerevisiae
S. paradoxus
S. mikatae
S. bayanus
Evidence of motif movement by neutral evolution
Evidence of strand crossing for near-palindromic motifs
ABF1 Crosses the Strand in YHL012W CGTNNNNNRYGAY
RTCRYNNNNNACGScer ---TAAAATAGCATATCGTTAAAAACGACAAACGCGTSpar ---TAATATAACATCTCGTTAAAAACGACAAACGCGTSmik TAATGAAATAA-ATCTCGTAAAAAACGACAAACGCGTSbay ---TGATCTGCCCTTCCGTATATAATGACAAACGCGT
ABF1 ABF1
ABF1
ABF1
ABF1
ABF1
ABF1
ABF1
S. cerevisiae
S. paradoxus
S. mikatae
S. bayanus
YHL012W
YHL012W
YHL012W
YHL012W
The birth-death process of regulatory motifs
Motif birth
Motif movement
Motif death
Hap4 Hap4
Abf1 Abf1
Msn2
- Footprinti - Informationi
Rate2
Motif birth governed by random process ?
AANNCGGTNNTG
GNNNT
2X
1X
Wider = Faster
ACCT
GT
4X
1X
More Bits: Slow movmt
Motif birth governed by random process !
Movement Rate Scales with Theory Given a Promoter Length of 70 (Expressed-C0)In Control Regions, Substantial Deviation from Expected Rates is Observed (All-C0)
y = 70.286x
R2 = 0.9994
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 0.0005 0.001 0.0015 0.002 0.0025
Obse
rved T
unneli
ng R
ate
Expressed-C0
TheoryModel
Observed rate
Motif birth can be modeled as a largely random process
Ob
serv
ed m
oti
f b
irth
rat
e
Motif information content
Red: All regions
Green: Bound regions
Age 0Motif aging
Information content
Num
ber
of in
stan
ces
Age 4
Information content
Num
ber
of in
stan
ces
Age 1 Age 2 Age 3
What is responsible for shift in distribution ?
3. Death rates governed by selective landscape
Green: Death rate in bound regions
Red: Death rate in unbound regions
Motif death rates drastically different in functional / non-functional regions
Intensity of selection determines motif death rate
Bound &
Cooperative
Bound Not bound
Cooperative
Neither
Ra
te o
f m
oti
f d
eath
Each level of selective pressure shows distinct death rate
Birth and death events for chromosome arm (16R)
Birth-death process governed by selection landscape
Green = motif birth
Red = motif death
Blue = motif agingStr
engt
h of
sel
ectiv
e pr
essu
re
Chromosomal position on chromosome 16 (right arm)
Yap 1
Yap 1
Motif evolution governed by three processes
• Motif birth– Short motifs can appear by neutral evolution– Rate of motif birth ~ information content
• Motif aging– Motif abundance shifts towards bound regions– Distribution changes gradually over time
• Motif death– Governed by functional selection landscape– Predicted by partner motifs + factor binding
Modeling motif evolution can lead to better discovery
Network evolution
by duplication
Aviva Presser
• Motif discovery
• Motif evolution
• Network evolution
Networks are dynamic in time and in evolution
Global motif co-occurrence map
Reveal co-operating regulators
Initial network of master regulators
Reveal hubs, cascades, network motifs
How do networks change in the face of gene duplication ?
Ste12 Tec1CBF1
Met31
Gcn4
Leu3
rESR1 Abf1
rESR2
Gcr1 Msn2
Evidence of Whole Genome Duplication
Whole Genome Duplications in diverse lineages
Yeast Duplication
Kellis et al. Nature, Apr 8, 2004
Vertebrate Duplication in Fish
Jaillon et al. Nature, Oct 21, 2004
Two rounds of WGD in human!
Dehal et al. PLoS Biology, Oct 2005
The return to haploidy
Numberof genes
5,000
10,000
WG
D
100Myrstime
Today
5,500
Gen
eL
oss
~500 gained
Advantage of WGD may lie in 500 gained genes
Functions of duplicated genes
S. cerevisiae copy 1
S. cerevisiae copy 2
K. waltii
Evidence of accelerated protein divergence ?
• As a group– Biased towards environment adaptation– Sugar metabolism, fermentation, regulation
• Individual pairs– Are new gene functions gained by WGD ?– How are new gene functions emerging ?
WGD Rate 1
Rate 2
Scenarios for rapid gene evolution
One copy faster
Both copies faster
Scer - copy1
Scer - copy2
Kwal
Scer - copy1
Scer - copy2Kwal
Ohno, 1970
Force, 1999
20% of duplicated genes show acceleration20% of duplicated genes show acceleration95% of cases: Only one copy faster
Emerging gene functions after duplication
Asymmetric divergence recognize ancestral / derived
Scer - Sir3 (silencing)
Scer - Orc1 (origin of replication)
Kwal - Orc1
4-fold acceleration
Scer - Ski7 (anti-viral defense)
Scer - Hbs1 (translation initiation)
Kwal - Hbs1
3-fold acceleration
• Origin of replication silencing
• Translation initiation anti-viral defense
Asymmetric divergence distinct functional properties
Gain new function and lose ancestral function
Ancestral function Derived function
Gene deletion
Lethal (20%) Never lethal
Asymmetric divergence distinct functional properties
Gain new function and lose ancestral function
Ancestral function Derived function
Gene deletion
Lethal (20%) Never lethal
Expression AbundantSpecific
(stress, starvation)
Localization GeneralSpecific
(mitochondrion, spores)
Duplication
Asymmetric
Divergence
Asymmetry also found in network connectivity
Interaction loss more likely than gain.One protein maintains ancestral function?
Study network in context of duplication
Duplicated gene
Interaction partners
Network evolution by duplication
LostDuplicate
Time
Pre-WGD
Modern Network
Network motif
Duplication
++
--
Loss
Duplication Gain Modernnetwork motif
Ancestralnetwork motifs
Scenario 1
Scenario 2
Mechanisms of network motif emergence
Duplication
Creation Probability
p·(1-q) q
Divergence
Transition Probability
[(1-Pplus) ·(1-Pminus)3·Pminus2]
Lost Interactions
Kept Interactions
Gained Interactions
• Pre-Duplication Probabilities–p = probability of interaction–q = probability of self-interaction
• Post-Duplication Probabilities–Pplus = probability of adding an interaction–Pminus = probability of eliminating an interaction
All have either 4 or 0 edges across the pairs(4-across or 0-across)
Emergence of post-duplication network motifs
Modeling network evolution
– Parameters: • Fraction Duplicated vs Spontaneous Generation• Fraction Edges Deleted• Number of Edges for Spontaneous Genes
– 90% of timesteps: duplication• Pick a gene at random• Duplicate with all its connections• Delete on average 35% of new connections
– 10% of timesteps: creation• “Create” a new gene• Randomly connect it to the existing network with 0 – 20
connections
Study emergence of network motifs
0 2 4 6 8 10 12 14 16 18 2010
0
101
102
103
104
105
Num
ber
of A
ppea
ranc
esAbundance of network motifs predicted by duplication
2. High frequency of ohnolog pair interaction
Ancestral self-interaction or gain of ohnolog interaction
Duplication
Duplication
Divergence
1. Asymmetry in network connectivity
Interaction loss more likely than gain.One protein maintains ancestral network function?
LessonsLearned
1. Abundance of ancestral self-interactions
2. Gain of ohnolog interaction by proximity due to common interactions
3. Selection for ohnologs with interaction, both kept since neither can mutate. Faulty A’ would disrupt polymerization of A-A-A-A, reduced fitness.
3. Abundance of global properties and network hubs
Duplication + asymmetric divergence model Traditional preferential attachment model
Model matches local and global network properties
Network evolution: Conclusions
• Asymmetric evolution of network connectivity– One pair preserves connections– One pair keeps subset (rarely gains)
• WGD preserves network connectivity– Duplicates highly interconnected
• Simple model of network evolution– Estimate rates of interaction gain and loss– Very good fit to simulated and actual yeast network
• Infer connectivity patterns of ancestral network– Ancestral network shows increased number of self-interactions– Self-interacting proteins favored in duplicated network?
Comparative genomics and regulatory networks
• Regulatory motif discovery– Genome-wide conservation score– Validated using expression, positional bias, multiplicity– Pre- and post-transcriptional regulation
• microRNA regulation– Motif-centric discovery of new microRNA genes– Many new microRNAs, experimentally validated– Role of microRNA regulation: 20% of the genome
• Regulatory motif evolution– Underlying birth-death process, random birth process– Aging shifts distribution, death governed by selection– Ability to model motifs for discovery in many species
• Protein network evolution– Simple duplication-based model– Motif abundance, degree distribution can be predicted– Asymmetric divergence, cross-interactions
Acknowledgements
• Human motifs– Xiaohui Xie– Eric Lander– Vamsi Mootha– Kerstin Lindblad-Toh– Jun Lu– E.J. Kulbokas– Todd R. Golub
• Fungal comparisons– Bruce Birren– Christina Cuomo– James Galagan– Li-Jun Ma– Joshua Grochow
• Gene identification– Mike Lin
– Michael Brent
• Network evolution– Aviva Presser
– Michael Elovitz
– Roy Kishony
• Motif Evolution– Erez Lieberman
– Martin Nowak
• Genome-wide phylogeny– Matt Rasmussen
– Marcia Lara
Who’s actually doing the work
Matt RasmussenWhole-genome phylogeny
Xiaohui XieMotif finding
Josh GrochowProtein motifs
Erez LiebermanMotif evolution
Aviva PresserNetwork evolution
Mike LinGene identification
Alex StarkFly regulatory networks
Pouya KheradpourHuman motifs