Whole-genome comparative genomics Analyzing the human genome 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 21Dec 6, 2005.

Whole-genome comparative genomics

Analyzing the human genome

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution

Lecture 21 Dec 6, 2005

Challenges in Computational Biology

DNA

4 Genome Assembly

Gene FindingRegulatory motif discovery

Database lookup

Gene expression analysis9

RNA transcript

Sequence alignment

Evolutionary Theory7

TCATGCTATTCGTGATAATGAGGATATTTATCATATTTATGATTT

Cluster discovery10 Gibbs samplingProtein network analysis12

Emerging network properties14

13 Regulatory network inference

Comparative Genomics

RNA folding

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT

Promoter motifs

3’ UTR motifs

Exons

Introns

Comparing genomes reveals functional elements

• Ultra-conserved elements

• Protein-coding genes

• Short regulatory motifs

MonotremataMarsupialia

Afrotheria

Xenarthra

Euarchontoglires

Laurasiatheria

platypusopossum

slothanteater

armadillohedgehog

shrewmole

phyllostomid microbatmicrobat (brown bat)

false vampire batflying foxmegabat (horseshoe bat)

whaledolphin

hippocow

pigllama

horserhinotapir

catdog

pangolinsquirrel

mouserat

hystricidguinea pig

rabbitpika

tree shrewtree shrew urogale

flying lemur variegatusflying lemur volans

lemurmouse lemurgalago

bushbaby

tarsier bancanustarsier syrichta

spider monkeygoeldi monkey

marmosetmacaquebaboonvervet

humanchimpanzee

gorillaorangutangibbon

tenrecgolden mole

short eared elephant shrewlong eared elephant shrew

aardvark

sirenianhyrax

elephant

Black - complete 8X

Red - 2x sequencing

elephant

armadillo

rabbit

bat

tenrec

shrew

cat

hedgehog

Average extra branch length 0.2 subs/site

Extensive sequencing of mammalian tree

Hidden Markov Modelsfor gene finding

Modeling biological sequences

• Ability to emit DNA sequences of a certain type– Not exact alignment to previously known gene– Preserving ‘properties’ of type, not identical sequence

• Ability to recognize DNA sequences of a certain type (state)– What (hidden) state is most likely to have generated observations– Find set of states and transitions that generated a long sequence

• Ability to learn distinguishing characteristics of each state– Training our generative models on large datasets– Learn to classify unlabelled data

Intergenic CpGisland

Promoter Firstexon

Intron Otherexon

Intron

GGTTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTAACGGTACTCACCGGGTTACAGGATTGTTACAGG

HMM-based Gene Finding

• GENSCAN (Burge 1997)

• FGENESH (Solovyev 1997)

• HMMgene (Krogh 1997)

• GENIE (Kulp 1996)

• GENMARK (Borodovsky & McIninch 1993)

• VEIL (Henderson, Salzberg, & Fasman 1997)

• TWINSCAN (Brent 2001)

• NSCAN (Brent 2005)

VEIL: Viterbi Exon-Intron Locator

• Contains 9 hidden states or features

• Each state is a complex internal Markovian model of the feature

• Features:– Exons, introns, intergenic regions, splice sites, etc.

Exon HMM Model

Upstream

Start Codon

Exon

Stop Codon

Downstream

3’ Splice Site

Intron

5’ Poly-A Site

5’ Splice Site

• Enter: start codon or intron (3’ Splice Site)

• Exit: 5’ Splice site or three stop codons (taa, tag, tga)

VEIL Architecture

Genie

• Uses a generalized HMM (GHMM)

• Edges in model are complete HMMs

• States can be any arbitrary program

• States are actually neural networks specially designed for signal finding

• J5’ – 5’ UTR

• EI – Initial Exon

• E – Exon, Internal Exon

• I – Intron

• EF – Final Exon

• ES – Single Exon

• J3’ – 3’UTR

Begin Sequenc

e

Start Translati

on

Donor splice site

Acceptor

splice site

Stop Translati

on

End Sequenc

e

Genscan Overview

• Developed by Chris Burge (Burge 1997)

• Characteristics:– Designed to predict complete gene structures

• Introns and exons, Promoter sites, Polyadenylation signals

– Incorporates:• Descriptions of transcriptional, translational and splicing signal

• Length distributions (Explicit State Duration HMMs)

• Compositional features of exons, introns, intergenic, C+G regions

– Larger predictive scope • Deal w/ partial and complete genes

• Multiple genes separated by intergenic DNA in a seq

• Consistent sets of genes on either/both DNA strands

• Based on a general probabilistic model of genomic sequences composition and gene structure

Genscan Architecture• It is based on Generalized HMM (GHMM)• Model both strands at once

– Other models: Predict on one strand first, then on the other strand

– Avoids prediction of overlapping genes on the two strands (rare)

• Each state may output a string of symbols (according to some probability distribution).

• Explicit intron/exon length modeling• Special sensors for Cap-site and TATA-box• Advanced splice site sensors

Fig. 3, Burge and Karlin 1997

GenScan States• N - intergenic region

• P - promoter

• F - 5’ untranslated region

• Esngl – single exon (intronless) (translation

start -> stop codon)

• Einit – initial exon (translation start -> donor

splice site)

• Ek – phase k internal exon (acceptor

splice site -> donor splice site)

• Eterm – terminal exon (acceptor splice site -

> stop codon)

• Ik – phase k intron: 0 – between codons; 1

– after the first base of a codon; 2 – after the second base of a codon

Classification-basedGene finding

Mike Lin

Gene identification

TTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA

M T K S H S E E V I V P E F K

• Intuition– Genes are translated in units of 3 nucleotides (codons)

• Every DNA strand can be translated in 3 reading frames• Insertions and deletions may cause frame-shifts

– Selective pressure on the amino-acid translation• Silent substitutions tolerated• Codons for similar amino-acids frequently exchanged

• Method– Observe patterns of nucleotide change in genes / intergenic regions– Develop signatures / tests to discriminate between the two– Validate tests with known genes / intergenic regions– Use them to revisit the yeast and human genomes

Gene identification

Study known genes

Derive conservation rules

Discover new genes

Overall conservation vs. signatures of divergence

• Not a gene– Region of perfect/near-perfect non-coding conservation

– Scores very well with HMM approaches, ExoniPhy, N-Scan, which measure general levels of local nucleotide conservation

• Real gene– Mutations do occur, consistent with constraints under which genes evolve

– Insertions preserve reading frame. Mutations preserve amino-acid function

Quantify and capture these constraints computationally

human TGC---CCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTACmouse TGCCAGCCACGTGACGTGGCTG---TGGCAGCGGCAGCTAAAAAAGAGCTTAAGTAT rat TGCCAGCCACGCGACGTGGCCG---TGGCAGCAGCCGCTAAAAAGGAACTTAAGTAC dog TGCCAGCCACGCGAGGTGGCGG---------CTGCGGCCAAGAAAGAGCTCAAGTAC *** ** ** ** ***** * * ** ** ** ** ** ** *****

human TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTACmouse TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC rat TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC dog TGCCAGCCGCGCGAGGTGGCCGCCTCGGCAGCCGCAGCTAAGAAGGAGCTCAAGTAC *********************************************************

Signature 1: Reading frame conservation

30% 1.3%

0.14%

58%14%

10.2%

Genes Intergenic

Mutations Gaps Frameshifts

Separation

2-fold10-fold75-fold

Signature 2: Distinct patterns of codon substitution

Codon observed in species 2

Cod

on o

bser

ved

in s

peci

es 1

Genes

Codon observed in species 2C

odon

obs

erve

d in

spe

cies

1

Intergenic

• Codon substitution patterns specific to genes– Genetic code dictates substitution patterns– Amino acid properties dictate substitution patterns

100%

100%

100%

100%

100%

100%

100%

100%

100%

60%

60%

90%

40%

60%

100%

30%

30%

30%

56%100%

Evaluating reading frame conservation (RFC) Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG

Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG

Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTGSpar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123

Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTGSpar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123

Scer CTTCTAGATTTTCATCTT-GTCGATGTTCAAACAACGTGTTA-----TCAGAGAAACAGCTCTATGAGAAATCAGCTGATG Scer_f1 123123123123123123-12312312312312312312312-----3123123123123123123123123123123123 Spar TATTCATA-TCTCATCTTCATCAATGTTCAAACAGCGTGTTACAGACACAGAGAAACAGCTTC-TGAGAAGTCAGCCGGTG RFCSpar_f1 12312312-312312312312312312312312312312312312312312312312312312-31231231231231231 43% Spar_f2 23123123-123123123123123123123123123123123123123123123123123123-12312312312312312 34%Spar_f3 31231231-231231231231231231231231231231231231231231231231231231-23123123123123123 23%

F1 F2 F1 F2 F3

Evaluating the codon substitution score (CSM)

pX/Y

= P(human codon X aligns to mouse codon Y in genes)q

X/Y = P(human codon X aligns to mouse codon Y outside genes)

AAA/K AAG/K AAC/N AAT/N AGA/R AGG/R...TAA/XAAA/K 1552 608 12 8 74 26 0AAG/K 423 2531 11 9 23 73 0AAC/N 8 13 1368 331 1 1 0AAT/N 8 12 444 1007 2 1 0AGA/R 44 22 1 1 664 178 0AGG/R 15 72 1 1 148 594 0

(×10-5)

MouseHuman

human CTGTTTTTCCCCTTTTGTAGGAAGTCAC

mouse CTGTTTTTCCTCTTTTGTAGTAAGTCAC

pCCC/CTC

qCCC/CTC

pAGG/AGT

qAGG/AGT

• Filling in the CSM

• Scoring an aligned region

Coding

Score=

Multiple levels of selection

Codon observed in species 2

Cod

on o

bser

ved

in s

peci

es 1

Genes

Codon observed in species 2C

odon

obs

erve

d in

spe

cies

1

Intergenic

• Multi-level information– All positions overall conservation– Exclude conserved triplets amino-acid sequence– Exclude conserved amino-acids amino-acid properties

Effect of using only off-diagonal CSM positions

Using full CSM matrix Using only off-diagonal positions

“Is it conserved like a coding gene?” “Has it diverged like a coding gene?”

False

positives No false

positives

CSM coding score for human/mouse (x-axis) and human/dog (y-axis) in CFTR region

Putting it all together: ExoClass gene finder

• Train Support Vector Machine (SVM) classifier– Reading Frame Conservation (RFC) score– Codon Substitution Matrix (CSM) coding score– Splice signal conservation, ESEs, ESIs– Exon length, conservation boundaries

• Apply it systematically to all candidate intervals• Use full gene model constraints for post-processing

Results in yeast

Accept Reject

~4000 named genes 99.9% 0.1%

~300 intergenic regions 1% 99%

Accept Reject

~4000 named genes

~300 intergenic regions

Accept Reject

~4000 named genes

~300 intergenic regions

Accept Reject

~4000 named genes 99.9% 0.1%

~300 intergenic regions 1% 99%

2000 Hypothetical ORFs 1500 500

High sensitivity and specificity

Spar

Smik

Sbay

Scer

528 deleted 43 novel280 boundary

changes34 merged

6235 annotated genes 5695 ‘real’ genes

Results in human ENCODE regions (Human/Mouse)

• High nucleotide sensitivity and specificity– Increases with additional species (with some caveats)

• ‘Missed’ exons due to: – Sequencing / assembly / alignment problems– Rapidly evolving genes: Immunity and olfactory families

• ‘Wrong’ exons due to: – Novel exons, Novel exons, Novel exons– Existing evidence: human / non-human spliced mRNAs– New evidence: validated using specific RT-PCR (with MGC)

Nucl Sn Nucl Sp Exon Sn Exon Sp Missed ‘Wrong’‘Wrong’w/evidnc

GENSCAN 85 62 67 49 17 39 17

TWINSCAN 77 88 66 79 26 11 25

SGP2 84 84 72 69 18 20 24

Exoniphy 73 88 57 67 26 10 53

ExoClass 86 87 73 75 17 14 37

Examples in the human

• Example 1: New gene

• Example 2: Deleted gene

• Example 3: Changed exons

• Fully rejected genes typically have only weak evidence• New exons often supported by existing experimental evidence• RT-PCR validation of 90 fully novel genes: 50 confirmed

Dog

Mouse

Rat

Human

1065 fullyrejected

454 novel(2591 exons)

1,919 notaligned

7,717refined

Initial results for the whole human genome

9862 fullyconfirmed

Experimental validation

• Select novel predictions with highest specificity– Unique in the genome– No pseudogenes– Absolutely no previous experimental evidence

• Results– June 2005: 454 genes 90 entirely novel– RT-PCR validation for specific exon splicing– 50 fully validated using pooled tissues

• New validation set– Top of the list: 354 genes, 1162 exons– … and many more (gene families, lower scores)

Gene Identification: Summary

• Exon-centric approach– Identify discriminating variables– Observed distinct patterns of nucleotide change– Systematically identify all exons in the genome– Use gene structure constraints to link them

• Application– High sensitivity and specificity (~90%)– More powerful than experimental methods– Largest reannotation of the yeast genome– Reannotation of the human gene set

Regulatory Motif Discovery

Xiaohui Xie

ATGACTAAATCTCATTCAGAAGAAGTGA


GAL1

CCCCWCGG CCG

Gal4 Mig1

CGG CCG

Gal4

• Gene regulation– Genes are turned on / off in response to changing environments

– No direct addressing: subroutines (genes) contain sequence tags (motifs)

– Specialized proteins (transcription factors) recognize these tags

• What makes motif discovery hard?– Motifs are short (6-8 bp), sometimes degenerate

– Can contain any set of nucleotides (no ATG or other rules)

– Act at variable distances upstream (or downstream) of target gene


Study known motifs

Derive conservation rules

Discover novel motifs

human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGGdog CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGGmouse GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----rat GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT----- ***** * * * * * *

human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGCdog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGCmouse --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGCrat --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** *

human TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCGdog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCGmouse TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCGrat -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG ** * * ** ** * *

Known motifs are preferentially conserved




Gabpa

Err




Is this enough to discover motifs?Is this enough to discover motifs?No.

Known motifs are frequently conserved

• Across the human promoter regions, the Err motif: – appears 434 times– is conserved 162 times

Human

Dog

Mouse

Rat

Err Err Err

Conservation rate: 37%

• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Err enrichment: 5.4-fold– Err p-value < 10-50 (25 standard deviations under binomial)

Motif Conservation Score (MCS)

MCS distribution of all 6-mers shows excess conservation

– High scoring patterns include known motifs– Excess specific to promoters and 3’-UTRs (not introns)– For MCS > 6, estimate 97% specificity

Mot

if de

nsity

Mot

if de

nsity

Motif Conservation Score (MCS)

Use MCS to discover new motifs

Select motifs with MCS > 6.0, cluster

Hill-climbing in sequence space

• Seed selection– Three mini-motif conservation criteria (CC1, CC2, CC3)

• Motif extension– Non-random conservation of neighbors

• Motif collapsing– Merge neighbors using hierarchical clustering, avg-max-linkage

• Re-scoring complex motifs– Motif conservation score for full motifs (MCS)

Test 1: Intergenic conservation

Total count

Con

serv

ed c

ount

CGG-11-CCG

Test 1: Selecting mini-motifs

• Estimate basal rate of conservation– Expected conservation rate at the

evolutionary distances observed– Average conservation rate of non-

outlier mini-motifs

• Score conservation of mini-motif– k: conserved motif occurrences– n: total motif occurrences– r: basal conservation rate– Evaluate binomial probability of

observing k successes out of n trials

• Assign z-score to each mini-motif– Bulk of distribution is symmetric– Estimate specificity as (R-L)/R– Select cutoff: 5.0 sigma– 1190 mini-motifs, 97.5% non-random

Conservation rater

N

knk ppk

nkp

)1()(

Binomial score

Right tail

Left tail

Specificity

Cu

toff

Test 2: Intergenic vs. Coding

Coding Conservation

Inte

rgen

ic C

onse

rvat

ion

CGG-11-CCG

Higher Conservation in Genes

Test 3: Upstream vs. Downstream

CGG-11-CCG

Downstream motifs?

MostPatterns

Downstream Conservation

Ups

trea

m C

onse

rvat

ion

Extend

Collapse

Full Motifs

Constructing full motifs

2,000 Mini-motifs

72 Full motifs

6CT A C GAR R

CT GR C C GA AA CCTG C GA A

CT GR C C GA ACT RA Y C GA A

Y 5Extend Extend Extend

Collapse Collapse Collapse

Merge

Test 1 Test 2 Test 3

Extending mini-motifs

• Separate conserved and non-conserved instances

CT A C GA6

CT x x GA6

Causalset

Randomset

CT A C GAR G W

CT x x GAY H S

• Find maximally discriminating neighborhood

N1

N2

M1

M2

• Evaluate non-randomness of neighborhood– chi-square contingency test on [N1,M1], [N2,M2]

Collapsing similar motifs

• Motif similarity: sequence and genomic positions– Motifs share similar sequences, count bits in common– Motifs appear conserved in similar sets of regions

Regionswith motif 2

Regionswith motif 1

Regionscontaining both motifs

• Collapsing: Hierarchical clustering– Sort the order of joins by decreasing similarity– Average max-linkage cluster similarity score

Systematically test candidate patterns

All potential motifs

Evaluate MCS

Cluster similar motifs

GT C A GTR RY gapS W

174 motifs in promoters

106 motifs in 3’ UTRs

• Enumerate

– Length between 6 and 15 nt, allow central gap

– 11 letter alphabet (A C G T, 2-fold codes, N)• Score

– Compute binomial score (conserved vs. total)– Select MCS > 6.0 specificity 97%

• Cluster– Sequence similarity– Overlapping occurrences

Are these real ?

Functions of discovered motifs

Evidence of motif function

• Promoter motifs: (1) Comparison to known motifs

(2) Distance from TSS

(3) Expression enrichment

Promoter 3’-UTRATG Stop

174 motifs 106 motifs

MCS Discovered motif 46.8 GGGCGGR 34.7 GCCATnTTg 32.7 CACGTG 31.2 GATTGGY 30.8 TGAnTCA 29.7 GGGAGGRR 29.5 TGACGTMR 26.0 CGGCCATYK 25.0 TGACCTTG 22.6 CCGGAARY 19.8 SCGGAAGY 17.9 CATTTCCK 14.9 TTGTTT 14.6 TATAAA 14.2 RTAAACA 13.9 SMGGAAGT 12.6 YYATTGTT 12.5 TCACGTG 12.4 YATGYAAAT 12.2 GGGnnTTTCC 11.9 TGACGTGK 11.7 TTAYRTAA 11.0 CCAWWnAAGG 10.7 TAAWWATAG

(1) Promoter motifs match known TF binding sites

• Compare discovered motifs to TRANSFAC database of 125 known motifs

MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY

55% of TRANSFAC motifs

match discovered motifs

45% of discovered motifs

match TRANSFAC motifs

(only 2% of control sequences

match TRANSFAC motifs)

MCS Discovered motif Factor Known motif 46.8 GGGCGGR SP-1 GGGCGGG 34.7 GCCATnTTg YY1 GCCATnTT 32.7 CACGTG MYC SCACGTG 31.2 GATTGGY NF-Y YSATTGGYY 30.8 TGAnTCA AP-1 CTGASTCA 29.7 GGGAGGRR MAZ GGGGAGGG 29.5 TGACGTMR CREB TGACGTMA 26.0 CGGCCATYK NF-MUE1 CGGCCATCT 25.0 TGACCTTG ERR? TGACCTTG 22.6 CCGGAARY ELK-1 CCGGAART 19.8 SCGGAAGY GABP VCCGGAAG 17.9 CATTTCCK STAT1 CAnTTCCS 14.9 TTGTTT SRY KTWGTTT 14.6 TATAAA TBP TATAAATW 14.2 RTAAACA FOXO1 RWAAACAA 13.9 SMGGAAGT PEA3 MGGAWGT 12.6 YYATTGTT SOX-5 ATTGTT 12.5 TCACGTG SREBP-1 ATCACGTGAY 12.4 YATGYAAAT OCTAMER ATGCAAATnA 12.2 GGGnnTTTCC P65 GGGRATTTCC 11.9 TGACGTGK ATF6 TGACGTGG 11.7 TTAYRTAA E4BP4 RTTACRTAAY 11.0 CCAWWnAAGG SRF CCAWATAWGGM 10.7 TAAWWATAG MEF-2 YTAAAWATAGCY

(2) Promoter motifs show preferred distance to TSS

32% of discovered motifs show strong positional bias

Conserved motif sites in all four species Motif instances in human

Eac

h of

174

dis

cove

red

mot

ifs

Motif 8

Motif 4-81

-63

Distance from TSS

Discovered motifs occur preferentially

Within 200 bp of Transcription Start Site

Individual motifs show strong peaks

Regardless of conservation

(3) Promoter motifs enriched in specific tissues

70% of motifs show significant enrichment in at least one tissue

New

mo

tifs

Kn

ow

n T

Fs

Summary for promoter motifs

Rank Discovered MotifKnown

TF motifTissue

EnrichmentDistance

bias

1 RCGCAnGCGY NRF-1 Yes Yes

2 CACGTG MYC Yes Yes

3 SCGGAAGY ELK-1 Yes Yes

4 ACTAYRnnnCCCR Yes Yes

5 GATTGGY NF-Y Yes Yes

6 GGGCGGR SP1 Yes Yes

7 TGAnTCA AP-1 Yes

8 TMTCGCGAnR Yes Yes

9 TGAYRTCA ATF3 Yes Yes

10 GCCATnTTG YY1 Yes

11 MGGAAGTG GABP Yes Yes

12 CAGGTG E12 Yes

13 CTTTGT LEF1 Yes

14 TGACGTCA ATF3 Yes Yes

15 CAGCTG AP-4 Yes

16 RYTTCCTG C-ETS-2 Yes Yes

17 AACTTT IRF1(*) Yes

18 TCAnnTGAY SREBP-1 Yes Yes

19 GKCGCn(7)TGAYG Yes Yes

20 GTGACGY E4F1 Yes Yes

21 GGAAnCGGAAnY Yes Yes

22 TGCGCAnK Yes Yes

23 TAATTA CHX10 Yes

24 GGGAGGRR MAZ Yes

25 TGACCTY ERRA Yes

• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias

75% have evidence

• Control sequences< 2% match known TF motifs

< 5% expression enrichment

< 3% show positional bias

< 7% false positives

Most discovered motifs are likely to be functional

NewNew

New

New

New

What about 3’-UTR motifs ?

• Sequence properties of 3’-UTR motifs

• Regulatory roles of 3’-UTR motifs

TSS 3’-UTRATG Stop

174 motifs 106 motifs

Directionality of 3’-UTR motifs

3’-UTR motifsATG Stopmotif motif

motif motif

also conserved on reverse strand NOT conserved on reverse strand

Promoter motifs

3’-UTR motifs likely to act post-transcriptionally

ATG Stopmotif

ATG Stopmotif

motif

DNA level – both strands are available

RNA level – only coding strand is available

Promoter motifs

3’-UTR motifs

Forward strand conservation

Re

ve

rse

str

an

d c

on

se

rva

tio

n

What are microRNAs (miRNAs)?

• Endogenous small non-coding RNA

• ~22nt in length

• Located in genomic loci that can produce fold-back structures

• Often conserved (but conservation may not be required)

miRNA and siRNA

miRNA gene/miRNA host geneDouble stranded RNA formation

P OH

5’ 3’RISC Complex

miRNA & siRNA as Negative Regulators of Gene Expression

miRNAsiRNA

lin-14 mRNA

lin-4 RNA, 22 nt

mRNA

Near Perfect MatchDegradation of Target

Partial MatchInhibition of TranslationDegradation of Target

ChromosomalSilencingOff-Target

Effect

Properties of microRNA genes (miRNAs)

Properties similar to the motifs we have discovered

DNA

~100 nt precursor

~50 nt stem loop structure

~22 nt miRNA gene

cleaved

Protein-coding gene 3’-UTR

miRNA

Repress target genes via loosesequence complementarity

• Small non-coding RNA genes involved in post-transcriptional regulation

• Properties of 3’-UTR motifsEnriched in motifs of length 8

75% end with nucleotide ‘A’

• Sequence properties of miRNAsNear-perfect complement to 7-mer seed

Many miRNAs start with ‘U’

3’-UTR motif properties

(2) Length distribution

Have we in fact discovered targets of microRNA genes?

• Enriched in motifs of length 8

(3) Sequence composition

• 75% end with nucleotide A

Compare 8-mer sequence to known miRNAs

• Compare 8-mer motifs against all 207 known miRNAs 72 discovered 8-mers match 44% of known miRNA genes

(72 control sequences only match 5%)

miRNA mature miRNA sequence hsa-miR-98 UGAGGUAGuaaguuguauuguu hsa-let-7i UGAGGUAGuaguuugugcu hsa-let-7g UGAGGUAGuaguuuguacagu hsa-let-7f UGAGGUAGuagauuguauaguu hsa-let-7e UGAGGUAGgagguuguauagu hsa-let-7c UGAGGUAGuagguuguaugguu hsa-let-7b UGAGGUAGuagguugugugguu hsa-let-7a UGAGGUAGuagguuguauaguu hsa-miR-124a uUAAGGCACgcggugaaugcca hsa-miR-92 UAUUGCACuugucccggccugu hsa-miR-32 UAUUGCACauuacuaaguugc hsa-miR-30e UGUAAACAuccuugacugga hsa-miR-30d UGUAAACAuccccgacuggaag hsa-miR-30c UGUAAACAuccuacacucucagc hsa-miR-30b UGUAAACAuccuacacucagc hsa-miR-30a-5p UGUAAACAuccucgacuggaagc hsa-miR-20 UAAAGUGCuuauagugcaggua hsa-miR-106b UAAAGUGCugacagugcagau hsa-miR-9 UCUUUGGUuaucuagcuguauga hsa-miR-29c UAGCACCAuuugaaaucgguua hsa-miR-29b UAGCACCAuuugaaaucagu hsa-miR-29a cUAGCACCAucugaaaucgguu hsa-let-7d aGAGGUAGUagguugcauagu hsa-miR-99a AACCCGUAgauccgaucuugug hsa-miR-100 AACCCGUAgauccgaacuugug hsa-miR-26b UUCAAGUAauucaggauaggu hsa-miR-26a UUCAAGUAauccaggauaggcu hsa-miR-153 UUGCAUAGucacaaaaguga hsa-miR-96 UUUGGCACuagcacauuuuugc

• Specifically, 8-mers match 5’-end of miRNA in 95% of cases

Position in miRNA where 8-mers match

8-mer motifs are likely miRNA targets

Novel miRNA genes show deep evolutionary conservation

• Using 8-mers to discovery novel miRNA genes

• Conserved much further than mammalian lineage

inferred miRNA

… ………

Can we use 8-mers to discover miRNA genes ?

TTGCATATATATGCAA8-mer motif miRNA complement

Conservedstem loop

3’ end

5’ end

5’ end 3’ endACGGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTACTTCATACTAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCGGCCACGAGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTACTTCATACTAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCTGGCCGCAGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTATTTTATACTA--AGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCCGCCGCCGGGAGGTTGAACATCCTGCATAGTGCTGCCAGGAAATCCCTATTTCATA-TAAGAGGGGGCTGGCTGGTTGCATATGTAGGATGTCCCATCTCCCAGCC* ***************************************** ** *** ** **************************************** ***

258 stem loops discovered

• 258 candidate miRNA genes discovered– 114 correspond to known miRNA genes (of 222)– 144 novel candidate miRNA genes

• Experimentally tested 12 representative novel miRNAs– Specifically tested for expression of inferred 22mer using RT-PCR– Pooled small RNAs from 10 adult human tissues– 6 of 12 found to be expressed with predicted structure in adults

(developmental tissues may contain additional miRNA genes)

Many of the discovered miRNA genes are likely to be real

Properties of discovered miRNA genes

ATATGCAA

8-mer motifDiscovered miRNA gene

Two classes of miRNA genes

Many targets Evolutionary constraintCo-evolution of miRNA genes and their targets ?

• Few targets

114 re-discovered 108 missed

• No 8-mers

• Many targets

• Conserved 8-mers

• Slowly evolving • Rapidly evolving (5-fold higher mutation rate)

222 known miRNA genesN

umbe

r of

mut

atio

ns

~150 targets

Number of targets

~5 targets

• What fraction of conserved 8-mers are true miRNA targets ?– 40% of genes contain at least one discovered 8-mer

– (vs. 25% for appropriate control 8-mers)

Extraordinary importance of miRNA regulation

How many targets do miRNA genes regulate ?

ATATGCAA

8-mer motifmiRNA gene

Inferred 3’-UTR targets

P(conserved) = P(conserved|real)* P(real) + P(conserved|not real)*P(not real) 40% = 1 * p + ¼ * (1-p) p = 20%

~20% of genes are targeted by miRNAs

3’ UTR motifs and post-transcriptional regulation

• Several noteworthy examples– AATAAA: Poly-A signal– 6 AT-rich elements: mRNA stability and degradation– 24 TGTA-rich elements: mRNA localization (PUF-family)– 29 other, potential target of RNA-binding proteins

8-mer associatedOther 3’-UTR motifs

Motif length

May help systematic study of post-transcriptional regulation

46 motifs are 8-mer associated

Targets of microRNAs

60 motifs left

Targets of RNA-binding proteins

Summary: Regulatory motif discovery

ATATGCAA

discovered8-mers 114 known + 144 new miRNA genes

Target ~20% of human 3’-UTRs

miRNA regulation

• 174 promoter motifs

70 match known TF motifs

115 expression enrichment

60 show positional bias

• 106 motifs in 3’-UTR

Strand specific

8-mers are miRNA-associated

mRNA localization and stability

TSS 3’-UTRATG Stop

Systematic discovery of regulatory motifs in the human• Frequently occurring, strongly conserved short regulatory signals

Towards human regulatory networks

Global motif co-occurrence map

Reveal co-operating regulators

Initial network of master regulators

Reveal hubs, cascades, network motifs

From sequence-based discovery to dynamic models

Ste12 Tec1CBF1

Met31

Gcn4

Leu3

rESR1 Abf1

rESR2

Gcr1 Msn2

Motifs outside promoters and 3’-UTRs

Extract conserved regions in the human genome

1. Extract top 5% most conserved regions in the human genome based on PhyloHMM score (142M bp).

2. Remove protein-coding regions.3. Extract regions with conservation rate above 80% in sliding

windows of 20 bp in human/mouse/rat/dog alignment.4. Remove alignments not in syntenic blocks.5. Remove alignments not in one-to-one mapping.6. Mask repeat sequences.

=> 70M bp sequences (2.5% of the human genome)

Procedure for generating conserved regions:

Random chance of occurrence of K-mers with different size in conserved regions

Size 0 1 212 4 15 1 18 0.1520 0.01

Number of mismatches

Mean number of occurrence in 70M bp region by chance:

An example K-mer

TTCAGCACCATGGACAGC 18-merAppear 199 times in the conserved regions --> 1300-fold enrichment.

Moreover, in the whole human genome:The 18-mer occurred 446 times (45% of the sites in conserved regions) --> an enrichment of 18-fold, compared with 2.5%.

Number of occurrence:

Enrichment in the conserved regions:

Model motifs by consensus with mismatch

GGCGCTGTCCGTGGTGCTGATGCGCTGTCCGTGGTGCTGAGGAGCTGTCCGTGGTACTGAGGCACTGGCCGTGGTGCTGA...

Given an k-mer word w, we consider the ball B(w, r) of radius r around w. r is distance measure between two different words.

Example: k=20 w= ‘GGCGCTGTCCGTGGTGCTGA’ r=2

Algorithms for searching overrepresented sequences

Ver1: Build suffix tree first, and then numerate motifs with mismatches. (don’t allow indels, but motif search is exhaustive, slow)Ver2: Hash k-mer first, and extend shared k-mer sites to screen out sites that are similar to each other. (allow indels, but with

lower sensitivity, fast)

Word-search based method:

Alignment based method (for long sequences > 30 bp):

1. Blastz human vs human sequences.2. Extract sequences with multiple hits.3. Generate consensus sequence for each multiple alignment.4. Smith-Waterman alignment on the whole genome to identify all hits

for each consensus.

Discovered sequences

67 consensus sequences with average size 80 bp, enrichment rate > 0.6, and number > 20.

30 20-mers enrichment rate > 20%, and number > 20.

46 18-mers, enrichment rate > 30%, number > 30.

An example K-mer

TTCAGCACCATGGACAGC 18-merAppear 199 times in the conserved regions --> 1300-fold enrichment.

Moreover, in the whole human genome:The 18-mer occurred 446 times (45% of the sites in conserved regions) --> an enrichment of 18-fold, compared with 2.5%.

Number of occurrence:

Enrichment in the conserved regions:

Discovered sequences

67 consensus sequences with average size 80 bp, enrichment rate > 0.6, and number > 20.

30 20-mers enrichment rate > 20%, and number > 20.

46 18-mers, enrichment rate > 30%, number > 30.

A few examples

Sequence Enrichment Total in_gene in_promotUTR

TGGAAATGCTGACACAACCT 0.789 21 7 2 0TTCATTTACACTTAACTCAT 0.739 90 28 5 0AAAGGCCCTTTTCAGAGCCA 0.729 46 46 0 43AAATGCTGACAGACCCTTAA 0.700 25 13 4 0GTCTGTCAGCATTTCCATTA 0.698 35 14 1 0GGTTCCCATGGCAACAGCCT 0.686 22 10 3 0AACTCCCATTAATGCTAATG 0.680 21 7 0 0CAGCATCTGGCTCCTTGGCA 0.667 21 7 0 0GTTGCCATGGCAACAGCAGC 0.640 32 14 5 2TTTTATGGCTGAGTTATAAA 0.640 23 11 1 1CTGTTGCCATGGCAACCAGG 0.630 39 22 11 1GGTCTCCATGGCAACCAGCC 0.621 15 7 3 0AGTGGCCTGAAAGAGTTAAT 0.615 22 12 1 0TTATAATGGAAATGCTGACA 0.604 52 23 2 0GTCTGTTAGCATTTCCATTA 0.595 23 10 2 0AATAGGGGTTTATAATGGAA 0.594 27 11 2 1TCCCATTAATGTTAATGGGA 0.591 23 10 2 0GCTTTGGTTTCCATGGAAAC 0.583 25 7 2 0CTGTCAGCATTTCCATTATA 0.556 49 22 4 1CAGCATTTCCATTACAAACC 0.550 22 10 1 0CCACAAGAGGGCAGCAGAGG 0.521 32 15 11 0GTGCTATATAAATGCTAAAT 0.500 21 4 1 0GACTACAACTCCCAGCAGGC 0.474 42 40 37 1TCAGCACCACGGACAGCGCC 0.381 44 35 24 0

Context of K-mers: conservation island

Conservation island

Context of K-mers: extended conservation

TGCTGTTCCATGGCAACPalindromic sequence

Context of K-mers: connected conservation

Histone 3’UTR motifTGGCTCTGAAAAGAGCCTTT



Identify long sequences based on alignment

Interesting RNA structure of the sequence

GGAAGAAGGGAAGAAATGGCTCACTTTTCAGAGGTGCATTTACTCTTTGACCCACTAGGGTACTATTTAGTGTTCTAGAAGAGGTAATTTAGTAAATTGTACCCCAGTGGCCTGAAAAAGTTAATGCAACTCTGAAAAGTGAGCCATTCAATCGATTTTCCCTATTGCTTTTAAAAAAT.(((((.(((((((((((((((((((((((((((.((((((.(((.(((.(.(((((.((((((.(((((.((.(((.....))).....)).))))).)))))).)))))..).))).))).)))))).))))))))))))))))))).......))))))))....)))))....... (-74.51)

Conserved instance in the intron of ADCY5

TGCTGTTCCATGGCAAC

Conclusion

• Goldmines of conservation in the human genome– Short motifs, very frequently occurring– Longer motifs, many occurrences– Extremely long elements, near-perfect conservation

• Regulatory role?– microRNA genes / other non-coding RNAs– Early development, body-plan formation– Repeat elements high-jacked for regulatory roles?

• Contain strong enhancer regions, scattered across genome

– A lot of un-translated transcription

Regulatory motif

evolution

Erez Lieberman

• Genes

• Regulation

• Evolution

Motif disappears, and reappearsabout 100 bp downstream in S. mikatae

CGTNNNNNRYGAYScer GGCTCCATCAATTCGTATCAAGTGATAATT-AT------CACATAAATTATATAATTGTASpar AACCCTATTAATTCGTAAGCAGTGATATAA-AT-AGAATAACCTAACTTATACAACTGTASmik AACCCTATGAATTCCTAGTAAGCCACCTATTATAGAGATAACCTAAGTAGTATAGTAGTASbay AGCCCTATACATTCGTACCAAGTGATAAAT-ATTATTAAGACCTAACATTTAAAACAGTT * * ** **** ** ** * ** ** *** ** * **

CGTNNNNNRYGAYScer AACCT------ATTAATAACCCTAAT-ATCATCCTCATGCCCTA-AGAAATATTCAATATSpar TCCCTTTTAAACCCCCTAATATTACC-ATCTAAGACCTAACTAATATCAA----GGGAAASmik A-CCTATTAAAATTAAAAACGTTAACCATGATGCCCTAACAATATAATGA-----AGGAASbay ACCCT-----ACCCTAAAATGGGAAC-ATAAAACACAAACCCTATATAAACGTAGAGAAA *** ** * ** * * * * *

ABF1

ABF1

ABF1

ABF1

YHR078W

YHR078W

YHR078W

YHR078W

S. cerevisiae

S. paradoxus

S. mikatae

S. bayanus

Evidence of motif movement by neutral evolution

Evidence of strand crossing for near-palindromic motifs

ABF1 Crosses the Strand in YHL012W CGTNNNNNRYGAY

RTCRYNNNNNACGScer ---TAAAATAGCATATCGTTAAAAACGACAAACGCGTSpar ---TAATATAACATCTCGTTAAAAACGACAAACGCGTSmik TAATGAAATAA-ATCTCGTAAAAAACGACAAACGCGTSbay ---TGATCTGCCCTTCCGTATATAATGACAAACGCGT

ABF1 ABF1

ABF1

ABF1

ABF1

ABF1

ABF1

ABF1

S. cerevisiae

S. paradoxus

S. mikatae

S. bayanus

YHL012W

YHL012W

YHL012W

YHL012W

The birth-death process of regulatory motifs

Motif birth

Motif movement

Motif death

Hap4 Hap4

Abf1 Abf1

Msn2

- Footprinti - Informationi

Rate2

Motif birth governed by random process ?

AANNCGGTNNTG

GNNNT

2X

1X

Wider = Faster

ACCT

GT

4X

1X

More Bits: Slow movmt

Motif birth governed by random process !

Movement Rate Scales with Theory Given a Promoter Length of 70 (Expressed-C0)In Control Regions, Substantial Deviation from Expected Rates is Observed (All-C0)

y = 70.286x

R2 = 0.9994

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 0.0005 0.001 0.0015 0.002 0.0025

Obse

rved T

unneli

ng R

ate

Expressed-C0

TheoryModel

Observed rate

Motif birth can be modeled as a largely random process

Ob

serv

ed m

oti

f b

irth

rat

e

Motif information content

Red: All regions

Green: Bound regions

Age 0Motif aging

Information content

Num

ber

of in

stan

ces

Age 4

Information content

Num

ber

of in

stan

ces

Age 1 Age 2 Age 3

What is responsible for shift in distribution ?

3. Death rates governed by selective landscape

Green: Death rate in bound regions

Red: Death rate in unbound regions

Motif death rates drastically different in functional / non-functional regions

Intensity of selection determines motif death rate

Bound &

Cooperative

Bound Not bound

Cooperative

Neither

Ra

te o

f m

oti

f d

eath

Each level of selective pressure shows distinct death rate

Birth and death events for chromosome arm (16R)

Birth-death process governed by selection landscape

Green = motif birth

Red = motif death

Blue = motif agingStr

engt

h of

sel

ectiv

e pr

essu

re

Chromosomal position on chromosome 16 (right arm)

Yap 1

Yap 1

Motif evolution governed by three processes

• Motif birth– Short motifs can appear by neutral evolution– Rate of motif birth ~ information content

• Motif aging– Motif abundance shifts towards bound regions– Distribution changes gradually over time

• Motif death– Governed by functional selection landscape– Predicted by partner motifs + factor binding

Modeling motif evolution can lead to better discovery

Network evolution

by duplication

Aviva Presser

• Motif discovery

• Motif evolution

• Network evolution

Networks are dynamic in time and in evolution

Global motif co-occurrence map

Reveal co-operating regulators

Initial network of master regulators

Reveal hubs, cascades, network motifs

How do networks change in the face of gene duplication ?

Ste12 Tec1CBF1

Met31

Gcn4

Leu3

rESR1 Abf1

rESR2

Gcr1 Msn2

Evidence of Whole Genome Duplication

Whole Genome Duplications in diverse lineages

Yeast Duplication

Kellis et al. Nature, Apr 8, 2004

Vertebrate Duplication in Fish

Jaillon et al. Nature, Oct 21, 2004

Two rounds of WGD in human!

Dehal et al. PLoS Biology, Oct 2005

The return to haploidy

Numberof genes

5,000

10,000

WG

D

100Myrstime

Today

5,500

Gen

eL

oss

~500 gained

Advantage of WGD may lie in 500 gained genes

Functions of duplicated genes

S. cerevisiae copy 1

S. cerevisiae copy 2

K. waltii

Evidence of accelerated protein divergence ?

• As a group– Biased towards environment adaptation– Sugar metabolism, fermentation, regulation

• Individual pairs– Are new gene functions gained by WGD ?– How are new gene functions emerging ?

WGD Rate 1

Rate 2

Scenarios for rapid gene evolution

One copy faster

Both copies faster

Scer - copy1

Scer - copy2

Kwal

Scer - copy1

Scer - copy2Kwal

Ohno, 1970

Force, 1999

20% of duplicated genes show acceleration20% of duplicated genes show acceleration95% of cases: Only one copy faster

Emerging gene functions after duplication

Asymmetric divergence recognize ancestral / derived

Scer - Sir3 (silencing)

Scer - Orc1 (origin of replication)

Kwal - Orc1

4-fold acceleration

Scer - Ski7 (anti-viral defense)

Scer - Hbs1 (translation initiation)

Kwal - Hbs1

3-fold acceleration

• Origin of replication silencing

• Translation initiation anti-viral defense

Asymmetric divergence distinct functional properties

Gain new function and lose ancestral function

Ancestral function Derived function

Gene deletion

Lethal (20%) Never lethal

Asymmetric divergence distinct functional properties

Gain new function and lose ancestral function

Ancestral function Derived function

Gene deletion

Lethal (20%) Never lethal

Expression AbundantSpecific

(stress, starvation)

Localization GeneralSpecific

(mitochondrion, spores)

Duplication

Asymmetric

Divergence

Asymmetry also found in network connectivity

Interaction loss more likely than gain.One protein maintains ancestral function?

Study network in context of duplication

Duplicated gene

Interaction partners

Network evolution by duplication

LostDuplicate

Time

Pre-WGD

Modern Network

Network motif

Duplication

++

--

Loss

Duplication Gain Modernnetwork motif

Ancestralnetwork motifs

Scenario 1

Scenario 2

Mechanisms of network motif emergence

Duplication

Creation Probability

p·(1-q) q

Divergence

Transition Probability

[(1-Pplus) ·(1-Pminus)3·Pminus2]

Lost Interactions

Kept Interactions

Gained Interactions

• Pre-Duplication Probabilities–p = probability of interaction–q = probability of self-interaction

• Post-Duplication Probabilities–Pplus = probability of adding an interaction–Pminus = probability of eliminating an interaction

All have either 4 or 0 edges across the pairs(4-across or 0-across)

Emergence of post-duplication network motifs

Modeling network evolution

– Parameters: • Fraction Duplicated vs Spontaneous Generation• Fraction Edges Deleted• Number of Edges for Spontaneous Genes

– 90% of timesteps: duplication• Pick a gene at random• Duplicate with all its connections• Delete on average 35% of new connections

– 10% of timesteps: creation• “Create” a new gene• Randomly connect it to the existing network with 0 – 20

connections

Study emergence of network motifs

0 2 4 6 8 10 12 14 16 18 2010

0

101

102

103

104

105

Num

ber

of A

ppea

ranc

esAbundance of network motifs predicted by duplication

2. High frequency of ohnolog pair interaction

Ancestral self-interaction or gain of ohnolog interaction

Duplication

Duplication

Divergence

1. Asymmetry in network connectivity

Interaction loss more likely than gain.One protein maintains ancestral network function?

LessonsLearned

1. Abundance of ancestral self-interactions

2. Gain of ohnolog interaction by proximity due to common interactions

3. Selection for ohnologs with interaction, both kept since neither can mutate. Faulty A’ would disrupt polymerization of A-A-A-A, reduced fitness.

3. Abundance of global properties and network hubs

Duplication + asymmetric divergence model Traditional preferential attachment model

Model matches local and global network properties

Network evolution: Conclusions

• Asymmetric evolution of network connectivity– One pair preserves connections– One pair keeps subset (rarely gains)

• WGD preserves network connectivity– Duplicates highly interconnected

• Simple model of network evolution– Estimate rates of interaction gain and loss– Very good fit to simulated and actual yeast network

• Infer connectivity patterns of ancestral network– Ancestral network shows increased number of self-interactions– Self-interacting proteins favored in duplicated network?

Comparative genomics and regulatory networks

• Regulatory motif discovery– Genome-wide conservation score– Validated using expression, positional bias, multiplicity– Pre- and post-transcriptional regulation

• microRNA regulation– Motif-centric discovery of new microRNA genes– Many new microRNAs, experimentally validated– Role of microRNA regulation: 20% of the genome

• Regulatory motif evolution– Underlying birth-death process, random birth process– Aging shifts distribution, death governed by selection– Ability to model motifs for discovery in many species

• Protein network evolution– Simple duplication-based model– Motif abundance, degree distribution can be predicted– Asymmetric divergence, cross-interactions

Acknowledgements

• Human motifs– Xiaohui Xie– Eric Lander– Vamsi Mootha– Kerstin Lindblad-Toh– Jun Lu– E.J. Kulbokas– Todd R. Golub

• Fungal comparisons– Bruce Birren– Christina Cuomo– James Galagan– Li-Jun Ma– Joshua Grochow

• Gene identification– Mike Lin

– Michael Brent

• Network evolution– Aviva Presser

– Michael Elovitz

– Roy Kishony

• Motif Evolution– Erez Lieberman

– Martin Nowak

• Genome-wide phylogeny– Matt Rasmussen

– Marcia Lara

Who’s actually doing the work

Matt RasmussenWhole-genome phylogeny

Xiaohui XieMotif finding

Josh GrochowProtein motifs

Erez LiebermanMotif evolution

Aviva PresserNetwork evolution

Mike LinGene identification

Alex StarkFly regulatory networks

Pouya KheradpourHuman motifs

Whole-genome comparative genomics Analyzing the human genome 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 21Dec 6, 2005.

Documents

human genome6