Page 1
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
Information Theory in BiologyThomas D. Schneider, Ph.D.
Molecular Information Theory GroupCenter for Cancer Research
Gene Regulation and Chromosome Biology LaboratoryNational Cancer InstituteFrederick, MD 21702-1201
Page 2
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
Page 3
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
Page 4
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
ATGC
Page 5
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
Page 6
Information Theory: One-Minute Lesson
38
B
symbols of bits
4
2
M
2
1
11
01 00
number ofexample
10
number
H T
M=2 B=log2M
B
Page 7
El Duomo, Florence, Italy
Page 8
T7 RNA polymerase + DNA
http://www.ebi.ac.uk/pdbe/entry/pdb/1qln/portfolio/?view=entry_index#ad-image-0
Page 9
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Page 10
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
2 bits/base
Page 11
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
1 bit/base
Page 12
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
6 of 17 sites
1 ttattaatacaactcactataaggagag 2 aaatcaatacgactcactatagagggac 3 cggttaatacgactcactataggagaac 4 gaagtaatacgactcagtatagggacaa 5 taattaattgaactcactaaagggagac 6 cgcttaatacgactcactaaaggagaca
17 Bacteriophage T7 RNA polymerase binding sites
0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
0 bits/base
Page 13
Sequence Logo
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Individual Information
Page 14
Sequence Logo and Sequence Walker
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
29.1 bits
SequenceWalkerPatent5,867,402
Page 15
Sequence Logo and Sequence Walker and Rsequence
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Rsequence is the average: 35.0± 0.6 bits
Page 16
Sequence Logo and Sequence Walker and Rsequence
Schneider &StephensNucl. Acids Res.18: 6097-61001990
1 ttattaatacaactcactataaggagag 33.3 2 aaatcaatacgactcactatagagggac 37.4 3 cggttaatacgactcactataggagaac 34.4 4 gaagtaatacgactcagtatagggacaa 33.1 5 taattaattgaactcactaaagggagac 30.1 6 cgcttaatacgactcactaaaggagaca 29.1
17 Bacteriophage T7 RNA polymerase binding sites
Bits0
1
2
bit
s
5′ -21TCGA
-20G
TA
-19CT
GA
-18CG
AT
-17C
T-1
6TA
-15A
-14T
-13C
TA
-12G
C-1
1AG
-10A
-9
C-8
T-7
C-6
A-5
GC
-4
T-3
A-2
AT
-1
TA
0AG
1AG
2AG
3
GA
4CAG
5TCA
6TGCA
3′
Rsequence is the average: 35.0± 0.6 bits= “area under the logo”
Page 17
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites
Page 18
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
Page 19
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Page 20
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Rfrequency = Hbefore −Hafter
Page 21
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Rfrequency = Hbefore −Hafter
= log2G− log2 γ
Page 22
Rfrequency
Information requiredto find a set of binding sites
G = # of potential binding sites= genome size in some cases
γ = number of binding sites on genome
Rfrequency = Hbefore −Hafter
= log2G− log2 γ= − log2 γ/G
Page 23
Rfrequency
2log 16/2 = 3 bits
16 positions 2 sites
16 positions 1 sitelog 16/1 = 4 bits2
Information requiredto find a set of binding sitesin a genome
Page 24
RNA Splicing
DNA
intron exon
donor acceptor
RNA
Spliced RNA
RNA Splicing
Copy DNA (transcription)
exon
exonexonexon
intronexon
Page 25
Donor and acceptor logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
Page 26
Rsequence and Rfrequency for Splice Acceptors
Rsequence
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
• Information at binding site sequences (area under sequence logo)• from: binding site sequences• 9.4 bits per site
Page 27
Rsequence and Rfrequency for Splice Acceptors
Rsequence
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
• Information at binding site sequences (area under sequence logo)• from: binding site sequences• 9.4 bits per site
Rfrequency
acceptor
intron exon
donor
• Information needed to locate the sites• from: size of genome and number of sites (length of intron+exon)• 9.7 bits per site
Rfrequency/Rsequence = 0.97
Page 28
Rsequence = Rfrequency Hypothesis
Hypothesis:The information in
binding site patternsis just sufficient
for the sites to be foundin the genome
Page 29
Rsequence versus Rfrequency
Binding Site Total Pattern Information needed to Pattern InfoLocation Info
Recognizer1 Information Locate Site in Genome
= Rsequence = Rfrequency =Rsequence
Rfrequency
(bits) (bits)
Spliceosome acceptor2 9.35± 0.12 9.66 0.97± 0.01Spliceosome donor 7.92± 0.09 9.66 0.82± 0.01
Ribosome 11.0 10.6 1.0λ cI/cro 17.7± 1.6 19.3 0.9± 0.1LexA 21.5± 1.7 18.4 1.2± 0.1TrpR 23.4± 1.9 20.3 1.2± 0.1LacI 19.2± 2.8 21.9 0.9± 0.1ArgR 16.4 18.4 0.9O (λ Origin) 20.9 19.9 1.0Ara C 19.3 19.3 1.0Transcription at TATA3
3.3 ∼ 3 ∼ 1
T7 Promoter 35.4 16.5 2.11T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht. J. Mol. Biol., 188:415-431, 1986.2R. M. Stephens and T. D. Schneider. J. Mol. Biol., 228:1124-1136, 1992.3F. E. Penotti. J Mol Biol, 213:37-52, 1990.
Page 30
Rsequence versus Rfrequency - meaning
The information in the binding site pattern (Rsequence)is close to
The information needed to find the binding sites (Rfrequency)
Page 31
Rsequence versus Rfrequency - meaning
The information in the binding site pattern (Rsequence)is close to
The information needed to find the binding sites (Rfrequency)
But for a species in a stable environment:
• size of genome (G) is fixed (e. g. E. coli has 4.7× 106 bp)• number of binding sites (γ) is fixed (e. g. there are ∼50 E. coli LexA sites)
so Rfrequency = log2G/γ is fixed
Page 32
Rsequence versus Rfrequency - meaning
The information in the binding site pattern (Rsequence)is close to
The information needed to find the binding sites (Rfrequency)
But for a species in a stable environment:
• size of genome (G) is fixed (e. g. E. coli has 4.7× 106 bp)• number of binding sites (γ) is fixed (e. g. there are ∼50 E. coli LexA sites)
so Rfrequency = log2G/γ is fixed
Rsequence must evolve towards Rfrequency!
Page 33
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence
Page 34
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Page 35
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
Page 36
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)
Page 37
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)
Page 38
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)• predetermined binding site locations (γ)
(to fix the frequency of sites)
Page 39
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)•
}
Rfrequencyis fixedpredetermined binding site locations (γ)
(to fix the frequency of sites)
Page 40
Evolution of Binding Sites
• Rfrequency is fixed relative to Rsequence• Does Rsequence evolve toward Rfrequency?
Setup a Computer Model, ‘Ev’:A population of “creatures” with
• genomes containing 4 bases (A, C, G, T)• a defined genome size (G)•
}
Rfrequencyis fixedpredetermined binding site locations (γ)
(to fix the frequency of sites)• a recognizer gene encoded in the sequence:
use a weight matrix
Page 41
How A Weight Matrix Works
Sequence matrix, s(b, l, j) for sequence j
base b position l
C A G G T C T G C A−3 −2 −1 0 1 2 3 4 5 6
A 0 1 0 0 0 0 0 0 0 1
C 1 0 0 0 0 1 0 0 1 0
G 0 0 1 1 0 0 0 1 0 0
T 0 0 0 0 1 0 1 0 0 0
Individual information weight matrix, Riw(b, l)
base b position l
−3 −2 −1 0 1 2 3 4 5 6
A +0.4 +1.3 −1.4 −8.8 −5.8 +1.1 +1.5 −1.8 −0.7 +0.0
C +0.6 −0.8 −2.4 −7.8 −5.5 −3.7 −1.6 −2.2 −0.5 −0.2G −0.6 −1.0 +1.6 +2.0 −6.2 +0.7 −1.1 +1.7 −0.3 +0.4
T −1.0 −0.9 −1.7 −5.8 +2.0 −3.4 −1.6 −2.2 +0.9 −0.5
Page 42
How A Weight Matrix Works
Sequence matrix, s(b, l, j) for sequence j
base b position l
C A G G T C T G C A−3 −2 −1 0 1 2 3 4 5 6
A 0 1 0 0 0 0 0 0 0 1
C 1 0 0 0 0 1 0 0 1 0
G 0 0 1 1 0 0 0 1 0 0
T 0 0 0 0 1 0 1 0 0 0
Individual information weight matrix, Riw(b, l)
base b position l
−3 −2 −1 0 1 2 3 4 5 6
A +0.4 +1.3 −1.4 −8.8 −5.8 +1.1 +1.5 −1.8 −0.7 +0.0
C +0.6 −0.8 −2.4 −7.8 −5.5 −3.7 −1.6 −2.2 −0.5 −0.2G −0.6 −1.0 +1.6 +2.0 −6.2 +0.7 −1.1 +1.7 −0.3 +0.4
T −1.0 −0.9 −1.7 −5.8 +2.0 −3.4 −1.6 −2.2 +0.9 −0.5
5’ c a g g t c t g c a 3’
Sequence Walker
Page 43
Unevolved Ev Creature
Page 44
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Page 45
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 bases
Page 46
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
Page 47
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
found real site
Page 48
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
found real sitemissed real site
Page 49
Unevolved Ev Creature}
“blue”geneweightmatrix:6 bpwide
Genome positions available G = 256 basesRfrequency = log2 256/16 = 4 bits
}
γ = 16bindingsites
found real sitemissed real sitefound wrong site
Page 50
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
Page 51
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix
Page 52
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome
Page 53
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
Page 54
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right place
Page 55
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
Page 56
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
• Sort the creatures by their mistakes
Page 57
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
• Sort the creatures by their mistakes
• REPLICATE: the best creatures areduplicated and replace the worst ones
Page 58
Evolution Cycle
sort
mutate
replicate
kill
select
ion
evaluate
• EVALUATE each creature
• translate the recognizer gene into a weight matrix• scan the weight matrix across the genome• count the number of mistakes:
missing a site at a right placefinding a site at a wrong place
• Sort the creatures by their mistakes
• REPLICATE: the best creatures areduplicated and replace the worst ones
• MUTATE all genomes randomly
Page 59
Evolved Ev Creature
Page 60
Evolution of Binding Sites
16 evolving binding sitesGeneration 100
Rs = -0.1 +/- 0.5 bits
0
1
2b
its
5′ -5 -4
CATG
-3 -2 -1 0 1CGAT
2 3TCGA
4CATG
5ATCG
6 7 8CATG
9 10 3′
16 evolving binding sitesGeneration 200
Rs = -0.0 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
CATG
-3 -2
GATC
-1 0 1GACT
2 3CTAG
4 5ATGC
6 7TGAC 8 9 10 3′
16 evolving binding sitesGeneration 300
Rs = 1.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5
AGCT
-4 -3
TACG
-2 -1 0GCTA
1AGCT
2 3TAG
4 5TGC
6 7GTCA
8GCTA
9 10
GCTA
3′
16 evolving binding sitesGeneration 400
Rs = 2.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
ATCG
-3
ACG
-2 -1
TCAG
0TGCA
1GACT
2GCTA
3CTAG
4TCAG
5TGC
6 7TGCA 8 9ACTG
10
CGTA
3′
16 evolving binding sitesGeneration 500
Rs = 2.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2 -1
CTAG 0TGA
1ACT
2GCTA
3CTAG
4TCAG
5AGC
6 7GTCA 8GCTA
9 10
GTA
3′
16 evolving binding sitesGeneration 600
Rs = 2.7 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3 -2
GTAC
-1 0TCGA
1GACT
2TCA
3ATG
4CGA
5AGC
6 7TGCA 8GCAT 9 10
GCTA
3′
16 evolving binding sitesGeneration 700
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2
ATCG
-1
TCGA 0TGA
1GCT
2CAT
3TGA
4CAG
5TGAC
6 7CTGA 8CTGA 9 10
GTA
3′
16 evolving binding sitesGeneration 800
Rs = 3.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3 -2
ATGC
-1 0
GA
1ACT
2CTA
3CTGA
4ACG
5TGAC
6 7TGCA 8CGAT 9CTAG
10
CGTA
3′
16 evolving binding sitesGeneration 900
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GCAT
-4 -3 -2
ATCG
-1
TAG
0
GA
1GACT
2CTA
3ACTG 4CAG
5TGAC
6 7 8TGA
9TGA
10 3′
16 evolving binding sitesGeneration 1000
Rs = 4.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TAGC
-2
ATCG
-1
TAG
0GA
1
CT
2CTA
3 4CAG
5GAC
6 7 8 9CTGA
10
CGTA
3′
16 evolving binding sitesGeneration 1100
Rs = 5.0 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3
ATGC
-2
ATCG
-1
CAG
0
GA
1GACT
2TA
3CTAG 4CAG
5GAC
6 7 8TCGA
9CTGA
10
TAG
3′
16 evolving binding sitesGeneration 1200
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCG
-2
ACGT
-1
CTGA
0GA
1GACT
2CTA
3 4ACG
5GTAC
6 7 8TCGA
9GTA
10
TGA
3′
16 evolving binding sitesGeneration 1300
Rs = 4.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GTCA
-4 -3
CATG
-2
ACGT
-1
TCGA
0GA
1GATC
2CTA
3ACTG 4CAG
5TGCA
6CTA
7 8 9CGTA
10
TGA
3′
16 evolving binding sitesGeneration 1400
Rs = 4.4 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4
GATC
-3
ATCG
-2
GCT
-1 0TGA
1ACT
2TCA
3ATCG
4CTAG
5GAC
6 7 8GCTA
9 10
CTGA
3′
16 evolving binding sitesGeneration 1500
Rs = 3.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3
ACTG
-2
ATGC
-1
TGCA 0TGA
1ACT
2TCA
3TACG
4CTAG
5GAC
6CGAT 7 8 9GCTA
10
TCGA
3′
16 evolving binding sitesGeneration 1600
Rs = 4.9 +/- 0.5 bits
0
1
2b
its
5′ -5
GATC
-4
AGTC
-3
TACG
-2 -1
TCGA
0
GA
1
CT
2TCA
3CTG
4TCAG
5TGAC
6AGCT
7 8GCTA
9CTGA
10
CTGA
3′
16 evolving binding sitesGeneration 1700
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
ACG
-2
ATCG
-1
TCAG
0
GA
1ACT
2GTCA
3CTG
4TAG
5GATC
6ATGC
7TACG
8GCTA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 1800
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2 -1 0
GA
1ACT
2TCA
3CATG
4
GA
5GTAC
6 7 8GCTA 9TCGA
10
GTA
3′
16 evolving binding sitesGeneration 1900
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2
GACT
-1 0
GA
1ACT
2TCA
3CATG
4
GA
5TAC
6 7 8TCA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 2000
Rs = 5.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
TGAC
-4 -3 -2
CGAT
-1
GTCA
0
GA
1ACT
2TCA
3ATCG
4
GA
5TAC
6CAGT 7 8GTCA 9CTGA
10
TGA
3′
Page 61
Evolution of Binding Sites
16 evolving binding sitesGeneration 100
Rs = -0.1 +/- 0.5 bits
0
1
2b
its
5′ -5 -4
CATG
-3 -2 -1 0 1CGAT
2 3TCGA
4CATG
5ATCG
6 7 8CATG
9 10 3′
16 evolving binding sitesGeneration 200
Rs = -0.0 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
CATG
-3 -2
GATC
-1 0 1GACT
2 3CTAG
4 5ATGC
6 7TGAC 8 9 10 3′
16 evolving binding sitesGeneration 300
Rs = 1.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5
AGCT
-4 -3
TACG
-2 -1 0GCTA
1AGCT
2 3TAG
4 5TGC
6 7GTCA
8GCTA
9 10
GCTA
3′
16 evolving binding sitesGeneration 400
Rs = 2.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
ATCG
-3
ACG
-2 -1
TCAG
0TGCA
1GACT
2GCTA
3CTAG
4TCAG
5TGC
6 7TGCA 8 9ACTG
10
CGTA
3′
16 evolving binding sitesGeneration 500
Rs = 2.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2 -1
CTAG 0TGA
1ACT
2GCTA
3CTAG
4TCAG
5AGC
6 7GTCA 8GCTA
9 10
GTA
3′
16 evolving binding sitesGeneration 600
Rs = 2.7 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3 -2
GTAC
-1 0TCGA
1GACT
2TCA
3ATG
4CGA
5AGC
6 7TGCA 8GCAT 9 10
GCTA
3′
16 evolving binding sitesGeneration 700
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2
ATCG
-1
TCGA 0TGA
1GCT
2CAT
3TGA
4CAG
5TGAC
6 7CTGA 8CTGA 9 10
GTA
3′
16 evolving binding sitesGeneration 800
Rs = 3.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3 -2
ATGC
-1 0
GA
1ACT
2CTA
3CTGA
4ACG
5TGAC
6 7TGCA 8CGAT 9CTAG
10
CGTA
3′
16 evolving binding sitesGeneration 900
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GCAT
-4 -3 -2
ATCG
-1
TAG
0
GA
1GACT
2CTA
3ACTG 4CAG
5TGAC
6 7 8TGA
9TGA
10 3′
16 evolving binding sitesGeneration 1000
Rs = 4.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TAGC
-2
ATCG
-1
TAG
0GA
1
CT
2CTA
3 4CAG
5GAC
6 7 8 9CTGA
10
CGTA
3′
16 evolving binding sitesGeneration 1100
Rs = 5.0 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3
ATGC
-2
ATCG
-1
CAG
0
GA
1GACT
2TA
3CTAG 4CAG
5GAC
6 7 8TCGA
9CTGA
10
TAG
3′
16 evolving binding sitesGeneration 1200
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCG
-2
ACGT
-1
CTGA
0GA
1GACT
2CTA
3 4ACG
5GTAC
6 7 8TCGA
9GTA
10
TGA
3′
16 evolving binding sitesGeneration 1300
Rs = 4.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GTCA
-4 -3
CATG
-2
ACGT
-1
TCGA
0GA
1GATC
2CTA
3ACTG 4CAG
5TGCA
6CTA
7 8 9CGTA
10
TGA
3′
16 evolving binding sitesGeneration 1400
Rs = 4.4 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4
GATC
-3
ATCG
-2
GCT
-1 0TGA
1ACT
2TCA
3ATCG
4CTAG
5GAC
6 7 8GCTA
9 10
CTGA
3′
16 evolving binding sitesGeneration 1500
Rs = 3.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3
ACTG
-2
ATGC
-1
TGCA 0TGA
1ACT
2TCA
3TACG
4CTAG
5GAC
6CGAT 7 8 9GCTA
10
TCGA
3′
16 evolving binding sitesGeneration 1600
Rs = 4.9 +/- 0.5 bits
0
1
2b
its
5′ -5
GATC
-4
AGTC
-3
TACG
-2 -1
TCGA
0
GA
1
CT
2TCA
3CTG
4TCAG
5TGAC
6AGCT
7 8GCTA
9CTGA
10
CTGA
3′
16 evolving binding sitesGeneration 1700
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
ACG
-2
ATCG
-1
TCAG
0
GA
1ACT
2GTCA
3CTG
4TAG
5GATC
6ATGC
7TACG
8GCTA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 1800
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2 -1 0
GA
1ACT
2TCA
3CATG
4
GA
5GTAC
6 7 8GCTA 9TCGA
10
GTA
3′
16 evolving binding sitesGeneration 1900
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2
GACT
-1 0
GA
1ACT
2TCA
3CATG
4
GA
5TAC
6 7 8TCA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 2000
Rs = 5.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
TGAC
-4 -3 -2
CGAT
-1
GTCA
0
GA
1ACT
2TCA
3ATCG
4
GA
5TAC
6CAGT 7 8GTCA 9CTGA
10
TGA
3′
selection no selection
0 500 1000 1500 2000 0
4
8
12
16
20
Generation
Mistakes of Best Organism
Page 62
Evolution of Binding Sites
16 evolving binding sitesGeneration 100
Rs = -0.1 +/- 0.5 bits
0
1
2b
its
5′ -5 -4
CATG
-3 -2 -1 0 1CGAT
2 3TCGA
4CATG
5ATCG
6 7 8CATG
9 10 3′
16 evolving binding sitesGeneration 200
Rs = -0.0 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
CATG
-3 -2
GATC
-1 0 1GACT
2 3CTAG
4 5ATGC
6 7TGAC 8 9 10 3′
16 evolving binding sitesGeneration 300
Rs = 1.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5
AGCT
-4 -3
TACG
-2 -1 0GCTA
1AGCT
2 3TAG
4 5TGC
6 7GTCA
8GCTA
9 10
GCTA
3′
16 evolving binding sitesGeneration 400
Rs = 2.6 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4
ATCG
-3
ACG
-2 -1
TCAG
0TGCA
1GACT
2GCTA
3CTAG
4TCAG
5TGC
6 7TGCA 8 9ACTG
10
CGTA
3′
16 evolving binding sitesGeneration 500
Rs = 2.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2 -1
CTAG 0TGA
1ACT
2GCTA
3CTAG
4TCAG
5AGC
6 7GTCA 8GCTA
9 10
GTA
3′
16 evolving binding sitesGeneration 600
Rs = 2.7 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3 -2
GTAC
-1 0TCGA
1GACT
2TCA
3ATG
4CGA
5AGC
6 7TGCA 8GCAT 9 10
GCTA
3′
16 evolving binding sitesGeneration 700
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2
ATCG
-1
TCGA 0TGA
1GCT
2CAT
3TGA
4CAG
5TGAC
6 7CTGA 8CTGA 9 10
GTA
3′
16 evolving binding sitesGeneration 800
Rs = 3.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3 -2
ATGC
-1 0
GA
1ACT
2CTA
3CTGA
4ACG
5TGAC
6 7TGCA 8CGAT 9CTAG
10
CGTA
3′
16 evolving binding sitesGeneration 900
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GCAT
-4 -3 -2
ATCG
-1
TAG
0
GA
1GACT
2CTA
3ACTG 4CAG
5TGAC
6 7 8TGA
9TGA
10 3′
16 evolving binding sitesGeneration 1000
Rs = 4.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TAGC
-2
ATCG
-1
TAG
0GA
1
CT
2CTA
3 4CAG
5GAC
6 7 8 9CTGA
10
CGTA
3′
16 evolving binding sitesGeneration 1100
Rs = 5.0 +/- 0.5 bits
0
1
2b
its
5′ -5 -4 -3
ATGC
-2
ATCG
-1
CAG
0
GA
1GACT
2TA
3CTAG 4CAG
5GAC
6 7 8TCGA
9CTGA
10
TAG
3′
16 evolving binding sitesGeneration 1200
Rs = 4.5 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCG
-2
ACGT
-1
CTGA
0GA
1GACT
2CTA
3 4ACG
5GTAC
6 7 8TCGA
9GTA
10
TGA
3′
16 evolving binding sitesGeneration 1300
Rs = 4.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GTCA
-4 -3
CATG
-2
ACGT
-1
TCGA
0GA
1GATC
2CTA
3ACTG 4CAG
5TGCA
6CTA
7 8 9CGTA
10
TGA
3′
16 evolving binding sitesGeneration 1400
Rs = 4.4 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4
GATC
-3
ATCG
-2
GCT
-1 0TGA
1ACT
2TCA
3ATCG
4CTAG
5GAC
6 7 8GCTA
9 10
CTGA
3′
16 evolving binding sitesGeneration 1500
Rs = 3.9 +/- 0.5 bits
0
1
2
bit
s
5′ -5
GATC
-4 -3
ACTG
-2
ATGC
-1
TGCA 0TGA
1ACT
2TCA
3TACG
4CTAG
5GAC
6CGAT 7 8 9GCTA
10
TCGA
3′
16 evolving binding sitesGeneration 1600
Rs = 4.9 +/- 0.5 bits
0
1
2b
its
5′ -5
GATC
-4
AGTC
-3
TACG
-2 -1
TCGA
0
GA
1
CT
2TCA
3CTG
4TCAG
5TGAC
6AGCT
7 8GCTA
9CTGA
10
CTGA
3′
16 evolving binding sitesGeneration 1700
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
ACG
-2
ATCG
-1
TCAG
0
GA
1ACT
2GTCA
3CTG
4TAG
5GATC
6ATGC
7TACG
8GCTA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 1800
Rs = 3.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3
TCGA
-2 -1 0
GA
1ACT
2TCA
3CATG
4
GA
5GTAC
6 7 8GCTA 9TCGA
10
GTA
3′
16 evolving binding sitesGeneration 1900
Rs = 4.7 +/- 0.5 bits
0
1
2
bit
s
5′ -5 -4 -3 -2
GACT
-1 0
GA
1ACT
2TCA
3CATG
4
GA
5TAC
6 7 8TCA
9TCGA
10
CTGA
3′
16 evolving binding sitesGeneration 2000
Rs = 5.2 +/- 0.5 bits
0
1
2
bit
s
5′ -5
TGAC
-4 -3 -2
CGAT
-1
GTCA
0
GA
1ACT
2TCA
3ATCG
4
GA
5TAC
6CAGT 7 8GTCA 9CTGA
10
TGA
3′
selection no selection
0 500 1000 1500 2000 0
4
8
12
16
20
Generation
Mistakes of Best Organism
Rfrequency Rsequence
selection
no selection
0 500 1000 1500 2000 -1.0
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Generation
Information (bits per site)
Page 63
Donor and acceptor logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
Page 64
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Page 65
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Page 66
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Page 67
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Page 68
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G —
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Page 69
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G — G
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Page 70
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G — G T
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
Page 71
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
C A G — G T
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
• BUT the information curves (sequence conservation) differ!
Page 72
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
TCAG TACGTCGAGTACTGACTGAC
GTCA
CTAG
CT
GGACT
CTGA
TCGA
CTAG
ACGT
TCAG
AGTCATGCATCGATCGATCGATGC
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCA
TAC
GTCAG
CGAT
GACTATGCGATCTAGCTCAG
C A G — G T
https://alum.mit.edu/www/toms/papers/splice/
• The consensus sequences match . . .
• BUT the information curves (sequence conservation) differ!
• Put letters into the graph proportional to their frequency!
Page 73
Human Splice Junction Information Curves
Donor
Sequence
Conservation→
inbitsper
base
Acceptor
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-9 481 444 497 371 0.01
-8 461 461 453 422 -0.00
-7 434 457 530 378 0.01
-6 498 433 452 416 0.00
-5 474 489 398 438 0.00
-4 492 527 446 334 0.02
-3 603 673 301 222 0.14
-2 1071 263 219 246 0.39
-1 169 85 1404 141 0.90
0 0 2 1789 8 1.95
1 8 10 6 1775 1.88
2 974 35 738 43 0.75
3 1274 152 215 149 0.68
4 128 95 1468 97 1.04
5 270 304 348 812 0.16
6 428 369 564 296 0.04
7 319 492 408 424 0.02
8 332 494 413 384 0.01
9 302 437 463 408 0.02
10 281 419 424 342 0.02
11 274 382 436 369 0.02
12 273 414 400 355 0.02
l
a
c
g
t
Rs
(l)
Rs(l
) =
Rseq
uenc
e(l)
, In
form
atio
n in
bit
s
Position L (in bases)
0.0
1.0
2.0
-25 309 392 208 386 0.04
-24 282 420 216 388 0.04
-23 273 392 220 424 0.05
-22 290 411 214 397 0.04
-21 309 400 194 416 0.05
-20 248 412 203 461 0.08
-19 256 388 225 482 0.07
-18 213 442 246 455 0.08
-17 201 453 183 535 0.15
-16 193 471 199 517 0.14
-15 182 459 212 533 0.14
-14 158 476 193 619 0.21
-13 135 486 197 656 0.25
-12 122 511 174 673 0.29
-11 121 500 152 732 0.34
-10 110 483 149 775 0.37
-9 131 534 175 800 0.33
-8 156 583 192 727 0.27
-7 172 656 172 674 0.27
-6 165 653 158 726 0.30
-5 123 691 110 788 0.43
-4 127 590 103 914 0.46
-3 394 499 432 411 0.00
-2 67 1253 20 397 0.92
-1 1717 10 6 10 1.86
0 9 11 1720 4 1.87
1 449 231 871 187 0.26
2 426 308 413 590 0.04
3 424 426 419 468 0.00
4 400 487 440 410 0.00
5 424 502 347 464 0.01
6 394 527 424 392 0.01
7 441 431 485 380 0.00
TCAG TACGTCGAGTACTGACTGAC
GTCA
CTAG
CT
GGACT
CTGA
TCGA
CTAG
ACGT
TCAG
AGTCATGCATCGATCGATCGATGC
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCA
TAC
GTCAG
CGAT
GACTATGCGATCTAGCTCAG
1799 Human donor sites
0
1
2
bit
s
5′ -9
TCAG
-8 -7
TACG
-6
TCGA
-5
GTAC
-4
TGAC
-3
TGAC
-2
GTCA
-1
CTAG
0CT
G
1GACT 2CTGA
3TCGA
4CTAG
5ACGT
6TCAG 7AGTC 8ATGC 9ATCG
10
ATCG
11
ATCG
12
ATGC
3′
1744 Human acceptor sites
0
1
2
bit
s
5′ -25GATC
-24GATC
-23GACT
-22GATC
-21GAC
T
-20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8
AGCT
-7
GACT
-6
GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT 3GACT 4ATGC 5GATC 6TAGC 7TCAG
3′
https://alum.mit.edu/www/toms/papers/logo/
• That’s how and why we invented sequence logos!
Page 74
Splice Junction Sequence Logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
• 90% of the splice junctioninformation is on the intron side
https://alum.mit.edu/www/toms/papers/splice/
Page 75
Splice Junction Sequence Logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
donor acceptor
proto
GTAC
GTCA
CTAG
TCAG
CGAT
TGAC
GTCA
CTAG
CT
GGACT
GATC
GTCA
TAC
GTCAG
CGAT
• 90% of the splice junctioninformation is on the intron side
• Hypothesis:donor and acceptor sites had acommon ancestor that duplicated
https://alum.mit.edu/www/toms/papers/splice/
Page 76
Splice Junction Sequence Logos
5′
3′
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
GTCAG
CGAT
donor acceptor
proto
GTAC
GTCA
CTAG
TCAG
CGAT
TGAC
GTCA
CTAG
CT
GGACT
GATC
GTCA
TAC
GTCAG
CGAT
• 90% of the splice junctioninformation is on the intron side
• Hypothesis:donor and acceptor sites had acommon ancestor that duplicated
• They evolved to put the informationinto the intron. This avoids affectingthe proteins.
https://alum.mit.edu/www/toms/papers/splice/
Page 77
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
Page 78
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
• Correlations can be measured in bits!
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
Page 79
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
• Correlations can be measured in bits!
• 3D Sequence logo
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
Page 80
3D Sequence Logos for tRNA Correlations
• tRNA reads RNAto make protein
• Correlations can be measured in bits!
• 3D Sequence logo
• OBSERVED: tRNA stems
https://alum.mit.edu/www/toms/papers/correlogo/ Nucleic Acids Res. 2006 34:W405-11
Page 81
Sequence Walker example: rrnB P1
. *4164240 . *4164250 . *4164260 . *4164270 . 5’ g g a g c t g a a c a a t t a t t g c c c g t t t t a c a g c g t t a c g g c t t c g a 3’ 3’ c c t c g a c t t g t t a a t a a c g g g c a a a a t g t c g c a a t g c c g a a g c t 5’
Fis 12.0 bits
*4164280 . *4164290 . *4164300 . *4164310 . *4164320 . *4164330 5’ a a c g c t c g a a a a a c t g g c a g t t t t a g g c t g a t t t g g t t g a a t g t t g c g c g g t c a 3’ 3’ t t g c g a g c t t t t t g a c c g t c a a a a t c c g a c t a a a c c a a c t t a c a a c g c g c c a g t 5’
Fis 5.3 bits Fis 10.4 bits
. *4164340 . *4164350 . *4164360 . *4164370 . *4164380 5’ g a a a a t t a t t t t a a a t t t c c t c t t g t c a g g c c g g a a t a a c t c c c t a t a a t g 3’ 3’ c t t t t a a t a a a a t t t a a a g g a g a a c a g t c c g g c c t t a t t g a g g g a t a t t a c 5’
distalUP 6.6 bits p35 5.5 bits p10 8.4 bits | | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | p35- ( 22) - p10 4164377 Gap 2.3 bits | | | | proximalUP 4.4 bits | | | | | | | | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | proximalUP- ( 27) - p10 4164377 Gap 3.4 bits | | |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | distalUP- ( 45) - p10 4164377 Gap 5.4 bits |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |- - - - - - - - - |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | distalUP- proximalUP- p35- p10 4164377 total 13.7 bits
rrnB P1
Page 82
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10
Page 83
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10• Using information theory we discovered that
stress-response σ38 promoters do not have a −35
Page 84
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10• Using information theory we discovered that
stress-response σ38 promoters do not have a −35• Instead, they have a −10 and two UP elements
Page 85
Complex Sequence Walker Example
• σ70 promoters have a −35 and a −10• Using information theory we discovered that
stress-response σ38 promoters do not have a −35• Instead, they have a −10 and two UP elements• σ38 promoter talA P1 is complex!
Page 86
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
Page 87
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
Page 88
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
• Information gained for energy dissipated
Page 89
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
• Information gained for energy dissipated
• Isothermal efficiency
Page 90
Important Discovery1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′
• Area under a sequence logo is the total information.
• How is that related to the binding energy?
• Information gained for energy dissipated
• Isothermal efficiency
• My most important discovery:
Molecules are often 70% efficient
Page 91
Information of EcoRI DNA Binding
′
• EcoRI - restriction enzyme
Page 92
Information of EcoRI DNA Binding
EcoRI sites
0
1
2
bit
s5′
0
G1A 2A 3T 4T 5
C3′
• EcoRI - restriction enzyme
• EcoRI binds DNA at 5′ GAATTC 3′
Page 93
Information of EcoRI DNA Binding
EcoRI sites
0
1
2
bit
s5′
0
G1A 2A 3T 4T 5
C3′
• EcoRI - restriction enzyme
• EcoRI binds DNA at 5′ GAATTC 3′
• information required:
6 bases × 2 bits per base = 12 bits
Page 94
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
Page 95
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
• Average energy dissipated by one molecule as it binds:
∆G◦
spec = −kBT lnKspec (joules per binding)
Page 96
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
• Average energy dissipated by one molecule as it binds:
∆G◦
spec = −kBT lnKspec (joules per binding)
• The Second Law of Thermodynamics as a conversion factor:
Emin = kBT ln 2 (joules per bit)
Page 97
Energy Dissipation by EcoRI
• Measured specific binding constant:
Kspec = 1.6× 105
• Average energy dissipated by one molecule as it binds:
∆G◦
spec = −kBT lnKspec (joules per binding)
• The Second Law of Thermodynamics as a conversion factor:
Emin = kBT ln 2 (joules per bit)
• Number of bits that could have been selected:
Renergy = −∆G◦/Emin
= kBT lnKspec/kBT ln 2
= log2Kspec ⇐ SO SIMPLE!
= 17.3 bits per binding
Page 98
Information/Energy = Efficiency of EcoRI
′
EcoRI could have made 17.3 binary choices
Page 99
Information/Energy = Efficiency of EcoRI
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Page 100
Information/Energy = Efficiency of EcoRI
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
Page 101
Information/Energy = Efficiency of EcoRI
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
12 bits per binding
17.3 bits per binding= 0.7
Page 102
Information/Energy = Efficiency of EcoRI = 70%
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
12 bits per binding
17.3 bits per binding= 0.7
The efficiency is 70%.
Page 103
Information/Energy = Efficiency of EcoRI = 70%
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
EcoRI could have made 17.3 binary choices. . . but it only made 12 choices.
Efficiency is‘WORK’ DONE / ENERGY DISSIPATED
12 bits per binding
17.3 bits per binding= 0.7
The efficiency is 70%.
18 out of 19 DNA binding proteins give ∼70% efficiency.
Page 104
Rhodopsin Shape Change
Dark State
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Page 105
Rhodopsin Shape Change
Dark State
hν
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Page 106
Rhodopsin Shape Change
Dark State After Photon - Light State
hν
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Page 107
Rhodopsin Shape Change
Dark State After Photon - Light State
hν
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Page 108
Rhodopsin Shape Change
Dark State After Photon - Light State
hν70%
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Page 109
Rhodopsin Shape Change
Dark State After Photon - Light State
hν70%
30%
P. Scheerer et al. Nature 455, 497-502(2008) doi:10.1038/nature07330
Page 110
Muscle Structure
https://en.wikipedia.org/wiki/Muscle
Page 111
Muscle Structure
https://en.wikipedia.org/wiki/Muscle
Page 112
Muscle Structure
https://en.wikipedia.org/wiki/Muscle
Page 113
Muscle Structure
https://en.wikipedia.org/wiki/Muscle
Page 114
Muscle Structure
https://en.wikipedia.org/wiki/Muscle
Page 115
Muscle Structure
https://en.wikipedia.org/wiki/Muscle
Page 116
Tom’s Model of Muscle Mechanism
54321 54321
54321 54321
1 2 3 4 5 54321
1 2 3 4 5 5432154321
+PiADP
+PiADP
+PiADP
ATP
heat
operation
before state
after statedegenerate
after stateforward
ATP
heat + ADP + Pi
70% success
heat + ADP + Pi
prime
"power stroke"
30% failure
Page 117
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Page 118
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Page 119
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
•Weight lifting gives work done
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Page 120
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
•Weight lifting gives work done
• NMR coil gives ATP = energy used
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Page 121
Efficiency of Muscle
• Experiments by Kushmerick’s lab since (at least) 1969
• new work: 2008, 2011
•Weight lifting gives work done
• NMR coil gives ATP = energy used
• Efficiency: 0.68± 0.09
http://dx.doi.org/10.1113/jphysiol.2007.146829 http://dx.doi.org/10.1242/jeb.052985
Page 122
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
Page 123
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
Page 124
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin
Page 125
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
Page 126
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
• other systems
Page 127
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
• other systems
Why 70% efficiency?
Page 128
Why are molecular machines 70% efficient?
EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
70% efficiency appears widely in biology:
• DNA - protein binding
• rhodopsin• muscle
• other systems
Why 70% efficiency?
Information theory explanation
Page 129
Lock and Key
Like a key in a lockwhich has many independent pins,it takes many numbersto describe the vibrational stateof a molecular machine
G A A T T C
Page 130
Gaussians
•2
x
ep(x) −x
Pin motion x has a Gaussian distribution:
p(x) =1√2πσ2
e−(x−µ)2
2σ2
µ = mean, σ = standard deviation
Page 131
Gaussians
•2
x
ep(x) −x
Pin motion x has a Gaussian distribution:
p(x) =1√2πσ2
e−(x−µ)2
2σ2
µ = mean, σ = standard deviation
• Gaussian distributions are generated by the sum ofmany small random variables
Page 132
Gaussians
•2
x
ep(x) −x
Pin motion x has a Gaussian distribution:
p(x) =1√2πσ2
e−(x−µ)2
2σ2
µ = mean, σ = standard deviation
• Gaussian distributions are generated by the sum ofmany small random variables
•Drunkard’s walk: Galton’s quincunx device!
http://www.youtube.com/watch?v=xDIyAOBa_yU
Page 133
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 (1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 (2)
Page 134
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
Page 135
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
Page 136
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
p(x, y) = p(x)× p(y) (3)
Page 137
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
Page 138
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
Page 139
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
credit: http://en.wikipedia.org/wiki/Pythagoras
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
∝ e−r2
(6)
Page 140
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
credit: http://en.wikipedia.org/wiki/Pythagoras
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
∝ e−r2
(6)
If p(x, y) is a constant,then r is a constant.
Page 141
Two Gaussians
p(x) =1√2πσ2
e−(x−µ)2
2σ2 ∝ e−x2
(1)
p(y) =1√2πσ2
e−(y−µ)2
2σ2 ∝ e−y2
(2)
r
p(x,y)
x
y
credit: http://en.wikipedia.org/wiki/Pythagoras
p(x, y) = p(x)× p(y) (3)
∝ e−x2 × e−y
2
(4)
∝ e−(x2+y2) (5)
∝ e−r2
(6)
If p(x, y) is a constant,then r is a constant.
Circular distribution!
Page 142
1 Dimension
Energy
States
1 dimension is too simple!
Page 143
Bowls in 2 Dimensions
Page 144
Spheres in 3 Dimensions
Page 145
N Dimensional Sphere
Page 146
Spheres tighten in high dimensions
D=1
D=2
D=4
D=8
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 2.20 2.40 2.60 2.80 3.00 0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Normalized Radius
Normalized Density
Page 147
Good Sphere Packing
Good packing of spheresgives a moleculethe capacityto make selections efficiently
Page 148
N Dimensional Sphere Separation
Degenerate Sphere
Page 149
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
Page 150
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise
Page 151
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise √
Power
Page 152
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise √
Power
Energy dissipated to escape the Degenerate Sphere must exceed the Noise
Page 153
N Dimensional Sphere Separation
Degenerate Sphere Forward Sphere
√Noise √
Power
Energy dissipated to escape the Degenerate Sphere must exceed the Noise√Power >
√Noise
Page 154
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
�
t
P
y
=N
y
Page 155
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
ǫt =ln
(
Py
Ny+1
)
Py
Ny
← isothermal efficiency
�
t
P
y
=N
y
Page 156
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
ǫt =ln
(
Py
Ny+1
)
Py
Ny
← isothermal efficiency
�
t
P
y
=N
y
0 1 2 3 40.00
1.00
0.69
0.55
0.46
0.40
Second Law upper bound
Isothermal Efficiency upper bound
The curve is an upper bound
Page 157
Theoretical Isothermal Efficiency
T. D. Schneider, Nucleic Acids Research (2010) 38: 5995-6006
• For molecular states of molecules with dspace ‘parts’Py energy is dissipated for noise Ny and
Cy = dspace log2(Py/Ny + 1) ← machine capacity
ǫt =ln
(
Py
Ny+1
)
Py
Ny
← isothermal efficiency
�
t
P
y
=N
y
0 1 2 3 40.00
1.00
0.69
0.55
0.46
0.40
Second Law upper bound
Isothermal Efficiency upper bound
The curve is an upper bound
• If Py/Ny = 1 the efficiency is 70%!
Page 158
Acknowledgements
• Mentors:• Larry Gold (graduate school mentor)• Andrej Ehrenfeucht (information theory)• Students:
• Mike Stephens (logos, splicing)• John Spouge (mathematics)• Paul C. Anagnostopoulos (Evolution model)• Bruce Shapiro and Eckart Bindewald (3D logo)• Kevin Franco (σ38)• Ding Jin (σ38)• Useful discussions with
• Jeff Strathern• Amar Klar• Kemi Abolude• Susan Lauffer• Cedric Cagliero• Amar Klar• Zhi-Ming Zheng• Mark Lewandoski• This research was supported by the Intramural Research Program of the NIH,National Cancer Institute, Center for Cancer Research.
Page 159
Information theory: the mathematics of biology1744 Human acceptor sites
0
1
2
bit
s
5′ -20GA
CT
-19GAC
T
-18AGC
T
-17G
ACT
-16A
GCT
-15A
GCT
-14A
GCT
-13A
GCT
-12A
GCT
-11A
GCT
-10A
GCT
-9
AGCT
-8AGCT
-7GACT
-6GACT
-5
GACT
-4
GACT
-3
ATGC
-2
GATC
-1
GTCA 0TAC
G1T
CAG
2CGAT
3′
1799 Human donor sites
0
1
2
bit
s
5′ -3
TGAC
-2
GTCA
-1
CTAG
0CT
G1GACT 2C
TGA
3TCGA
4CTAG
5ACGT
3′ EcoRI sites
0
1
2
bit
s
5′
0
G
1A 2A 3T 4T 5
C3′
https://alum.mit.edu/www/toms/
Page 161
Version
version = 1.89 of kanpurtalk.tex 2016 Oct 16