Detecting signals in biological sequences Leelavati Narlikar
Detecting signalsin biological sequences
Leelavati Narlikar
August 2010
DNA: code for life
sugar-phosphate backbone
hydrogen-bondedbase pairs
2
Cells store their hereditary information on DNA molecules
Complete DNA content = Genome (~3 billion base-pairs)
Four types of nucleotides:
A: adenine C: cytosineG: guanineT: thymine
August 2010
Expressing information on the DNA
3
DNA
RNA
Transcription(RNA synthesis)
PROTEIN
Translation(protein synthesis)
Gene
August 2010
Can explain differences across species...
4
August 2010
A region along human genome...
5
August 2010
Also explain differences within species...
6
August 2010
but what about the same organism?
Neuron Lymphocyte
25 μm
7
[adapted from Molecular Biology of the Cell, Alberts et al.]
August 2010
Variability in expression levels
Gene A Gene B
8
A BA A A A A
A A A A A A
A A A A A A
Transcription Transcription
Translation
Translation
×Gene C
August 2010
Transcriptional regulatory code
Transcriptional regulation
Basal
transcription
complex
Transcription start
+ +
+
-
-
Transcription factor
(TF)
Transcription factor binding site
(TFBS)
Transcription start site
9
August 2010
What we know about TF-DNA binding
A TF binds a small DNA-site (around 5 to 10 bp long)
It is specific, will not bind arbitrary nucleotide string: for example, a TF may want to binding only CAGTGT
If we knew its “preference”, we could just scan the DNA and look for matches
There are over 500 TFs in the human genome, we probably know preferences for 50 or so
10
August 2010
3000bp region near gene, bound by TF ZYX
11
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCAGTGGGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
August 2010
3000bp region near TSS of a real mouse gene
12
AGGAATATTCTGCTGTTTGGGATCTTGCCACAGCCACTTCCAGCCTGGGAAAAGGCATTTACTGTAAACAGCGGGAGAAGGGGCTCCTTCCCCAACAGCTGACAGCTCATTTTAACCAGAGGAACTGAGATTTGATTTTGGAGTTCATCTCCCTGGGCAGTGAAGGATCAAAAAACAAACAACAACCTGAGAGAAGGGGTGGAGGTTTCCATAGAGGAAAGTACTGGGGCGGGGATGGAGGCTTGGTGGTGGGGGCTTGGGGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGGGGTGGGGAGGGAAGCATCTAACTTCCTGGGTTCTAACCTGGCTTTCTCTCAGGTTCTGGCTGACACTGAGCAGGACACTTTATTGCATTGCAGCCTTGTGGGTTTGCCTATCTGTAAAATTAGGAATAAAAAGAGGCCCAGTCAGGATATTTGAAGATGGAAGACAGAATAAATCAAGTGATCTCTCAGAACTGGGTCTCACAGAGCCAAGAATTGGGGGTAAGAAGTACAGGTGGCGGCTACACTTGCTCTGGTAATTCCAACAGAATAGCCATGCACAGTTAGGGAGGTAAAAAGTGGATACGTAAACGGCCCCATTTCTCACTGATGAAAAAACCCTTCCTGCTTCCTGTAACAAAAGCACTGTACGGCAAAGCAAGGAGAAGCTTAAGTACTTGGGACCTCCTCGTCAAAGAAGGCATCCATGGCCATCTTAAGGTGTGGAGGGGCACAAGAGGTCACTACAAAGGTACTAGGTCCCTCTGATACATATGTCAGGTGGGCAAAAGAATTCCTCCAGGAAAGGGGGGCAGTGAAGGGGACGGAAACCCTTGTCTGGATGAAGTTCTGGGGTGAAGAGTCTCTTCTGTCCAAAGCCTTTTGGGAAGATGAGGTGCCTGCATTTACTCTCTTTGCTTCTGTCAAGTGCCTGAGGGTGGCCAGATTTCGACCTGCTGGGGAGAAGGATTTTGGTCATGGTTTAGCAGGAGGTGGGGGTTTTGCAGTGGGCATGTGAAGGAAGGGATGGTGGCGGAGTGACGAAGACCACTCTGTCTGTTGTACGAAGGTCCCCAACCTGGGATAGAGGCCTCTTCCAGAGACTGTGAGAGTGCTTGAGGTGAAGGGGGGTGACTAGGGGTAGCCCGTCTTGTCCTGGCAGCTCCTACTTGCTGGTCAAAGCCCTCAGGCCGCCCTACTTGTGCACTGACCTGAGCTAATCTAAAAAATACCGCAGGGGAGGTAGAGACTGGGGTTCCCAGTGAGAGAGAGTCTCCAGACGGGAGAAAAGGAGCGGGAGTCCTTGTCATTTCTGTCAGCTTTCTTAGGCTCAGTGACAACAATGCTTCTCCTTCATCTAGGCTGGGTCCCATCTCGTGGTTTGCTGCTTAGGAGTTTGAAAGAGAACCCAGCTGGGGACGTAGACAGGGACCCACAGAAAGCAGCCGTAGCTGACCCATGCCTCATGAAGACTACAAAGGGGCTCACGCCAGCACGAACGCAAGGCAACTCCTTTCAGAAGCGCCAGCTCGGCAATGAAACTCGGCTGCGCAGCAAACCACACACGGAATACGCACGGTTACCAAAGCTGCCGCTCAGAGTTCACACAGCGCCAACCCACAGCTGTATCTAATGCGATGTCTTTGTCTCTGGATCTCTTTCGTCTTCGTGCCCGCGCGCACTCGCATGACACTCAACAGAAACATCCAAGCTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTGAGCTCCGGTCCGCAAGGCTGTCAGCTCGCCTTGCCTTTCGTCTATCCTGACCTTCTCAGATAAGCATTTGCTTACCGAGGGGGCGAGGGGGCGTCCTCAGAATCCCTCCGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAGACAGAGACAGAGACAGAGACACAGAGACAGAGACAGACAGAGACAGACAGAGACAGAGAGAGCGCCCAAAGGCTAGCCTTTCCCTTCCACTGCGCGCAGTTGATGGTGAGGCACCAGCTCCTACCACGGCATCCCTGGACGACAGAAACAGCTCAGATGGTCCAACCCAGGGCTGACTTTCTTCAAAAGTAATCCAAGACAGTCACTTCTGTCATCAGGATGGACTTGCAGAACAAGTGATAGATGGCAGAGACACAAACAGACGACCGATCGGCCGGCCTAGCTCTAGAGACTCTCACCTGTCTGTCCTGTTGGTTATCCGGACGGTTAGCCAGAGGATCCGGGGCCCCCGCACAGCTCCGGGACTCTGGAAGATAGTTCCGAGGGTGGGGACCTTCGAGAACCAGCCCACACTGAACTCCTCCCTCCTTGTGGCAGCAGCAAGGTGGGACGGGCCAGGACGTCTGCTTAGCACCTCCTCCAGAAATGCAGCACTTGGGGGGCCCCCACCCTTTCGCGCGCTCCTTCCCACCGACCTCCCAGGGGTGCACCTCTCCAGCCTCGGTCGCGCTTCCGAAACCTTTGGTGCCCCCTTTTCCTGGTCCCGACCCCCCACCTCACGCCCCCTGGTCTGGACAGCATCTCCCCCTCGCCGCCCTCCGCCCACGCACCGCCTGACTCCGAGGGGTGCGAGCGCATTGGGCTGCGCCCGCGTGGGGGCGCCGCGCCAGCCTCGCGTAGCTGTTCTGACGCTGCCGTCGCCGCCGCCCTCCGCAGCCCAGCCGGCACCCGCACCAGCTCTGCAGTGCACTCGTCGCCTCTCGGGCCGGTCCCACCAAGAGCCAGACTGTCGTGACCGGGGCCAGCCTCGAACGTCAGGCGCGAGGGTCATGAGCCAGAGCGCCCTGGGGCGCCGCGCGGAGACCCAGCGGAGATAGCAGTCCTCGCTGCCTTGACGCGCGCCCGCCGCGTCCCCAGA
August 2010
Markov models for background
13
Probabilistic Model for the Background
Markov Model of Order 0: Aggregate Probabilities of
Occurrence
A C G T0.2 0.3 0.3 0.2
• Long-range correlations are known to exist in DNA sequences.
Markov Model of Order 1: Conditional Probabilities
Representing One-Step Correlations
Next Base → A C G TCurrent Base ↓
A 0.3 0.2 0.4 0.1
C 0.5 0.5 0 0
G 0 1 0 0
T 0 0.75 0 0.25
• Each row adds up to 1.
Mihir Arjunwadkar Probabilistic Pattern Discovery in Genomic Sequences
August 2010
Markov models for background
14
Probabilistic Model for the Background
Markov Model of Order 2: Two-Step Correlations
Next Base → A C G TCurrent Word ↓
AA 0.14468887 0.34738369 0.3647338 0.14319361AC 0.28652560 0.04944015 0.2475591 0.41647510AG 0.25500737 0.24605241 0.2790940 0.21984627AT 0.24553913 0.32125812 0.2272264 0.20597631CA 0.29077682 0.44232378 0.2193125 0.04758690CC 0.18156346 0.30470655 0.2991050 0.21462501CG 0.51227826 0.15130835 0.2561048 0.08030855CT 0.33931374 0.30861055 0.1403038 0.21177192GA 0.03849488 0.59090986 0.2929306 0.07766468GC 0.29668576 0.01695954 0.3368234 0.34953134GG 0.60471652 0.01173188 0.1304406 0.25311101GT 0.27622836 0.42019980 0.1779573 0.12561454TA 0.31275145 0.26811120 0.1299927 0.28914466TC 0.05453213 0.28767978 0.2924408 0.36534726TG 0.22519927 0.47783775 0.1882527 0.10871030TT 0.11225098 0.17351505 0.3864218 0.32781218
• Background model parameters can be estimated from the samesequence data where motifs are searched. Generally, not recommended.
• Appropriate/optimal model order needs to be determined.Mihir Arjunwadkar Probabilistic Pattern Discovery in Genomic Sequences
August 2010
3000bp region near TSS of a real mouse gene
15
AGGATTATTCTGCTGTTTGGGATCTTGCCACAGCCACTTCCAGCCTGGGAAAAGGCATTTACTGTAAACAGCGGGAGAAGGGGCTCCTTCCCCAACAGCTGACAGCTCATTTTAACCAGAGGAACTGAGATTTGATTTTGGAGTTCATCTCCCTGGGCAGTGAAGGATCAAAAAACAAACAACAACCTGAGAGAAGGGGTGGAGGTTTCCATAGAGGAAAGTACTGGGGCGGGGATGGAGGCTTGGTGGTGGGGGCTTGGGGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGTGGTGGGGGCTTGGGGGTGGGGAGGGAAGCATCTAACTTCCTGGGTTCTAACCTGGCTTTCTCTCAGGTTCTGGCTGACACTGAGCAGGACACTTTATTGCATTGCAGCCTTGTGGGTTTGCCTATCTGTAAAATTAGGAATAAAAAGAGGCCCAGTCAGGATATTTGAAGATGGAAGACAGAATAAATCAAGTGATCTCTCAGAACTGGGTCTCACAGAGCCAAGAATTGGGGGTAAGAAGTACAGGTGGCGGCTACACTTGCTCTGGTAATTCCAACAGAATAGCCATGCACAGTTAGGGAGGTAAAAAGTGGATACGTAAACGGCCCCATTTCTCACTGATGAAAAAACCCTTCCTGCTTCCTGTAACAAAAGCACTGTACGGCAAAGCAAGGAGAAGCTTAAGTACTTGGGACCTCCTCGTCAAAGAAGGCATCCATGGCCATCTTAAGGTGTGGAGGGGCACAAGAGGTCACTACAAAGGTACTAGGTCCCTCTGATACATATGTCAGGTGGGCAAAAGAATTCCTCCAGGAAAGGGGGGCAGTGAAGGGGACGGAAACCCTTGTCTGGATGAAGTTCTGGGGTGAAGAGTCTCTTCTGTCCAAAGCCTTTTGGGAAGATGAGGTGCCTGCATTTACTCTCTTTGCTTCTGTCAAGTGCCTGAGGGTGGCCAGATTTCGACCTGCTGGGGAGAAGGATTTTGGTCATGGTTTAGCAGGAGGTGGGGGTTTTGCAGTGGGCATGTGAAGGAAGGGATGGTGGCGGAGTGACGAAGACCACTCTGTCTGTTGTACGAAGGTCCCCAACCTGGGATAGAGGCCTCTTCCAGAGACTGTGAGAGTGCTTGAGGTGAAGGGGGGTGACTAGGGGTAGCCCGTCTTGTCCTGGCAGCTCCTACTTGCTGGTCAAAGCCCTCAGGCCGCCCTACTTGTGCACTGACCTGAGCTAATCTAAAAAATACCGCAGGGGAGGTAGAGACTGGGGTTCCCAGTGAGAGAGAGTCTCCAGACGGGAGAAAAGGAGCGGGAGTCCTTGTCATTTCTGTCAGCTTTCTTAGGCTCAGTGACAACAATGCTTCTCCTTCATCTAGGCTGGGTCCCATCTCGTGGTTTGCTGCTTAGGAGTTTGAAAGAGAACCCAGCTGGGGACGTAGACAGGGACCCACAGAAAGCAGCCGTAGCTGACCCATGCCTCATGAAGACTACAAAGGGGCTCACGCCAGCACGAACGCAAGGCAACTCCTTTCAGAAGCGCCAGCTCGGCAATGAAACTCGGCTGCGCAGCAAACCACACACGGAATACGCACGGTTACCAAAGCTGCCGCTCAGAGTTCACACAGCGCCAACCCACAGCTGTATCTAATGCGATGTCTTTGTCTCTGGATCTCTTTCGTCTTCGTGCCCGCGCGCACTCGCATGACACTCAACAGAAACATCCAAGCTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTCTCAGTCCGGGGGCGGTGATCCTAGCCTGGCCGAGCGTACCCATGTTTCTGAGCTCCGGTCCGCAAGGCTGTCAGCTCGCCTTGCCTTTCGTCTATCCTGACCTTCTCAGATAAGCATTTGCTTACCGAGGGGGCGAGGGGGCGTCCTCAGAATCCCTCCGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAGACAGAGACAGAGACAGAGACACAGAGACAGAGACAGACAGAGACAGACAGAGACAGAGAGAGCGCCCAAAGGCTAGCCTTTCCCTTCCACTGCGCGCAGTTGATGGTGAGGCACCAGCTCCTACCACGGCATCCCTGGACGACAGAAACAGCTCAGATGGTCCAACCCAGGGCTGACTTTCTTCAAAAGTAATCCAAGACAGTCACTTCTGTCATCAGGATGGACTTGCAGAACAAGTGATAGATGGCAGAGACACAAACAGACGACCGATCGGCCGGCCTAGCTCTAGAGACTCTCACCTGTCTGTCCTGTTGGTTATCCGGACGGTTAGCCAGAGGATCCGGGGCCCCCGCACAGCTCCGGGACTCTGGAAGATAGTTCCGAGGGTGGGGACCTTCGAGAACCAGCCCACACTGAACTCCTCCCTCCTTGTGGCAGCAGCAAGGTGGGACGGGCCAGGACGTCTGCTTAGCACCTCCTCCAGAAATGCAGCACTTGGGGGGCCCCCACCCTTTCGCGCGCTCCTTCCCACCGACCTCCCAGGGGTGCACCTCTCCAGCCTCGGTCGCGCTTCCGAAACCTTTGGTGCCCCCTTTTCCTGGTCCCGACCCCCCACCTCACGCCCCCTGGTCTGGACAGCATCTCCCCCTCGCCGCCCTCCGCCCACGCACCGCCTGACTCCGAGGGGTGCGAGCGCATTGGGCTGCGCCCGCGTGGGGGCGCCGCGCCAGCCTCGCGTAGCTGTTCTGACGCTGCCGTCGCCGCCGCCCTCCGCAGCCCAGCCGGCACCCGCACCAGCTCTGCAGTGCACTCGTCGCCTCTCGGGCCGGTCCCACCAAGAGCCAGACTGTCGTGACCGGGGCCAGCCTCGAACGTCAGGCGCGAGGGTCATGAGCCAGAGCGCCCTGGGGCGCCGCGCGGAGACCCAGCGGAGATAGCAGTCCTCGCTGCCTTGACGCGCGCCCGCCGCGTCCCCAGA
August 2010
ChIP-chip experiments
Anoverview
oftheChIP-chip
experimentalprocedure
Rangingfrom
yeast
toculturedmam
maliancells,
there
issurprisingly
little
variationin
published
ChIP-chip
pro-
tocols.Generally,cellsaregrownunder
thedesired
exper-
imentalconditionandthen
fixed
withform
aldehyde(Fig.
1A).
Form
aldehydecrosslinksproteinsto
each
other
pri-
marilybetweenthee-am
inogroupoflysineresidues
and
anadjacentpeptide
bond.Form
aldehyde
can
also
form
DNA–protein
crosslinks,
butonly
iftheDNA
ispartially
Fig.1.(A)AsummaryoftheChIP-chip
procedure.S
eethetextfordetails.(B)Comparison
ofthecontrolsusedforsingle-locus,PCR-based
ChIP
experim
ents
andmicroarray-based
experim
ents.Single-locusexperim
entsuse
asingleinternal
controlin
each
sample.Theintensity
ofthetarget
bandiscompared
across
theIP,mock
IP(orcontrolIP),andinputDNA.In
microarrayexperim
ents,ratiosobtained
forenriched
elem
ents
(boxed
inwhite)
arecompared
tothose
obtained
forallother
elem
ents,whichareterm
ednon-enriched.(C
)Globalarraynorm
alizationwillslidetherawdistribution(red)alongthex-axisso
thatthe
medianlog2ratioisequal
to0forthenorm
alized
distribution
(blue).(D
)Theeffect
ofdefaultnorm
alizationonasimulatedChIP-chip
experim
entin
which
20%
ofarrayed
elem
ents
detectfive-fold
enrichment(log 2
STDev
=0.5).Thesimulatedexperim
entwas
repeatedthreetimes,andthedistributionofthe
averageratiosareplotted.T
hedistributionisskew
edsuch
thatthemedianlog2ratioofthenon-enriched
populationisat!0.25(black).Theidealnorm
alization
would
centerthenon-enriched
populationat
0(green).
M.J.Buck,J.D.Lieb/Genomics83(2004)349–360
350
16
August 2010
Given: a set of DNA sequences bound by a TF
X1
X2
X3
X4
Xn
···
Problem of motif discovery
Goal is to find:locations of these binding sites in the sequences, and
description of the word (or motif)
Each believed to contain a binding site of that TF
Xi
17
August 2010
E.g. Pho4, a yeast TF, has 12 binding sites listed in SCPD
Common sequence “pattern” in the binding sites: motif
How can we model a motif?
Modeling binding specificities
>YBR093C>YBR093C>YBR093C>YBR093C>YBR093C>YBR093C>YDR481C>YGR233C>YML123C>YML123C>YML123C>YML123C
ACACGTGGACACGTGG GCACGTTT GCACGTTT GCACGTTT ACACGTGG CCACGCGC GCACGTGC GCACGTGGCCACGTGG GCACGTTT TCACGTTA
ACACGTGGACACGTGG GCACGTTT GCACGTTT GCACGTTT ACACGTGG CCACGCGC GCACGTGC GCACGTGGCCACGTGG GCACGTTTTCACGTTA
18
Consensus: GCACGTGG
Expand alphabet: gCACGTgg
Regular expr.: [GACT]CACG[TC][GT][GTCA]
1 2 3 4 5 6 7 8
A 0.25 0.00 1.00 0.00 0.00 0.00 0.00 0.08
C 0.17 1.00 0.00 1.00 0.00 0.08 0.00 0.17
G 0.50 0.00 0.00 0.00 1.00 0.00 0.58 0.42
T 0.08 0.00 0.00 0.00 0.00 0.92 0.42 0.33
1 2 3 4 5 6 7 8
A 3 0 12 0 0 0 0 1
C 2 12 0 12 0 1 0 2
G 6 0 0 0 12 0 7 5
T 1 0 0 0 0 11 5 4
Position specific scoring matrix
August 2010
Basics of probability theory
Probability distribution
19
Introduction to probability models1
Background
• We define probability in terms of sample space S, a set whose elements areelementary events.
• Each elementary event can be viewed as the outcome of an experiment.
• For example, consider the flipping of a coin twice. The sample spaceconsists of all possible outcomes that is S = {HH, TT,HT, TH}. Let A
denote an event that we obtain a head. Then probability that event A
occurs is given byPr(A) = 3/4
P (event A) = number of elements in the sample space favorable to event Asize of sample space
• Random variable: function used to assign unique numerical values to allpossible outcomes of a random experiment. E.g., we can use randomvariable X to be the number of heads in the coin tosses. So X can be 0,1, or 2 and we need to find P (X ≥ 1) in our example.
Probability distribution
• A probability distribution P assigns the probability to each value of aparticular random variable such that the following axioms are satisfied.
1. P (X = u) ≥ 0 ∀u .2.
�u P (X = u) = 1
• Say you flip a coin n times, and you want to find the probability of findingk heads in it.
• Let’s say the coin in biased. So you have probability p of getting a headand 1− p of getting a tail. What then?
Continuous probability distribution / probability density
• Associated with continuous random variables. E.g., weight of a new born.
• f(x) ≥ 0
•� +∞−∞ f(x)dx = 1
• P (a ≤ x ≤ b) =� b
a f(x)dx ≥ 0
1These notes are based on lecture notes from a Fall 2002 course at Duke University taught
by Prof. Hartemink.
1
Introduction to probability models1
Background
• We define probability in terms of sample space S, a set whose elements areelementary events.
• Each elementary event can be viewed as the outcome of an experiment.
• For example, consider the flipping of a coin twice. The sample spaceconsists of all possible outcomes that is S = {HH, TT,HT, TH}. Let A
denote an event that we obtain a head. Then probability that event A
occurs is given byPr(A) = 3/4
P (event A) = number of elements in the sample space favorable to event Asize of sample space
• Random variable: function used to assign unique numerical values to allpossible outcomes of a random experiment. E.g., we can use randomvariable X to be the number of heads in the coin tosses. So X can be 0,1, or 2 and we need to find P (X ≥ 1) in our example.
Probability distribution
• A probability distribution P assigns the probability to each value of aparticular random variable such that the following axioms are satisfied.
1. P (X = u) ≥ 0 ∀u .2.
�u P (X = u) = 1
• Say you flip a coin n times, and you want to find the probability of findingk heads in it.
• Let’s say the coin in biased. So you have probability p of getting a headand 1− p of getting a tail. What then?
Continuous probability distribution / probability density
• Associated with continuous random variables. E.g., weight of a new born.
• f(x) ≥ 0
•� +∞−∞ f(x)dx = 1
• P (a ≤ x ≤ b) =� b
a f(x)dx ≥ 0
1These notes are based on lecture notes from a Fall 2002 course at Duke University taught
by Prof. Hartemink.
1
August 2010
Basics of probability theory
Probability distribution
Conditional distribution
Bayes’ theorem
Marginalization
20
P (A | B) =P (A ∩B)
P (B)
P (A|B) =P (A)P (B|A)
P (B)
P (A) =�
B
P (A, B)
August 2010
Given: a set of DNA sequences bound by a TF
X1
X2
X3
X4
Xn
···
Problem of de novo motif discovery
Goal is to find:locations of these binding sites in the sequences, and
parameters of the motif model describing these binding sites
Each believed to contain a binding site of that TF
Xi
21
August 2010
Some more notation
vector indicating starting position of the binding site in each sequence
length of the binding site
parameters of the motif model (we assume a position specific scoring matrix, or PSSM). is the probability of finding nucleotide at location within the binding site
background model parameters - use a k-order Markov model
Z
φ
φ0
X1
X2
X3
X4
Xn
···
Z1
Z2
Z3
Z4
Zn
ab
φa,b
W
22
August 2010
Likelihood of a sequence
23
X1
X2
X3
Xi
Xn−1
Xn
li
Z3
Z2
Z1
Zi
Zn−1
Zn
···
···
Figure 3.1: We are given n DNA sequences X1, . . . ,Xn. Each Xi is believed tocontain one binding site depicted in red at an unknown position denoted by the Zi.The goal is to infer the value of Z as well as the motif parameters φ that best describethe variabilities in the binding sites.
3.2.2 Sequence model
For simplicity, as depicted in Figure 3.1, each Xi is assumed to contain exactly one
binding site of that TF. Let Z be a vector of length n denoting the starting location
of the binding site in each sequence: Zi = j if there is a binding site at position j in
sequence Xi. The nucleotides not belonging to the binding sites are assumed to be
drawn from some background model parameterized by φ0.
Thus if the sequence Xi is of length li, and it contains a binding site at location
Zi, we can compute the likelihood of the sequence as:
P (Xi | φ, Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)×�
W�
a=1
φa,Xi,Zi+a−1
�
× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.8)
Each sequence Xi can thus be portioned into two regions, one that contains
nucleotides in the binding site while the other that contains nucleotides that are not
part of the nucleotide based on the value of Zi and W . For simplicity, let us use
PM(Xi | φ, Zi) to denote the region that is explained by the motif model φ, and
PM(Xi | Zi, φ0) to denote the region that is explained by the background model φ0.
58
X1
X2
X3
Xi
Xn−1
Xn
li
Z3
Z2
Z1
Zi
Zn−1
Zn
···
···
Figure 3.1: We are given n DNA sequences X1, . . . ,Xn. Each Xi is believed tocontain one binding site depicted in red at an unknown position denoted by the Zi.The goal is to infer the value of Z as well as the motif parameters φ that best describethe variabilities in the binding sites.
3.2.2 Sequence model
For simplicity, as depicted in Figure 3.1, each Xi is assumed to contain exactly one
binding site of that TF. Let Z be a vector of length n denoting the starting location
of the binding site in each sequence: Zi = j if there is a binding site at position j in
sequence Xi. The nucleotides not belonging to the binding sites are assumed to be
drawn from some background model parameterized by φ0.
Thus if the sequence Xi is of length li, and it contains a binding site at location
Zi, we can compute the likelihood of the sequence as:
P (Xi | φ, Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)×�
W�
a=1
φa,Xi,Zi+a−1
�
× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.8)
Each sequence Xi can thus be portioned into two regions, one that contains
nucleotides in the binding site while the other that contains nucleotides that are not
part of the nucleotide based on the value of Zi and W . For simplicity, let us use
PM(Xi | φ, Zi) to denote the region that is explained by the motif model φ, and
PM(Xi | Zi, φ0) to denote the region that is explained by the background model φ0.
58
In other words,
PM(Xi | φ, Zi) =
W�
a=1
φa,Xi,Zi+a−1 (3.9)
and
PM(Xi | Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.10)
Thus equation (3.8) can be written as:
P (Xi | φ, Zi, φ0) = PM(Xi | Zi, φ0)× PM(Xi | φ, Zi) (3.11)
3.2.3 Objective function
We wish to find φ and Z that maximize the joint posterior distribution of the un-
knowns conditional on the data. Therefore, our objective function is:
arg maxφ,Z
P (φ, Z | X, φ0) = arg maxφ,Z
P (X | φ, Z, φ0)P (φ)P (Z) (3.12)
assuming independent priors P (φ) and P (Z) over φ and Z, respectively.
3.2.4 Collapsed Gibbs sampling
If we applied traditional Gibbs sampling to the optimization problem described in
equation (3.12), we would have to sample each Zi and the high dimensional φ. How-
ever, collapsing φ, as proposed by Liu [1994], results in a more efficient algorithm
with n components.
We note that
P (Z | X, φ0) ∝ P (Z, X | φ0)
=
�
φ
P (φ, X | Z, φ0)P (Z)dφ
= P (Z)
�
φ
P (X | φ, Z, φ0)P (φ)dφ
= P (Z)PM(X | Z, φ0)
�
φ
PM(X | φ, Z)P (φ)dφ (3.13)
59
In other words,
PM(Xi | φ, Zi) =
W�
a=1
φa,Xi,Zi+a−1 (3.9)
and
PM(Xi | Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.10)
Thus equation (3.8) can be written as:
P (Xi | φ, Zi, φ0) = PM(Xi | Zi, φ0)× PM(Xi | φ, Zi) (3.11)
3.2.3 Objective function
We wish to find φ and Z that maximize the joint posterior distribution of the un-
knowns conditional on the data. Therefore, our objective function is:
arg maxφ,Z
P (φ, Z | X, φ0) = arg maxφ,Z
P (X | φ, Z, φ0)P (φ)P (Z) (3.12)
assuming independent priors P (φ) and P (Z) over φ and Z, respectively.
3.2.4 Collapsed Gibbs sampling
If we applied traditional Gibbs sampling to the optimization problem described in
equation (3.12), we would have to sample each Zi and the high dimensional φ. How-
ever, collapsing φ, as proposed by Liu [1994], results in a more efficient algorithm
with n components.
We note that
P (Z | X, φ0) ∝ P (Z, X | φ0)
=
�
φ
P (φ, X | Z, φ0)P (Z)dφ
= P (Z)
�
φ
P (X | φ, Z, φ0)P (φ)dφ
= P (Z)PM(X | Z, φ0)
�
φ
PM(X | φ, Z)P (φ)dφ (3.13)
59
August 2010
Objective function
Need to find optimal values for and which will maximize the posterior distribution:
Z φ
arg maxφ,Z
P (φ,Z | X,φ0)
= arg maxφ,Z
P (X | φ,Z,φ0)× P (φ)× P (Z)
24
August 2010
Traditional Gibbs sampling
Goal is to generate samples from the joint distribution
Gibbs sampling is used when joint distribution is not known explicitly, but conditional of each can be computed
25
D
: random variables: data or known parameters
θi
θ1, . . . , θk
N
1. Initialize θ(0)1:k
2. For t = 1 to N
• Sample θ(t)1 ∼ P (θ1 | θ(t−1)
2 , θ(t−1)3 , . . . , θ(t−1)
k , D)
• Sample θ(t)2 ∼ P (θ2 | θ(t)
1 , θ(t−1)3 , . . . , θ(t−1)
k , D)...
• Sample θ(t)k ∼ P (θk | θ(t)
1 , θ(t)2 , . . . , θ(t)
k−1, D)
P (θ1, θ2, . . . , θk | D)
August 2010
Collapsed Gibbs sampling
Ideal qualities of a Gibbs sampler:
sampling one component conditional on others must be fast
sample autocorrelation must be low to promote better exploration of the sample-space
Instead of sampling from , reduce the number of components, by integrating out one or more of them
26
P (θi | θ1, . . . , θi−1, θi+1, . . . , θk)
P (θi | θ1, . . . , θi−1, θi+1, . . . , θk−1) =�
P (θi, θk | θ1, . . . , θi−1, θi+1, . . . , θk−1)d(θk)
August 2010
Gibbs sampler for motif discovery
Objective:
Collapsed Gibbs sampling (Liu 1995): Sample only by integrating out the parameters
arg maxφ,Z
P (X | φ,Z,φ0)× P (φ)× P (Z)
φZ
27
In other words,
PM(Xi | φ, Zi) =
W�
a=1
φa,Xi,Zi+a−1 (3.9)
and
PM(Xi | Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.10)
Thus equation (3.8) can be written as:
P (Xi | φ, Zi, φ0) = PM(Xi | Zi, φ0)× PM(Xi | φ, Zi) (3.11)
3.2.3 Objective function
We wish to find φ and Z that maximize the joint posterior distribution of the un-
knowns conditional on the data. Therefore, our objective function is:
arg maxφ,Z
P (φ, Z | X, φ0) = arg maxφ,Z
P (X | φ, Z, φ0)P (φ)P (Z) (3.12)
assuming independent priors P (φ) and P (Z) over φ and Z, respectively.
3.2.4 Collapsed Gibbs sampling
If we applied traditional Gibbs sampling to the optimization problem described in
equation (3.12), we would have to sample each Zi and the high dimensional φ. How-
ever, collapsing φ, as proposed by Liu [1994], results in a more efficient algorithm
with n components.
We note that
P (Z | X, φ0) ∝ P (Z, X | φ0)
=
�
φ
P (φ, X | Z, φ0)P (Z)dφ
= P (Z)
�
φ
P (X | φ, Z, φ0)P (φ)dφ
= P (Z)PM(X | Z, φ0)
�
φ
PM(X | φ, Z)P (φ)dφ (3.13)
59
August 2010
Gibbs sampler for motif discovery
If we assume a Dirichlet distribution for the prior over parameterized by
28
We also note that, since φ is a multinomial distribution, if we assume a conjugate
product Dirichlet prior over φ, we can get a closed form solution for the integral
in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the
Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral
in equation (3.13) can be simplified as:
�
φ
PM(X | φ, Z)P (φ)dφ ∝W�
a=1
4�
b=1
Γ(cab(X) + αab) (3.14)
where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath
position in the binding site, where the positions of the binding sites are determined
based on corresponding Z.
For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).
Using equations (3.13) and (3.14) we get:
P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)
P (Z[−i] | X, φ0)
∝P (Z)PM(X | Z, φ0)
W�a=1
4�b=1
Γ(cab(X)) + αab)
P (Z[−i])PM(X[−i] | Z[−i], φ0)W�
a=1
4�b=1
Γ(cab(X[−i]) + αab)
= PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
Γ(cab(X) + αab)
Γ(cab(X[−i]) + αab)(3.15)
where cab(X[−i]) denotes the counts obtained from the data X without the sequence
Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)
denotes the counts in sequence Xi. We now use the fact that if c1 � c2,
Γ(c1 + c2)
Γ(c1)≈ cc2
1 (3.16)
Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the
nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.
Therefore, as long as the total number of sequences n is large, we can assume
60
φ
We also note that, since φ is a multinomial distribution, if we assume a conjugate
product Dirichlet prior over φ, we can get a closed form solution for the integral
in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the
Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral
in equation (3.13) can be simplified as:
�
φ
PM(X | φ, Z)P (φ)dφ ∝W�
a=1
4�
b=1
Γ(cab(X) + αab) (3.14)
where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath
position in the binding site, where the positions of the binding sites are determined
based on corresponding Z.
For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).
Using equations (3.13) and (3.14) we get:
P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)
P (Z[−i] | X, φ0)
∝P (Z)PM(X | Z, φ0)
W�a=1
4�b=1
Γ(cab(X)) + αab)
P (Z[−i])PM(X[−i] | Z[−i], φ0)W�
a=1
4�b=1
Γ(cab(X[−i]) + αab)
= PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
Γ(cab(X) + αab)
Γ(cab(X[−i]) + αab)(3.15)
where cab(X[−i]) denotes the counts obtained from the data X without the sequence
Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)
denotes the counts in sequence Xi. We now use the fact that if c1 � c2,
Γ(c1 + c2)
Γ(c1)≈ cc2
1 (3.16)
Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the
nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.
Therefore, as long as the total number of sequences n is large, we can assume
60
We also note that, since φ is a multinomial distribution, if we assume a conjugate
product Dirichlet prior over φ, we can get a closed form solution for the integral
in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the
Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral
in equation (3.13) can be simplified as:
�
φ
PM(X | φ, Z)P (φ)dφ ∝W�
a=1
4�
b=1
Γ(cab(X) + αab) (3.14)
where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath
position in the binding site, where the positions of the binding sites are determined
based on corresponding Z.
For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).
Using equations (3.13) and (3.14) we get:
P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)
P (Z[−i] | X, φ0)
∝P (Z)PM(X | Z, φ0)
W�a=1
4�b=1
Γ(cab(X)) + αab)
P (Z[−i])PM(X[−i] | Z[−i], φ0)W�
a=1
4�b=1
Γ(cab(X[−i]) + αab)
= PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
Γ(cab(X) + αab)
Γ(cab(X[−i]) + αab)(3.15)
where cab(X[−i]) denotes the counts obtained from the data X without the sequence
Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)
denotes the counts in sequence Xi. We now use the fact that if c1 � c2,
Γ(c1 + c2)
Γ(c1)≈ cc2
1 (3.16)
Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the
nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.
Therefore, as long as the total number of sequences n is large, we can assume
60
August 2010
Gibbs sampler for motif discovery
29
We also note that, since φ is a multinomial distribution, if we assume a conjugate
product Dirichlet prior over φ, we can get a closed form solution for the integral
in equation (3.13). In particular, let αa for 1 ≤ a ≤ W be the parameters for the
Dirichlet prior where αa = (αa1, . . . ,αa4) for the four nucleotides. Thus the integral
in equation (3.13) can be simplified as:
�
φ
PM(X | φ, Z)P (φ)dφ ∝W�
a=1
4�
b=1
Γ(cab(X) + αab) (3.14)
where cab(X) denotes the counts of nucleotide b obtained from the data X at the ath
position in the binding site, where the positions of the binding sites are determined
based on corresponding Z.
For the Gibbs sampler, we need to be able to draw Zi ∼ P (Zi | Z[−i], X, φ0).
Using equations (3.13) and (3.14) we get:
P (Zi | Z[−i], X, φ0) =P (Z | X, φ0)
P (Z[−i] | X, φ0)
∝P (Z)PM(X | Z, φ0)
W�a=1
4�b=1
Γ(cab(X)) + αab)
P (Z[−i])PM(X[−i] | Z[−i], φ0)W�
a=1
4�b=1
Γ(cab(X[−i]) + αab)
= PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
Γ(cab(X) + αab)
Γ(cab(X[−i]) + αab)(3.15)
where cab(X[−i]) denotes the counts obtained from the data X without the sequence
Xi, based on positions Z[−i]. Also, cab(X) = cab(X[−i]) + cab(Xi), where cab(Xi)
denotes the counts in sequence Xi. We now use the fact that if c1 � c2,
Γ(c1 + c2)
Γ(c1)≈ cc2
1 (3.16)
Note that cab(Xi) is “1” for exactly one value of b for each a, depending on the
nucleotide present in position Zi + a − 1 in sequence Xi. It is “0” otherwise.
Therefore, as long as the total number of sequences n is large, we can assume
60
that cab(X[−i]) + αab � cab(Xi). Using the approximation in equation (3.16), equa-
tion (3.15) can be simplified as:
P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
(cab(X[−i]) + αab)cab(Xi)
(3.17)
Note that ∀a,4�
b=1cab(X[−i]) = n−1. Thus if we divide each
4�b=1
(cab(X[−i])+αab)cab(Xi)
in equation (3.17) by (n− 1) +
4�b=1
αab, we get:
P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
��φa,b
�cab(Xi)
= PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
�φa,Xi,Zi+a−1 (3.18)
where �φ is the posterior mean of φ conditioned on sequences X[−i] and positions
Z[−i].
3.3 Framework of PRIORITY
We have developed PRIORITY, a program that performs motif discovery using a
collapsed Gibbs sampling approach, similar to the one described in the previous
section. In the following subsections, we describe the various properties of PRIORITY
that make it different from traditional motif discovery programs.
3.3.1 Relaxing the assumption of exactly one binding site
PRIORITY allows for the possibility of some sequences not possessing a binding site.
This is particularly important when dealing with noisy data. The idea is similar
to the ZOOPS (Zero or One Occurrence Per Sequence) model described used in
61
X1
X2
X3
Xi
Xn−1
Xn
li
Z3
Z2
Z1
Zi
Zn−1
Zn
···
···
Figure 3.1: We are given n DNA sequences X1, . . . ,Xn. Each Xi is believed tocontain one binding site depicted in red at an unknown position denoted by the Zi.The goal is to infer the value of Z as well as the motif parameters φ that best describethe variabilities in the binding sites.
3.2.2 Sequence model
For simplicity, as depicted in Figure 3.1, each Xi is assumed to contain exactly one
binding site of that TF. Let Z be a vector of length n denoting the starting location
of the binding site in each sequence: Zi = j if there is a binding site at position j in
sequence Xi. The nucleotides not belonging to the binding sites are assumed to be
drawn from some background model parameterized by φ0.
Thus if the sequence Xi is of length li, and it contains a binding site at location
Zi, we can compute the likelihood of the sequence as:
P (Xi | φ, Zi, φ0) = P (Xi,1, . . . , Xi,Zi−1 | φ0)×�
W�
a=1
φa,Xi,Zi+a−1
�
× P (Xi,Zi+W , . . . , Xi,li | φ0) (3.8)
Each sequence Xi can thus be portioned into two regions, one that contains
nucleotides in the binding site while the other that contains nucleotides that are not
part of the nucleotide based on the value of Zi and W . For simplicity, let us use
PM(Xi | φ, Zi) to denote the region that is explained by the motif model φ, and
PM(Xi | Zi, φ0) to denote the region that is explained by the background model φ0.
58
=W�
a=1
�φa,Xi,Zi+a−1
P (Xi,Zi+a−1 | Xi, φ0)
August 2010
The algorithm
Initialize vector randomly
Two-step iterative procedure:
1. Hold out one of the sequences at random (or in some specified order). Compute based on the alignment of all sequences except the held out sequence.
2. Calculate for each value of using the new PSSM and sample a position from it.
30
Z
n Xi�φ
that cab(X[−i]) + αab � cab(Xi). Using the approximation in equation (3.16), equa-
tion (3.15) can be simplified as:
P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
(cab(X[−i]) + αab)cab(Xi)
(3.17)
Note that ∀a,4�
b=1cab(X[−i]) = n−1. Thus if we divide each
4�b=1
(cab(X[−i])+αab)cab(Xi)
in equation (3.17) by (n− 1) +
4�b=1
αab, we get:
P (Zi | Z[−i], X, φ0) ∝ PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
4�
b=1
��φa,b
�cab(Xi)
= PM(Xi | Zi, φ0)P (Z)
P (Z[−i])
W�
a=1
�φa,Xi,Zi+a−1 (3.18)
where �φ is the posterior mean of φ conditioned on sequences X[−i] and positions
Z[−i].
3.3 Framework of PRIORITY
We have developed PRIORITY, a program that performs motif discovery using a
collapsed Gibbs sampling approach, similar to the one described in the previous
section. In the following subsections, we describe the various properties of PRIORITY
that make it different from traditional motif discovery programs.
3.3.1 Relaxing the assumption of exactly one binding site
PRIORITY allows for the possibility of some sequences not possessing a binding site.
This is particularly important when dealing with noisy data. The idea is similar
to the ZOOPS (Zero or One Occurrence Per Sequence) model described used in
61
Zi
August 2010
Iterations in a typical Gibbs sampler
31
! "!! #!!! #"!! $!!! $"!! %!!! %"!! &!!! &"!! "!!!!#!!!
!'!!
!(!!
!&!!
!$!!
!
$!!
&!!
)*+,-*)./
0.*)1234.,+
2
2
5)*62)/1.,0-*)7+28,).,
5)*629/)1.,028,).,
Figure 4.2: Motif scores for two Gibbs samplers: one with and the other without
the informative prior, over 5,000 iterations. Both programs were run five times from
different starting locations. The two black plots are the best and worst runs for the
program with the uniform prior. The two grey plots are the best and worst runs for
the program with the informative prior. Although the absolute values of the scores
are not comparable (due to an arbitrary constant value assigned to the uniform prior),
it is clear that the number of iterations taken to converge for the algorithm with the
informative prior is almost half. Also, each of the five runs converges to a similar
final motif in the case of the algorithm incorporating the informative prior. On the
other hand, during the worst of the five runs for the other program with the uniform
prior, the sampler gets stuck in a local maximum that corresponds to a suboptimal
motif.
83