This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
DTC Gerton Lunter, WTCHG February 2011 Includes material from:
Dirk Husmeier, Heng Li Hidden Markov models in Computational
Biology
Slide 2
Overview First part: Mathematical context: Bayesian Networks
Markov models Hidden Markov models Second part: Worked example: the
occasionally crooked casino Applications in biology Third part:
Practical 0: more theory on HMMs Practical I-V: theory,
implementation, biology. Pick & choose.
Slide 3
Part I HMMs in (mathematical) context
Slide 4
Probabilistic models Mathematical model describing joint
distribution over many variables. Three type of variables are
distinguished: Observed variables Latent (hidden) variables
Parameters Latent variables often are the quantities of interest,
to be inferred from observations using the model. Sometimes these
represent nuisance variables necessary to correctly describe the
relationships in the data. Example: P(clouds, sprinkler_used, rain,
wet_grass)
Slide 5
Some notation / terminology P(X,Y,Z): probability of (X,Y,Z)
occurring simultaneously P(X,Y):probability of (X,Y) occurring in
combination with any Z (marginalized over Z). P(X,Y|Z): probability
of (X,Y) occurring, provided that it is known that Z occurs
(conditional on Z, or given Z) P(X,Y) = Z P(X,Y,Z) P(Z) = X,Y
P(X,Y,Z) P(X,Y| Z ) = P(X,Y,Z) / P(Z) X,Y,Z P(X,Y,Z) = 1 P(Y | X )
= P(X | Y) P(Y) / P(X) (Bayes rule)
Slide 6
Independence Two variables X, Y are independent if P(X,Y) =
P(X) P(Y) Knowing or assuming that two variables are independent
reduces the model complexity. Suppose X, Y each take N possible
values: specification of P(X,Y) requires N 2 -1 numbers
specification of P(X), P(Y) requires 2N-2 numbers. Two variables
X,Y are conditionally independent (given Z) if P(X,Y|Z) = P(X|Z)
P(Y|Z).
Slide 7
Probabilistic model: example P(Clouds, Sprinkler, Rain,
WetGrass) = P(Clouds) P(Sprinker|Clouds) P(Rain|Clouds) P(WetGrass
| Sprinkler, Rain) This specification of the model determines which
variables are deemed to be (conditionally) independent (e.g.
Sprinkler and Rain given Clouds; WetGrass and Clouds given
Sprinkler and Rain). These independence assumptions simplify the
model. Using formulas as above to describe the independence
relationship is not very intuitive, particularly for large models.
Graphical models (in particular, Bayesian Networks) are a more
intuitive way to do the same
Slide 8
Bayesian network: example Cloudy Sprinkl er Rain Wet grass
P(Clouds) P(Sprinker|Clouds) P(Rain|Clouds) P(WetGrass | Sprinkler,
Rain) Rule: Two nodes of the graph are conditionally independent
given the state of their parents E.g. Sprinker and Rain are
independent given Cloudy
Slide 9
Bayesian network: example Cloudy Sprinkl er Rain Wet grass
Convention: Latent variables are open Observed variables are shaded
P(Clouds) P(Sprinker|Clouds) P(Rain|Clouds) P(WetGrass | Sprinkler,
Rain)
Slide 10
Bayesian network: example Combat Air Identification algorithm;
www.wagner.com
Slide 11
Bayesian networks Intuitive formalism to develop models
Algorithms to learn parameters from training data (maximum
likelihood; EM) General and efficient algorithms to infer latent
variables from observations (message passing algorithm) Allows
dealing with missing data in a robust and coherent way (make
relevant node a latent variable) Simulate data
Slide 12
Markov model A particular kind of Bayesian network All
variables are observed Suitable for modeling dependencies within
sequences P(S n | S 1,S 2,,S n-1 ) = P(S n | S n-1 ) (Markov
property) P(S 1, S 2, S 3, , S n ) = P(S 1 ) P(S 2 |S 1 ) P (S n |
S n-1 ) S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6
S7S7 S7S7 S8S8 S8S8
Slide 13
Markov model States: letters in English words Transitions:
which letter follows which S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4
S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 MR SHERLOCK HOLMES WHO WAS
USUALLY VERY LATE IN THE MORNINGS SAVE UPON THOSE NOT INFREQUENT
OCCASIONS WHEN HE WAS UP ALL . S 1 =M S 2 =R S 3 = S 4 =S S 5 =H .
P(S n = y| S n-1 = x ) = (parameters) P(S n-1 S n = xy ) / P (S n-1
= x ) (frequency of xy) / (frequency of x) (max likelihood)
UNOWANGED HE RULID THAND TROPONE AS ORTIUTORVE OD T HASOUT TIVE IS
MSHO CE BURKES HEST MASO TELEM TS OME SSTALE MISSTISE S
TEWHERO
Slide 14
Markov model States: triplets of letters Transitions: which
(overlapping) triplet follows which S1S1 S1S1 S2S2 S2S2 S3S3 S3S3
S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 MR SHERLOCK
HOLMES WHO WAS USUALLY VERY LATE IN THE MORNINGS SAVE UPON THOSE
NOT INFREQUENT OCCASIONS WHEN HE WAS UP ALL . S 1 =MR S 2 =R S S 3
= SH S 4 =SHE S 5 =HER . P(S n = xyz| S n-1 = wxy ) = P( wxyz ) /
P( wxy ) (frequency of wxyz) / (frequency of wxy) THERE THE YOU
SOME OF FEELING WILL PREOCCUPATIENCE CREASON LITTLED MASTIFF HENRY
MALIGNATIVE LL HAVE MAY UPON IMPRESENT WARNESTLY
Slide 15
Markov model States: word pairs Text from:
http://www.gutenberg.org/etext/1105http://www.gutenberg.org/etext/1105
Then churls their thoughts (although their eyes were kind) To thy
fair appearance lies To side this title is impanelled A quest of
thoughts all tenants to the sober west As those gold candles fixed
in heaven's air Let them say more that like of hearsay well I will
drink Potions of eisel 'gainst my strong infection No bitterness
that I was false of heart Though absence seemed my flame to qualify
As easy might I not free When thou thy sins enclose! That tongue
that tells the story of thy love Ay fill it full with feasting on
your sight Book both my wilfulness and errors down And on just
proof surmise accumulate Bring me within the level of your eyes And
in mine own when I of you beauteous and lovely youth When that
churl death my bones with dust shall cover And shalt by fortune
once more re-survey These poor rude lines of life thou art forced
to break a twofold truth Hers by thy deeds
Slide 16
Hidden Markov model HMM = probabilistic observation of Markov
chain Another special kind of Bayesian network S i form a Markov
chain as before, but states are unobserved Instead, y i (dependent
on S i ) are observed Generative viewpoint: state S i emits symbol
y i y i do not form a Markov chain (= do not satisfy Markov
property) They exhibit more complex (long-range) dependencies S1S1
S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7
S8S8 S8S8 y1y1 y1y1 y2y2 y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5 y6y6
y6y6 y7y7 y7y7 y8y8 y8y8
Slide 17
Hidden Markov model Representation above emphasizes relation to
Bayesian networks Different graph representation, emphasizing
transition probabilities P(S i |S i-1 ). E.g. in the case S i
{A,B,C,D}: Notes: Emission probabilities P( y i | S i ) not
explicitly represented Advance from i to i+1 also implicit Not all
arrows need to be present (prob = 0) S1S1 S1S1 S2S2 S2S2 S3S3 S3S3
S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 y1y1 y1y1 y2y2
y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5 y6y6 y6y6 y7y7 y7y7 y8y8 y8y8 A
A B B D D C C
Slide 18
Pair Hidden Markov model S 11 S 21 S 31 S 41 S 51 z1z1 z1z1 S
12 S 22 S 23 S 24 S 25 z2z2 z2z2 S 31 S 32 S 33 S 34 S 35 z3z3 z3z3
y1y1 y1y1 y2y2 y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5
Slide 19
Pair Hidden Markov model S 11 S 21 S 31 S 41 S 51 z1z1 z1z1 S
12 S 22 S 23 S 24 S 25 z2z2 z2z2 S 31 S 32 S 33 S 34 S 35 z3z3 z3z3
y1y1 y1y1 y2y2 y2y2 y3y3 y3y3 y4y4 y4y4 y5y5 y5y5 Normalization:
paths p s p(1) s p(N) y 1 y A z 1 z B P(s p(1),,s p(N),y 1 y A,z 1
z B ) = 1 N = N(p) = length of path States may emit a symbol in
sequence y, or in z, or both, or neither (silent state). If a
symbol is emitted, the associated coordinate subscript increases by
one. E.g. diagonal transitions are associated to simultaneous
emissions in both sequences. A realization of the pair HMM consists
of a state sequence, with each symbol emitted by exactly one state,
and the associated path through the 2D table. (A slightly more
general viewpoint decouples the states and the path; then the
hidden variables are the sequence of states S, and a path through
the table. In this viewpoint the transitions, not states, emit
symbols. The technical term in finite state machine theory is Mealy
machine; the standard viewpoint is also known as Moore
machine)
Slide 20
Inference in HMMs So HMMs can describe complex (temporal,
spatial) relationships in data. But how can we use the model? A
number of (efficient) inference algorithms exist for HMMs: Viterbi
algorithm: most likely state sequence, given observables Forward
algorithm: likelihood of model given observables Backward
algorithm: together with Forward, allows computation of posterior
probabilities Baum-Welch algorithm: parameter estimation given
observables
Slide 21
Summary of part I Probabilistic models Observed variables
Latent variables: of interest for inference, or nuisance variables
Parameters: obtained from training data, or prior knowledge
Bayesian networks independence structure of model represented as a
graph Markov models linear Bayesian network; all nodes observed
Hidden Markov models observed layer, and hidden (latent) layer of
nodes efficient inference algorithm (Viterbi algorithm) Pair Hidden
Markov model two observed sequences with interdependencies,
determined by an unobserved Markov sequence
Slide 22
Part II Examples of HMMs
Slide 23
Example: The Occasionally Corrupt Casino
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Application: Sequence alignment
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Application: Profile HMMs
Slide 51
Slide 52
Slide 53
Slide 54
Slide 55
Slide 56
Slide 57
Application: Ab-initio Gene Finding
Slide 58
Semi-Markov model: explicit state durations Parameters depend
on isochore Detailed modeling of local signals: poly-A signal
(AATAAA) translation initiation (12 bp PSWM) promoter model (TATA
box, cap signal) splice acceptor and donor models (maximal
dependence decomposition model) GenScan HMM
Slide 59
Genscan prediction
Slide 60
Application: Demographic inference
Slide 61
Kingmans coalescent model Coalescent Wright-Fisher Backwards in
time
Slide 62
Coalescent as a sequential process Recombination as a point
process along sequences Wiuf, Hein 1999. Whole-genome. Constructs
ARG. Exact, not practical for whole genomes. Sampling only.
Approximating the coalescent with recombination: SMC McVean, Cardin
2005. Whole-genome. Fast and accurate. Fast coalescent simulations:
SMC Marjoram, Wall 2006. Small improvement over original SMC SMC
for coalescent algorithms with demographic structure Eriksson,
Mahjani, Mehlig 2009. SMC + migration. Simulation only.
Slide 63
Sequentially Markovian coalescent McVean, Cardin
Slide 64
SMC model for inference diploid individual Simplest situation:
2 leaves Discretize SMC model as an HMM State = genealogy = T MRCA
Li and Durbin (unpublished)
Slide 65
Part III Practicals
Slide 66
Practical 0: HMMs What is the interpretation of the probability
computed by the Forward (FW) algorithm? The Viterbi algorithm also
computes a probability. How does that relate to the one computed by
the FW algorithm? How do the probabilities computed by FW and
Backward algorithms compare? Explain what a posterior is, either in
the context of alignment using an HMM, or of profile HMMs. Why is
the logarithm trick useful for the Viterbi algorithm? Does the same
trick work for the FW algorithm?
Slide 67
Practical I: Profile HMMs in context
Slide 68
Lookup protein sequence of PRDM9 in the UCSC genome browser
Search Intropro for the protein sequence. Look at the ProSite
profile and sequence logo. Work out the syntax of the profile
(HMMer syntax), and relate the logo and profile. Which residues are
highly conserved? What structural role do these play? Which are not
very much conserved? Can you infer that these are less important
biologically? Read PMID: 19997497 (PubMed). What is the meaning of
the changed number of zinc finger motifs across species? Relate the
conserved and changeable positions in the zinc fingers to the
INTERPRO motif. Do these match the predicted pattern? Read PMID:
19008249 and PMID:20044541. Explain the relationship between the
recombination motif and the zinc fingers. What do you think is the
cellular function of PRDM9? Relate the fact that recombination
hotspots in chimpanzee do not coincide with those in human with
PRDM9. What do you predict about recombination hotspots in other
mammalian species? Why do you think PRDM9 evolves so fast?
Background information on motif finding:
www.bx.psu.edu/courses/bx-fall04/phmm.ppt
http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html
Slide 69
Practical II: HMMs and population genetics
Slide 70
Read PMID: 17319744, and PMID: 19581452 What is the difference
between phylogeny and genealogy? What is incomplete lineage
sorting? The model operates on multiple sequences. Is it a linear
HMM, a pair HMM, or something else? What do the states represent?
How could the model be improved? Which patterns in the data is the
model looking for? Would it be possible to analyze these patterns
without a probabilistic model? (Estimate how frequently (per
nucleotide) mutations occur between the species considered. What is
the average distance between recombinations?) How does the method
scale to more species?
Slide 71
Practical III: HMMs and alignment
Slide 72
PMID: 18073381 What are the causes of inaccuracies in
alignments? Would a more accurate model of sequence evolution
improve alignments? Would this be a large improvement? What is the
practical limit (in terms of evolutionary distance, in
mutations/site) on pairwise alignment? Would multiple alignment
allow more divergent species to be aligned? How does the complexity
scale for multiple alignment using HMMs, in a nave implementation?
What could you do to improve this? What is posterior decoding and
how does it work? In what way does it improve alignments, compared
to Viterbi? Why is this?
Slide 73
Practical IV: HMMs and conservation: phastCons
Slide 74
Read PMID: 16024819 What is the difference between a phyloHMM
and a standard HMM? How does the model identify conserved regions?
How is the model helped by the use of multiple species? How is the
model parameterized? The paper uses the model to estimate the
fraction of the human genome that is conserved. How can this
estimate be criticized? Look at a few protein-coding genes, and
their conservation across mammalian species, using the UCSC genome
browser. Is it always true that (protein-coding) exons are well
conserved? Can you see regions of conservation outside of
protein-coding exons? Do these observations suggest that the model
is inaccurate? Read PMID: 19858363. Summarize the differences of
approaches of the new methods and the old phyloHMM.
Slide 75
Practical V: Automatic code generation for HMMs
Slide 76
http://www.well.ox.ac.uk/~gerton/Gulbenkian/HMMs and
alignments.doc. Skip sections 1-3.
http://www.well.ox.ac.uk/~gerton/Gulbenkian/HMMs and alignments.doc
Implementing the various algorithms for HMMs can be hard work,
particularly when a reasonable efficiency is required. Library
implementations are however neither fast nor flexible enough. This
practical demonstrates a code generator that takes the pain out of
working with HMMs. This practical takes you through an existing
alignment HMM, and modifies it to identify conserved regions ( la
phastCons) Requirements: a Linux system, with Java and GCC
installed. Experience with C and/or C++ is helpful for this
tutorial.