This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Albert Gatt Corpora and Statistical Methods Lecture 8
Slide 2
Markov and Hidden Markov Models: Conceptual Introduction Part
2
Slide 3
In this lecture We focus on (Hidden) Markov Models conceptual
intro to Markov Models relevance to NLP Hidden Markov Models
algorithms
Slide 4
Acknowledgement Some of the examples in this lecture are taken
from a tutorial on HMMs by Wolgang Maass
Slide 5
Talking about the weather Suppose we want to predict tomorrows
weather. The possible predictions are: sunny foggy rainy We might
decide to predict tomorrows outcome based on earlier weather if its
been sunny all week, its likelier to be sunny tomorrow than if it
had been rainy all week how far back do we want to go to predict
tomorrows weather?
Slide 6
Statistical weather model Notation: S: the state space, a set
of possible values for the weather: {sunny, foggy, rainy} (each
state is identifiable by an integer i) X: a sequence of random
variables, each taking a value from S these model weather over a
sequence of days t is an integer standing for time (X 1, X 2, X
3,... X T ) models the value of a series of random variables each
takes a value from S with a certain probability P(X=s i ) the
entire sequence tells us the weather over T days
Slide 7
Statistical weather model If we want to predict the weather for
day t+1, our model might look like this: E.g. P(weather tomorrow =
sunny), conditional on the weather in the past t days. Problem: the
larger t gets, the more calculations we have to make.
Slide 8
Markov Properties I: Limited horizon The probability that were
in state s i at time t+1 only depends on where we were at time t:
Given this assumption, the probability of any sequence is
just:
Slide 9
Markov Properties II: Time invariance The probability of being
in state s i given the previous state does not change over
time:
Slide 10
Concrete instantiation Day tDay t+1 sunnyrainyfoggy
sunny0.80.050.15 rainy0.20.60.2 foggy0.20.30.5 This is essentially
a transition matrix, which gives us probabilities of going from one
state to the other. We can denote state transition probabilities as
a ij (prob. of going from state i to state j)
Slide 11
Graphical view Components of the model: 1. states (s) 2.
transitions 3. transition probabilities 4. initial probability
distribution for states Essentially, a non-deterministic finite
state automaton.
Slide 12
Example continued If the weather today (X t ) is sunny, whats
the probability that tomorrow (X t+1 ) is sunny and the day after
(X t+2 ) is rainy? Markov assumption
Slide 13
Formal definition A Markov Model is a triple (S, , A) where: S
is the set of states are the probabilities of being initially in
some state A are the transition probabilities
Slide 14
Hidden Markov Models
Slide 15
A slight variation on the example Youre locked in a room with
no windows You cant observe the weather directly You only observe
whether the guy who brings you food is carrying an umbrella or not
Need a model telling you the probability of seeing the umbrella,
given the weather distinction between observations and their
underlying emitting state. Define: O t as an observation at time t
K = {+umbrella, -umbrella} as the possible outputs Were interested
in P(O t =k|X t =s i ) i.e. p. of a given observation at t given
that the underlying weather state at t is s i
Slide 16
Symbol emission probabilities weatherProbability of umbrella
sunny0.1 rainy0.8 foggy0.3 This is the hidden model, telling us the
probability that O t = k given that X t = s i We assume that each
underlying state X t = s i emits an observation with a given
probability.
Slide 17
Using the hidden model Model gives:P(O t =k|X t =s i ) Then, by
Bayes Rule we can compute: P(X t =s i |O t =k) Generalises easily
to an entire sequence
Slide 18
HMM in graphics Circles indicate states Arrows indicate
probabilistic dependencies between states
Slide 19
HMM in graphics Green nodes are hidden states Each hidden state
depends only on the previous state (Markov assumption)
Slide 20
Why HMMs? HMMs are a way of thinking of underlying events
probabilistically generating surface events. Example: Parts of
speech a POS is a class or set of words we can think of language as
an underlying Markov Chain of parts of speech from which actual
words are generated (emitted) So what are our hidden states here,
and what are the observations?
Slide 21
HMMs in POS Tagging ADJNV DET Hidden layer (constructed through
training) Models the sequence of POSs in the training corpus
Slide 22
HMMs in POS Tagging ADJ tall N lady V is DET the Observations
are words. They are emitted by their corresponding hidden state.
The state depends on its previous state.
Slide 23
Why HMMs There are efficient algorithms to train HMMs using
Expectation Maximisation General idea: training data is assumed to
have been generated by some HMM (parameters unknown) try and learn
the unknown parameters in the data Similar idea is used in finding
the parameters of some n-gram models, especially those that use
interpolation.
Slide 24
Formalisation of a Hidden Markov model
Slide 25
Crucial ingredients (familiar) Underlying states: S = {s 1,,s N
} Output alphabet (observations): K = {k 1,,k M } State transition
probabilities: A = {a ij }, i,j S State sequence: X = (X 1,,X T+1 )
+ a function mapping each X t to a state s Output sequence: O = (O
1,,O T ) where each o t K
Slide 26
Crucial ingredients (additional) Initial state probabilities: =
{ i }, i S (tell us the initial probability of each state) Symbol
emission probabilities: B = {b ijk }, i,j S, k K (tell us the
probability b of seeing observation O t =k, given that X t =s i and
X t+1 = s j )
Slide 27
Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3
Slide 28
Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3 o1o1
o2o2 o3o3 Obs. seq: time: t1t1 t2t2 t3t3
Slide 29
Trellis diagram of an HMM s1s1 s2s2 s3s3 a 1,1 a 1,2 a 1,3 o1o1
o2o2 o3o3 Obs. seq: time: t1t1 t2t2 t3t3 b 1,1,k b 1,2,k b
1,3,k
Slide 30
The fundamental questions for HMMs 1. Given a model = (A, B, ),
how do we compute the likelihood of an observation P(O| )? 2. Given
an observation sequence O, and model , which is the state sequence
(X 1,,X t+1 ) that best explains the observations? This is the
decoding problem 3. Given an observation sequence O, and a space of
possible models = (A, B, ), which model best explains the observed
data?
Slide 31
Application of question 1 (ASR) Given a model = (A, B, ), how
do we compute the likelihood of an observation P(O| )? Input of an
ASR system: a continuous stream of sound waves, which is ambiguous
Need to decode it into a sequence of phones. is the input the
sequence [n iy d] or [n iy]? which sequence is the most
probable?
Slide 32
Application of question 2 (POS Tagging) Given an observation
sequence O, and model , which is the state sequence (X 1,,X t+1 )
that best explains the observations? this is the decoding problem
Consider a POS Tagger Input observation sequence: I can read need
to find the most likely sequence of underlying POS tags: e.g. is
can a modal verb, or the noun? how likely is it that can is a noun,
given that the previous word is a pronoun?
Slide 33
Summary HMMs are a way of representing: sequences of
observations arising from sequences of states states are the
variables of interest, giving rise to the observations Next up:
algorithms for answering the fundamental questions about HMMs