Hidden Markov Models Hidden Markov Models in NLP in NLP Leah Spontaneo Leah Spontaneo
Jan 03, 2016
Hidden Markov Models Hidden Markov Models in NLPin NLP
Leah SpontaneoLeah Spontaneo
OverviewOverviewIntroductionIntroduction
Deterministic ModelDeterministic ModelStatistical ModelStatistical ModelSpeech RecognitionSpeech Recognition
Discrete Markov ProcessesDiscrete Markov Processes
Hidden Markov ModelHidden Markov ModelElements of HMMsElements of HMMsOutput SequenceOutput SequenceBasic HMM ProblemsBasic HMM ProblemsHMM OperationsHMM Operations
Forward-BackwardForward-Backward
Viterbi AlgorithmViterbi Algorithm
Baum-WelchBaum-Welch
Different Types of HMMsDifferent Types of HMMs
Speech Recognition using HMMsSpeech Recognition using HMMsIsolated Word RecognitionIsolated Word Recognition
Limitations of HMMsLimitations of HMMs
IntroductionIntroductionReal-world processes generally produce Real-world processes generally produce observable outputs characterized in observable outputs characterized in signalssignals
Signals can be discrete or continuousSignals can be discrete or continuous
Signal source can be stationary or Signal source can be stationary or nonstationarynonstationary
They can be pure or corruptedThey can be pure or corrupted
IntroductionIntroductionSignals are characterized in signal Signals are characterized in signal modelsmodels
Process the signal to provide desired Process the signal to provide desired outputoutput
Learn about the signal source without Learn about the signal source without the source availablethe source available
Work well in practiceWork well in practice
Signal models can be deterministic or Signal models can be deterministic or statisticalstatistical
Deterministic Deterministic ModelModel
Exploit some known Exploit some known specific properties of specific properties of the signalthe signal
Sine waveSine wave
Sum of exponentialsSum of exponentials
Chaos theoryChaos theory
Specification of the Specification of the signal generally signal generally straight-forwardstraight-forward
Determine/estimate Determine/estimate the values of the the values of the parameters of the parameters of the signal modelsignal model
Statistical ModelStatistical ModelStatistical models try to Statistical models try to characterize the characterize the statistical properties of statistical properties of the signalthe signal
Gaussian processesGaussian processes
Markov processesMarkov processes
Hidden Markov Hidden Markov processesprocesses
Signal characterized as Signal characterized as a parametric random a parametric random processprocess
Parameters of the Parameters of the stochastic process can stochastic process can be determine/estimated be determine/estimated in a precise, well-in a precise, well-defined mannerdefined manner
Speech Speech RecognitionRecognition
Basic theory of hidden Markov models Basic theory of hidden Markov models in speech recognition originally in speech recognition originally pulished in 1960s by Baum and pulished in 1960s by Baum and colleguescollegues
Implemented in speech processing Implemented in speech processing applications in 1970s be Baker at CMU applications in 1970s be Baker at CMU and by Jelinek at IBMand by Jelinek at IBM
Discrete Markov Discrete Markov ProcessesProcesses
Contains a set of N distinct states: SContains a set of N distinct states: S11, , SS22,..., S,..., SNN
At discrete time intervals the state At discrete time intervals the state changes and the state at time t is qchanges and the state at time t is qt t
For the first-order Markov chain, the For the first-order Markov chain, the probabilistic description isprobabilistic description is
P[qP[qtt = S = Sjj| q| qt -1t -1= S= Sii, q, qt -2t -2= S= Skk,...] = P[q,...] = P[qtt = S = Sjj| | qqt -1t -1= S= Sii]]
Discrete Markov Discrete Markov ProcessesProcesses
Processes considered are those independent Processes considered are those independent of time leading to transition probabilitiesof time leading to transition probabilities
aaijij = P[q = P[qtt = S = Sjj| q| qt -1t -1= S= Sii] 1 ≤ i, j ≤ N] 1 ≤ i, j ≤ N
aaijij ≥ 0 ≥ 0
ΣΣ a aijij = 1 = 1
The revious stochastic process is considered The revious stochastic process is considered an observable Markov modelan observable Markov model
The output process is the set of states at The output process is the set of states at each time interval and each state each time interval and each state corresponds to an observable eventcorresponds to an observable event
Hidden Markov Hidden Markov ModelModel
Markov process decides future Markov process decides future probabilities based on recent valuesprobabilities based on recent values
A hidden Markov model is a Markov A hidden Markov model is a Markov process with an unobservable stateprocess with an unobservable state
HMMs must have 3 sets of probabilities:HMMs must have 3 sets of probabilities:Initial probabilitiesInitial probabilities
Transition probabilitiesTransition probabilities
Emission probabilitiesEmission probabilities
Hidden Markov Hidden Markov ModelModel
Includes the case where the Includes the case where the observation is a probabilistic function of observation is a probabilistic function of the statethe state
A doubly embedded stochastic process A doubly embedded stochastic process with an underlying unobservable with an underlying unobservable stochastic processstochastic process
Unobservable process only observed Unobservable process only observed through a set of stochastic processes through a set of stochastic processes producing the observationsproducing the observations
Elements of Elements of HMMsHMMs
N, number of states in the modelN, number of states in the model
Although hidden, there is physical Although hidden, there is physical significance attached to the states of the significance attached to the states of the modelmodel
Individual states are denoted as S = {SIndividual states are denoted as S = {S11, , SS22,..., S,..., SNN}}
State at time t is denoted qState at time t is denoted qtt
M, number of distinct observation symbols M, number of distinct observation symbols for each statefor each state
Elements of Elements of HMMsHMMs
Observation symbols correspond to the Observation symbols correspond to the output of the system modeledoutput of the system modeled
Individual symbols are denoted V = {vIndividual symbols are denoted V = {v11, , vv22,..., v,..., vMM}}
State transition probability distribution A State transition probability distribution A = {a= {aijij}}
aaijij = P[q = P[qtt = S = Sjj| q| qt -1t -1= S= Sii] 1 ≤ i, j ≤ N] 1 ≤ i, j ≤ N
The special case where any state can The special case where any state can reach any other, areach any other, aijij > 0 for all i, j > 0 for all i, j
Elements of Elements of HMMsHMMs
The observation probability distribution in The observation probability distribution in state j, B = {bstate j, B = {bjj(k)} (k)}
bbjj(k) = P[v(k) = P[vkk at t|q at t|qtt = S = Sjj] 1 ≤ j ≤ N, 1 ≤ k ≤ M] 1 ≤ j ≤ N, 1 ≤ k ≤ M
The initial state distribution The initial state distribution ππ = { = {ππii}}ππii = P[q = P[q11 = S = Sii] 1 ≤ i ≤ N] 1 ≤ i ≤ N
With the right values for N, M, A, B, and With the right values for N, M, A, B, and ππ the HMM can generate and output sequencethe HMM can generate and output sequence
O = OO = O11OO22...O...OTT where each O where each Ott is an observation is an observation in V and T is the total number of observations in V and T is the total number of observations in the sequencein the sequence
Output SequenceOutput Sequence1) Choose an initial state q1) Choose an initial state q11 = S = Sii according to according to ππ
2) t = 12) t = 1
3) Get O3) Get Ott = v = vkk based on the emission probability for S based on the emission probability for Sii, , bbii(k)(k)
4) Transition to new state q4) Transition to new state qt+1t+1 = S = Sjj based on transition based on transition probability for Sprobability for Sii, a, aijij
5) t = t + 1 and go back to #3 if t < T; otherwise end 5) t = t + 1 and go back to #3 if t < T; otherwise end sequencesequence
Successfully used for acoustic modeling in speech Successfully used for acoustic modeling in speech recognitionrecognition
Applied to language modeling and POS taggingApplied to language modeling and POS tagging
Output SequenceOutput SequenceThe procedure can be used to generate The procedure can be used to generate a sequence of observations and to a sequence of observations and to model how an observation sequence model how an observation sequence was produced by and HMMwas produced by and HMM
The cost of determining the probability The cost of determining the probability that the system is in state Sthat the system is in state Sii at time t is at time t is O(tNO(tN22))
Basic HMM Basic HMM ProblemsProblems
HMMs can find the state HMMs can find the state sequence that most sequence that most likely produces a given likely produces a given outputoutput
The sequence of states The sequence of states is most efficiently is most efficiently computed using the computed using the Viterbi algorithmViterbi algorithm
Maximum likelihood Maximum likelihood estimates of the estimates of the probability sets are probability sets are determined using the determined using the Baum-Welch algorithmBaum-Welch algorithm
HMM OperationsHMM OperationsCalculating P(qCalculating P(qtt = S = Sii|O|O11OO22...O...Ott) uses the ) uses the forward-backward algorithmforward-backward algorithm
Computing QComputing Q** = argmax = argmaxQQ P(Q|O) P(Q|O) requires the Viterbi algorithmrequires the Viterbi algorithm
Learning Learning λλ** = argmax = argmaxλλ P(O| P(O|λλ) using the ) using the Baum-Welch algorithmBaum-Welch algorithm
The complexity for the three algorithms The complexity for the three algorithms is O(TNis O(TN22) where T is the time taken and ) where T is the time taken and N is the number of statesN is the number of states
EvaluationEvaluationScores how well a given model matches Scores how well a given model matches an observation sequencean observation sequence
Extremely useful in considering which Extremely useful in considering which model, among many, best represents model, among many, best represents the set of observationsthe set of observations
Forward-Forward-BackwardBackward
Given observations OGiven observations O11OO22...O...OTT
ααtt(i) = P(O(i) = P(O11OO22...O...Ott ^ q ^ qtt = S = Sii|| λ λ) is the probability that ) is the probability that given the first t observations, we end up in state Sgiven the first t observations, we end up in state S ii on visit ton visit t
αα11(i) = b(O(i) = b(O11)) π πii
α α t+1t+1(j) = (j) = ΣΣ a aijijbbii(O(Ot+1t+1)) α αtt(i)(i)
We can now cheaply compute We can now cheaply compute ααtt(i) = P(O(i) = P(O11OO22...O...Ott ^ q ^ qtt = S= Sii) )
P(OP(O11OO22...O...Ott) = ) = ΣΣ ααtt(i)(i)
P(qP(qtt = S = Sii|| OO11OO22...O...Ott) = ) = ααtt(i)/(i)/ Σ Σ ααtt(j)(j)
Forward-Forward-BackwardBackward
The key is since there are only N states, The key is since there are only N states, all possible state sequences will merge all possible state sequences will merge to the N nodesto the N nodes
At t = 1, only the values of At t = 1, only the values of αα11(i), 1 ≤ i ≤ (i), 1 ≤ i ≤ N require calculationN require calculation
When t = 2, 3,..., T we calculate When t = 2, 3,..., T we calculate ααtt(j), 1 (j), 1 ≤ j ≤ N where each calculation involves ≤ j ≤ N where each calculation involves N previous values of N previous values of ααt-1t-1(i) only(i) only
State SequenceState SequenceThere is no ‘correct’ state sequence There is no ‘correct’ state sequence except for degenerate modelsexcept for degenerate models
Optimality criterion is used instead to Optimality criterion is used instead to determine the best possible outcomedetermine the best possible outcome
There are several reasonable optimality There are several reasonable optimality criteria and thus, the chosen criterion criteria and thus, the chosen criterion depends on the use of the uncovered depends on the use of the uncovered sequencesequence
Used for continuous speech recognitionUsed for continuous speech recognition
Viterbi AlgorithmViterbi AlgorithmFinds the most likely sequence of hidden Finds the most likely sequence of hidden states based on known observations using states based on known observations using dynamic programmingdynamic programming
Makes three assumptions about the model:Makes three assumptions about the model:The model must be a state machineThe model must be a state machine
Transitions between states are marked by a Transitions between states are marked by a metricmetric
Events must be cumulative over the pathEvents must be cumulative over the path
Path history must be kept in memory to find Path history must be kept in memory to find the best probable path in the endthe best probable path in the end
Viterbi AlgorithmViterbi AlgorithmUsed for speech recognition where the Used for speech recognition where the hidden state is part of word formationhidden state is part of word formation
Given a specific signal, it would deduce Given a specific signal, it would deduce the most probable word based on the the most probable word based on the modelmodel
To find the best state sequence, Q = To find the best state sequence, Q = {q{q11, q, q22,..., q,..., qTT} for the observation } for the observation sequence O = {Osequence O = {O11OO22...O...OTT} we define} we define
Viterbi AlgorithmViterbi Algorithmδδtt(i) is the best score along a single (i) is the best score along a single path at time t accounting for the first t path at time t accounting for the first t observations ending in state Sobservations ending in state Sii
The inductive step isThe inductive step is
Viterbi AlgorithmViterbi AlgorithmSimilar to forward calculation of the Similar to forward calculation of the forward-backward algorithmforward-backward algorithm
The major difference is the The major difference is the maximization over the previous states maximization over the previous states instead of summing the probabilitiesinstead of summing the probabilities
Optimizing Optimizing ParametersParameters
A training sequence is used to train the A training sequence is used to train the HMM and adjust the model parametersHMM and adjust the model parameters
Training problem is crucial for most Training problem is crucial for most HMM applicationsHMM applications
Allows us to optimally adapt model Allows us to optimally adapt model parameters to observed training data parameters to observed training data creating good models for real datacreating good models for real data
Baum-WelchBaum-WelchA special case of expectation A special case of expectation maximizationmaximization
EM has two main steps:EM has two main steps:Devise the expectation of the log-likelihood Devise the expectation of the log-likelihood using current estimates of latent variablesusing current estimates of latent variables
Compute the maximized log-likelihood Compute the maximized log-likelihood using values from the first stepusing values from the first step
Baum-Welch is a form of generalized EM Baum-Welch is a form of generalized EM which allows the algorithm to converge to which allows the algorithm to converge to a local optimuma local optimum
Baum-WelchBaum-WelchAlso known as the forward-backwards Also known as the forward-backwards algorithmalgorithm
Baum-Welch uses two main steps:Baum-Welch uses two main steps:The forward and backward probability is The forward and backward probability is calculated for each state in the HMMcalculated for each state in the HMM
The transition and emission probabilities The transition and emission probabilities are determined and divided by the are determined and divided by the probability of the whole model based on probability of the whole model based on the previous stepthe previous step
Baum-WelchBaum-WelchCons: lots of local minimaCons: lots of local minima
Pros: local minima are often adequate Pros: local minima are often adequate models of the datamodels of the data
EM requires the number of states to be EM requires the number of states to be givengiven
Sometimes HMMs require some links to Sometimes HMMs require some links to be zero. For this abe zero. For this aijij = 0 in the initial = 0 in the initial estimate estimate λλ(0)(0)
Different Types of Different Types of HMMHMM
Left-right modelLeft-right modelAs time increases the state index As time increases the state index increases or stays the sameincreases or stays the same
No transitions are allowed between No transitions are allowed between states whose indecies are lower than the states whose indecies are lower than the current statecurrent state
Cross-coupled two parallel left-rightCross-coupled two parallel left-rightObeys the left-right constraints on Obeys the left-right constraints on transition probabilities but provide more transition probabilities but provide more flexibilityflexibility
Speech Speech Recognition using Recognition using
HMMsHMMsFeature Analysis: A spectral/temporal Feature Analysis: A spectral/temporal analysis of speech signals can be analysis of speech signals can be performed to provide observation performed to provide observation vectors used to train HMMsvectors used to train HMMs
Unit Matching: Each unit is Unit Matching: Each unit is characterized by an HMM with characterized by an HMM with parameters estimated from speech dataparameters estimated from speech data
Provides the likelihoods of matches of all Provides the likelihoods of matches of all sequences of speech recognition units to sequences of speech recognition units to the unknown inputthe unknown input
Isolated Word Isolated Word RecognitionRecognition
Vocabulary of V words to be recognized Vocabulary of V words to be recognized and each modeled by a distinct HMMand each modeled by a distinct HMM
For each word, a training set of K For each word, a training set of K occurrences of each spoken word where occurrences of each spoken word where each occurrence of the word is an each occurrence of the word is an observation sequenceobservation sequence
For each word v, build an HMM For each word v, build an HMM (estimating the model parameters which (estimating the model parameters which optimize the likelihood of the training set optimize the likelihood of the training set observations for each word)observations for each word)
Isolated Word Isolated Word RecognitionRecognition
For each unknown word recognized, For each unknown word recognized, measure the observation sequence O = measure the observation sequence O = {O{O11OO22...O...OTT} via feature analysis of the } via feature analysis of the speech corresponding to the wordspeech corresponding to the word
Calculate model likelihoods for all Calculate model likelihoods for all possible models followed by the possible models followed by the selection of the word with the highest selection of the word with the highest model likelihoodmodel likelihood
Isolated Word Isolated Word RecognitionRecognition
Limitations of Limitations of HMMsHMMs
Assumption that successive observations are Assumption that successive observations are independent and thus, the probability of an independent and thus, the probability of an observation sequence can be written as the observation sequence can be written as the product of the probabilities of individual product of the probabilities of individual observationsobservations
Assumes the distributions of individual Assumes the distributions of individual observation parameters are well represented observation parameters are well represented as a mixture of autoregressive or Gaussian as a mixture of autoregressive or Gaussian densitiesdensities
Assumption of being in a given state at each Assumption of being in a given state at each time interval inappropriate for speech sound time interval inappropriate for speech sound which can extend through several stateswhich can extend through several states
Questions?Questions?
ReferencesVogel, S., et al. “HMM-based word alignment in statistical translation.” In Proceedings of the 16th conference on Computational linguistics (1996), pp. 836-841.
Rabiner, L. R. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE (1989), pp. 257-286.
Moore, A. W. “Hidden Markov Models.” Carnegie Mellon University. https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/hmm14.pdf
http://en.wikipedia.org/wiki/Hidden_Markov_model