Hidden Markov Models in NLP

Hidden Markov Models Hidden Markov Models in NLPin NLP

Leah SpontaneoLeah Spontaneo

OverviewOverviewIntroductionIntroduction

Deterministic ModelDeterministic ModelStatistical ModelStatistical ModelSpeech RecognitionSpeech Recognition

Discrete Markov ProcessesDiscrete Markov Processes

Hidden Markov ModelHidden Markov ModelElements of HMMsElements of HMMsOutput SequenceOutput SequenceBasic HMM ProblemsBasic HMM ProblemsHMM OperationsHMM Operations

Forward-BackwardForward-Backward

Viterbi AlgorithmViterbi Algorithm

Baum-WelchBaum-Welch

Different Types of HMMsDifferent Types of HMMs

Speech Recognition using HMMsSpeech Recognition using HMMsIsolated Word RecognitionIsolated Word Recognition

Limitations of HMMsLimitations of HMMs

IntroductionIntroductionReal-world processes generally produce Real-world processes generally produce observable outputs characterized in observable outputs characterized in signalssignals

Signals can be discrete or continuousSignals can be discrete or continuous

Signal source can be stationary or Signal source can be stationary or nonstationarynonstationary

They can be pure or corruptedThey can be pure or corrupted

IntroductionIntroductionSignals are characterized in signal Signals are characterized in signal modelsmodels

Process the signal to provide desired Process the signal to provide desired outputoutput

Learn about the signal source without Learn about the signal source without the source availablethe source available

Work well in practiceWork well in practice

Signal models can be deterministic or Signal models can be deterministic or statisticalstatistical

Deterministic Deterministic ModelModel

Exploit some known Exploit some known specific properties of specific properties of the signalthe signal

Sine waveSine wave

Sum of exponentialsSum of exponentials

Chaos theoryChaos theory

Specification of the Specification of the signal generally signal generally straight-forwardstraight-forward

Determine/estimate Determine/estimate the values of the the values of the parameters of the parameters of the signal modelsignal model

Statistical ModelStatistical ModelStatistical models try to Statistical models try to characterize the characterize the statistical properties of statistical properties of the signalthe signal

Gaussian processesGaussian processes

Markov processesMarkov processes

Hidden Markov Hidden Markov processesprocesses

Signal characterized as Signal characterized as a parametric random a parametric random processprocess

Parameters of the Parameters of the stochastic process can stochastic process can be determine/estimated be determine/estimated in a precise, well-in a precise, well-defined mannerdefined manner

Speech Speech RecognitionRecognition

Basic theory of hidden Markov models Basic theory of hidden Markov models in speech recognition originally in speech recognition originally pulished in 1960s by Baum and pulished in 1960s by Baum and colleguescollegues

Implemented in speech processing Implemented in speech processing applications in 1970s be Baker at CMU applications in 1970s be Baker at CMU and by Jelinek at IBMand by Jelinek at IBM

Discrete Markov Discrete Markov ProcessesProcesses

Contains a set of N distinct states: SContains a set of N distinct states: S11, , SS22,..., S,..., SNN

At discrete time intervals the state At discrete time intervals the state changes and the state at time t is qchanges and the state at time t is qt t

For the first-order Markov chain, the For the first-order Markov chain, the probabilistic description isprobabilistic description is

P[qP[qtt = S = Sjj| q| qt -1t -1= S= Sii, q, qt -2t -2= S= Skk,...] = P[q,...] = P[qtt = S = Sjj| | qqt -1t -1= S= Sii]]

Discrete Markov Discrete Markov ProcessesProcesses

Processes considered are those independent Processes considered are those independent of time leading to transition probabilitiesof time leading to transition probabilities

aaijij = P[q = P[qtt = S = Sjj| q| qt -1t -1= S= Sii] 1 ≤ i, j ≤ N] 1 ≤ i, j ≤ N

aaijij ≥ 0 ≥ 0

ΣΣ a aijij = 1 = 1

The revious stochastic process is considered The revious stochastic process is considered an observable Markov modelan observable Markov model

The output process is the set of states at The output process is the set of states at each time interval and each state each time interval and each state corresponds to an observable eventcorresponds to an observable event

Hidden Markov Hidden Markov ModelModel

Markov process decides future Markov process decides future probabilities based on recent valuesprobabilities based on recent values

A hidden Markov model is a Markov A hidden Markov model is a Markov process with an unobservable stateprocess with an unobservable state

HMMs must have 3 sets of probabilities:HMMs must have 3 sets of probabilities:Initial probabilitiesInitial probabilities

Transition probabilitiesTransition probabilities

Emission probabilitiesEmission probabilities

Hidden Markov Hidden Markov ModelModel

Includes the case where the Includes the case where the observation is a probabilistic function of observation is a probabilistic function of the statethe state

A doubly embedded stochastic process A doubly embedded stochastic process with an underlying unobservable with an underlying unobservable stochastic processstochastic process

Unobservable process only observed Unobservable process only observed through a set of stochastic processes through a set of stochastic processes producing the observationsproducing the observations

Elements of Elements of HMMsHMMs

N, number of states in the modelN, number of states in the model

Although hidden, there is physical Although hidden, there is physical significance attached to the states of the significance attached to the states of the modelmodel

Individual states are denoted as S = {SIndividual states are denoted as S = {S11, , SS22,..., S,..., SNN}}

State at time t is denoted qState at time t is denoted qtt

M, number of distinct observation symbols M, number of distinct observation symbols for each statefor each state


Observation symbols correspond to the Observation symbols correspond to the output of the system modeledoutput of the system modeled

Individual symbols are denoted V = {vIndividual symbols are denoted V = {v11, , vv22,..., v,..., vMM}}

State transition probability distribution A State transition probability distribution A = {a= {aijij}}

aaijij = P[q = P[qtt = S = Sjj| q| qt -1t -1= S= Sii] 1 ≤ i, j ≤ N] 1 ≤ i, j ≤ N

The special case where any state can The special case where any state can reach any other, areach any other, aijij > 0 for all i, j > 0 for all i, j


The observation probability distribution in The observation probability distribution in state j, B = {bstate j, B = {bjj(k)} (k)}

bbjj(k) = P[v(k) = P[vkk at t|q at t|qtt = S = Sjj] 1 ≤ j ≤ N, 1 ≤ k ≤ M] 1 ≤ j ≤ N, 1 ≤ k ≤ M

The initial state distribution The initial state distribution ππ = { = {ππii}}ππii = P[q = P[q11 = S = Sii] 1 ≤ i ≤ N] 1 ≤ i ≤ N

With the right values for N, M, A, B, and With the right values for N, M, A, B, and ππ the HMM can generate and output sequencethe HMM can generate and output sequence

O = OO = O11OO22...O...OTT where each O where each Ott is an observation is an observation in V and T is the total number of observations in V and T is the total number of observations in the sequencein the sequence

Output SequenceOutput Sequence1) Choose an initial state q1) Choose an initial state q11 = S = Sii according to according to ππ

2) t = 12) t = 1

3) Get O3) Get Ott = v = vkk based on the emission probability for S based on the emission probability for Sii, , bbii(k)(k)

4) Transition to new state q4) Transition to new state qt+1t+1 = S = Sjj based on transition based on transition probability for Sprobability for Sii, a, aijij

5) t = t + 1 and go back to #3 if t < T; otherwise end 5) t = t + 1 and go back to #3 if t < T; otherwise end sequencesequence

Successfully used for acoustic modeling in speech Successfully used for acoustic modeling in speech recognitionrecognition

Applied to language modeling and POS taggingApplied to language modeling and POS tagging

Output SequenceOutput SequenceThe procedure can be used to generate The procedure can be used to generate a sequence of observations and to a sequence of observations and to model how an observation sequence model how an observation sequence was produced by and HMMwas produced by and HMM

The cost of determining the probability The cost of determining the probability that the system is in state Sthat the system is in state Sii at time t is at time t is O(tNO(tN22))

Basic HMM Basic HMM ProblemsProblems

HMMs can find the state HMMs can find the state sequence that most sequence that most likely produces a given likely produces a given outputoutput

The sequence of states The sequence of states is most efficiently is most efficiently computed using the computed using the Viterbi algorithmViterbi algorithm

Maximum likelihood Maximum likelihood estimates of the estimates of the probability sets are probability sets are determined using the determined using the Baum-Welch algorithmBaum-Welch algorithm

HMM OperationsHMM OperationsCalculating P(qCalculating P(qtt = S = Sii|O|O11OO22...O...Ott) uses the ) uses the forward-backward algorithmforward-backward algorithm

Computing QComputing Q** = argmax = argmaxQQ P(Q|O) P(Q|O) requires the Viterbi algorithmrequires the Viterbi algorithm

Learning Learning λλ** = argmax = argmaxλλ P(O| P(O|λλ) using the ) using the Baum-Welch algorithmBaum-Welch algorithm

The complexity for the three algorithms The complexity for the three algorithms is O(TNis O(TN22) where T is the time taken and ) where T is the time taken and N is the number of statesN is the number of states

EvaluationEvaluationScores how well a given model matches Scores how well a given model matches an observation sequencean observation sequence

Extremely useful in considering which Extremely useful in considering which model, among many, best represents model, among many, best represents the set of observationsthe set of observations

Forward-Forward-BackwardBackward

Given observations OGiven observations O11OO22...O...OTT

ααtt(i) = P(O(i) = P(O11OO22...O...Ott ^ q ^ qtt = S = Sii|| λ λ) is the probability that ) is the probability that given the first t observations, we end up in state Sgiven the first t observations, we end up in state S ii on visit ton visit t

αα11(i) = b(O(i) = b(O11)) π πii

α α t+1t+1(j) = (j) = ΣΣ a aijijbbii(O(Ot+1t+1)) α αtt(i)(i)

We can now cheaply compute We can now cheaply compute ααtt(i) = P(O(i) = P(O11OO22...O...Ott ^ q ^ qtt = S= Sii) )

P(OP(O11OO22...O...Ott) = ) = ΣΣ ααtt(i)(i)

P(qP(qtt = S = Sii|| OO11OO22...O...Ott) = ) = ααtt(i)/(i)/ Σ Σ ααtt(j)(j)

Forward-Forward-BackwardBackward

The key is since there are only N states, The key is since there are only N states, all possible state sequences will merge all possible state sequences will merge to the N nodesto the N nodes

At t = 1, only the values of At t = 1, only the values of αα11(i), 1 ≤ i ≤ (i), 1 ≤ i ≤ N require calculationN require calculation

When t = 2, 3,..., T we calculate When t = 2, 3,..., T we calculate ααtt(j), 1 (j), 1 ≤ j ≤ N where each calculation involves ≤ j ≤ N where each calculation involves N previous values of N previous values of ααt-1t-1(i) only(i) only

State SequenceState SequenceThere is no ‘correct’ state sequence There is no ‘correct’ state sequence except for degenerate modelsexcept for degenerate models

Optimality criterion is used instead to Optimality criterion is used instead to determine the best possible outcomedetermine the best possible outcome

There are several reasonable optimality There are several reasonable optimality criteria and thus, the chosen criterion criteria and thus, the chosen criterion depends on the use of the uncovered depends on the use of the uncovered sequencesequence

Used for continuous speech recognitionUsed for continuous speech recognition

Viterbi AlgorithmViterbi AlgorithmFinds the most likely sequence of hidden Finds the most likely sequence of hidden states based on known observations using states based on known observations using dynamic programmingdynamic programming

Makes three assumptions about the model:Makes three assumptions about the model:The model must be a state machineThe model must be a state machine

Transitions between states are marked by a Transitions between states are marked by a metricmetric

Events must be cumulative over the pathEvents must be cumulative over the path

Path history must be kept in memory to find Path history must be kept in memory to find the best probable path in the endthe best probable path in the end

Viterbi AlgorithmViterbi AlgorithmUsed for speech recognition where the Used for speech recognition where the hidden state is part of word formationhidden state is part of word formation

Given a specific signal, it would deduce Given a specific signal, it would deduce the most probable word based on the the most probable word based on the modelmodel

To find the best state sequence, Q = To find the best state sequence, Q = {q{q11, q, q22,..., q,..., qTT} for the observation } for the observation sequence O = {Osequence O = {O11OO22...O...OTT} we define} we define

Viterbi AlgorithmViterbi Algorithmδδtt(i) is the best score along a single (i) is the best score along a single path at time t accounting for the first t path at time t accounting for the first t observations ending in state Sobservations ending in state Sii

The inductive step isThe inductive step is

Viterbi AlgorithmViterbi AlgorithmSimilar to forward calculation of the Similar to forward calculation of the forward-backward algorithmforward-backward algorithm

The major difference is the The major difference is the maximization over the previous states maximization over the previous states instead of summing the probabilitiesinstead of summing the probabilities

Optimizing Optimizing ParametersParameters

A training sequence is used to train the A training sequence is used to train the HMM and adjust the model parametersHMM and adjust the model parameters

Training problem is crucial for most Training problem is crucial for most HMM applicationsHMM applications

Allows us to optimally adapt model Allows us to optimally adapt model parameters to observed training data parameters to observed training data creating good models for real datacreating good models for real data

Baum-WelchBaum-WelchA special case of expectation A special case of expectation maximizationmaximization

EM has two main steps:EM has two main steps:Devise the expectation of the log-likelihood Devise the expectation of the log-likelihood using current estimates of latent variablesusing current estimates of latent variables

Compute the maximized log-likelihood Compute the maximized log-likelihood using values from the first stepusing values from the first step

Baum-Welch is a form of generalized EM Baum-Welch is a form of generalized EM which allows the algorithm to converge to which allows the algorithm to converge to a local optimuma local optimum

Baum-WelchBaum-WelchAlso known as the forward-backwards Also known as the forward-backwards algorithmalgorithm

Baum-Welch uses two main steps:Baum-Welch uses two main steps:The forward and backward probability is The forward and backward probability is calculated for each state in the HMMcalculated for each state in the HMM

The transition and emission probabilities The transition and emission probabilities are determined and divided by the are determined and divided by the probability of the whole model based on probability of the whole model based on the previous stepthe previous step

Baum-WelchBaum-WelchCons: lots of local minimaCons: lots of local minima

Pros: local minima are often adequate Pros: local minima are often adequate models of the datamodels of the data

EM requires the number of states to be EM requires the number of states to be givengiven

Sometimes HMMs require some links to Sometimes HMMs require some links to be zero. For this abe zero. For this aijij = 0 in the initial = 0 in the initial estimate estimate λλ(0)(0)

Different Types of Different Types of HMMHMM

Left-right modelLeft-right modelAs time increases the state index As time increases the state index increases or stays the sameincreases or stays the same

No transitions are allowed between No transitions are allowed between states whose indecies are lower than the states whose indecies are lower than the current statecurrent state

Cross-coupled two parallel left-rightCross-coupled two parallel left-rightObeys the left-right constraints on Obeys the left-right constraints on transition probabilities but provide more transition probabilities but provide more flexibilityflexibility

Speech Speech Recognition using Recognition using

HMMsHMMsFeature Analysis: A spectral/temporal Feature Analysis: A spectral/temporal analysis of speech signals can be analysis of speech signals can be performed to provide observation performed to provide observation vectors used to train HMMsvectors used to train HMMs

Unit Matching: Each unit is Unit Matching: Each unit is characterized by an HMM with characterized by an HMM with parameters estimated from speech dataparameters estimated from speech data

Provides the likelihoods of matches of all Provides the likelihoods of matches of all sequences of speech recognition units to sequences of speech recognition units to the unknown inputthe unknown input

Isolated Word Isolated Word RecognitionRecognition

Vocabulary of V words to be recognized Vocabulary of V words to be recognized and each modeled by a distinct HMMand each modeled by a distinct HMM

For each word, a training set of K For each word, a training set of K occurrences of each spoken word where occurrences of each spoken word where each occurrence of the word is an each occurrence of the word is an observation sequenceobservation sequence

For each word v, build an HMM For each word v, build an HMM (estimating the model parameters which (estimating the model parameters which optimize the likelihood of the training set optimize the likelihood of the training set observations for each word)observations for each word)


For each unknown word recognized, For each unknown word recognized, measure the observation sequence O = measure the observation sequence O = {O{O11OO22...O...OTT} via feature analysis of the } via feature analysis of the speech corresponding to the wordspeech corresponding to the word

Calculate model likelihoods for all Calculate model likelihoods for all possible models followed by the possible models followed by the selection of the word with the highest selection of the word with the highest model likelihoodmodel likelihood


Limitations of Limitations of HMMsHMMs

Assumption that successive observations are Assumption that successive observations are independent and thus, the probability of an independent and thus, the probability of an observation sequence can be written as the observation sequence can be written as the product of the probabilities of individual product of the probabilities of individual observationsobservations

Assumes the distributions of individual Assumes the distributions of individual observation parameters are well represented observation parameters are well represented as a mixture of autoregressive or Gaussian as a mixture of autoregressive or Gaussian densitiesdensities

Assumption of being in a given state at each Assumption of being in a given state at each time interval inappropriate for speech sound time interval inappropriate for speech sound which can extend through several stateswhich can extend through several states

Questions?Questions?

ReferencesVogel, S., et al. “HMM-based word alignment in statistical translation.” In Proceedings of the 16th conference on Computational linguistics (1996), pp. 836-841.

Rabiner, L. R. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE (1989), pp. 257-286.

Moore, A. W. “Hidden Markov Models.” Carnegie Mellon University. https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/hmm14.pdf

http://en.wikipedia.org/wiki/Hidden_Markov_model

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/hmm14.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/hmm14.pdf

http://en.wikipedia.org/wiki/Hidden_Markov_model

Hidden Markov Models in NLP

Documents