Hidden Markov Models 1 10Ǧ601 Introduction to Machine Learning Matt Gormley Lecture 20 Mar. 30, 2020 Machine Learning Department School of Computer Science Carnegie Mellon University
Hidden�Markov�Models
1
10Ǧ601�Introduction�to�Machine�Learning
Matt�GormleyLecture�20
Mar.�30,�2020
Machine�Learning�DepartmentSchool�of�Computer�ScienceCarnegie�Mellon�University
Reminders� Practice Problems for Exam 2
± Out:�Fri,�Mar�20� Midterm Exam 2
± Thu,�Apr 2�– evening exam,�details announced onPiazza
� Homework 7:�HMMs± Out:�Thu,�Apr 02± Due:�Fri,�Apr�10�at�11:59pm
� Today’s InǦClass Poll± http://poll.mlcourse.org
2
+00V��+LVWRU\� 0DUNRY�FKDLQV��$QGUH\ 0DUNRY�������
± 5DQGRP�ZDONV�DQG�%URZQLDQ�PRWLRQ� 8VHG�LQ�6KDQQRQ¶V�ZRUN�RQ�LQIRUPDWLRQ�WKHRU\�������� %DXP�:HOVK�OHDUQLQJ�DOJRULWKP��ODWH���¶V��HDUO\���¶V�
± 8VHG�PDLQO\�IRU�VSHHFK�LQ���V���V�� /DWH���¶V�DQG���¶V��'DYLG�+DXVVOHU���PDMRU�SOD\HU�LQ�
OHDUQLQJ�WKHRU\�LQ���¶V��EHJDQ�WR�XVH�+00V�IRU�PRGHOLQJ�ELRORJLFDO�VHTXHQFHV
� 0LG�ODWH�����¶V��'D\QH )UHLWDJ�$QGUHZ�0F&DOOXP± )UHLWDJ WKHVLV�ZLWK�7RP�0LWFKHOO�RQ�,(�IURP�:HE�
XVLQJ�ORJLF�SURJUDPV��JUDPPDU�LQGXFWLRQ��HWF�± 0F&DOOXP���PXOWLQRPLDO�1DwYH�%D\HV�IRU�WH[W± :LWK�0F&DOOXP��,(�XVLQJ�+00V�RQ�&25$
� «
��Slide�from�William�Cohen
HigherǦorder�HMMs� 1stǦorder�HMM�(i.e.�bigram�HMM)
� 2ndǦorder�HMM�(i.e.�trigram�HMM)
� 3rdǦorder�HMM
33
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
HigherǦorder�HMMs� 1stǦorder�HMM�(i.e.�bigram�HMM)
� 2ndǦorder�HMM�(i.e.�trigram�HMM)
� 3rdǦorder�HMM
34
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
Y1 Y2 Y3 Y4 Y5
X1 X2 X3 X4 X5
<START>
Hidden�States,�y
ObservaǦtions,�x
BACKGROUND:�MESSAGE�PASSING
35
Great�Ideas�in�ML:�Message�Passing
3�behind�you
2�behind�you
1�behind�you
4�behind�you
5�behind�you
1�beforeyou
2�beforeyou
there's1�of�me
3�beforeyou
4�beforeyou
5�beforeyou
Count the soldiers
36adapted�from�MacKay�(2003)�textbook
Great�Ideas�in�ML:�Message�Passing
3�behind�you
2�beforeyou
there's1�of�me
Belief:Must�be2�+�1�+�3�=�6�of�us
only�seemy�incomingmessages
2 31
Count the soldiers
37adapted�from�MacKay�(2003)�textbook
2�beforeyou
Great�Ideas�in�ML:�Message�Passing
4�behind�you
1�beforeyou
there's1�of�me
only�seemy�incomingmessages
Count the soldiers
38adapted�from�MacKay�(2003)�textbook
Belief:Must�be2�+�1�+�3�=�6�of�us2 31
Belief:Must�be1�+�1�+�4�=�6�of�us1 41
Great�Ideas�in�ML:�Message�Passing
7�here
3�here
11�here(=�7+3+1)
1�of�me
Each soldier receives reports from all branches of tree
39adapted�from�MacKay�(2003)�textbook
Great�Ideas�in�ML:�Message�Passing
3�here
3�here
7�here(=�3+3+1)
Each soldier receives reports from all branches of tree
40adapted�from�MacKay�(2003)�textbook
Great�Ideas�in�ML:�Message�Passing
7�here
3�here
11�here(=�7+3+1)
Each soldier receives reports from all branches of tree
41adapted�from�MacKay�(2003)�textbook
Great�Ideas�in�ML:�Message�Passing
7�here
3�here
3�here
Belief:Must�be14�of�us
Each soldier receives reports from all branches of tree
42adapted�from�MacKay�(2003)�textbook
Great�Ideas�in�ML:�Message�PassingEach soldier receives reports from all branches of tree
7�here
3�here
3�here
Belief:Must�be14�of�us
43adapted�from�MacKay�(2003)�textbook
THE�FORWARDǦBACKWARD�ALGORITHM
44
InferenceQuestion:True�or�False:�The�joint�probability�of�the�observations�and�the�hidden�states�in�an�HMM�is�given�by:
Question:True�or�False:�The�joint�probability�of�the�observations�and�the�hidden�states�in�an�HMM�is�given�by:
45
Recall:Recall:
InferenceQuestion:True�or�False:�The�probability�of�the�observations�in�an�HMM�is�given�by:
Question:True�or�False:�The�probability�of�the�observations�in�an�HMM�is�given�by:
46
Recall:Recall:
InferenceQuestion:True�or�False:�Suppose�each�hidden�state�takes�K�values.�The�marginal�probability�of�a�hidden�state�yt given�the�observations�x is�given�by:
Question:True�or�False:�Suppose�each�hidden�state�takes�K�values.�The�marginal�probability�of�a�hidden�state�yt given�the�observations�x is�given�by:
47
Recall:Recall:
Inference�for�HMMs
Whiteboard± Three�Inference�Problems�for�an�HMM
1. Evaluation:�Compute�the�probability�of�a�given�sequence�of�observations
2. Viterbi�Decoding:�Find�the�mostǦlikely�sequence�of�hidden�states,�given�a�sequence�of�observations
3. Marginals:�Compute�the�marginal�distribution�for�a�hidden�state,�given�a�sequence�of�observations
48
n n v d nSample�2:
time likeflies an arrow
Dataset�for�Supervised�PartǦofǦSpeech�(POS)�Tagging
49
n v p d nSample�1:
time likeflies an arrow
p n n v vSample�4:
with youtime will see
n v p n nSample�3:
flies withfly their wings
Data:
y(1)
x(1)
y(2)
x(2)
y(3)
x(3)
y(4)
x(4)
Inference�for�HMMs
Whiteboard± ForwardǦbackward�search�space
50
time flies like an arrow
n v p d n<START>
Hidden�Markov�Model
52
A�Hidden�Markov�Model�(HMM)�provides�a�joint�distribution�over�the�the�sentence/tags�with�an�assumption�of�dependence�between�adjacent�tags.
v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0
v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0
tim
efl
ies
like
…
v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1
tim
efl
ies
like
…
v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1
p�n, v, p, d, n, time, flies, like, an, arrow������= (.3�*�.8�*�.2�*�.5�*�…)
X3X2X1
Y2 Y3Y1
53
find preferred tags
Could�be�adjective�or�verb Could�be�noun�or�verbCould�be�verb�or�noun
ForwardǦBackward�Algorithm
ForwardǦBackward�Algorithm
54
Y2 Y3Y1
X3X2X1find preferred tags
Y2 Y3Y1
X3X2X1find preferred tags
ForwardǦBackward�Algorithm
55
v
n
a
v
n
a
v
n
a
START END
� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�factors�think�of�it�…
Y2 Y3Y1
X3X2X1find preferred tags
ForwardǦBackward�Algorithm
56
v
n
a
v
n
a
v
n
a
START END
� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�factors�think�of�it�…
Y2 Y3Y1
X3X2X1find preferred tags
57
v
n
a
v
n
a
v
n
a
START END
� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�factors�think�of�it�…
ForwardǦBackward�Algorithm
Y2 Y3Y1
X3X2X1find preferred tags
58
v
n
a
v
n
a
v
n
a
START END
� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�transition�/�emission�factors�think�of�it …
ForwardǦBackward�Algorithm
Y2 Y3Y1
X3X2X1find preferred tags
59
v
n
a
v
n
a
v
n
a
START END
� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�transition�/�emission�factors�think�of�it …
ForwardǦBackward�Algorithmfi
ndp
ref.
tag
s…
v 3 5 3n 4 5 2a 0.1 0.2 0.1
v n av 1 6 4n 8 4 0.1a 0.1 8 0
Y2 Y3Y1
X3X2X1find preferred tags
Viterbi�Algorithm:�Most�Probable�Assignment
60
v
n
a
v
n
a
v
n
a
START END
� So�S�v a n�� ����=�� product�of�7�numbers� Numbers�associated�with�edges�and�nodes�of�path� Most�probable�assignment�=�path�with�highest�product
B�D�(1'�
A�WDJV�Q�
Y2 Y3Y1
X3X2X1find preferred tags
Viterbi�Algorithm:�Most�Probable�Assignment
61
v
n
a
v
n
a
v
n
a
START END
� So�S�v a n�� ����=�� product�weight�of�one�path
B�D�(1'�
A�WDJV�Q�
Y2 Y3Y1
X3X2X1find preferred tags
ForwardǦBackward�Algorithm:�Finds�Marginals
62
v
n
a
v
n
a
v
n
a
START END
� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �D�
����=�� total�weight�of�all paths�through a
Y2 Y3Y1
X3X2X1find preferred tags
ForwardǦBackward�Algorithm:�Finds�Marginals
63
v
n
a
v
n
a
v
n
a
START END
� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �Q�
����=�� total�weight�of�all paths�through n
Y2 Y3Y1
X3X2X1find preferred tags
ForwardǦBackward�Algorithm:�Finds�Marginals
64
v
n
a
v
n
a
v
n
a
START END
� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �Y�
����=�� total�weight�of�all paths�through v
Y2 Y3Y1
X3X2X1find preferred tags
ForwardǦBackward�Algorithm:�Finds�Marginals
65
v
n
a
v
n
a
v
n
a
START END
� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �Q�
����=�� total�weight�of�all paths�through n
Y2 Y3Y1
X3X2X1find preferred tags
Į��n� =�total�weight�of�thesepath�prefixes
(found�by�dynamic�programming:�matrixǦvector�products)
ForwardǦBackward�Algorithm:�Finds�Marginals
66
v
n
a
v
n
a
v
n
a
START END
Y2 Y3Y1
X3X2X1find preferred tags
=�total�weight�of�thesepath�suffixes
E��n�
(found�by�dynamic�programming:�matrixǦvector�products)
ForwardǦBackward�Algorithm:�Finds�Marginals
67
v
n
a
v
n
a
v
n
a
START END
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
v
n
a
START END
Į��n� =�total�weight�of�thesepath�prefixes
=�total�weight�of�thesepath�suffixes
ForwardǦBackward�Algorithm:�Finds�Marginals
68
E2(n)(a�+�b�+�c) (x�+�y�+�z)
Product�gives��ax+ay+az+bx+by+bz+cx+cy+cz =�total�weight�of�paths
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
v
n
a
START END
ForwardǦBackward�Algorithm:�Finds�Marginals
69
total�weight�of�all paths�through= � �
n
$�SUHI���n�
Į��n� E2(n)
Į��n� $�SUHI���n� E2(n)
“belief�that�Y2 =�n”
Oops!�The�weight�of�a�path�through�a�state�also�
includes�a�weight�at�that�state.
So�Į�n�Âȕ�n� isn’t�enough.
The�extra�weight�is�the�opinion�of�the�emission�
probability�at�this�variable.
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a a
v
n
a
START END
ForwardǦBackward�Algorithm:�Finds�Marginals
70
total�weight�of�all paths�through= � �
v
Į��v� $�SUHI���v� E2(v)
n
v
“belief�that�Y2 =�n”Į��v� E2(v)
“belief�that�Y2 =�v”
$�SUHI���v�
Y2 Y3Y1
X3X2X1find preferred tags
v
n
a
v
n
a
START END
ForwardǦBackward�Algorithm:�Finds�Marginals
71
total�weight�of�all paths�through= � �
a
Į��a� $�SUHI���a� E2(a)
n
v
“belief�that�Y2 =�n”Į��a� E2(a)
“belief�that�Y2 =�v”
$�SUHI���a�
a “belief�that�Y2 =�a”
sum�=�=(total�weightof�all paths)
v 0.1n 0a 0.4
v 0.2n 0a 0.8
divide�by�Z=0.5�to�get�
marginal�probs
X3X2X1
Y2 Y3Y1
72
find preferred tags
Could�be�adjective�or�verb Could�be�noun�or�verbCould�be�verb�or�noun
ForwardǦBackward�Algorithm
Inference�for�HMMs
Whiteboard± Derivation�of�Forward�algorithm± ForwardǦbackward�algorithm± Viterbi�algorithm
73
ForwardǦBackward�Algorithm
74
Derivation�of�Forward�Algorithm
75
Derivation:
Definition:
Viterbi�Algorithm
76
Inference�in�HMMsWhat�is�the�computational�complexity�of�inference�for�HMMs?
� The�naïve (brute�force)�computations�for�Evaluation,�Decoding,�andMarginals take�exponential time,�O(KT)
� The�forwardǦbackward algorithm�and�Viterbialgorithm�run�in�polynomial time,�O(T*K2)± Thanks�to�dynamic�programming!
77
Shortcomings�of�Hidden�Markov�Models
� HMM�models�capture�dependences�between�each�state�and�only its�corresponding�observation��± NLP�example:�In�a�sentence�segmentation�task,�each�segmental�state�may�depend�
not�just�on�a�single�word�(and�the�adjacent�segmental�stages),�but�also�on�the�(nonǦlocal)�features�of�the�whole�line�such�as�line�length,�indentation,�amount�of�white�space,�etc.
� Mismatch�between�learning�objective�function�and�prediction�objective�function± HMM�learns�a�joint�distribution�of�states�and�observations�P(Y,�X),�but�in�a�prediction�
task,�we�need�the�conditional�probability�P(Y|X)
©�Eric�Xing�@�CMU,�2005Ǧ2015 78
Y1 Y2 … … … Yn
X1 X2 … … … Xn
START
MBR�DECODING
79
Inference�for�HMMs
± Three�Inference�Problems�for�an�HMM1. Evaluation:�Compute�the�probability�of�a�given�
sequence�of�observations2. Viterbi�Decoding:�Find�the�mostǦlikely�sequence�of�
hidden�states,�given�a�sequence�of�observations3. Marginals:�Compute�the�marginal�distribution�for�a�
hidden�state,�given�a�sequence�of�observations4. MBR�Decoding:�Find�the�lowest�loss�sequence�of�
hidden�states,�given�a�sequence�of�observations�(Viterbi�decoding�is�a�special�case)
80
Minimum�Bayes�Risk�Decoding� Suppose�we�given�a�loss�function�l(y’, y) and�are�
asked�for�a�single�tagging� How�should�we�choose�just�one�from�our�probability�
distribution�p(y|x)?� A�minimum�Bayes�risk�(MBR)�decoder�h(x) returns�
the�variable�assignment�with�minimum�expected loss�under�the�model’s�distribution
81
The�0-1 loss�function�returns�1 only�if�the�two�assignments�are�identical�and�0 otherwise:
The�MBR�decoder�is:
which�is�exactly�the�Viterbi�decoding�problem!
Minimum�Bayes�Risk�Decoding
Consider�some�example�loss�functions:
82
The�Hamming�loss�corresponds�to�accuracy�and�returns�the�number�of�incorrect�variable�assignments:
The�MBR�decoder�is:
This�decomposes�across�variables�and�requires�the�variable�marginals.
Minimum�Bayes�Risk�Decoding
Consider�some�example�loss�functions:
83
Learning�ObjectivesHidden�Markov�Models
You�should�be�able�to…1. Show�that�structured�prediction�problems�yield�highǦcomputation�inference�
problems2. Define�the�first�order�Markov�assumption3. Draw�a�Finite�State�Machine�depicting�a�first�order�Markov�assumption4. Derive�the�MLE�parameters�of�an�HMM5. Define�the�three�key�problems�for�an�HMM:�evaluation,�decoding,�and�
marginal�computation6. Derive�a�dynamic�programming�algorithm�for�computing�the�marginal�
probabilities�of�an�HMM7. Interpret�the�forwardǦbackward�algorithm�as�a�message�passing�algorithm8. Implement�supervised�learning�for�an�HMM9. Implement�the�forwardǦbackward�algorithm�for�an�HMM10. Implement�the�Viterbi�algorithm�for�an�HMM11. Implement�a�minimum�Bayes�risk�decoder�with�Hamming�loss�for�an�HMM
84