10 æ 601 Introduction to Machine Learning

Hidden�Markov�Models

1

10Ǧ601�Introduction�to�Machine�Learning

Matt�GormleyLecture�20

Mar.�30,�2020

Machine�Learning�DepartmentSchool�of�Computer�ScienceCarnegie�Mellon�University

Reminders� Practice Problems for Exam 2

± Out:�Fri,�Mar�20� Midterm Exam 2

± Thu,�Apr 2�– evening exam,�details announced onPiazza

� Homework 7:�HMMs± Out:�Thu,�Apr 02± Due:�Fri,�Apr�10�at�11:59pm

� Today’s InǦClass Poll± http://poll.mlcourse.org

2

+00V��+LVWRU\� 0DUNRY�FKDLQV��$QGUH\ 0DUNRY��

± 5DQGRP�ZDONV�DQG�%URZQLDQ�PRWLRQ� 8VHG�LQ�6KDQQRQ¶V�ZRUN�RQ�LQIRUPDWLRQ�WKHRU\�� %DXP�:HOVK�OHDUQLQJ�DOJRULWKP��ODWH��¶V��HDUO\��¶V�

± 8VHG�PDLQO\�IRU�VSHHFK�LQ��V��V�� /DWH��¶V�DQG��¶V��'DYLG�+DXVVOHU��PDMRU�SOD\HU�LQ�

OHDUQLQJ�WKHRU\�LQ��¶V��EHJDQ�WR�XVH�+00V�IRU�PRGHOLQJ�ELRORJLFDO�VHTXHQFHV

� 0LG�ODWH��¶V��'D\QH )UHLWDJ�$QGUHZ�0F&DOOXP± )UHLWDJ WKHVLV�ZLWK�7RP�0LWFKHOO�RQ�,(�IURP�:HE�

XVLQJ�ORJLF�SURJUDPV��JUDPPDU�LQGXFWLRQ��HWF�± 0F&DOOXP��PXOWLQRPLDO�1DwYH�%D\HV�IRU�WH[W± :LWK�0F&DOOXP��,(�XVLQJ�+00V�RQ�&25$

� «

��Slide�from�William�Cohen

HigherǦorder�HMMs� 1stǦorder�HMM�(i.e.�bigram�HMM)

� 2ndǦorder�HMM�(i.e.�trigram�HMM)

� 3rdǦorder�HMM

33

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

HigherǦorder�HMMs� 1stǦorder�HMM�(i.e.�bigram�HMM)

� 2ndǦorder�HMM�(i.e.�trigram�HMM)

� 3rdǦorder�HMM

34

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Hidden�States,�y

ObservaǦtions,�x

BACKGROUND:�MESSAGE�PASSING

35

Great�Ideas�in�ML:�Message�Passing

3�behind�you

2�behind�you

1�behind�you

4�behind�you

5�behind�you

1�beforeyou

2�beforeyou

there's1�of�me

3�beforeyou

4�beforeyou

5�beforeyou

Count the soldiers

36adapted�from�MacKay�(2003)�textbook


3�behind�you

2�beforeyou

there's1�of�me

Belief:Must�be2�+�1�+�3�=�6�of�us

only�seemy�incomingmessages

2 31

Count the soldiers


2�beforeyou


4�behind�you

1�beforeyou

there's1�of�me

only�seemy�incomingmessages

Count the soldiers


Belief:Must�be2�+�1�+�3�=�6�of�us2 31

Belief:Must�be1�+�1�+�4�=�6�of�us1 41


7�here

3�here

11�here(=�7+3+1)

1�of�me

Each soldier receives reports from all branches of tree



3�here

3�here

7�here(=�3+3+1)




7�here

3�here

11�here(=�7+3+1)




7�here

3�here

3�here

Belief:Must�be14�of�us



Great�Ideas�in�ML:�Message�PassingEach soldier receives reports from all branches of tree

7�here

3�here

3�here

Belief:Must�be14�of�us


THE�FORWARDǦBACKWARD�ALGORITHM

44

InferenceQuestion:True�or�False:�The�joint�probability�of�the�observations�and�the�hidden�states�in�an�HMM�is�given�by:

Question:True�or�False:�The�joint�probability�of�the�observations�and�the�hidden�states�in�an�HMM�is�given�by:

45

Recall:Recall:

InferenceQuestion:True�or�False:�The�probability�of�the�observations�in�an�HMM�is�given�by:

Question:True�or�False:�The�probability�of�the�observations�in�an�HMM�is�given�by:

46

Recall:Recall:

InferenceQuestion:True�or�False:�Suppose�each�hidden�state�takes�K�values.�The�marginal�probability�of�a�hidden�state�yt given�the�observations�x is�given�by:

Question:True�or�False:�Suppose�each�hidden�state�takes�K�values.�The�marginal�probability�of�a�hidden�state�yt given�the�observations�x is�given�by:

47

Recall:Recall:

Inference�for�HMMs

Whiteboard± Three�Inference�Problems�for�an�HMM

1. Evaluation:�Compute�the�probability�of�a�given�sequence�of�observations

2. Viterbi�Decoding:�Find�the�mostǦlikely�sequence�of�hidden�states,�given�a�sequence�of�observations

3. Marginals:�Compute�the�marginal�distribution�for�a�hidden�state,�given�a�sequence�of�observations

48

n n v d nSample�2:

time likeflies an arrow

Dataset�for�Supervised�PartǦofǦSpeech�(POS)�Tagging

49

n v p d nSample�1:

time likeflies an arrow

p n n v vSample�4:

with youtime will see

n v p n nSample�3:

flies withfly their wings

Data:

y(1)

x(1)

y(2)

x(2)

y(3)

x(3)

y(4)

x(4)


Whiteboard± ForwardǦbackward�search�space

50

time flies like an arrow

n v p d n<START>

Hidden�Markov�Model

52

A�Hidden�Markov�Model�(HMM)�provides�a�joint�distribution�over�the�the�sentence/tags�with�an�assumption�of�dependence�between�adjacent�tags.

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

tim

efl

ies

like

…

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

tim

efl

ies

like

…

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

p�n, v, p, d, n, time, flies, like, an, arrow��= (.3�*�.8�*�.2�*�.5�*�…)

X3X2X1

Y2 Y3Y1

53

find preferred tags

Could�be�adjective�or�verb Could�be�noun�or�verbCould�be�verb�or�noun

ForwardǦBackward�Algorithm


54

Y2 Y3Y1

X3X2X1find preferred tags

Y2 Y3Y1



55

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�factors�think�of�it�…

Y2 Y3Y1



56

v

n

a

v

n

a

v

n

a

START END


Y2 Y3Y1


57

v

n

a

v

n

a

v

n

a

START END



Y2 Y3Y1


58

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�transition�/�emission�factors�think�of�it …


Y2 Y3Y1


59

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�transition�/�emission�factors�think�of�it …

ForwardǦBackward�Algorithmfi

ndp

ref.

tag

s…

v 3 5 3n 4 5 2a 0.1 0.2 0.1

v n av 1 6 4n 8 4 0.1a 0.1 8 0

Y2 Y3Y1


Viterbi�Algorithm:�Most�Probable�Assignment

60

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� =�� product�of�7�numbers� Numbers�associated�with�edges�and�nodes�of�path� Most�probable�assignment�=�path�with�highest�product

B�D�(1'�

A�WDJV�Q�

Y2 Y3Y1


Viterbi�Algorithm:�Most�Probable�Assignment

61

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� =�� product�weight�of�one�path

B�D�(1'�

A�WDJV�Q�

Y2 Y3Y1


ForwardǦBackward�Algorithm:�Finds�Marginals

62

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� =�� product�weight�of�one�path� Marginal�probability�S�Y2 �D�

��=�� total�weight�of�all paths�through a

Y2 Y3Y1



63

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� =�� product�weight�of�one�path� Marginal�probability�S�Y2 �Q�

��=�� total�weight�of�all paths�through n

Y2 Y3Y1



64

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� =�� product�weight�of�one�path� Marginal�probability�S�Y2 �Y�

��=�� total�weight�of�all paths�through v

Y2 Y3Y1



65

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� =�� product�weight�of�one�path� Marginal�probability�S�Y2 �Q�

��=�� total�weight�of�all paths�through n

Y2 Y3Y1


Į��n� =�total�weight�of�thesepath�prefixes

(found�by�dynamic�programming:�matrixǦvector�products)


66

v

n

a

v

n

a

v

n

a

START END

Y2 Y3Y1


=�total�weight�of�thesepath�suffixes

E��n�

(found�by�dynamic�programming:�matrixǦvector�products)


67

v

n

a

v

n

a

v

n

a

START END

Y2 Y3Y1


v

n

a

v

n

a

v

n

a

START END

Į��n� =�total�weight�of�thesepath�prefixes

=�total�weight�of�thesepath�suffixes


68

E2(n)(a�+�b�+�c) (x�+�y�+�z)

Product�gives��ax+ay+az+bx+by+bz+cx+cy+cz =�total�weight�of�paths

Y2 Y3Y1


v

n

a

v

n

a

v

n

a

START END


69

total�weight�of�all paths�through= � �

n

$�SUHI��n�

Į��n� E2(n)

Į��n� $�SUHI��n� E2(n)

“belief�that�Y2 =�n”

Oops!�The�weight�of�a�path�through�a�state�also�

includes�a�weight�at�that�state.

So�Į�n�Âȕ�n� isn’t�enough.

The�extra�weight�is�the�opinion�of�the�emission�

probability�at�this�variable.

Y2 Y3Y1


v

n

a a

v

n

a

START END


70


v

Į��v� $�SUHI��v� E2(v)

n

v

“belief�that�Y2 =�n”Į��v� E2(v)

“belief�that�Y2 =�v”

$�SUHI��v�

Y2 Y3Y1


v

n

a

v

n

a

START END


71


a

Į��a� $�SUHI��a� E2(a)

n

v

“belief�that�Y2 =�n”Į��a� E2(a)

“belief�that�Y2 =�v”

$�SUHI��a�

a “belief�that�Y2 =�a”

sum�=�=(total�weightof�all paths)

v 0.1n 0a 0.4

v 0.2n 0a 0.8

divide�by�Z=0.5�to�get�

marginal�probs

X3X2X1

Y2 Y3Y1

72

find preferred tags

Could�be�adjective�or�verb Could�be�noun�or�verbCould�be�verb�or�noun



Whiteboard± Derivation�of�Forward�algorithm± ForwardǦbackward�algorithm± Viterbi�algorithm

73


74

Derivation�of�Forward�Algorithm

75

Derivation:

Definition:

Viterbi�Algorithm

76

Inference�in�HMMsWhat�is�the�computational�complexity�of�inference�for�HMMs?

� The�naïve (brute�force)�computations�for�Evaluation,�Decoding,�andMarginals take�exponential time,�O(KT)

� The�forwardǦbackward algorithm�and�Viterbialgorithm�run�in�polynomial time,�O(T*K2)± Thanks�to�dynamic�programming!

77

Shortcomings�of�Hidden�Markov�Models

� HMM�models�capture�dependences�between�each�state�and�only its�corresponding�observation��± NLP�example:�In�a�sentence�segmentation�task,�each�segmental�state�may�depend�

not�just�on�a�single�word�(and�the�adjacent�segmental�stages),�but�also�on�the�(nonǦlocal)�features�of�the�whole�line�such�as�line�length,�indentation,�amount�of�white�space,�etc.

� Mismatch�between�learning�objective�function�and�prediction�objective�function± HMM�learns�a�joint�distribution�of�states�and�observations�P(Y,�X),�but�in�a�prediction�

task,�we�need�the�conditional�probability�P(Y|X)

©�Eric�Xing�@�CMU,�2005Ǧ2015 78

Y1 Y2 … … … Yn

X1 X2 … … … Xn

START

MBR�DECODING

79


± Three�Inference�Problems�for�an�HMM1. Evaluation:�Compute�the�probability�of�a�given�

sequence�of�observations2. Viterbi�Decoding:�Find�the�mostǦlikely�sequence�of�

hidden�states,�given�a�sequence�of�observations3. Marginals:�Compute�the�marginal�distribution�for�a�

hidden�state,�given�a�sequence�of�observations4. MBR�Decoding:�Find�the�lowest�loss�sequence�of�

hidden�states,�given�a�sequence�of�observations�(Viterbi�decoding�is�a�special�case)

80

Minimum�Bayes�Risk�Decoding� Suppose�we�given�a�loss�function�l(y’, y) and�are�

asked�for�a�single�tagging� How�should�we�choose�just�one�from�our�probability�

distribution�p(y|x)?� A�minimum�Bayes�risk�(MBR)�decoder�h(x) returns�

the�variable�assignment�with�minimum�expected loss�under�the�model’s�distribution

81

The�0-1 loss�function�returns�1 only�if�the�two�assignments�are�identical�and�0 otherwise:

The�MBR�decoder�is:

which�is�exactly�the�Viterbi�decoding�problem!

Minimum�Bayes�Risk�Decoding

Consider�some�example�loss�functions:

82

The�Hamming�loss�corresponds�to�accuracy�and�returns�the�number�of�incorrect�variable�assignments:

The�MBR�decoder�is:

This�decomposes�across�variables�and�requires�the�variable�marginals.

Minimum�Bayes�Risk�Decoding

Consider�some�example�loss�functions:

83

Learning�ObjectivesHidden�Markov�Models

You�should�be�able�to…1. Show�that�structured�prediction�problems�yield�highǦcomputation�inference�

problems2. Define�the�first�order�Markov�assumption3. Draw�a�Finite�State�Machine�depicting�a�first�order�Markov�assumption4. Derive�the�MLE�parameters�of�an�HMM5. Define�the�three�key�problems�for�an�HMM:�evaluation,�decoding,�and�

marginal�computation6. Derive�a�dynamic�programming�algorithm�for�computing�the�marginal�

probabilities�of�an�HMM7. Interpret�the�forwardǦbackward�algorithm�as�a�message�passing�algorithm8. Implement�supervised�learning�for�an�HMM9. Implement�the�forwardǦbackward�algorithm�for�an�HMM10. Implement�the�Viterbi�algorithm�for�an�HMM11. Implement�a�minimum�Bayes�risk�decoder�with�Hamming�loss�for�an�HMM

84

10 æ 601 Introduction to Machine Learning

Documents