Top Banner
Hidden Markov Models 1 10Ǧ601 Introduction to Machine Learning Matt Gormley Lecture 20 Mar. 30, 2020 Machine Learning Department School of Computer Science Carnegie Mellon University
54

10 æ 601 Introduction to Machine Learning

Apr 20, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 10 æ 601 Introduction to Machine Learning

Hidden�Markov�Models

1

10Ǧ601�Introduction�to�Machine�Learning

Matt�GormleyLecture�20

Mar.�30,�2020

Machine�Learning�DepartmentSchool�of�Computer�ScienceCarnegie�Mellon�University

Page 2: 10 æ 601 Introduction to Machine Learning

Reminders� Practice Problems for Exam 2

± Out:�Fri,�Mar�20� Midterm Exam 2

± Thu,�Apr 2�– evening exam,�details announced onPiazza

� Homework 7:�HMMs± Out:�Thu,�Apr 02± Due:�Fri,�Apr�10�at�11:59pm

� Today’s InǦClass Poll± http://poll.mlcourse.org

2

Page 3: 10 æ 601 Introduction to Machine Learning

+00V��+LVWRU\� 0DUNRY�FKDLQV��$QGUH\ 0DUNRY�������

± 5DQGRP�ZDONV�DQG�%URZQLDQ�PRWLRQ� 8VHG�LQ�6KDQQRQ¶V�ZRUN�RQ�LQIRUPDWLRQ�WKHRU\�������� %DXP�:HOVK�OHDUQLQJ�DOJRULWKP��ODWH���¶V��HDUO\���¶V�

± 8VHG�PDLQO\�IRU�VSHHFK�LQ���V���V�� /DWH���¶V�DQG���¶V��'DYLG�+DXVVOHU���PDMRU�SOD\HU�LQ�

OHDUQLQJ�WKHRU\�LQ���¶V��EHJDQ�WR�XVH�+00V�IRU�PRGHOLQJ�ELRORJLFDO�VHTXHQFHV

� 0LG�ODWH�����¶V��'D\QH )UHLWDJ�$QGUHZ�0F&DOOXP± )UHLWDJ WKHVLV�ZLWK�7RP�0LWFKHOO�RQ�,(�IURP�:HE�

XVLQJ�ORJLF�SURJUDPV��JUDPPDU�LQGXFWLRQ��HWF�± 0F&DOOXP���PXOWLQRPLDO�1DwYH�%D\HV�IRU�WH[W± :LWK�0F&DOOXP��,(�XVLQJ�+00V�RQ�&25$

� «

��Slide�from�William�Cohen

Page 4: 10 æ 601 Introduction to Machine Learning

HigherǦorder�HMMs� 1stǦorder�HMM�(i.e.�bigram�HMM)

� 2ndǦorder�HMM�(i.e.�trigram�HMM)

� 3rdǦorder�HMM

33

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Page 5: 10 æ 601 Introduction to Machine Learning

HigherǦorder�HMMs� 1stǦorder�HMM�(i.e.�bigram�HMM)

� 2ndǦorder�HMM�(i.e.�trigram�HMM)

� 3rdǦorder�HMM

34

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Y1 Y2 Y3 Y4 Y5

X1 X2 X3 X4 X5

<START>

Hidden�States,�y

ObservaǦtions,�x

Page 6: 10 æ 601 Introduction to Machine Learning

BACKGROUND:�MESSAGE�PASSING

35

Page 7: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�Passing

3�behind�you

2�behind�you

1�behind�you

4�behind�you

5�behind�you

1�beforeyou

2�beforeyou

there's1�of�me

3�beforeyou

4�beforeyou

5�beforeyou

Count the soldiers

36adapted�from�MacKay�(2003)�textbook

Page 8: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�Passing

3�behind�you

2�beforeyou

there's1�of�me

Belief:Must�be2�+�1�+�3�=�6�of�us

only�seemy�incomingmessages

2 31

Count the soldiers

37adapted�from�MacKay�(2003)�textbook

2�beforeyou

Page 9: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�Passing

4�behind�you

1�beforeyou

there's1�of�me

only�seemy�incomingmessages

Count the soldiers

38adapted�from�MacKay�(2003)�textbook

Belief:Must�be2�+�1�+�3�=�6�of�us2 31

Belief:Must�be1�+�1�+�4�=�6�of�us1 41

Page 10: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�Passing

7�here

3�here

11�here(=�7+3+1)

1�of�me

Each soldier receives reports from all branches of tree

39adapted�from�MacKay�(2003)�textbook

Page 11: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�Passing

3�here

3�here

7�here(=�3+3+1)

Each soldier receives reports from all branches of tree

40adapted�from�MacKay�(2003)�textbook

Page 12: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�Passing

7�here

3�here

11�here(=�7+3+1)

Each soldier receives reports from all branches of tree

41adapted�from�MacKay�(2003)�textbook

Page 13: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�Passing

7�here

3�here

3�here

Belief:Must�be14�of�us

Each soldier receives reports from all branches of tree

42adapted�from�MacKay�(2003)�textbook

Page 14: 10 æ 601 Introduction to Machine Learning

Great�Ideas�in�ML:�Message�PassingEach soldier receives reports from all branches of tree

7�here

3�here

3�here

Belief:Must�be14�of�us

43adapted�from�MacKay�(2003)�textbook

Page 15: 10 æ 601 Introduction to Machine Learning

THE�FORWARDǦBACKWARD�ALGORITHM

44

Page 16: 10 æ 601 Introduction to Machine Learning

InferenceQuestion:True�or�False:�The�joint�probability�of�the�observations�and�the�hidden�states�in�an�HMM�is�given�by:

Question:True�or�False:�The�joint�probability�of�the�observations�and�the�hidden�states�in�an�HMM�is�given�by:

45

Recall:Recall:

Page 17: 10 æ 601 Introduction to Machine Learning

InferenceQuestion:True�or�False:�The�probability�of�the�observations�in�an�HMM�is�given�by:

Question:True�or�False:�The�probability�of�the�observations�in�an�HMM�is�given�by:

46

Recall:Recall:

Page 18: 10 æ 601 Introduction to Machine Learning

InferenceQuestion:True�or�False:�Suppose�each�hidden�state�takes�K�values.�The�marginal�probability�of�a�hidden�state�yt given�the�observations�x is�given�by:

Question:True�or�False:�Suppose�each�hidden�state�takes�K�values.�The�marginal�probability�of�a�hidden�state�yt given�the�observations�x is�given�by:

47

Recall:Recall:

Page 19: 10 æ 601 Introduction to Machine Learning

Inference�for�HMMs

Whiteboard± Three�Inference�Problems�for�an�HMM

1. Evaluation:�Compute�the�probability�of�a�given�sequence�of�observations

2. Viterbi�Decoding:�Find�the�mostǦlikely�sequence�of�hidden�states,�given�a�sequence�of�observations

3. Marginals:�Compute�the�marginal�distribution�for�a�hidden�state,�given�a�sequence�of�observations

48

Page 20: 10 æ 601 Introduction to Machine Learning

n n v d nSample�2:

time likeflies an arrow

Dataset�for�Supervised�PartǦofǦSpeech�(POS)�Tagging

49

n v p d nSample�1:

time likeflies an arrow

p n n v vSample�4:

with youtime will see

n v p n nSample�3:

flies withfly their wings

Data:

y(1)

x(1)

y(2)

x(2)

y(3)

x(3)

y(4)

x(4)

Page 21: 10 æ 601 Introduction to Machine Learning

Inference�for�HMMs

Whiteboard± ForwardǦbackward�search�space

50

Page 22: 10 æ 601 Introduction to Machine Learning

time flies like an arrow

n v p d n<START>

Hidden�Markov�Model

52

A�Hidden�Markov�Model�(HMM)�provides�a�joint�distribution�over�the�the�sentence/tags�with�an�assumption�of�dependence�between�adjacent�tags.

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

v n p dv .1 .4 .2 .3n .8 .1 .1 0p .2 .3 .2 .3d .2 .8 0 0

tim

efl

ies

like

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

tim

efl

ies

like

v .2 .5 .2n .3 .4 .2p .1 .1 .3d .1 .2 .1

p�n, v, p, d, n, time, flies, like, an, arrow������= (.3�*�.8�*�.2�*�.5�*�…)

Page 23: 10 æ 601 Introduction to Machine Learning

X3X2X1

Y2 Y3Y1

53

find preferred tags

Could�be�adjective�or�verb Could�be�noun�or�verbCould�be�verb�or�noun

ForwardǦBackward�Algorithm

Page 24: 10 æ 601 Introduction to Machine Learning

ForwardǦBackward�Algorithm

54

Y2 Y3Y1

X3X2X1find preferred tags

Page 25: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

ForwardǦBackward�Algorithm

55

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�factors�think�of�it�…

Page 26: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

ForwardǦBackward�Algorithm

56

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�factors�think�of�it�…

Page 27: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

57

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�factors�think�of�it�…

ForwardǦBackward�Algorithm

Page 28: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

58

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�transition�/�emission�factors�think�of�it …

ForwardǦBackward�Algorithm

Page 29: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

59

v

n

a

v

n

a

v

n

a

START END

� Let’s�show�the�possible�values for�each�variable� One�possible�assignment� And�what�the�7�transition�/�emission�factors�think�of�it …

ForwardǦBackward�Algorithmfi

ndp

ref.

tag

s…

v 3 5 3n 4 5 2a 0.1 0.2 0.1

v n av 1 6 4n 8 4 0.1a 0.1 8 0

Page 30: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

Viterbi�Algorithm:�Most�Probable�Assignment

60

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� ����=�� product�of�7�numbers� Numbers�associated�with�edges�and�nodes�of�path� Most�probable�assignment�=�path�with�highest�product

B�D�(1'�

A�WDJV�Q�

Page 31: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

Viterbi�Algorithm:�Most�Probable�Assignment

61

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� ����=�� product�weight�of�one�path

B�D�(1'�

A�WDJV�Q�

Page 32: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

ForwardǦBackward�Algorithm:�Finds�Marginals

62

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �D�

����=�� total�weight�of�all paths�through a

Page 33: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

ForwardǦBackward�Algorithm:�Finds�Marginals

63

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �Q�

����=�� total�weight�of�all paths�through n

Page 34: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

ForwardǦBackward�Algorithm:�Finds�Marginals

64

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �Y�

����=�� total�weight�of�all paths�through v

Page 35: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

ForwardǦBackward�Algorithm:�Finds�Marginals

65

v

n

a

v

n

a

v

n

a

START END

� So�S�v a n�� ����=�� product�weight�of�one�path� Marginal�probability�S�Y2 �Q�

����=�� total�weight�of�all paths�through n

Page 36: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

Į��n� =�total�weight�of�thesepath�prefixes

(found�by�dynamic�programming:�matrixǦvector�products)

ForwardǦBackward�Algorithm:�Finds�Marginals

66

v

n

a

v

n

a

v

n

a

START END

Page 37: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

=�total�weight�of�thesepath�suffixes

E��n�

(found�by�dynamic�programming:�matrixǦvector�products)

ForwardǦBackward�Algorithm:�Finds�Marginals

67

v

n

a

v

n

a

v

n

a

START END

Page 38: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a

v

n

a

v

n

a

START END

Į��n� =�total�weight�of�thesepath�prefixes

=�total�weight�of�thesepath�suffixes

ForwardǦBackward�Algorithm:�Finds�Marginals

68

E2(n)(a�+�b�+�c) (x�+�y�+�z)

Product�gives��ax+ay+az+bx+by+bz+cx+cy+cz =�total�weight�of�paths

Page 39: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a

v

n

a

v

n

a

START END

ForwardǦBackward�Algorithm:�Finds�Marginals

69

total�weight�of�all paths�through= � �

n

$�SUHI���n�

Į��n� E2(n)

Į��n� $�SUHI���n� E2(n)

“belief�that�Y2 =�n”

Oops!�The�weight�of�a�path�through�a�state�also�

includes�a�weight�at�that�state.

So�Į�n�Âȕ�n� isn’t�enough.

The�extra�weight�is�the�opinion�of�the�emission�

probability�at�this�variable.

Page 40: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a a

v

n

a

START END

ForwardǦBackward�Algorithm:�Finds�Marginals

70

total�weight�of�all paths�through= � �

v

Į��v� $�SUHI���v� E2(v)

n

v

“belief�that�Y2 =�n”Į��v� E2(v)

“belief�that�Y2 =�v”

$�SUHI���v�

Page 41: 10 æ 601 Introduction to Machine Learning

Y2 Y3Y1

X3X2X1find preferred tags

v

n

a

v

n

a

START END

ForwardǦBackward�Algorithm:�Finds�Marginals

71

total�weight�of�all paths�through= � �

a

Į��a� $�SUHI���a� E2(a)

n

v

“belief�that�Y2 =�n”Į��a� E2(a)

“belief�that�Y2 =�v”

$�SUHI���a�

a “belief�that�Y2 =�a”

sum�=�=(total�weightof�all paths)

v 0.1n 0a 0.4

v 0.2n 0a 0.8

divide�by�Z=0.5�to�get�

marginal�probs

Page 42: 10 æ 601 Introduction to Machine Learning

X3X2X1

Y2 Y3Y1

72

find preferred tags

Could�be�adjective�or�verb Could�be�noun�or�verbCould�be�verb�or�noun

ForwardǦBackward�Algorithm

Page 43: 10 æ 601 Introduction to Machine Learning

Inference�for�HMMs

Whiteboard± Derivation�of�Forward�algorithm± ForwardǦbackward�algorithm± Viterbi�algorithm

73

Page 44: 10 æ 601 Introduction to Machine Learning

ForwardǦBackward�Algorithm

74

Page 45: 10 æ 601 Introduction to Machine Learning

Derivation�of�Forward�Algorithm

75

Derivation:

Definition:

Page 46: 10 æ 601 Introduction to Machine Learning

Viterbi�Algorithm

76

Page 47: 10 æ 601 Introduction to Machine Learning

Inference�in�HMMsWhat�is�the�computational�complexity�of�inference�for�HMMs?

� The�naïve (brute�force)�computations�for�Evaluation,�Decoding,�andMarginals take�exponential time,�O(KT)

� The�forwardǦbackward algorithm�and�Viterbialgorithm�run�in�polynomial time,�O(T*K2)± Thanks�to�dynamic�programming!

77

Page 48: 10 æ 601 Introduction to Machine Learning

Shortcomings�of�Hidden�Markov�Models

� HMM�models�capture�dependences�between�each�state�and�only its�corresponding�observation��± NLP�example:�In�a�sentence�segmentation�task,�each�segmental�state�may�depend�

not�just�on�a�single�word�(and�the�adjacent�segmental�stages),�but�also�on�the�(nonǦlocal)�features�of�the�whole�line�such�as�line�length,�indentation,�amount�of�white�space,�etc.

� Mismatch�between�learning�objective�function�and�prediction�objective�function± HMM�learns�a�joint�distribution�of�states�and�observations�P(Y,�X),�but�in�a�prediction�

task,�we�need�the�conditional�probability�P(Y|X)

©�Eric�Xing�@�CMU,�2005Ǧ2015 78

Y1 Y2 … … … Yn

X1 X2 … … … Xn

START

Page 49: 10 æ 601 Introduction to Machine Learning

MBR�DECODING

79

Page 50: 10 æ 601 Introduction to Machine Learning

Inference�for�HMMs

± Three�Inference�Problems�for�an�HMM1. Evaluation:�Compute�the�probability�of�a�given�

sequence�of�observations2. Viterbi�Decoding:�Find�the�mostǦlikely�sequence�of�

hidden�states,�given�a�sequence�of�observations3. Marginals:�Compute�the�marginal�distribution�for�a�

hidden�state,�given�a�sequence�of�observations4. MBR�Decoding:�Find�the�lowest�loss�sequence�of�

hidden�states,�given�a�sequence�of�observations�(Viterbi�decoding�is�a�special�case)

80

Page 51: 10 æ 601 Introduction to Machine Learning

Minimum�Bayes�Risk�Decoding� Suppose�we�given�a�loss�function�l(y’, y) and�are�

asked�for�a�single�tagging� How�should�we�choose�just�one�from�our�probability�

distribution�p(y|x)?� A�minimum�Bayes�risk�(MBR)�decoder�h(x) returns�

the�variable�assignment�with�minimum�expected loss�under�the�model’s�distribution

81

Page 52: 10 æ 601 Introduction to Machine Learning

The�0-1 loss�function�returns�1 only�if�the�two�assignments�are�identical�and�0 otherwise:

The�MBR�decoder�is:

which�is�exactly�the�Viterbi�decoding�problem!

Minimum�Bayes�Risk�Decoding

Consider�some�example�loss�functions:

82

Page 53: 10 æ 601 Introduction to Machine Learning

The�Hamming�loss�corresponds�to�accuracy�and�returns�the�number�of�incorrect�variable�assignments:

The�MBR�decoder�is:

This�decomposes�across�variables�and�requires�the�variable�marginals.

Minimum�Bayes�Risk�Decoding

Consider�some�example�loss�functions:

83

Page 54: 10 æ 601 Introduction to Machine Learning

Learning�ObjectivesHidden�Markov�Models

You�should�be�able�to…1. Show�that�structured�prediction�problems�yield�highǦcomputation�inference�

problems2. Define�the�first�order�Markov�assumption3. Draw�a�Finite�State�Machine�depicting�a�first�order�Markov�assumption4. Derive�the�MLE�parameters�of�an�HMM5. Define�the�three�key�problems�for�an�HMM:�evaluation,�decoding,�and�

marginal�computation6. Derive�a�dynamic�programming�algorithm�for�computing�the�marginal�

probabilities�of�an�HMM7. Interpret�the�forwardǦbackward�algorithm�as�a�message�passing�algorithm8. Implement�supervised�learning�for�an�HMM9. Implement�the�forwardǦbackward�algorithm�for�an�HMM10. Implement�the�Viterbi�algorithm�for�an�HMM11. Implement�a�minimum�Bayes�risk�decoder�with�Hamming�loss�for�an�HMM

84