Top Banner
Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454
75

Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Hidden Markov Models (HMMs)

for Information ExtractionDaniel S. Weld

CSE 454

Page 2: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Extraction with

Finite State Machinese.g.

Hidden Markov Models (HMMs)

standard sequence model in genomics, speech, NLP, …

Page 3: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

What’s an HMM?

• Set of states– Initial probabilities– Transition probabilities

• Set of potential observations– Emission probabilities

HMM generates observation sequence

o1 o2 o3 o4 o5

Page 4: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Hidden Markov Models (HMMs)Finite state machine

o1 o2 o3 o4 o5 o6 o7 o8

Hidden state sequence

Observation sequence

Generates

Adapted from Cohen & McCallum

Page 5: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Hidden Markov Models (HMMs)Finite state machine

Graphical Model...Hidden states

Observations

o1 o2 o3 o4 o5 o6 o7 o8

Hidden state sequence

Observation sequence

Generates

Adapted from Cohen & McCallum

yt-2 y

t-1 yt

xt-2 x

t-1x

t...

...

...

Random variable yt takes values from sss{s1, s2, s3, s4} Random variable xt takes values from s{o1, o2, o3, o4, o5, …}

Page 6: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

HMMFinite state machine

o1 o2 o3 o4 o5 o6 o7 o8

Hidden state sequence

Observation sequence

Generates

Graphical Model...Hidden states

Observations

yt-2 y

t-1 yt

xt-2 x

t-1x

t...

...

...

Random variable yt takes values from sss{s1, s2, s3, s4} Random variable xt takes values from s{o1, o2, o3, o4, o5, …}

Page 7: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

HMMGraphical Model...Hidden states

Observations

yt-2 y

t-1 yt

xt-2 x

t-1x

t...

...

...

Random variable yt takes values from sss{s1, s2, s3, s4} Random variable xt takes values from s{o1, o2, o3, o4, o5, …}

Need Parameters: Start state probabilities: P(y1=sk ) Transition probabilities: P(yt=si | yt-1=sk) Observation probabilities: P(xt=oj | yt=sk )

Training: Maximize probability of training obervations

Usually multinom

ial

over atomic, fixed

alphabet

Page 8: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Example: The Dishonest Casino

A casino has two dice:• Fair die

P(1) = P(2) = P(3) = P(5) = P(6) = 1/6

• Loaded dieP(1) = P(2) = P(3) = P(5) = 1/10P(6) = 1/2

Dealer switches back-&-forth between fair and loaded die about once every 20 turns

Game:1. You bet $12. You roll (always with a fair die)3. Casino player rolls (maybe with

fair die, maybe with loaded die)4. Highest number wins $2

Slides from Serafim Batzoglou

Page 9: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

The dishonest casino HMM

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

Slides from Serafim Batzoglou

Page 10: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Question # 1 – Evaluation

GIVENA sequence of rolls by the casino player

124552646214614613613666166466163661636616361…

QUESTIONHow likely is this sequence, given our model

of how the casino works?

This is the EVALUATION problem in HMMs

Slides from Serafim Batzoglou

Page 11: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Question # 2 – Decoding

GIVENA sequence of rolls by the casino player

1245526462146146136136661664661636616366163…

QUESTIONWhat portion of the sequence was generated

with the fair die, and what portion with the loaded die?

This is the DECODING question in HMMs

Slides from Serafim Batzoglou

Page 12: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Question # 3 – Learning

GIVENA sequence of rolls by the casino player

124552646214614613613666166466163661636616361651…

QUESTIONHow “loaded” is the loaded die? How “fair” is

the fair die? How often does the casino player change from fair to loaded, and back?

This is the LEARNING question in HMMs

Slides from Serafim Batzoglou

Page 13: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

What’s this have to do with Info Extraction?

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

Page 14: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

What’s this have to do with Info Extraction?

TEXT NAME

0.05

0.05

0.950.95

P(the | T) = 0.003P(from | T) = 0.002…..

P(Dan | N) = 0.005P(Sue | N) = 0.003…

Page 15: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

IE with Hidden Markov Models

Yesterday Pedro Domingos spoke this example sentence.

Yesterday Pedro Domingos spoke this example sentence.

Person name: Pedro Domingos

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

person name

location name

background

Slide by Cohen & McCallum

Page 16: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

IE with Hidden Markov ModelsFor sparse extraction tasks :• Separate HMM for each type of

target • Each HMM should

– Model entire document– Consist of target and non-target states– Not necessarily fully connected

16Slide by Okan Basegmez

Page 17: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Or … Combined HMM• Example – Research Paper Headers

17Slide by Okan Basegmez

Page 18: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

HMM Example: “Nymble”Task: Named Entity Extraction

Train on ~500k words of news wire text.

Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 1998], [BBN “IdentiFinder”]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Results:

Slide adapted from Cohen & McCallum

Page 19: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Finite State Model

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

y1

x1 x2 x3

y2 y3 y4 y5 y6 …

x4 x5 x6 …

vs. Path

Page 20: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Question #1 – EvaluationGIVEN

A sequence of observations x1 x2 x3 x4 ……xN

A trained HMM θ=( , , )

QUESTION

How likely is this sequence, given our HMM ?

P(x,θ)Why do we care?

Need it for learning to choose among competing models!

Page 21: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

A parse of a sequenceGiven a sequence x = x1……xN,

A parse of o is a sequence of states y = y1, ……, yN

1

2

K

1

2

K

1

2

K

1

2

K

x1 x2 x3 xK

2

1

K

2

Slide by Serafim Batzoglou

person

other

location

Page 22: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Question #2 - DecodingGIVEN

A sequence of observations x1 x2 x3 x4 ……xN

A trained HMM θ=( , , )

QUESTION

How dow we choose the corresponding parse (state sequence) y1 y2 y3 y4 ……yN , which “best” explains x1 x2 x3 x4 ……xN ?

There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, …

Page 23: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Question #3 - LearningGIVEN

A sequence of observations x1 x2 x3 x4 ……xN

QUESTION

How do we learn the model parameters

θ =( , , )

which maximize P(x, θ ) ?

Page 24: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Three Questions• Evaluation

– Forward algorithm

• Decoding– Viterbi algorithm

• Learning– Baum-Welch Algorithm (aka “forward-

backward”)– A kind of EM (expectation maximization)

Page 25: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Naive Solution to #1: Evaluation

Given observations x=x1 …xN and HMM θ, what is p(x) ?

Enumerate every possible state sequence y=y1 …yN

Probability of x and given particular y

Probability of particular y

Summing over all possible state sequences we get

NT state sequences!

2T multiplications per sequence

For small HMMsT=10, N=10there are 10

billion sequences!

Page 26: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

• Use Dynamic Programming

• Cache and reuse inner sums

“forward variables”

Many Calculations Repeated

Page 27: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Solution to #1: EvaluationUse Dynamic Programming:

Define forward variable

probability that at time t- the state is Si

- the partial observation sequence x=x1 …xt has been emitted

Page 28: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Base Case: Forward Variable t(i)

prob - that the state at t has value Si and

- the partial obs sequence x=x1 …xt has been seen

1

2

K

x1

person

other

location

Base Case: t=1 1(i) = p(y1=Si) p(x1=o1)

Page 29: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Inductive Case: Forward Variable t(i)

prob - that the state at t has value Si and

- the partial obs sequence x=x1 …xt has been seen

1

2

K

1

2

K

1

2

K

1

2

K

x1 x2 x3

person

other

location

xt

Si

Page 30: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

The Forward Algorithm

yt-1 yt

t-1(1)

t-1(2)

t-1(3)

t-1(K)

t(3)

SK SK

Page 31: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

The Forward Algorithm

INITIALIZATION

INDUCTION

TERMINATION

Time:O(K2N)

Space:O(KN)

K = |S| #statesN length of sequence

Page 32: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

The Backward Algorithm

Page 33: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Three Questions• Evaluation

– Forward algorithm– (also Backward algorithm)

• Decoding– Viterbi algorithm

• Learning– Baum-Welch Algorithm (aka “forward-

backward”)– A kind of EM (expectation maximization)

Page 34: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

#2 – Decoding ProblemGiven x=x1 …xN and HMM θ, what is “best”

parse y1 …yT?

Several possible meanings of ‘solution’1. States which are individually most likely2. Single best state sequence

We want sequence y1 …yT, such that P(x,y) is maximized

y* = argmaxy P( x, y )

Again, we can use dynamic programming!

1

2

K

1

2

K

1

2

K

1

2

K

o1 o2 o3 oT

2

1

K

2

? ? ? ? ? ? ? ?

Page 35: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

δt(i) • Like

t(i) = prob that the state, y, at time t has value Si

and the partial obs sequence x=x1 …xt has been seen

• Defineδt(i) = probability of most likely state sequence

ending with state Si, given observations x1, …, xt

P(y1, …,yt-1 , yt=Si | o1, …, ot, )

Page 36: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

δt(i) = probability of most likely

state sequence ending with state Si,

given observations x1, …, xt

P(y1, …,yt-1 , yt=Si | o1, …, ot, )

Base Case: t=1

Maxi P(y1=Si) P(x1=o1 | y1 = Si)

Page 37: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Inductive StepP(y1, …,yt-1 , yt=Si | o1, …, ot, )

t-1(1)

t-1(2)

t-1(3)

t-1(K) K K

t(3)

Take

Max

S3

Page 38: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

The Viterbi AlgorithmDEFINE

INITIALIZATION

INDUCTION

TERMINATION

Backtracking to get state sequence y*

| o1, …, ot,

Page 39: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Slides from Serafim Batzoglou

The Viterbi Algorithmx1 x2 ……xj-1 xj……………………………..xT

State 1

2

K

i δj(i)

Maxi δj-1(i) * Ptrans* Pobs

Remember: δt(i) = probability of most likely state seq ending with yt = state St

Page 40: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Terminating Viterbi

δ

δ

δ

δ

δ

x1 x2 …………………………………………..xT

State 1

2

K

i Choose Max

Page 41: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Terminating Viterbi

Time: O(K2T)Space: O(KT)

x1 x2 …………………………………………..xT

State 1

2

K

i

Linear in length of sequence

Maxi δT-1(i) * Ptrans* Pobs

Max

How did we compute *?

δ*

Now Backchain to Find Final Sequence

Page 42: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

The Viterbi Algorithm

44

Pedro Domingos

Page 43: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Three Questions• Evaluation

– Forward algorithm– (Could also go other direction)

• Decoding– Viterbi algorithm

• Learning– Baum-Welch Algorithm (aka “forward-

backward”)– A kind of EM (expectation maximization)

Page 44: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

• If we have labeled training data!• Input:

Solution to #3 - Learning

person name

location name

background

• Output:– Initial state & transition probabilities:

p(y1), p(yt|yt-1)

– Emission probabilities: p(xt | yt)

}states & edges, butno probabilities

Yesterday Pedro Domingos spoke this example sentence.Many labeled sentences

Page 45: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

• Input:

Supervised Learningperson name

location name

background

• Output:– Initial state probabilities: p(y1)

• P(y1=name) = 1/4

• P(y1=location) = 1/4

• P(y1=background) = 2/4

}states & edges, butno probabilities

Yesterday Pedro Domingos spoke this example sentence.Daniel Weld gave his talk in Mueller 153.Sieg 128 is a nasty lecture hall, don’t you think?The next distinguished lecture is by Oren Etzioni on Thursday.

Page 46: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

• Input:

Supervised Learningperson name

location name

background

• Output:– State transition probabilities: p(yt|yt-1)

• P(yt=name | yt-1=name) =

• P(yt=name | yt-1=background) =

• Etc…

}states & edges, butno probabilities

Yesterday Pedro Domingos spoke this example sentence.Daniel Weld gave his talk in Mueller 153.Sieg 128 is a nasty lecture hall, don’t you think?The next distinguished lecture is by Oren Etzioni on Thursday.

3/62/22

Page 47: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

• Input:

Supervised Learningperson name

location name

background

• Output:– Initial state probabilities: p(y1), p(yt|yt-1)

– Emission probabilities: p(xt | yt)

}states & edges, butno probabilities

Yesterday Pedro Domingos spoke this example sentence.Many labeled sentences

Page 48: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

• Input:

Supervised Learningperson name

location name

background

• Output:– Initial state probabilities: p(y1), p(yt|yt-1)

– Emission probabilities: p(xt | yt)

}states & edges, butno probabilities

Yesterday Pedro Domingos spoke this example sentence.Many labeled sentences

Page 49: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Solution to #3 - LearningGiven x1 …xN , how do we learn θ =( , , ) to

maximize P(x)?

• Unfortunately, there is no known way to analytically find a global maximum θ * such that

θ * = arg max P(o | θ)

• But it is possible to find a local maximum; given an initial model θ, we can always find a model θ’ such that

P(o | θ’) ≥ P(o | θ)

Page 50: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

52

Chicken & Egg Problem• If we knew the actual sequence of states

– It would be easy to learn transition and emission probabilities

– But we can’t observe states, so we don’t!

• If we knew transition & emission probabilities– Then it’d be easy to estimate the sequence of states

(Viterbi)– But we don’t know them!

Slide by Daniel S. Weld

Page 51: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

53

Simplest Version• Mixture of two distributions

• Know: form of distribution & variance,% =5

• Just need mean of each distribution.01 .03 .05 .07 .09

Slide by Daniel S. Weld

Page 52: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

54

Input Looks Like

.01 .03 .05 .07 .09

Slide by Daniel S. Weld

Page 53: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

55

We Want to Predict

.01 .03 .05 .07 .09

?

Slide by Daniel S. Weld

Page 54: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

56

Chicken & Egg

.01 .03 .05 .07 .09

Note that coloring instances would be easy

if we knew Gausians….

Slide by Daniel S. Weld

Page 55: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

57

Chicken & Egg

.01 .03 .05 .07 .09

And finding the Gausians would be easy

If we knew the coloring

Slide by Daniel S. Weld

Page 56: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

58

Expectation Maximization (EM)

• Pretend we do know the parameters

– Initialize randomly: set 1=?; 2=?

.01 .03 .05 .07 .09

Slide by Daniel S. Weld

Page 57: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

59

Expectation Maximization (EM)• Pretend we do know the parameters

– Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

Slide by Daniel S. Weld

Page 58: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

60

Expectation Maximization (EM)• Pretend we do know the parameters

– Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

Slide by Daniel S. Weld

Page 59: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

61

Expectation Maximization (EM)• Pretend we do know the parameters

– Initialize randomly• [E step] Compute probability of instance

having each possible value of the hidden variable

.01 .03 .05 .07 .09

[M step] Treating each instance as fractionally having both values compute the new parameter values

Slide by Daniel S. Weld

Page 60: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

62

ML Mean of Single Gaussian

Uml = argminu i(xi – u)2

.01 .03 .05 .07 .09

Slide by Daniel S. Weld

Page 61: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

63

Expectation Maximization (EM)

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

[M step] Treating each instance as fractionally having both values compute the new parameter values

Slide by Daniel S. Weld

Page 62: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

64

Expectation Maximization (EM)

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

Slide by Daniel S. Weld

Page 63: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

65

Expectation Maximization (EM)

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

[M step] Treating each instance as fractionally having both values compute the new parameter values

Slide by Daniel S. Weld

Page 64: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

66

Expectation Maximization (EM)

• [E step] Compute probability of instance having each possible value of the hidden variable

.01 .03 .05 .07 .09

[M step] Treating each instance as fractionally having both values compute the new parameter values

Slide by Daniel S. Weld

Page 65: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

67

EM for HMMs [E step] Compute probability of instance

having each possible value of the hidden variable– Compute the forward and backward probabilities

for given model parameters and our observations

[M step] Treating each instance as fractionally having every value compute the new parameter values

- Re-estimate the model parameters - Simple counting

Page 66: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Summary - Learning• Use hill-climbing

– Called the Baum/Welch algorithm– Also “forward-backward algorithm”

• Idea– Use an initial parameter instantiation– Loop

• Compute the forward and backward probabilities for given model parameters and our observations

• Re-estimate the parameters

– Until estimates don’t change much

Page 67: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

The Problem with HMMs• We want more than an Atomic View

of Words• We want many arbitrary, overlapping

features of words

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”

yt -1

yt

xt

yt+1

xt +1

xt -1

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Slide by Cohen & McCallum

Page 68: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Problems with the Joint ModelThese arbitrary features are not independent.

– Multiple levels of granularity (chars, words, phrases)

– Multiple dependent modalities (words, formatting, layout)

– Past & future

Two choices:

Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!

Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

St -1

St

Ot

St+1

Ot +1

Ot -1

St -1

St

Ot

St+1

Ot +1

Ot -1Slide by Cohen & McCallum

Page 69: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Discriminative vs. Generative Models

• So far: all models generative• Generative Models …

model P(y, x)• Discriminative Models …

model P(y | x)

P(y|x) does not include a model of P(x), so it does not need to model the dependencies between features!

Page 70: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Discriminative Models often better• Eventually, what we care about is p(y|x)!

– Bayes Net describes a family of joint distributions of, whose conditionals take certain form

– But there are many other joint models, whose conditionals also have that form.

• We want to make independence assumptions among y, but not among x.

Page 71: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Conditional Sequence Models• We prefer a model that is trained to

maximize a conditional probability rather than joint probability:

P(y|x) instead of P(y,x):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Slide by Cohen & McCallum

Page 72: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Finite State Models

Naïve Bayes

Logistic Regression

Linear-chain CRFs

HMMsGenerative

directed models

General CRFs

Sequence

Sequence

Conditional Conditional Conditional

GeneralGraphs

GeneralGraphs

Page 73: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Linear-Chain Conditional Random Fields

• Conditional p(y|x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions!

Page 74: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Linear-Chain Conditional Random Fields

• Definition:A linear-chain CRF is a distribution that takes the form

where Z(x) is a normalization function

parameters

feature functions

Page 75: Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454.

Linear-ChainConditional Random Fields

• HMM-like linear-chain CRF

• Linear-chain CRF, in which transition score depends on the current observation

…x

y

…x

y