Top Banner
Sequence Labeling Inputs: x = (x 1 , …, x n ) Labels: y = (y 1 , …, y n ) Typical goal: Given x, predict y Example sequence labeling tasks – Part-of-speech tagging – Named-entity-recognition (NER) Label people, places, organizations
18

Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Jan 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Sequence Labeling

• Inputs: x = (x1, …, xn)• Labels: y = (y1, …, yn)• Typical goal: Given x, predict y

• Example sequence labeling tasks– Part-of-speech tagging– Named-entity-recognition (NER)

• Label people, places, organizations

Page 2: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

NER Example:

Page 3: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

First Solution:Maximum Entropy Classifier

• Conditional model p(y|x).– Do not waste effort modeling p(x), since x

is given at test time anyway.– Allows more complicated input features,

since we do not need to modeldependencies between them.

• Feature functions f(x,y):– f1(x,y) = { word is Boston & y=Location }– f2(x,y) = { first letter capitalized & y=Name }– f3(x,y) = { x is an HTML link & y=Location}

Page 4: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

First Solution: MaxEnt Classifier

• How should we choose a classifier?

• Principle of maximum entropy– We want a classifier that:

• Matches feature constraints from training data.• Predictions maximize entropy.

• There is a unique, exponential familydistribution that meets these criteria.

Page 5: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

First Solution: MaxEnt Classifier

• p(y|x;θ), inference, learning, andgradient.

• (ON BOARD)

Page 6: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

First Solution: MaxEnt Classifier

• Problem with using a maximum entropyclassifier for sequence labeling:

• It makes decisions at each positionindependently!

Page 7: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Second Solution: HMM

• Defines a generative process.• Can be viewed as a weighted finite

state machine.!

P(y,x) = P(yt | yt"1)P(x | yt )t

#

Page 8: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Second Solution: HMM

• HMM problems: (ON BOARD)– Probability of an input sequence.– Most likely label sequence given an input

sequence.– Learning with known label sequences.– Learning with unknown label sequences?

Page 9: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Second Solution: HMM

• How can represent we multiple featuresin an HMM?– Treat them as conditionally independent

given the class label?• The example features we talked about are not

independent.– Try to model a more complex generative

process of the input features?• We may lose tractability (i.e. lose a dynamic

programming for exact inference).

Page 10: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Second Solution: HMM

• Let’s use a conditional model instead.

Page 11: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Third Solution: MEMM

• Use a series of maximum entropyclassifiers that know the previous label.

• Define a Viterbi algorithm for inference.

!

P(y | x) = Pyt"1 (yt | x)t

#

Page 12: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Third Solution: MEMM

• Finding the most likely label sequencegiven an input sequence and learning.

• (ON BOARD)

Page 13: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Third Solution: MEMM

• Combines the advantages of maximumentropy and HMM!

• But there is a problem…

Page 14: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Problem with MEMMs: Label Bias• In some state space configurations,

MEMMs essentially completely ignorethe inputs.

• Example (ON BOARD).

• This is not a problem for HMMs,because the input sequence isgenerated by the model.

Page 15: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Fourth Solution:Conditional Random Field

• Conditionally-trained, undirectedgraphical model.

• For a standard linear-chain structure:

!

P(y | x) = "k (yt ,yt#1,x)t

$

"k (yt ,yt#1,x) = exp %kk

& f (yt ,yt#1,x)'

( )

*

+ ,

Page 16: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Fourth Solution: CRF

• Finding the most likely label sequencegiven an input sequence and learning.(ON BOARD)

Page 17: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Fourth Solution: CRF

• Have the advantages of MEMMs, butavoid the label bias problem.

• CRFs are globally normalized, whereasMEMMs are locally normalized.

• Widely used and applied. CRFs givestate-the-art results in many domains.

Page 18: Sequence Labeling - UMass Amherstmccallum/courses/inlp2007/lect15-memm-crf.ppt.pdfSecond Solution: HMM •How can represent we multiple features in an HMM? –Treat them as conditionally

Example Applications

• CRFs have been applied to:– Part-of-speech tagging– Named-entity-recognition– Table extraction– Gene prediction– Chinese word segmentation– Extracting information from research

papers.– Many more…