Top Banner
Part-of-Speech Tagging Assign grammatical tags to words Basic task in the analysis of natural language data Phrase identification, entity extraction, etc. Ambiguity: “tag” could be a noun or a verb “a tag is a part-of-speech label” – context resolves the ambiguity
38

Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Mar 29, 2018

Download

Documents

lenhi
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Part-of-Speech Tagging

• Assign grammatical tags to words• Basic task in the analysis of natural language data• Phrase identification, entity extraction, etc.• Ambiguity: “tag” could be a noun or a verb• “a tag is a part-of-speech label” – context resolves the

ambiguity

Page 2: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

The Penn Treebank POS Tag Set

Page 3: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

POS Tagging Process

Berlin Chen

Page 4: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

POS Tagging Algorithms

• Rule-based taggers: large numbers of hand-craftedrules

• Probabilistic tagger: used a tagged corpus to trainsome sort of model, e.g. HMM.

tag1

word1

tag2

word2

tag3

word3

Page 5: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

The Brown Corpus

• Comprises about 1 million English words• HMM’s first used for tagging on the Brown Corpus• 1967. Somewhat dated now.• British National Corpus has 100 million words

Page 6: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Simple Charniak Model

w1

t1

w2

t2

w3

t3

•What about words that have never been seen before? •Clever tricks for smoothing the number of parameters (aka priors)

Page 7: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

some details…

number of times word j appears with tag i

number of times word j appears

number of times a word that had never beenseen with tag i gets tag i

number of such occurrences in total

Test data accuracy on Brown Corpus = 91.51%

Page 8: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

HMM

t1

w1

t2

w2

t3

w3

•Brown test set accuracy = 95.97%

Page 9: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Morphological Features• Knowledge that “quickly” ends in “ly” should

help identify the word as an adverb• “randomizing” -> “ing”• Split each word into a root (“quick”) and a

suffix (“ly”)

t1

r1

t2 t3

s1 r2 s2

Page 10: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Morphological Features• Typical morphological analyzers produce

multiple possible splits• “Gastroenteritis” ???

• Achieves 96.45% on the Brown Corpus

Page 11: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Inference in an HMM

• Compute the probability of a given observationsequence

• Given an observation sequence, compute the mostlikely hidden state sequence

• Given an observation sequence and set of possiblemodels, which model most closely fits the data?

David Blei

Page 12: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

==!!

!

"

The state sequence which maximizes theprobability of seeing the observations to timet-1, landing in state j, and seeing theobservation at time t

x1 xt-1 j

Page 13: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

==!!

!

"

1)(max)1(

+

=+tjoiji

ij batt !!

1)(maxarg)1(

+

=+tjoiji

ij batt !"

RecursiveComputation

x1 xt-1 xt xt+1

State TransitionProbability

“Emission”Probability

Page 14: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

oTo1 otot-1 ot+1

Viterbi Algorithm

ˆ X T = argmaxj

! j (T)

ˆ X t

=!X

^

t +1

(t + 1)

P( ˆ X ) = maxi

!i(T)

Compute the mostlikely state sequenceby workingbackwards

x1 xt-1 xt xt+1 xT

Page 15: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Viterbi Small Example

o2o1

x1 x2Pr(x1=T) = 0.2Pr(x2=T|x1=T) = 0.7Pr(x2=T|x1=F) = 0.1Pr(o=T|x=T) = 0.4Pr(o=T|x=F) = 0.9o1=T; o2=F

Brute ForcePr(x1=T,x2=T, o1=T,o2=F) = 0.2 x 0.4 x 0.7 x 0.6 = 0.0336Pr(x1=T,x2=F, o1=T,o2=F) = 0.2 x 0.4 x 0.3 x 0.1 = 0.0024Pr(x1=F,x2=T, o1=T,o2=F) = 0.8 x 0.9 x 0.1 x 0.6 = 0.0432Pr(x1=F,x2=F, o1=T,o2=F) = 0.8 x 0.9 x 0.9 x 0.1 = 0.0648

Pr(X1,X2 | o1=T,o2=F) ∝ Pr(X1,X2 , o1=T,o2=F)

Page 16: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Viterbi Small Example

o2o1

x1 x2

ˆ X 2 = argmaxj

! j (2)

! j (2) =maxi!i(1)aijb jo2

!T(1) = Pr(x1 = T)Pr(o1 = T | x1 = T) = 0.2 " 0.4 = 0.08

!F(1) = Pr(x1 = F)Pr(o1 = T | x1 = F) = 0.8 " 0.9 = 0.72

!T(2) =max !

F(1) "Pr(x2 = T | x1 = F)Pr(o2 = F | x2 = T),!

T(1) "Pr(x2 = T | x1 = T)Pr(o2 = F | x2 = T)( )

=max(0.72 " 0.1" 0.6,0.08 " 0.7 " 0.6) = 0.0432

!F(2) =max !

F(1) "Pr(x2 = F | x1 = F)Pr(o2 = F | x2 = F),!

T(1) "Pr(x2 = F | x1 = T)Pr(o2 = F | x2 = F)( )

=max(0.72 " 0.9 " 0.1,0.0.08 " 0.3" 0.1) = 0.0648

Page 17: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Bayesian Analysis of a Markov Chain

y1

y2

y3 y

4y5

0 1π π π

iid

Page 18: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Bayesian Analysis of a HMM

• Widely used in speech recognition, finance,bioinformatics, etc.

• The y’s are observed but the z’s (discrete) arenot

• Combines a first-order dependence structurewith a mixture model.

z1

z2

z3

y1

y2

y3

Page 19: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Bayesian HMM (continued)

this is generally intractable but the conditionals are OK:

depends on the priorsdepends on the priors……

yi~ N(µ

zi,1)

Page 20: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data
Page 21: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Serfling's method

Page 22: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data
Page 23: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data
Page 24: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Bayesian HMM for High-Frequency Data

• Observations may not arrive regularly

• Elapsed time between observations may be

related to state

• Finance: tick-level stock data

• Molecular biology: single molecule experiments

• Assume now that z’s follow a continuous-time

first-order Markov chain

Page 25: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

HF-HMM

z1

z2

z3

y1

y2

y3

!2 !3

For K=2, suppose the time z stays in state i is exp(λi). Then:

For K>2, from state i, z transitions to state i+1 withprobability βi and state i-1 with probability 1- βi (i.e. birth &death, reflecting boundaries)

eigendecomposition…

Page 26: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

HF-HMM Priors

z1 z2 z3

y1

y2

y3

!2

!3

"0 "1

#0

#1

p0

p1

Posteriors not available inclosed-form…

Page 27: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

HF-HMM Gibbs Sampler

Use a Metropolis step for this one

Page 28: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Metropolis Within Gibbs

Let

Generate a candidate from:

Accept with probability:

Page 29: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

HMMHMM

MEMMMEMMmaximum entropy markov model

Page 30: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

MEMM

•MEMM learns a single multiclass Logistic regression modelfor yi | yi-1, xi

•Predict y1 from x1, then y2 from y1 and x2, etc.

•No reason for the features not to include xi-1, xi+1, etc.

ti ti-1 f1 f2 f3 … fdPERLOC T 1 2.7 0……

Page 31: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Dependency Network

• Toutanova et al., 2003,use a “dependencynetwork” and richerfeature set

•Idea: using the “next” tag as well as the “previous” tag shouldimprove tagging performance•Need modified Viterbi to find most likely sequence

ti ti-1 ti+1 f1 f2 … fdPERLOCPER 1 2.7 0……

Page 32: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Conditional Random Fields

•Dependency network does consider the tag sequence in itsentirety•CRF’s optimize model parameters with respect to the entiresequence•More expensive optimization; increased flexibility and accuracy

Page 33: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

From Logistic Regression to CRF

•Logistic regression:

•Or

•Linear chain CRF:

vector

scalar

p(y | x) =1

Z(x)exp !k fk (yt , yt"1, x)

k=1

K

#$%&

'()t

*

Page 34: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

CRF Parameter Estimation•Conditional log likelihood:

•Regularized log likelihood:

•Conjugate gradient, BFGS, etc.

•POS Tagging, 45 tags, 10^6 words = 1 week

Page 35: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Sutton and McCallum (2006)

Page 36: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Skip-Chain CRF’s

Page 37: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Bock’s Results: POS Tagging

Penn Treebank

Page 38: Part-of-Speech Tagging - Columbia Universitymadigan/DM08/hmm.pdf · Part-of-Speech Tagging •Assign grammatical tags to words •Basic task in the analysis of natural language data

Bock’s Results: Named Entity

CoNLL-03 task