Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.)

Ch 9 Part of Speech Tagging

(slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.)

Parts of Speech

8 (ish) traditional parts of speech

• Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc

• This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)

• Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS

• We’ll use POS most frequently

POS examples for English

N noun chair, bandwidth, pacing V verb study, debate, munch ADJ adj purple, tall, ridiculous ADV adverb unfortunately, slowly, P preposition of, by, to PRO pronoun I, me, mine DET determiner the, a, that, those

Open Class Words

Every known human language has nouns and verbs

Nouns: people, places, things• Classes of nouns

—proper vs. common—count vs. mass

Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge!• Unfortunately, John walked home extremely

slowly yesterday

Definition:

An adverb is a part of speech. It is any word that modifies any othe r part of language: verbs, adjectives (including numbers), clauses, sentences and other adverbs, except for nouns; modifiers of nouns are primarily determiners and adjectives.

http://en.wikipedia.org/wiki/Part_of_speech

http://en.wikipedia.org/wiki/Verb

http://en.wikipedia.org/wiki/Adjective

http://en.wikipedia.org/wiki/Clause

http://en.wikipedia.org/wiki/Sentence_%28linguistics%29

http://en.wikipedia.org/wiki/Noun

http://en.wikipedia.org/wiki/Determiner

http://en.wikipedia.org/wiki/Adjective

Closed Class Words

Differ more from language to language than open class words

Examples:• prepositions: on, under, over, …• particles: up, down, on, off, …• determiners: a, an, the, …• pronouns: she, who, I, ..• conjunctions: and, but, or, …• auxiliary verbs: can, may should, …• numerals: one, two, three, third, …

Prepositions from CELEX

Pronouns in CELEX

Conjunctions

Auxiliaries

NLP Task I – Determining Part of Speech Tags

The Problem:

nounpot

advnounadjlarge

noun-proper

noundeta

advnounprepin

nounoil

verbnounheat

POS listing in Brown CorpusWord

POS Tagging: Definition

The process of assigning a part-of-speech or lexical class marker to each word in a corpus:

thekoalaputthe

keysonthe

table

WORDSTAGS

NVP

DET

POS Tagging example

WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

What is POS tagging good for?

Speech synthesis:• How to pronounce “lead”?• INsult inSULT• OBject obJECT• OVERflow overFLOW• DIScount disCOUNT• CONtent conTENT

Stemming for information retrieval• Knowing a word is a N tells you it gets plurals• Can search for “aardvarks” get “aardvark”

Parsing and speech recognition and etc• Possessive pronouns (my, your, her) followed by nouns• Personal pronouns (I, you, he) likely to be followed by verbs

Related Problem in Bioinformatics

Durbin et al. Biological Sequence Analysis, Cambridge University Press.

Several applications, e.g. proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC..

History: From Yair Halevi (Bar-Ilan U.)

1960

1970

1980

1990

2000

Brown Corpus Created (EN-

US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93%-95%

Greene and Rubin

Rule Based - 70%

LOB Corpus Created (EN-UK)1 Million Words

DeRose/Church

Efficient HMMSparse Data

95%+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based – 95%

+

Tree-Based Statistics (Helmut

Shmid)Rule Based – 96%

+Neural Network 96%

+

Trigram Tagger

(Kempe)96%+

Combined Methods

98%+

Penn Treebank Corpus

(WSJ, 4.5M)

LOB Corpus Tagged

British National Carpus

What is it used for?

Ultimately, its use is limited only by our imagination; if you have any need for up to 100 million words of modern British English, you can make use of the British National Corpus.

The main uses of the corpus, are as follows: Reference Book Publishing

• Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.

Linguistic Research• Raw data for studying lexis, syntax, morphology, semantics, discourse

analysis, stylistics, sociolinguistics... Artificial Intelligence

• Extensive data test bed for program development. Natural language processing

• Taggers, parsers, natural language understanding programs, spell checking word lists...

English Language Teaching• Syllabus and materials design, classroom reference, independent learner

research.

Penn Treebank Tagset

A Simplified Tagset for English

Tagsets for English have grown progressively larger since the Brown Corpus until the Penn Treebank project.

34 tags + punctuationUPenn Treebank:

197 tagsLondon-Lund Corpus:

166 tagsLancaster UCREL group:

135 tagsLOB Corpus:

87 tagsBrown Corpus:

Rationale behind British & European tag sets

To provide “distinct codings for all classes of words having distinct grammatical behaviour” – Garside et al. 1987

The Lund tagset for adverb distinguishes between

• Adjunct – Process, Space, Time• Wh-type – Manner, Reason, Space, Time, Wh-type + ‘S• Conjunct – Appositional, Contrastive, Inferential, Listing, …• Disjunct – Content, Style• Postmodifier – “else”• Negative – “not”• Discourse Item – Appositional, Expletive, Greeting,

Hesitator, …

Reasons for a Smaller Tagset

Many tags are unique to particular lexical items, and can be recovered automatically if desired.

sung/VBNhad/HVNbeen/BENsinging/VBGhaving/HVGbeing/BEGsang/VBDhad/HVDwas/BEDsing/VBZhas/HVZis/BEZsing/VBhave/HVbe/BE

Brown Tags For Verbs

sung/VBNhad/VBNbeen/VBNsinging/VBGhaving/VBGbeing/VBGsang/VBDhad/VBDwas/VBDsing/VBZhas/VBZis/VBZsing/VBhave/VBbe/VB

Penn Treebank Tags For Verbs

Task I – Determining Part of Speech Tags

The Problem:

The Old Solution: Combinatorial search. • If each of n words has k tags on average, try the nk

combinations until one works.

nounpot

advnounadjlarge

noun-propernoundeta

advnounprepin

nounoil

verbnounheat

POS listing in BrownWord

NLP Task I – Determining Part of Speech Tags

Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment.

• The best techniques achieve 96-97% accuracy per word on new materials, given large training corpora.

Simple Statistical Approaches: Idea 1

Simple Statistical Approaches: Idea 2

For a string of words

w = w1w2w3…wn

find the string of POS tags

T = t1 t2 t3 …tn

which maximizes P(T|W)• i.e., the probability of tag string T given that

the word string was w• i.e., that w was tagged T

Again, The Sparse Data Problem …

A Simple, Impossible Approach to Compute P(T|W):

Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..

A Practical Statistical Tagger

A Practical Statistical Tagger II

But we can't accurately estimate more than tag bigrams or so…

We change to a model that we CAN estimate:

A Practical Statistical Tagger III

So, for a given string W = w1w2w3…wn, the tagger needs to find the string of tags T which maximizes

Training and Performance

To estimate the parameters of this model, given an annotated training corpus:

Because many of these counts are small, smoothing is necessary for best results…

Such taggers typically achieve about 95-96% correct tagging, for tag sets of 40-80 tags.

Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.)

Documents

celex slide

brown corpusword slide

verbs nouns

speech tagging slides

speech recognition

speech synthesis

words brown corpus

pos examples