Word classes and the distribution of words, and Part of Speech tagging Computational linguistics.

Post on 15-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Word classes andthe distribution of words,

and Part of Speech taggingComputational linguistics

Eats shoots and leaves

• We move on to finding higher level structure of natural language. Most of this is expressed in terms of categories – categories of words, and categories of sequences of words called phrases.

Classes or categories of words

• Roughly: words whose distributions are very similar. Two words are in the same categories iff we can substitute one for the other in a sentence and preserve grammaticality.

• We will return to this question.

Open and closed classes

• Open: Noun, verb, adjective.

• Closed: preposition, adverbs, conjunctions.

• Open: large classes, and more words can be added to them.

• Closed: small classes, and they are resistant to adding new members. A new preposition?

Two points of view

What’s real and central in grammar are notions like Noun and Verb (and Noun Phrase and Verb Phrase). Then we find real nouns, like dog and John and Monday. Many of them are good nouns, but some of them are defective; they don’t “do” all the things that they “should do”.

2nd point of view

What’s real are sentences (or corpora):

John is leaving Wednesday with his dog.

When we look at a language, we find an enormous range of “places” where a given word can appear. (“Places” meaning environments, perhaps meanings). No two words are quite alike, but words do form clusters with regard to their grammatical behavior. For example, ...

The days of the week (Monday…Sunday) share a lot in common. We can simplify our description by generalizing over that set of words.

John left __. John left last __. John leaves next __. He leaves on __. You must do it before__. Do it by __. Your horoscope for __. __’s weather forecast. The __ after Christmas.

* at__. * to __. *saw__. *We__. *I __.

Proper given namesLikewise, Proper given names (John, Jerry,

…).

As we form larger and larger classes, there are fewer things that they have in common.

How do these J-words (!) differ from other “nouns”?

Rarely take articles (the Jim) or relative clauses or adjectives (Mary who bought a book), but they certainly can: the Jim I went to elementary school with, the Bush who made those campaign promises, a fresh and smiling Ralph Nader)

Back to first view

• Grammar consists of a set of non-terminal nodes, terminal nodes, a set of context-free expansion rules, and a lexicon, at the least.

• Depending on your analysis, also a set of transformations.

• Syntax is responsible for the generation of phrase-structures, whose terminal nodes are lexical categories.

• Lexical categories are expanded to words of the appropriate category.

Syntax• Non-terminal categories: two correspond to

semantic primitives (proposition and term); these are Sentence (S) and Noun Phrase (NP).

• Terminals: the categories into which words are put. Perhaps these are universal, perhaps they aren’t. (Some) Linguists tend to think they are; computational linguists tend to think they aren’t.

• Non-terminals based on terminal categories. Noun begets Noun Phrase, Adjective begets Adjective Phrase, etc.

• Context-free phrase structure rules: Non-terminal node expands to both non-terminals and terminal nodes.

• Terminals are expanded to words (“lexical elements”, in the parlance).

S

NP VPINFL

mightV VP

V

sleeping

be

N

John

Syntactic rules

• S → NP + INFL + VP• INFL → { can, could, may, might, will,

should, do } • VP → ( Advnot ) VP• VP → V NP NP PP*• VP → VP AdvP[hrase]• VP → V (NP) S: allows for recursive

structure: sentences within sentences, of unbounded length.

S → NP + INFL + VPS has other expansions in English,

such as in infinitives; there, an INFL with to is found, but no tense, no auxiliary verbs, no dummy do.

S → NP + [INFL to ] + VP

It is important for John to leave, but not …*for John to should leave, …*for John should to leave, etc.

• NP → det AdjP

• → N PrepP_

N

NP

det AP

_

N

The former king

A PP

P NP

N

of England

_

N

Head of NP

N is head of NP

• The semantically central word:

A big book is a book.

And the one whose form is determined by the governing verb in a case-marking language, and the one that determines the number and gender of any words that agree with the NP.

Categories

We have 4 things in mind when we make them:

1. (Lexical categories): Morphological structure

2. Meaning (semantics)

3. External distribution

4. (Phrasal categories): internal distribution

...

Morphology

• What suffixes may appear with a given stem: ‘s, NULL, s;

• ed, s, ing, ed

• er, est, ness

Meaning

• Reference to objects in the world

• Reference to n-ary predicates:

• unary: tall, sleep

• binary: eat (human, food), saw (human, object)

• ternary: give (human, human, object)

External distributionRoughly speaking: this means, what this word

(or phrase) can appear next to (before, after).

Nouns appear after articles (=noun determiners, nominal determiners), after adjectives. before Prepositinal Phrase complements.

the dog, my dog, the taste of champagne, the war of the worlds

Internal distribution (phrases)

• A “noun phrase” has three parts: a determiner, followed by an adjective, followed by a noun.

• Some of these are “optional”: that is, we may still call something an noun phrase even if not all 3 are present.

Back to categories for wordsNoun properties (?English):• Takes articles• Takes preceding adjectives• May appear as subject of a sentence• May appear as object of a preposition• Has singular and plural form; plural is realized

as /s/• Refers to an object or set of objects• May take possessive ‘s• May serve as antecedent to a pronoun

Verb

• Has present-tense form (-s in 3rd singular)

• Has past-tense form (-ed)

• Agrees with its subject noun phrase

• Refers to a predicate (1 or more arguments)

• Follows the subject immediately

• Appears at the beginning of a verb-phrase

Lexical categories in language• One view is that there is a small number of

categories, and they can be identified across languages. (I think most people believe that.)

• The core criterion for membership is semantic, and the only effective way of identifying across languages is semantic.

• All languages have a category of phrases that refer to things (NP), and one that expresses propositions (S).

Nouns and pronouns

• Nouns in many languages are inflected for number and case.

• Case: Nominative, accusative, genitive, dative, and often others.

• Pronouns, but not nouns, in English are inflected for case: nominative, genitive, and accusative (or other).

Pronouns

Tag Nom Acc Possessive Genitive head

reflexive

1st sg I Me My Mine Myself

2nd sg You You Your Yours Yourself

3rd sg m He Him His His Himself

3rd sg f Her Her Her Hers Herself

3rd sg neuter It It Its Its itself

1st plural We Us Our Ours Ourselves

2nd plural You You Your Yours Yourselves

3rd plural They Them Their Theirs themselves

Penn Treebank noun categoriesNN noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour

falloff slick wind hyena override subhumanity machinist ... NNP noun, proper, singular Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar

Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ... NNPS noun, proper, plural Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Andalusians

Andes Andruses Angels Animals Anthony Antilles Antiques Apache Apaches Apocrypha ... NNS noun, common, plural undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses

designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ...

Along with nouns…• Determiners:

– Articles (a,an,the): definite, indefinite– Possessive pronouns (my, your, his…)– Demonstrative determiners: this, that…

• Adjectives– In many languages, agree with the noun that

they modify for case and number (but not in English). Spanish: l-a-s mes-a-s pequeñ-a-s ‘the tables small-fem-plural’

Adjectives

• Absolute (or positive) form: big

• Comparative: biggerYour car is bigger than theirs.

• Superlative:

Of these cars, John’s car is the biggest.

Quantifiers

• Often appear in pre-noun positions, inside the Noun Phrase

• Express notions of “some, all, none”• May be pre-noun modifiers, or a full NP

(like pronouns): something, anyone, etc. (Are these really two words stuck together?)

• Question and relative clause words: who, what, where, when, why, whose, which.

Relative clauses in English:that-Comp, gap in clause

NP

NPS’

SComp(that)

The thing

I saw [e]

that is option if gap is notin subject position.[e] marks the “gap”

Relative clauses in English:wh-phrase

NP

NPS’

SCompwhich

The ideas

I disagree with [e]

Relative clauses in English:wh-phrase w/ pied-piping of P

NP

NPS’

SCompwith which

The ideas

I disagree [e]

Relative clause formation can rip out of embedded clauses

NP

NPS’

SCompwith which

The ideas

Your manager said

S

You disagree [e]

Verbs

• Verbs are words that refer to actions, and which are the essential component of most sentences.

• There are non-verbal sentences, but they are relatively infrequent. Most frequent of these: Linking a noun (NP) with an adjective or a location. English uses the copula (to be) for this function.

Verbs

• Have an argument structure: typically 1, 2, or 3 nominal arguments.

• 1 argument: typically the subject NP. Intransitive verb: John slept/arrived/left/yawned. The door opened.The phone rang.

• 0 arguments?

Verb arguments

• 2 arguments (transitive): Subject and direct object, usually:

Kim shut the door, helped the students, wrote a book.

• 3 arguments (ditransitive): Subject, indirect object, direct object:

Kim gave Terry a book/a hand/a hard time.

Syntactic/semantic ambiguities• I saw the man with the telescope.

• Time flies like an arrow.

S

NP VP

V NP

Det N’

the N PP

man P NP

with det N’

N

telescope

Isaw

S

NP VP

V NP

Det N’

the N

man

N

telescope

Isaw

PP

P NP

with det N’

Part of Speech tagging

An attempt to assign categories to words without doing a whole syntactic parse:

Getting a whole parse is extremely difficult;

Much of the difficulty is the constituency, not the part of speech tagging.

High frequency words are the most ambiguous regarding PoS

• table

• like– I like ice cream– I like things like ice cream– I’ve been there like 100 times.– People like him.– People like him are obnoxious.

Taggers

• Start with a lexicon with ranges of PoSs– each word is marked with its range of permitted

PoS– an OOV word is given a PoS based on its

morphology, if we’re lucky– A mechanism finds the best combination of

PoS, given the order of the words.

The Detdesign N Vpres Vinf

Vimperative of Preptaggers Npluralis Vpresoften Advbased VpastParticiple

VpastTenseon Adverb Prepositionwhat WhPronoun

WhDeterminer

is Vpresentknown Vpast tenseabout Adverb Prepthe Detlexicon Noun. punctuation

History of PoS tagging

• First large scale system in 1971: TAGGIT (Greene and Rubin): 71 items in tag set, based on 3,300 hand-written rules, using a window of up to 5 words of the word being disambiguated. But almost all of the rules looked at immediate neighbors.

CLAWS1

• Part of the annotation of the Lancaster-Oslo/Bergen corpus; produced at the University of Lancaster.

• Used largely statistical techniques rather than hand-crafted rules, trained off a tagged 200K words of the Brown corpus.

• 96-97% accuracy of top PoS guess. • Used an open (not hidden) Markov model

Markov model

I’m not sure that this is exactly the model that CLAWS used, but it’s in the spirit:

p(W[i..n] & PoS[i..n]) =

])[|][(*])1[|][( iPosiWprobiPoSiPoSprob

top related