School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology COMP3310 Natural Language Processing Eric Atwell,

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Tokenization and Morphology

COMP3310 Natural Language Processing

Eric Atwell, Language Research Group

(with thanks to Katja Markert, Marti Hearst, and other contributors)

• Thanks to Katja Markert and Marti Hearst for much of the material

• Katja Markert, Lecturer, School of Computing, Leeds University http://www.comp.leeds.ac.uk/markert http://www.comp.leeds.ac.uk/lng

• Marti Hearst, Associate Professor, School of Information, University of California at Berkeley http://www.ischool.berkeley.edu/people/faculty/martihearst http://courses.ischool.berkeley.edu/i256/f06/sched.html

School of ComputingFACULTY OF ENGINEERING

http://www.comp.leeds.ac.uk/markert

http://www.comp.leeds.ac.uk/lng

http://www.ischool.berkeley.edu/people/faculty/martihearst

http://www.ischool.berkeley.edu/people/faculty/martihearst

http://courses.ischool.berkeley.edu/i256/f06/sched.html

http://courses.ischool.berkeley.edu/i256/f06/sched.html

Reminder

The main areas of linguistics

Rationalism: language models based on expert introspection

Empiricism: models via machine-learning from a corpus

Corpus: text selected by language, genre, domain, …

Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …

Corpus Annotation: text headers, PoS, parses, …

Corpus size is no. of words – depends on tokenisation

We can count word tokens, word types, type-token distribution

Lexeme/lemma is “root form”, v inflections (be v am/is/was…)

What’s a word?

How many words do you find in the following short text?

What is the biggest/smallest plausible answer to this question?

What problems do you encounter?

It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $2300.50 and that database B costs $5000. All databases cost far too much.

Time: 3 minutes

Counting words: tokenization

Tokenisation is a processing step where the input text is

automatically divided into units called tokens where each is either a word or a number or a punctuation mark…

So, word count can ignore numbers, punctuation marks (?)

Word: Continuous alphanumeric characters delineated by whitespace.

Whitespace: space, tab, newline.

BUT dividing at spaces is too simple: It’s, data base

Another approach is to use regular expressions to specify which substrings are valid words.

Regular expressions for tokenization

• wordr = r'(\w+)‘

• hyphen = r'(\w+\-\s?\w+)‘

• Eg data-base, Allows for a space after the hyphen

• apostrophe = r'(\w+\'\w+)‘

• Eg isn’t

• numbers = r'((\$|#)?\d+(\.)?\d+%?)‘

• Needs to handle large numbers with commas

Some Tokenization Issues

Sentence Boundaries

• Punctuation, eg quotation marks around sentences?

• Periods – end of line or not?

Proper Names

• What to do about

• “New York-New Jersey train”?

• “California Governor Arnold Schwarzenegger”?

Contractions

• That’s Fred’s jacket’s pocket.

• I’m doing what you’re saying “Don’t do!”.

Jabberwocky Analysis

This is nonsense … or is it?

This is not English … but it’s much more like English than it is like French or German or Chinese or …

Why do we pretty much understand the words?



We recognize combinations of morphemes.

• Chortled - Laugh in a breathy, gleeful way; (Definition from Oxford American Dictionary) A combination of "chuckle" and "snort."

• Galumphing - Moving in a clumsy, ponderous, or noisy manner. Perhaps a blend of "gallop" and "triumph." (Definition from Oxford American Dictionary)

Activity:

• Make up a word whose meaning can be inferred from the morphemes that you used.



• Surrounding English words strongly indicate the parts-of-speech of the nonsense words.

• toves: probably can perform an action

(because they did gyre and gimble)

• wabe: is probably a place.

(they did … in the wabe)

http://assets.cambridge.org/052185/542X/excerpt/052185542X_excerpt.pdf


• Surrounding English words strongly indicate the parts-of-speech of the nonsense words.

• It’s similar in the French Translation:

Example from http://www.departments.bucknell.edu/linguistics/lectures/05lect02.html

Morphology

Morphology:

• The study of the way words are built up from smaller meaning units.

Morphemes:

• The smallest meaningful unit in the grammar of a language.

Contrasts:• Derivational vs. Inflectional• Regular vs. Irregular• Concatinative vs. Templatic (root-and-pattern)

A useful resource:• Glossary of linguistic terms by Eugene Loos• http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm

Examples (English)

“unladylike”

• 3 morphemes, 4 syllables

un- ‘not’

lady ‘(well behaved) female adult human’

-like ‘having the characteristics of’

• Can’t break any of these down further without distorting the meaning of the units

“technique”

• 1 morpheme, 2 syllables

“dogs”

• 2 morphemes, 1 syllable

-s, a plural marker on nouns

Morpheme DefinitionsRoot

• The portion of the word that:

• is common to a set of derived or inflected forms, if any, when all affixes are removed

• is not further analyzable into meaningful elements

• carries the principle portion of meaning of the words

Stem

• The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.

Affix

• A bound morpheme that is joined before, after, or within a root or stem.

Clitic• a morpheme that functions syntactically like a word, but does not appear as an

independent phonological word

• Arabic: al in Al-Qaeda (definite particle)

• English: ‘s in Hal’s (genitive particle)

Inflectional vs. Derivational

Word Classes• Parts of speech: noun, verb, adjectives, etc.

• Word class dictates how a word combines with morphemes to form new words

Inflection:• Variation in the form of a word, typically by means of an affix, that expresses

a grammatical contrast.

• Doesn’t change the word class

• Usually produces a predictable, nonidiosyncratic change of meaning.

• run -> runs | running | ran

Derivation:• The formation of a new word or inflectable stem from another word or stem.

• compute -> computer -> computerization

Inflectional Morphology

Adds:

• tense, number, person, mood, aspect

Word class doesn’t change

Word serves new grammatical role

Examples

• come is inflected for person and number:

The pizza guy comes at noon.

• las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s

las manzanas rojas (‘the red apples’)

Derivational MorphologyWord class changes: verb noun, noun adjective etc

Nominalization (formation of nouns from other parts of speech, primarily verbs in English):

• computerization

• appointee

• killer

• fuzziness

Formation of adjectives (primarily from nouns)

• computational

• clueless

• Embraceable

Difficult cases:

• building from which word-class and sense of “build”?

Concatinative Morphology

Morpheme+Morpheme+Morpheme+…

Stems: also called lemma, base form, root, lexeme

• hope+ing hoping hop hopping

Affixes

• Prefixes: Antidisestablishmentarianism

• Suffixes: Antidisestablishmentarianism

• Infixes: hingi (borrow) – humingi (borrower) in Tagalog

• Circumfixes: sagen (say) – gesagt (said) in German

Agglutinative Languages

• uygarlaştıramadıklarımızdanmışsınızcasına

• uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına

• Behaving as if you are among those whom we could not cause to become civilized

Templatic MorphologyRoots and Patterns• Example: Hebrew or Arabic or Amharic (spoken in Ethiopia)

• Root:

• Consists of 3 consonants CCC

• Carries basic meaning

• Template:

• Gives the ordering of consonants and vowels

• Specifies semantic information about the verb

• Active, passive, middle voice

• Example (Hebrew):

• lmd (to learn or study)

• CaCaC -> lamad (he studied)

• CiCeC -> limed (he taught)

• CuCaC -> lumad (he was taught)

Morphological Analysis Tools

Porter stemmer:

• A simple approach: just hack off the end of the word!

• Frequently used, especially for Information Retrieval, but results are pretty ugly!

porter.demo()Original *****************************

Pierre Vinken , 61 years old , will join the board as a nonexecutive

director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch

publishing group . Rudolph Agnew , 55 years old and former chairman of

Consolidated Gold Fields PLC , was named a nonexecutive director of

this British industrial conglomerate . A form of asbestos once used to

make Kent cigarette filters has caused a high percentage of cancer

deaths among a group of workers exposed to it more than 30 years ago ,

researchers reported .

Results *******************************

Pierr Vinken , 61 year old , will join the board as a nonexecut

director Nov. 29 . Mr. Vinken is chairman of Elsevi N.V. , the Dutch

publish group . Rudolph Agnew , 55 year old and former chairman of

Consolid Gold Field PLC , wa name a nonexecut director of thi British

industri conglomer . A form of asbesto onc use to make Kent cigarett

filter ha caus a high percentag of cancer death among a group of

worker expos to it more than 30 year ago , research report .


WordNet’s morphy()

• A slightly more sophisticated approach

• Use an understanding of inflectional morphology

• Uses a set of Rules of Detachment

• Use an Exception List for irregulars

• Handle collocations in a special way

• Do the transformation, compare the result to the WordNet dictionary

• If the transformation produces a real word, then keep it, else use the original word.

• For more details, see

• http://wordnet.princeton.edu/man/morphy.7WN.html

Some morphy() output

>>> wntools.morphy('dogs')

'dog'

>>> wntools.morphy('running', pos='verb')

'run'

>>> wntools.morphy('corpora')

'corpus'

>>>


Very sophisticated programs have been developed

Use a techniqued called Two-Level Phonology

• Has been applied to numerous languages

Best known: PCKimmo

• After Kimmo Koskenniemi, based in part on work by Lauri Kartunnen in 1983

• Uses:

• A rules file which specifies the alphabet and the phonological (or spelling) rules,

• A lexicon file which lists lexical items and encodes morphotactic constraints.

• http://www.sil.org/pckimmo/

Commercial versions are available

• inXight’s LinguistX version based on technology developed by Kaplan and others from Xerox PARC (or at least used to be)


“cheat”: store all variants in a dictionary database, eg

CatVar:

• Categorial Variation Database

• “A database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants.”

• Example: the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).

http://clipdemos.umiacs.umd.edu/catvar

based on published dictionaries: LDOCE, CELEX, OALD++, PROPOSEL ...

MorphoChallenge

One problem with rule-based systems (PCkimmo) or dictionary-lookup systems: Porting to new languages

In principle, Unsupervised Machine Learning could learn from any language data-set, by finding recurring patterns which correspond to roots, prefixes, postfixes

MorphoChallenge is a contest to find the best UML morphological analyser

http://www.cis.hut.fi/morphochallenge2005/



Atwell, Roberts: Combinatory Hybrid Elementary Analysis of Text http://www.cis.hut.fi/morphochallenge2005/P07_Atwell.pdf

Arabic morphological analysis

Arabic is particularly challenging - different script, infixes, vowels may be left out in written Arabic …

Leeds researcher Majdi Sawalha: online analysis tool http://www.comp.leeds.ac.uk/sawalha/

Sawalha, Majdi; Atwell, Eric (2010). Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. in: Proceedings of the Language Resource and Evaluation Conference LREC 2010, 17-23 May 2010, Valetta, Malta.

http://www.comp.leeds.ac.uk/sawalha/sawalha10lrecB.pdf

Reminder

Tokenization - by whitespace, regular expressions

Problems: It’s data-base New York …

Jabberwocky shows we can break words into morphemes

Morpheme types: root/stem, affix, clitic

Derivational vs. Inflectional

Regular vs. Irregular

Concatinative vs. Templatic (root-and-pattern)

Morphological analysers: Porter stemmer, Morphy, PC-Kimmo

Morphology by lookup: CatVar, CELEX, OALD++

Unsupervised Machine Learning: MorphoChallenge

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology COMP3310 Natural Language Processing Eric Atwell,

Documents

amiswas slide

minutes slide

nonsense words

berkeley http

way words

wabe http

counting words

valid words