School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)
30
Embed
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology COMP3310 Natural Language Processing Eric Atwell,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
School of somethingFACULTY OF OTHER
School of ComputingFACULTY OF ENGINEERING
Tokenization and Morphology
COMP3310 Natural Language Processing
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst, and other contributors)
• Thanks to Katja Markert and Marti Hearst for much of the material
• Katja Markert, Lecturer, School of Computing, Leeds University http://www.comp.leeds.ac.uk/markert http://www.comp.leeds.ac.uk/lng
• Marti Hearst, Associate Professor, School of Information, University of California at Berkeley http://www.ischool.berkeley.edu/people/faculty/martihearst http://courses.ischool.berkeley.edu/i256/f06/sched.html
Rationalism: language models based on expert introspection
Empiricism: models via machine-learning from a corpus
Corpus: text selected by language, genre, domain, …
Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …
Corpus Annotation: text headers, PoS, parses, …
Corpus size is no. of words – depends on tokenisation
We can count word tokens, word types, type-token distribution
Lexeme/lemma is “root form”, v inflections (be v am/is/was…)
What’s a word?
How many words do you find in the following short text?
What is the biggest/smallest plausible answer to this question?
What problems do you encounter?
It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $2300.50 and that database B costs $5000. All databases cost far too much.
Time: 3 minutes
Counting words: tokenization
Tokenisation is a processing step where the input text is
automatically divided into units called tokens where each is either a word or a number or a punctuation mark…
So, word count can ignore numbers, punctuation marks (?)
Word: Continuous alphanumeric characters delineated by whitespace.
Whitespace: space, tab, newline.
BUT dividing at spaces is too simple: It’s, data base
Another approach is to use regular expressions to specify which substrings are valid words.
Regular expressions for tokenization
• wordr = r'(\w+)‘
• hyphen = r'(\w+\-\s?\w+)‘
• Eg data-base, Allows for a space after the hyphen
• apostrophe = r'(\w+\'\w+)‘
• Eg isn’t
• numbers = r'((\$|#)?\d+(\.)?\d+%?)‘
• Needs to handle large numbers with commas
Some Tokenization Issues
Sentence Boundaries
• Punctuation, eg quotation marks around sentences?
• Periods – end of line or not?
Proper Names
• What to do about
• “New York-New Jersey train”?
• “California Governor Arnold Schwarzenegger”?
Contractions
• That’s Fred’s jacket’s pocket.
• I’m doing what you’re saying “Don’t do!”.
Jabberwocky Analysis
This is nonsense … or is it?
This is not English … but it’s much more like English than it is like French or German or Chinese or …
Why do we pretty much understand the words?
Jabberwocky Analysis
Why do we pretty much understand the words?
We recognize combinations of morphemes.
• Chortled - Laugh in a breathy, gleeful way; (Definition from Oxford American Dictionary) A combination of "chuckle" and "snort."
• Galumphing - Moving in a clumsy, ponderous, or noisy manner. Perhaps a blend of "gallop" and "triumph." (Definition from Oxford American Dictionary)
Activity:
• Make up a word whose meaning can be inferred from the morphemes that you used.
Jabberwocky Analysis
Why do we pretty much understand the words?
• Surrounding English words strongly indicate the parts-of-speech of the nonsense words.
• After Kimmo Koskenniemi, based in part on work by Lauri Kartunnen in 1983
• Uses:
• A rules file which specifies the alphabet and the phonological (or spelling) rules,
• A lexicon file which lists lexical items and encodes morphotactic constraints.
• http://www.sil.org/pckimmo/
Commercial versions are available
• inXight’s LinguistX version based on technology developed by Kaplan and others from Xerox PARC (or at least used to be)
Morphological Analysis Tools
“cheat”: store all variants in a dictionary database, eg
CatVar:
• Categorial Variation Database
• “A database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants.”
• Example: the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).
http://clipdemos.umiacs.umd.edu/catvar
based on published dictionaries: LDOCE, CELEX, OALD++, PROPOSEL ...
MorphoChallenge
One problem with rule-based systems (PCkimmo) or dictionary-lookup systems: Porting to new languages
In principle, Unsupervised Machine Learning could learn from any language data-set, by finding recurring patterns which correspond to roots, prefixes, postfixes
MorphoChallenge is a contest to find the best UML morphological analyser
http://www.cis.hut.fi/morphochallenge2005/
http://www.cis.hut.fi/morphochallenge2007/
http://www.cis.hut.fi/morphochallenge2008/
Atwell, Roberts: Combinatory Hybrid Elementary Analysis of Text http://www.cis.hut.fi/morphochallenge2005/P07_Atwell.pdf
Arabic morphological analysis
Arabic is particularly challenging - different script, infixes, vowels may be left out in written Arabic …
Sawalha, Majdi; Atwell, Eric (2010). Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. in: Proceedings of the Language Resource and Evaluation Conference LREC 2010, 17-23 May 2010, Valetta, Malta.