School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)
23
Embed
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
School of somethingFACULTY OF OTHER
School of ComputingFACULTY OF ENGINEERING
PoS-Tagging theory and terminology
COMP3310 Natural Language Processing
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst, and other contributors)
Reminder: PoS-tagging programs
Models behind some example PoS-tagging methods in NLTK:
Hand-coded
Statistical taggers
Brill (transformation-based) tagger
NB you don’t have to use NLTK – useful to illustrate
Training and Testing ofMachine Learning Algorithms
Algorithms that “learn” from data see a set of examples and try to generalize from them.
Training set:
• Examples trained on
Test set:
• Also called held-out data and unseen data
• Use this for evaluating your algorithm
• Must be separate from the training set
• Otherwise, you cheated!
“Gold standard” evaluation set
• A test set that a community has agreed on and uses as a common benchmark. DO NOT USE IN TRAINING OR TESTING
PoS word classes in English
Word classes, also called syntactic categories or grammatical categories or Parts of Speech
closed class type: classes with fixed and few members, function words e.g. prepositions;
open class type: large class of members, many new additions, content words e.g. nouns
8 major word classes: nouns, verbs, adjectives, adverbs,
prepositions, determiners, conjunctions, pronouns
In English, also most (?all) Natural Languages
What properties define “noun”?
Semantic properties: refer to people, places and things
Distributional properties: ability to occur next to determiners, possessives, adjectives (specific locations)
Morphological properties: most occur in singular and plural
These are properties of a word TYPE,
eg “man” is a noun (usually)
Sometimes a given TOKEN may not meet all these criteria …
The men are happy … the man is happy …
They man the lifeboat (?)
Subcategories
Noun
Proper Noun v Common Noun
(Mass noun v Count Noun)
singular v plural
Count v mass (often not covered in PoS-tagsets)
Some tag-sets may have other subcategories,
Eg NNP = common noun with Word Initial Capital
(eg Englishman)
PoS-tagset Often encodes morphological categories like person, number, gender, tense, case . . .