7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran
Dec 16, 2015
7 November 2006 University of Tehran
Persian POS Tagging
Hadi Amiri
Database Research Group (DBRG)
ECE Department, University of Tehran
7 November 2006 DBRG- University of Tehran
Outline
• What is POS tagging• How is data tagged for POS?• Tagged Corpora• POS Tagging Approaches• Corpus Training• How to Evaluate a tagger?• Bijankhan Corpus• Memory Based POS• MLE Based POS• Neural Network POS Tagger
7 November 2006 DBRG- University of Tehran
What is POS tagging
Annotating each word for its part of speech (grammaticaltype) in a given sentence.
e.g. I/PRP would/MD prefer/VB to/TO study/VB at/IN a/DT traditional/JJ school/NN
Properties:• It helps parsing• It resolves pronunciation ambiguities
As the water grew colder, their hands grew number. (number=ADJ, not N)
• It resolves semantic ambiguitiesPatients can bear pain.
7 November 2006 DBRG- University of Tehran
POS Application
Part-of-speech (POS) tagging is important for manyapplications• Word sense disambiguation • Parsing• Language modeling• Q&A and Information extraction• Text-to-speech• Tagging techniques can be used for a variety of tasks• Semantic tagging• Dialogue tagging• Information Retrieval….
7 November 2006 DBRG- University of Tehran
POS Tags
N noun baby, toy
V verb see, kiss
ADJ adjective tall, grateful, alleged
ADV adverb quickly, frankly, ...
P preposition in, on, near
DET determiner the, a, that
WhPron wh-pronoun who, what, which, …
COORD coordinator and, or
Open Class
7 November 2006 DBRG- University of Tehran
POS Tags
• There is no standard set of POS tags Some use coarse classes: e.g., N, V, A, Aux, …. Others prefer finer distinctions (e.g., Penn Treebank):
• PRP: personal pronouns (you, me, she, he, them, him, …)
• PRP$: possessive pronouns (my, our, her, his, …)
• NN: singular common nouns (sky, door, theorem, …)
• NNS: plural common nouns (doors, theorems, women, …)
• NNP: singular proper names (Fifi, IBM, Canada, …)
• NNPS: plural proper names (Americas, Carolinas, …)
7 November 2006 DBRG- University of Tehran
How is data tagged for POS?
• We are trying to model human performance.
• So we have humans tag a corpus and try to match their performance.
To creating a model A corpora are hand-tagged for POS by more than 1
annotator Then checked for reliability
7 November 2006 DBRG- University of Tehran
Penn Treebank Corpus
(WSJ, 4.5M)
History
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93%-95%
Greene and RubinRule Based - 70%
LOB Corpus Created (EN-UK)1 Million Words
DeRose/ChurchEfficient HMMSparse Data
95%+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based – 95%+
Tree-Based Statistics (Helmut Shmid)
Rule Based – 96%+
Neural Network 96%+
Trigram Tagger(Kempe)
96%+
Combined Methods98%+
LOB Corpus Tagged
7 November 2006 DBRG- University of Tehran
Tagged Corpora
Corpus # Tags #Tokens
Brown 87 1 million
British Natl 61 100 million
Penn Treebank 45 4.8 million
Original Bijankhan 550 ?
Bijankhan 40 2.6 million
7 November 2006 DBRG- University of Tehran
POS Tagging Approaches
POS Tagging
Supervised Unsupervised
Rule-Based Stochastic Neural Rule-Based Stochastic Neural
7 November 2006 DBRG- University of Tehran
Rule-Based POS Tagger
Lexicon with tagsidentified for each word
that ADV PRON DEM SG DET
CENTRAL DEM SG CS
Constraints to eliminate tags:
If next word is adj, adv,
quant And following is S bdry And previous word is not
consider-type V
Then Eliminate non-ADV tags
He was that drunk.
7 November 2006 DBRG- University of Tehran
Probabilistic POS Tagging
• Provides the possibility of automatic training rather than painstaking rule revision.
• Automatic training means that a tagger can be easily adapted to new text domains.
E.g.
A moving/VBG house
A moving/JJ ceremony
7 November 2006 DBRG- University of Tehran
Probabilistic POS Tagging
• Needs large tagged corpus for training
• Unigram statistics (most common part-of-speech for each word) get us to about 90% accuracy
• For greater accuracy, we need some information on adjacent words
7 November 2006 DBRG- University of Tehran
Corpus Training
• The probabilities in a statistical model come from the corpus it is trained on.
• If the corpus is too domain-specific, the model may not be portable to other domains.
• If the corpus is too general, it will not capitalize on the advantages of domain-specific probabilities
7 November 2006 DBRG- University of Tehran
Tagger Evaluation
• Once a tagging model has been built, how is it tested? Typically, a corpus is split into a training set (usually ~90%
of the data) and a test set (10%). The test set is held out from the training. The tagger learns the tag sequences that maximize the
probabilities for that model. The tagger is tested on the test set.
• Tagger is not trained on test data.• But test data is highly similar to training data.
7 November 2006 DBRG- University of Tehran
Current Performance
• How many tags are correct? About 98% currently But baseline is already 90% Baseline algorithm:
• Tag every word with its most frequent tag
• Tag unknown words as nouns
• How well do people do?
7 November 2006 University of Tehran
Memory Based Part Of Speech Tagging Experiments With
Persian Text
7 November 2006 DBRG- University of Tehran
Corpus Study
• At first the corpus had 550 tags.• The content is gathered form daily news and common
texts. • Each document is assigned a subject such as political,
cultural and so on. Totally, there are 4300 different subjects. This subject categorization provides an ideal experimental
environment for clustering, filtering, categorization research.
• In this research, we simply ignored the subject categories of the documents and concentrated on POS tags.
7 November 2006 DBRG- University of Tehran
Selecting Suitable Tags
• At first frequencies of each tags was gathered. • Then many of the tags were grouped together and a
smaller tag set was produced • Each tag in the tag set is placed in a hierarchical
structure. As an example, consider the tag “N_PL_LOC”.
N stands for a noun
PL describes the plurality of the tag
LOC defines the tag as about locations
7 November 2006 DBRG- University of Tehran
The Tags Distribution
7 November 2006 DBRG- University of Tehran
Max, Min, AVG, Total # of Tags in The Training Set
7 November 2006 DBRG- University of Tehran
Number of Different Tags
For instance, the word “آسمان” which means “the sky” in English is always tagged with "N_SING" in the whole corpus; but a word like “باال” which means “high or above” has been tagged by several tags ("ADJ_SIM", "ADV", "ADV_NI", "N_SING", "P", and "PRO").
7 November 2006 DBRG- University of Tehran
Classifying the Rare WordsETC12%
PRO2%
V_PA3%
N_PL6%
CON8%
ADJ_SIM9%
DELM10%
P12%
N_SING38%
The Tags whose number of occurrences is below 5000 times in the corpus are gathered to “ETC” group.
7 November 2006 DBRG- University of Tehran
Bijankhan Corpus
7 November 2006 DBRG- University of Tehran
Implemented Mehtods
• MLE Based POS Tagger
• Neural Network POS Tagger
• Memory Based POS Tagger
7 November 2006 DBRG- University of Tehran
Implemented Mehtods
• MLE Based POS Tagger
• Neural Network POS Tagger
• Memory Based POS Tagger
7 November 2006 DBRG- University of Tehran
Memory-Based POS Tagging
• Memory-based POS tagging is also called Lazy Leaning, Example Based learning or Case Based Learning
• MBT uses some specifications of each word such as its possible tags, and a fixed width context as features.
• We used MBT, a tool for memory based tagger generation and tagging. (available at: http://ilk.uvt.nl/mbt/)
7 November 2006 DBRG- University of Tehran
The MBT tool generates a tagger by working
through the annotated corpus and creating
three data structures: a lexicon, associating words to tags as evident in
the training corpus a case base for known words (words occurring in
the lexicon) a case base for unknown words.
Memory-Based POS Tagging
Selecting appropriate feature sets for known and unknown words has important impact on the accuracy of the results
7 November 2006 DBRG- University of Tehran
After different experiments, we chose “ddfa” as thefeature set for known words.
So “ddfa” is choosing the appropriate tag for each known word, based on the tag of two words before and possible tags of the word after it.
Memory-Based POS Tagging
afdd
d stand for disambiguated tags
d stand for disambiguated tags
f means focus (current) worda is ambiguous word after the current word.
7 November 2006 DBRG- University of Tehran
The feature set chosen for unknown word is “dFass”
Memory-Based POS Tagging
ssaFd
current word
d is the disambiguated tag of the word before current word
a stands for ambiguous tags of the word after current word
ss are two suffix letters of the current word.
The F in unknown words features indicates position of the focus word and it is not included in actual feature set for tagging.
7 November 2006 DBRG- University of Tehran
MBT Results- Known Words
“ddfa”
7 November 2006 DBRG- University of Tehran
MBT Results- Unknown Words
“dFass”
7 November 2006 DBRG- University of Tehran
MBT Results- Overall
7 November 2006 DBRG- University of Tehran
Implemented Mehtods
• Neural Network POS Tagger
• MLE Based POS Tagger
• Memory Based POS Tagger
7 November 2006 DBRG- University of Tehran
Maximum Likelihood Estimation
As a bench mark of POS tagging accuracy, wechose Maximum Likelihood Estimation (MLE)approach.
Calculating the maximum likelihood probabilities for each tag assigned to any word in the training set.
Choosing the tag with greater maximum likelihood probability (designated tag) for each word and make it the only tag assignable to that word.
• In order to evaluate this method we analyze the words in the test set and assign the designated tags to the words in the test set.
7 November 2006 DBRG- University of Tehran
Maximum Likelihood Estimation
Occurrence Word Tag MLE
1 پدرانه ADV_NI 0.1667
5 پدرانه ADJ_SIM 0.8333
4 پديدار ADJ_SIM 0.1538
22 پديدار N_SING 0.8462
1 پذيرفته N_SING 0.0096
3 پذيرفته ADJ_SIM 0.0288
6 پذيرفته V_PA 0.0577
94 پذيرفته ADJ_INO 0.9038
2 اند پراكنده V_PRE 0.5000
2 اند پراكنده V_PA 0.5000
7 November 2006 DBRG- University of Tehran
MLE Results-Known Words
7 November 2006 DBRG- University of Tehran
MLE Results- Unknown Words, “DEFAULT”
For each unknown word we assign the “DEFAULT” tag.
7 November 2006 DBRG- University of Tehran
MLE Results- Overall, “DEFAULT”
For each unknown word we assign the “DEFAULT” tag.
7 November 2006 DBRG- University of Tehran
MLE Results- Unknown Words, “N_SING”
For each unknown word we assign the “N_SING” tag.
7 November 2006 DBRG- University of Tehran
MLE Results- Overall, “N_SING”
For each unknown word we assign the “N_SING” tag, most assigned tag.
7 November 2006 DBRG- University of Tehran
Comparison With Other Languages
7 November 2006 DBRG- University of Tehran
Implemented Mehtods
• MLE Based POS Tagger
• Neural Network POS Tagger
• Memory Based POS Tagger
7 November 2006 DBRG- University of Tehran
Neural Network
Each unit corresponds to one of the tags in the tag set.
Preceding Words Following Words
7 November 2006 DBRG- University of Tehran
Neural Network
• For each POS tag, posi and each of the p+1+f in the context, there is an input unit whose activation ini,j represent the probability that wordi has pos posj.
Input representation for the currently tagged word and the following words:
The activation value for the preceding words:
7 November 2006 DBRG- University of Tehran
Neural Network Results on Bijankhan Corpus
Training Algorithm
No. of Hidden Layer
No. of Input for Train
Training Duration (Hour)
No. of Input for Test
Accuracy
MLP 2 1mil 120:00:87
1000 Too Low
MLP 3 1mil ? 1000 Too Low
Generalized Feed Forward
1 1mil 95:30:57 1000 Too Low
Generalized Feed Forward
2 1mil ? 1000 Too Low
Generalized Feed Forward
2 20000 1:53:35 1000 %58
7 November 2006 DBRG- University of Tehran
Neural Network on Other Languages
English
7 November 2006 DBRG- University of Tehran
Neural Network on Other Languages
Chinese
7 November 2006 DBRG- University of Tehran
Future Work
• Using more than 1 level POS tags.
• Unsupervised POS tagging using Hamshahri Collection
• Investigation of other methods for Persian POS tagging such as Support Vector Machine (SVM) based tagging
• KASRE YE EZAFE in Persian!
7 November 2006 DBRG- University of Tehran
Thank You
Space for Question?