7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 University of Tehran

Persian POS Tagging

Hadi Amiri

Database Research Group (DBRG)

ECE Department, University of Tehran

7 November 2006 DBRG- University of Tehran

Outline

• What is POS tagging• How is data tagged for POS?• Tagged Corpora• POS Tagging Approaches• Corpus Training• How to Evaluate a tagger?• Bijankhan Corpus• Memory Based POS• MLE Based POS• Neural Network POS Tagger


What is POS tagging

Annotating each word for its part of speech (grammaticaltype) in a given sentence.

e.g. I/PRP would/MD prefer/VB to/TO study/VB at/IN a/DT traditional/JJ school/NN

Properties:• It helps parsing• It resolves pronunciation ambiguities

As the water grew colder, their hands grew number. (number=ADJ, not N)

• It resolves semantic ambiguitiesPatients can bear pain.


POS Application

Part-of-speech (POS) tagging is important for manyapplications• Word sense disambiguation • Parsing• Language modeling• Q&A and Information extraction• Text-to-speech• Tagging techniques can be used for a variety of tasks• Semantic tagging• Dialogue tagging• Information Retrieval….


POS Tags

N noun baby, toy

V verb see, kiss

ADJ adjective tall, grateful, alleged

ADV adverb quickly, frankly, ...

P preposition in, on, near

DET determiner the, a, that

WhPron wh-pronoun who, what, which, …

COORD coordinator and, or

Open Class


POS Tags

• There is no standard set of POS tags Some use coarse classes: e.g., N, V, A, Aux, …. Others prefer finer distinctions (e.g., Penn Treebank):

• PRP: personal pronouns (you, me, she, he, them, him, …)

• PRP$: possessive pronouns (my, our, her, his, …)

• NN: singular common nouns (sky, door, theorem, …)

• NNS: plural common nouns (doors, theorems, women, …)

• NNP: singular proper names (Fifi, IBM, Canada, …)

• NNPS: plural proper names (Americas, Carolinas, …)


How is data tagged for POS?

• We are trying to model human performance.

• So we have humans tag a corpus and try to match their performance.

To creating a model A corpora are hand-tagged for POS by more than 1

annotator Then checked for reliability


Penn Treebank Corpus

(WSJ, 4.5M)

History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93%-95%

Greene and RubinRule Based - 70%

LOB Corpus Created (EN-UK)1 Million Words

DeRose/ChurchEfficient HMMSparse Data

95%+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based – 95%+

Tree-Based Statistics (Helmut Shmid)

Rule Based – 96%+

Neural Network 96%+

Trigram Tagger(Kempe)

96%+

Combined Methods98%+

LOB Corpus Tagged


Tagged Corpora

Corpus # Tags #Tokens

Brown 87 1 million

British Natl 61 100 million

Penn Treebank 45 4.8 million

Original Bijankhan 550 ?

Bijankhan 40 2.6 million


POS Tagging Approaches

POS Tagging

Supervised Unsupervised

Rule-Based Stochastic Neural Rule-Based Stochastic Neural


Rule-Based POS Tagger

Lexicon with tagsidentified for each word

that ADV PRON DEM SG DET

CENTRAL DEM SG CS

Constraints to eliminate tags:

If next word is adj, adv,

quant And following is S bdry And previous word is not

consider-type V

Then Eliminate non-ADV tags

He was that drunk.


Probabilistic POS Tagging

• Provides the possibility of automatic training rather than painstaking rule revision.

• Automatic training means that a tagger can be easily adapted to new text domains.

E.g.

A moving/VBG house

A moving/JJ ceremony


Probabilistic POS Tagging

• Needs large tagged corpus for training

• Unigram statistics (most common part-of-speech for each word) get us to about 90% accuracy

• For greater accuracy, we need some information on adjacent words


Corpus Training

• The probabilities in a statistical model come from the corpus it is trained on.

• If the corpus is too domain-specific, the model may not be portable to other domains.

• If the corpus is too general, it will not capitalize on the advantages of domain-specific probabilities


Tagger Evaluation

• Once a tagging model has been built, how is it tested? Typically, a corpus is split into a training set (usually ~90%

of the data) and a test set (10%). The test set is held out from the training. The tagger learns the tag sequences that maximize the

probabilities for that model. The tagger is tested on the test set.

• Tagger is not trained on test data.• But test data is highly similar to training data.


Current Performance

• How many tags are correct? About 98% currently But baseline is already 90% Baseline algorithm:

• Tag every word with its most frequent tag

• Tag unknown words as nouns

• How well do people do?

7 November 2006 University of Tehran

Memory Based Part Of Speech Tagging Experiments With

Persian Text


Corpus Study

• At first the corpus had 550 tags.• The content is gathered form daily news and common

texts. • Each document is assigned a subject such as political,

cultural and so on. Totally, there are 4300 different subjects. This subject categorization provides an ideal experimental

environment for clustering, filtering, categorization research.

• In this research, we simply ignored the subject categories of the documents and concentrated on POS tags.


Selecting Suitable Tags

• At first frequencies of each tags was gathered. • Then many of the tags were grouped together and a

smaller tag set was produced • Each tag in the tag set is placed in a hierarchical

structure. As an example, consider the tag “N_PL_LOC”.

N stands for a noun

PL describes the plurality of the tag

LOC defines the tag as about locations


The Tags Distribution


Max, Min, AVG, Total # of Tags in The Training Set


Number of Different Tags

For instance, the word “آسمان” which means “the sky” in English is always tagged with "N_SING" in the whole corpus; but a word like “باال” which means “high or above” has been tagged by several tags ("ADJ_SIM", "ADV", "ADV_NI", "N_SING", "P", and "PRO").


Classifying the Rare WordsETC12%

PRO2%

V_PA3%

N_PL6%

CON8%

ADJ_SIM9%

DELM10%

P12%

N_SING38%

The Tags whose number of occurrences is below 5000 times in the corpus are gathered to “ETC” group.


Bijankhan Corpus


Implemented Mehtods

• MLE Based POS Tagger

• Neural Network POS Tagger

• Memory Based POS Tagger


Implemented Mehtods





Memory-Based POS Tagging

• Memory-based POS tagging is also called Lazy Leaning, Example Based learning or Case Based Learning

• MBT uses some specifications of each word such as its possible tags, and a fixed width context as features.

• We used MBT, a tool for memory based tagger generation and tagging. (available at: http://ilk.uvt.nl/mbt/)

http://ilk.uvt.nl/mbt/




The MBT tool generates a tagger by working

through the annotated corpus and creating

three data structures: a lexicon, associating words to tags as evident in

the training corpus a case base for known words (words occurring in

the lexicon) a case base for unknown words.


Selecting appropriate feature sets for known and unknown words has important impact on the accuracy of the results


After different experiments, we chose “ddfa” as thefeature set for known words.

So “ddfa” is choosing the appropriate tag for each known word, based on the tag of two words before and possible tags of the word after it.


afdd

d stand for disambiguated tags

d stand for disambiguated tags

f means focus (current) worda is ambiguous word after the current word.


The feature set chosen for unknown word is “dFass”


ssaFd

current word

d is the disambiguated tag of the word before current word

a stands for ambiguous tags of the word after current word

ss are two suffix letters of the current word.

The F in unknown words features indicates position of the focus word and it is not included in actual feature set for tagging.


MBT Results- Known Words

“ddfa”


MBT Results- Unknown Words

“dFass”


MBT Results- Overall


Implemented Mehtods





Maximum Likelihood Estimation

As a bench mark of POS tagging accuracy, wechose Maximum Likelihood Estimation (MLE)approach.

Calculating the maximum likelihood probabilities for each tag assigned to any word in the training set.

Choosing the tag with greater maximum likelihood probability (designated tag) for each word and make it the only tag assignable to that word.

• In order to evaluate this method we analyze the words in the test set and assign the designated tags to the words in the test set.


Maximum Likelihood Estimation

Occurrence Word Tag MLE

1 پدرانه ADV_NI 0.1667

5 پدرانه ADJ_SIM 0.8333

4 پديدار ADJ_SIM 0.1538

22 پديدار N_SING 0.8462

1 پذيرفته N_SING 0.0096

3 پذيرفته ADJ_SIM 0.0288

6 پذيرفته V_PA 0.0577

94 پذيرفته ADJ_INO 0.9038

2 اند پراكنده V_PRE 0.5000

2 اند پراكنده V_PA 0.5000


MLE Results-Known Words


MLE Results- Unknown Words, “DEFAULT”

For each unknown word we assign the “DEFAULT” tag.


MLE Results- Overall, “DEFAULT”

For each unknown word we assign the “DEFAULT” tag.


MLE Results- Unknown Words, “N_SING”

For each unknown word we assign the “N_SING” tag.


MLE Results- Overall, “N_SING”

For each unknown word we assign the “N_SING” tag, most assigned tag.


Comparison With Other Languages


Implemented Mehtods





Neural Network

Each unit corresponds to one of the tags in the tag set.

Preceding Words Following Words


Neural Network

• For each POS tag, posi and each of the p+1+f in the context, there is an input unit whose activation ini,j represent the probability that wordi has pos posj.

Input representation for the currently tagged word and the following words:

The activation value for the preceding words:


Neural Network Results on Bijankhan Corpus

Training Algorithm

No. of Hidden Layer

No. of Input for Train

Training Duration (Hour)

No. of Input for Test

Accuracy

MLP 2 1mil 120:00:87

1000 Too Low

MLP 3 1mil ? 1000 Too Low

Generalized Feed Forward

1 1mil 95:30:57 1000 Too Low


2 1mil ? 1000 Too Low


2 20000 1:53:35 1000 %58


Neural Network on Other Languages

English


Neural Network on Other Languages

Chinese


Future Work

• Using more than 1 level POS tags.

• Unsupervised POS tagging using Hamshahri Collection

• Investigation of other methods for Persian POS tagging such as Support Vector Machine (SVM) based tagging

• KASRE YE EZAFE in Persian!


Thank You

Space for Question?

7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

Documents

speech pos tagging

claws pos tagging

university of tehran

dbrg university

tehran pos application

tehran pos tags n nounbaby

corpora corpus

standard set of pos