Top Banner
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran
24

Persian Part Of Speech Tagging

Jan 18, 2016

Download

Documents

tuyet

Persian Part Of Speech Tagging. Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran. Decision Trees. Decision Tree (DT): Tree where the root and each internal node is labeled with a question. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Persian Part Of Speech Tagging

1

Persian Part Of Speech Tagging

Mostafa Keikha

Database Research Group (DBRG)

ECE Department, University of Tehran

Page 2: Persian Part Of Speech Tagging

2

Decision Trees

Decision Tree (DT): Tree where the root and each internal node is

labeled with a question. The arcs represent each possible answer to the

associated question. Each leaf node represents a prediction of a

solution to the problem. Popular technique for classification; Leaf

node indicates class to which the corresponding tuple belongs.

Page 3: Persian Part Of Speech Tagging

3

Decision Tree Example

Page 4: Persian Part Of Speech Tagging

4

Decision Trees

A Decision Tree Model is a computational model consisting of three parts: Algorithm to create the tree Algorithm that applies the tree to data

Creation of the tree is the most difficult part. Processing is basically a search similar to that in

a binary search tree (although DT may not be binary).

Page 5: Persian Part Of Speech Tagging

5

Decision Tree Algorithm

Page 6: Persian Part Of Speech Tagging

6

Using DT in POS Tagging

Compute Ambiguity classes Each term may have

different tags Ambiguity class for each

term: set of all possible tags

compute # of occurrence for each tag in each ambiguity class

Ambiguity Class

# of occurrence

a b c d10 20 25 40

b c d 40 39 50

b d 60 55

Page 7: Persian Part Of Speech Tagging

7

Using DT in POS Tagging

Create Decision Tree on Ambiguity classes

In each level delete tag with minimum occurrence

a b c d10 20 25 40

b c d40 39 50

b d60 55

b

Page 8: Persian Part Of Speech Tagging

8

Using DT in POS Tagging

Advantage Easy to understand Easy to implement

Disadvantage Context independent

Page 9: Persian Part Of Speech Tagging

9

Using DT in POS Tagging

Known Tokens Results

Run PercentTokensCorrectAccuracy

197.9739392336376492.34%

298.0635563032896592.50%

397.9639752836778992.51%

497.9241056138157892.94%

597.9740307937230592.36%

Average97.976392144.2362880.292.474%

Page 10: Persian Part Of Speech Tagging

11

POS tagging using HMMs

Let W be a sequence of words W = w1 , w2 , … , wn

Let T be the corresponding tag sequence T = t1 , t2 , … , tn

Task : Find T which maximizes P ( T | W )

T’ = argmaxT P ( T | W )

Page 11: Persian Part Of Speech Tagging

12

POS tagging using HMMs

By Bayes Rule,

P ( T | W ) = P ( W | T ) * P ( T ) / P ( W )

T’ = argmaxT P ( W | T ) * P ( T )

Transition Probability,

P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | t1 … tn-1 )

Applying Tri-gram approximation,

P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

Introducing a dummy tag, $, to represent the beginning of a sentence,

P ( T ) = P ( t1 | $ ) * P ( t2 | $ t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

Page 12: Persian Part Of Speech Tagging

13

POS tagging using HMMs

Smoothing Transition Probabilities

Sparse data problem

Linear interpolation method

P'(ti | ti - 2 , ti - 1) = λ1 P( ti ) + λ2 P(ti | ti - 1 ) + λ3 P(ti | ti - 2 , ti - 1)

such that the s sum to 1

Page 13: Persian Part Of Speech Tagging

14

POS tagging using HMMs

Calculation of λs

Page 14: Persian Part Of Speech Tagging

15

POS tagging using HMMs

Emission Probability,

P(W | T ) ≈ P(w1 | t1) * P(w2 | t2) * . . . * P(wn | tn)

Context Dependency

To make more dependent on the context the emission probability is calculated as:

P(W | T ) ≈ P(w1 | $ t1) * P(w2 | t1 t2) ...* P(wn | tn-1 tn)

Page 15: Persian Part Of Speech Tagging

16

POS tagging using HMMs

Smoothing technique is applied

P' (wi | ti-1 ti) = θ1 P(wi | ti) + θ2 P(wi | ti-1 ti) Sum of all θs is equal to 1

θs are different for different words.

Page 16: Persian Part Of Speech Tagging

17

POS tagging using HMMs

1(

2(

3(

4(

5(

6(

Page 17: Persian Part Of Speech Tagging

18

POS tagging using HMMs

Page 18: Persian Part Of Speech Tagging

19

POS tagging using HMMs

Page 19: Persian Part Of Speech Tagging

20

POS tagging using HMMs

Lexicon generation probability

Page 20: Persian Part Of Speech Tagging

21

POS tagging using HMMs

Page 21: Persian Part Of Speech Tagging

22

P(N V ART N | files like a flower) = 4.37*10-6

POS tagging using HMMs

Page 22: Persian Part Of Speech Tagging

23

POS tagging using HMMs

Known Tokens Results

Run PercentTokensCorrectAccuracy

198.0739429038221196.94%

298.1634591334591397.18%

398.0439784934389496.96%

498.0241097039848796.96%

598.0740346039147597.03%

Average98.072390496.437239697.01%

Page 23: Persian Part Of Speech Tagging

24

Unknown Tokens Results

Run PercentTokensCorrectAccuracy

11.937760582975.12%

21.846689535780.09%

31.967956615377.34%

41.988283643577.69%

51.937945624678.62%

Average1.9287726.6600477.77%

Page 24: Persian Part Of Speech Tagging

25

Overall Results

Run TokensCorrectAccuracy

140205038804096.52%

236265835127096.86%

340580539189096.57%

441925340492296.58%

541140539772196.67%

Average400234.2386768.696.64%