Part-of-Speech Tagging for Bengali with Hidden Markov Model

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Part-of-Speech Tagging for Bengali with Hidden Markov Model

Sandipan Dandapat, Sudeshna Sarkar

Department of Computer Science & Engineering




Machine Learning to Resolve POS Tagging

HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.)

Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)

Maximum Entropy (Ratnaparkhi,96; etc.)

TB(ED)L (Brill,92,94,95; etc.)

Decision Tree (Black,92; Marquez,97; etc.)



Our Approach HMM based

Simplicity of the model Language Independence Reasonably good accuracy

Data intensive Sparseness problem when extending order

We are adapting first-order HMM



POS Tagging Schema

Language Model

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging



POS Tagging: Our Approach

First-order HMM


Rawtext

Taggedtext


POS tagging

First order HMM: Current state

depends on previous state

1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t




µ = (π,A,B)


Rawtext

Taggedtext


POS tagging

1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t

{ ( | )}i iB P w t1{ ( | )}i iA P t t

start{ ( )}iP t Model Parameters First-order HMM




µ = (π,A,B)


Rawtext

Taggedtext

…

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer First-order HMM




µ = (π,A,B)

Viterbi Algorithm

Rawtext

Taggedtext

…

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer First-order HMM




1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t

n321 wwww Text:

Tags:• • •

• • •

• • •

• • •

Where, ti {T} , wi {T} = Set of tags




1

1

... 1,

( | ) ( | )arg max i i i i

t tn i n

S P w t P t t

n321 wwww Text:

Tags:• •

•

• •

• •

Where, ti TMA(wi), wi {T} = Set of tags



Learning HMM Parameters Supervised Learning ( HMM-S)

Estimates three parameters directly from the tagged corpus

ino. of sentences which begin with t( )

no. of sentencesstart iP t

- 11

- 1

( )( | )

( )

i ii i

i

count t tP t t

count t

with 1

( )( | )

( )

i ii i

i

count w tP w t

count t



Learning HMM Parameters Semi-supervised Learning (HMM-SS)

Untagged data (observation) are used to find a model that most likely produce the observation sequence

Initial model is created based on tagged training data Based on initial model and untagged data, update the model

parameters

arg max ( | )untaggedP O

New model parameters are estimated using Baum-Welch algorithm

P(O | ̂) P(O | )



Smoothing and Unknown Word Hypothesis

All emission and transition are not observed from the training data

Add-one smoothing to estimate both emission and transition probabilities

Not all words are known to Morphological Analyzer Assume open class grammatical categories



Experiments Baseline Model Supervised bigram HMM (HMM-S)

HMM-S HMM-S + IMA HMM-S + CMA

Semi-supervised bigram HMM (HMM-SS) HMM-SS HMM-SS + IMA HMM-SS + CMA



Data Used Tagged data: 3085 sentences ( ~ 41,000 words)

Includes both the data in non-privileged and privileged mode

Untagged corpus from CIIL: 11,000 sentences (100,000 words) – unclean To re-estimate the model parameters using Baum-Welch

algorithm



Tagset and Corpus Ambiguity Tagset consists of 27 grammatical classes

Corpus Ambiguity Mean number of possible tags for each word Measured in the training tagged data

Dutch Spanish German English French Bengali

1.11 1.19 1.3 1.34 1.69 2.09

(Dermatas et al 1995)



Results on Development set

Baseline

30405060708090

100

5 10 15 20 25 30 35 40

Size of the traing corpus (1000x words)

Tagg

ing

Acc

urac

y (%

)

30

40

50

60

70

80

90

100

5 10 15 20 25 30 35 40

Size of the traing corpus ( 1000x words)

Tag

gin

g A

ccu

racy

( %

)

HMM-S

HMM-S + IMA

HMM-S + CMA

30

40

50

60

70

80

90

100

5 10 15 20 25 30 35 40

Size of the training corpus (1000x words)

Tag

gin

g A

ccu

racy (

%)

ACOPOST

30

40

50

60

70

80

90

100

5 10 15 20 25 30 35 40

Size of the training corpus ( 1000x words)

Tagg

ing

Acc

urac

y (%

)

HMM-SS

HMM-SS + IMA

HMM-SS + CMA



Results on Development setMethod Accuracy

Baseline 69.11

ACOPOST 83.45

HMM-S 74.53

HMM-S + IMA 78.65

HMM-S + CMA 88.83

HMM-SS 73.77

HMM-SS + IMA 77.98

HMM-SS + CMA 89.65

89.61

89.03

87.0987.4

89.3688.92

85.5

86

86.5

87

87.5

88

88.5

89

89.5

90

knowndata

seen data unknowndata

Tagg

ing

Acc

urac

y(%

)

HMM-S + CMA

HMM-SS + CMA



Error Analysis

Actual Class

Predicted Class

% of total error

% of class error

NNC NN 14.2 4.0

VRB VFM 7.1 8.7

JJ NN 5.9 1.7

QF JJ 5.1 3.7

RB JJ 5.0 3.6

NLOC NN 4.5 1.3

VNN VFM 3.7 4.5



Results on Test Set Tested on 458 sentences ( 5127 words)

Precision: 84.32% Recall: 84.36% Fβ=1 : 84.34%

Type Precision(%) Recall (%) Fβ=1 Frequency

SYM 100 99.78 99.89 911

NEG 95.45 100 97.67 44

PRP 95.72 93.18 94.43 257

QFNUM 94.70 91.24 92.94 132

Top 4 classes in terms of F-measure



Results on Test Set Tested on 458 sentences ( 5127 words)

Precision: 84.32% Recall: 84.36% Fβ=1 : 84.34%

Type Precision(%) Recall (%) Fβ=1 Frequency

VJJ 0 0 0 0

NVB 0 0 0 28

JVB 0 0 0 12

INF 100 12.5 22.22 1

Bottom 4 classes in terms of F-measure



Further Improvement Uses suffix information to handle unknown words Calculates the probability of a tag, given the last m

letters (suffix) of a word

Each symbol emission probability of unknown word is normalized

n 1 n

( | _ ) ( _ )( _ | )

( )

( | ,..., ) ( _ )

( )

ii

i

i m

i

P t Unknown word P Unknown wordP Unknown word t

P t

P t l l P Unknown word

P t



Further Improvement

73.77

89.65

77.98

90.33

84.6183.33

70

75

80

85

90

95

100

HMM-SS HMM-SS+IMA HMM-SS+CMA

Tag

gin

g A

ccu

racy

(%)

Accuracy reflected on development set

90.17

78.65

88.83

74.53

85.04 85.95

70

75

80

85

90

95

100

HMM-S HMM-S+IMA HMM-S+CMA

Tagg

ing

Acc

urac

y(%

)

IMA

CMA



Conclusion and Future Scope Morphological restriction on tags gives an efficient

tagging model even when small labeled text is available

Semi-supervised learning performs better compare to supervised learning

Better adjustment of emission probability can be adopted for both unknown words and less frequent words

Higher order Markov model can be adopted



Thank You

Part-of-Speech Tagging for Bengali with Hidden Markov Model

Documents