Morphology 2 A case study of developing Bengali morph analyzer and generator Sudeshna Sarkar IIT Kharagpur.

Morphology 2A case study of developing Bengali

morph analyzer and generator

Sudeshna Sarkar

IIT Kharagpur

Two level morphology

PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983).

Koskenniemi's model of two-level morphology was based on the traditional distinction that linguists make between morphotactics, which enumerates the inventory of

morphemes and specifies in what order they can occur, and

morphophonemics, which accounts for alternate forms or "spellings" of morphemes according to the phonological context in which they occur.

For example, the word chased is analyzed morphotactically as the stem chase followed by the suffix -ed.

However, the addition of the suffix -ed apparently causes the loss of the final e of chase; thus chase and chas are allomorphs or alternate forms of the same morpheme.

Koskenniemi's model is "two-level" in the sense that a word is represented as a direct, letter-for-letter correspondence between its lexical or underlying form and its surface form. For example, the word chased is given this two-level representation (where + is a morpheme boundary symbol and 0 is a null character):

Lexical form: c h a s e + e d

Surface form: c h a s 0 0 e d

Main components of Karttunen's KIMMO parser

1. the rules component: two-level rules that accounted for regular phonological or orthographic alternations, such as chase versus chas.

2. lexical component: list all morphemes (stems and affixes) in their lexical form and specify morphotactic constraints.

Englex: a two-level description of English morphology

Englex consists of a set of orthographic rules, a 20,000-entry lexicon of roots and affixes, and a word grammar. With Englex and PC-KIMMO, you can morphologically parse English words and text.

Generative rules and 2-level rules

Two-level rules are similar to the rules of standard generative phonology, but differ in several crucial ways. Rule R1 is an example of a generative rule.

R1 t ---> c / ___ i Rule R2 is the analogous two-level rule. R2 t:c => ___ i

Generative rules Transformational rules Sequential application UnidirectionalTwo-level rules Declarative – talk about correspondences They apply is parallel Bidirectional

Hindi Morphology

Hindi noun analysis

A. Noun analysis

Nouns are categorised into 20 different paradigms based on the following criterion:

1. Vowel ending.

2. Valid suffix of a word.

3. Gender, Number, Person and Case information.

A snapshot of the analysis in shown in table 2.1.

There are 20,000 Nouns classified in 20 such paradigms.

Hindi verb analysis

B. Verb AnalysisThe Verb Group represents the following grammatical prop-erties:1. Tense : Present, Past and Future.2. Aspect: Durative, Stative, Infinitive, Habitual and Per-fective etc.3. Modal: Abilitive, Deontic, Probabilitative etc.4. Gender: Male, Female, Dual.5. Person: 1st , 2nd and 3rd.These values formed the basis to list Verb Groups accordingto their TAM-GNP values. A TAM-GNP matrix having allpossible VGs is developed. IITB morph analyzer Presently there are 622 uniqueparadigms in the TAM-GNP matrix

Bengali Morphology

Morphology: Verb

Attribute 1: RootVal 0: root word of the given surface form of the word

Attribute 2: CategoryVal 0: verb (v)

Attribute 3: PersonVal 0: first, Val 1: second normal, Val 2: second familiar, Val 3: third normal, Val 4: formal (second/third)

Attribute 4: Tense Val 0: Present, Val 1: Past, Val 2: FutureAttribute 5: Aspect Val 0: simple, Val 1: continuous Val 2: perfectAttribute 6: ModalityAttribute 8: Specificity Val 0: non-specific, Val 1: specificAttribute 9: Emphasizer Val 0: none, Val 1: only, Val 2: alsoAttribute 10: Polarity Val 0: positive Val 1: negative

Attributes & Values (Verb) :

Person: First Person-(1),Ami Second Formal-(2),Apani Second Normal-(3),tumi Second Familiar-(4),tui Third Normal-(5),se Third Formal-(6),tini Unspecified


Tense:

Present-(1),kari

Past-(2),karalAma

Future-(3),karaba

Overall-(4)


Aspect:

Simple-(1),karalAma

Habitual-(2),karatAma

Continuous-(3),karachhe

Perfect-(4),karechhi

Indefinite-(5),kari


Modality: Indicative-(1),kara Imperative-(2),kar Subjunctive-(3),karale


Polarity:

Positive-(1),kari

Negative-(2),karini

INFORMATION:VERBS

Total Numbers of Categories (Based on Syllabic Structure) : 20

Rules:214/Category Total Numbers of Rules : 214x20=4280(apprx.)

Bengali Verb Paradigms

Bengali Verb morphology for one of the paradigms

Classification : Nouns

Morphological Classification Based on Different Types of Nouns: 1.Animate (example: mAnuSha) 2.Inanimate(example: mATi) 3.Abstract/Qualitative(example: daYA) 4.Verbal(example : bhojana) 5.Collective(example: pAla) 6.The Singular (example: chandra) 7.Compounded(example: riksAoYAlA)

Sub Classification :Nouns

Sub Classification based on “Root Endings”: 1.a-ending root (animate “mAnusha”) 2.A- ending root (animate “bAlikA”) 3.i- ending root (animate “pAkhi”) 4.I- ending root (animate “khukI”) 5.e- ending root (animate “chhele”) 6.o- ending root (animate “myA;o”) 7.u-ending root (animate “shishu”) 8.U- ending root (animate “badhU”)

Classification :Pronouns

Morphological Analysis Based on Different Natures of Pronouns:1.Personal (Ami,Apani,-)2.Inclusive (saba,sakala,ubhaYa,-)3.Relative(ye,yAhA,-)4.Interrogative(ke,ki,-)5.Denoting Others (anya,para,-)6.Near Demonstrative (e,ihA,-)7.Far Demonstrative (o,uhA,-)8.Reflexive (nija,nijenije,-) 9.Indeffinite (keu,kichhu,-)

Morphology : Pronoun

Attributes: Number

Val 0: singular, Val 1: plural, Val 2: honorary plural Form

Val 0: direct, Val 1: oblique Specificity

Val 0: non-specific, Val 1: specific Case

Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative Emphatic Marker

Val 0: none, Val 1: only, Val 2: also Ellipses

Val 0: false, Val 1: true Nature Types

Bengali POS Categories (Noun)

Bengali Noun has the following attributes: Number, Specificity, Ellipses, Form, Case and Emphasizer

Number has 2 values (Singular and Plural) Specificity has 2 values (Specific and non_specific) Ellipses has 2 values (Elliptic and non_elliptic) Form has 2 values (Direct and Oblique) Case has 5 values (Nominative, Accusative, Genitive, Locative,

Instrumental) Emphasizer has 3 values (None, Only, Also)

Adjective Morphology

Root Val 0: root word of the given surface form of the word

Specificity Val 0: non-specific, Val 1: specific

Emphasizer Val 0: none, Val 1: only, Val 2: also

Degree Val 0: normal, Val 1: superlative, Val 2: Comparative

Gender Val 0: masculine Val 1: feminine Val 2: neuter

Adverb Morphology



Degree Val 0: normal, Val 1: superlative, Val 2: Comparative

Postposition Morphology



Morphological Generator

Developed at IIT Kharagpur

Introduction

Morphological Generator uses certain linguistic resources and generates the surface form from a given input.

The following linguistic resources are required Root Dictionary Morphological Rules

Rule/Attribute Type Declaration (RATD) Morphotactics Paradigm Tables Orthographic Rewrite Rules

Exception List

Format of the root dictionary

<root_word>:<category, paradigm_no;>+ root_word: The root word in UTF-8 category: Part-of-speech category paradigm_no: A specific non-negative number referring to the

paradigm table to be used for generation of the surface form for the root_word, when used as a particular POS-category.

+: denotes one or more occurrence of the <category, paradigm_no;>

Example for Hindi: कर: NN,0; VM,1; आम: NN,1; JJ, 0;

RATD

The first line of the RATD is

<#categories> <cat_tag >+ #categories: The total number of distinct categories, for which

morphological generation is required. cat_tag: The category tag as used in the root dictionary, for which the

generation is required. Example:

3 NN QC VM

RATD

This is followed by the declarations related to the #categories categories. The declaration for each category consists of meta declaration line followed by #morphotactics lines specifying the morphotactic rules. The meta declaration for a category is as follows:

<cat_tag> <file_name> <#paradigms> <#morphotactics><#attributes> <#values_for_attribute>+

cat_tag: As defined above file_name: The name of the file that contains the morphotactics, paradigm

tables and rewrite rules of the particular category. #paradigms: Total number of paradigms for the category #morphotactics: Total number of linear morphotactic rules for the category #attributes: Total number of attributes that govern the morphology #values_for_attribute: The number of values for each of the attributes.

Example NN nn.txt 5 1 2 2 2

Morphotactics

The morphotactics are specified linearly in the following format

{ ‘(’ { attribute_id, }+ ‘)’ }+ For example, the morphotactic rule (0, 2)(3)(1, 4) means that the suffix marking for

the features 0 and 2 is followed by the suffix marking feature 3 and then the suffix marking the features 1 and 4.

We assume a linear morphology We assume that inflections are in the form of suffixes only (i.e. no prefix or infix) In the above example, it is not possible to split the suffixes marking for features 0 and

2, and 1 and 4. In other words, the suffixes for these features are fusional as far as (0,2) or (1,4) feature combinations are considered, but the morphology is agglutinative in general.

There can be more than one morphotactic rule for a category in a language. In that case, the first rule is taken as the default one, whereas the other rules are triggered only under special circumstances, which are to be specified with the rule by assigning some specific value to the feature, like (0, 2=5)(3)(1, 4) implies that the rule is triggered only when Attribute 2 has a value of 5.

Morphotactics example

Bengali noun morphology Attribute 0: Number Val 0: singular, Val 1: plural Attribute 1: Obliqueness Val 0: direct, Val 1: oblique Attribute 2: Specificity Val 0: non-specific, Val 1: specific Attribute 3: Case Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative

Attribute 4: Emphasizer Val 0: none, Val 1: only, Val 2: also Attribute 5: Ellipses Val 0: false, Val 1: true

Bengali nouns follow one of the following two morphotactics (0,1,2)(3)(4) (0,1,2)(5=1)(0,1,2)(3)(4)

The second rule is triggered only in the case of ellipses.

Paradigm Table

The category specific files (e.g. nn.txt in the earlier example) store the paradigm tables and orthographic rewrite rules.

There are paradigm tables corresponding to every paradigm number for each of the feature/feature-combination in the morphotactics. Thus, if there are #paradigms for Bengali nouns, then there are 4*#paradigms paradigm tables. The 4 tables per paradigm corresponds to (0,1,2), (3), (4), and (5).

However, several paradigms might share some of the tables. Therefore, in the declaration, a particular table can stand for more than one paradigm.

Paradigm table contains the list of suffices for a particular combination of attributes.

<ParadigmTable

<Attributes a1, a2>

<ParadigmNumber x1, x2, x3>

<Suffixes s11, s12, s13,…, s21, s22, s23,…>

The Number of suffices in a table is equal to the multiplication of the values of the attributes in that combination.

Example: If the combination is (0,1) and 1st attribute has 10 values and 2nd attribute has 3 values, the table for the combination (0,1) will contain 10×3 = 30 suffices (may be some of them are NULL).

Orthographic Rules

Orthographic rules are specified as rewrite rules of the following formsinput output / left_context, right_context

We also have provisions to specify two layer rules, where on the top layer specifies the rule on strings, and on the bottom layer, the features are indicated.

Thus, a rule of typeinput output / left_context, right_context[att1] [root], [att2]

means that when the suffix corresponding to the attribute att1 has the pattern input, and it is immediately preceded by the pattern left_context, which belongs to the root and followed by the pattern right_context, which belongs to another suffix corresponding to some attribute att2, then input should be replaced by the pattern output.

RATD for Bengali

11 NN QC VM PN AV AJ PS OT UT QF QO NN nn.txt nn_rule.txt mean_noun.txt 1 1 6 2 2 2 2 5 3 QC qc.txt qc_rule.txt mean_card.txt 1 1 4 4 2 2 3 VM vm.txt vm_rule.txt mean_verb.txt 1 2 5 6 10 3 2 2 PN pn.txt pn_rule.txt mean_pron.txt 1 2 7 2 2 2 2 2 5 3 AV av.txt av_rule.txt mean_adv.txt 1 1 2 3 3 AJ aj.txt aj_rule.txt mean_adj.txt 1 1 2 3 3 PS ps.txt ps_rule.txt mean_psp.txt 1 1 1 3 OT ot.txt ot_rule.txt mean_oth.txt 1 1 1 1 UT ut.txt ut_rule.txt mean_quot.txt 1 1 1 3 QF qf.txt qf_rule.txt mean_quan.txt 1 1 2 2 3 QO qo.txt qo_rule.txt mean_ord.txt 1 1 1 3 symbols: aAbcdDeghiIjklmn.;NoprsStTuUyY

Orthographic RulesThe format is similar to two level morphological rules. Each rule has 4 parts

input:output/left_context,right_context

Here input is changed to output provided left_context is preceded by and right_context is followed by input. Suffix is ended by #.

Example:“giveîng# = giving” can be written by the rule Rule 1: e^:NULL/giv,ing#

If we say all “e-ending” words are inflected like “give” then we can write the rule Rule 2: e^:NULL/*,ing#

If we say all “a-ending” and “o-ending” words are simply concatenated when added with “ing#” we can write

Rule 3: ^:NULL/*~,ing# (Where ~ symbol means either ‘a’ or ‘o’)

Orthographic Rules Contd..

The Orthographic rules are best designed by FSM (Deterministic).

FSM will help to decide whether the rule is satisfied by the input word. If “yes” finding out the portion to be replaced is not very tricky.

If no Orthographic rule is triggered suffix is simply concatenated.

If following the FSM, input word reach the final state, we say the rule is triggered.

Building FSM

Example FSM for Rule 2:

e^:NULL/*,ing#

#

**-#*-g

g

*-n*-i

^

*-e-^

e*

S

H

FEDCBA

*-e e

i n

G

Orthographic Rules for Bengali Verb ^y:;i/*,# â:;o/*,# AWAâie:eWe/*,# Aâie:e/*,# yAoYAâie:giYe/*,* AoYAâie:eYe/X,# eoYAâie:iYe/*,# Anoâie:iYe/*,# oYAâie:uYe/*$,# AWAâie:eWe/*,# A^:NULL/B,~* A^:a/B,$* no^:ch/*A,chh* oYA^:ch/*A,chh* Ano^:iY/*,echh* no^:NULL/*A,E* no^F:NULL/*A,G* oYA^F:NULL/*A,G* no^:NULL/*A,iK oYA^:NULL/*A,iK no^L:o/*A,*

oYA^L:o/*A,* noê:Ya/*A,K oYAê:Ya/*A,K AoYA^:eY/X,echh* yAoYA^:giY/*,echh* AoYA^:e/X,M* oYA^:NULL/*A,b* AoYA^:e/y,t* yAoYA^:ge/*,l* eoYA^:iY/*,echh* eoYA^:i;/*,iK eoYA^:ich/*,chh* eoYA^:i/*,P* oYA^:NULL/*e,Q* eoYAû:i/*,* eoYA^:NULL/*,R* eoYA^:A/*,o* eoYAâ:Ao/*,ni

eoYAê:eYa/*,K oYA^:uY/*$,echh* oYA^:uch/*$,chh* oYA^:u/*$,V* oYAî:u/*$,sa* YA^:;/*$o,o# YAâ:;o/*$o,ni* YAê:NULL/*$o,naK YAû:NULL/*$o,* Aê:a/*$oY,K ^y:;i/*,# â:;o/*,# AWAâie:eWe/*,# yAoYAâie:giYe/*,* AoYAâie:eYe/X,# eoYAâie:iYe/*,# Anoâie:iYe/*,# oYAâie:uYe/*$,#

Input Format

Input to the Morphological Generator is started with the root of the word followed by the POS Category and Attribute names and their values.

Example:

karA VM Person 3 Tense 2 Emp 2

In Bengali Person and Tense combine to give a suffix which will be added first and Emphasizer will give another suffix which will be added next.

See Morphotactic for Bengali Verb.

Input Format Contd.

In Bengali, Person can have 6 values and Tense (which is actually TAM) can have 10 values. The suffices In the Paradigm table is arranged in the following way.

First entry is Person 0 Tense 0Second entry is Person 0 Tense 1Third entry is Person 0 Tense 2 …10th entry is Person 0 Tense 911th entry is Person 1 Tense 0So Person 3 Tense 2 will be the entry number (Person input) × (TAM value) + TAM input +1= 3 × 10 + 2 + 1 = 33

Get 33rd entry from the Paradigm table for (0,1) and use the Orthographic rule to get the correct word.

Bengali Verb Paradigms and Morphotactics

<ParadigmTable

<Attributes 1 2 > /* 1 indicates Person and 2 indicates TAM */

<suffixes

i chhi echhi lAma chhilAma echhilAma ba tAma NULL ini isa chhisa echhisa li chhili echhili bi tisa NULL isani o chha echha le chhile echhile be te NULL ani

ena chhena echhena lena chhilena echhilena bena tena una enani e chhe echhe la chhila echhila be ta uka eni ena chhena echhena lena chhilena echhilena bena tena una enani

>>

<ParadigmTable<Attributes 3 >

/*Case*/<suffixes NULL i o>>

Morphotactic rule (0,1)(2)(3) (3=2)(2)

Bengali Noun Paradigms and Morphotactics

<ParadigmTable<Attributes 0 1 2 > /* Number, Specificity, Ellipses 2×2×2 = 8 entries*/<suffixesNULL eraTA TA NULL gulo guloraTA NULL NULL>>

<ParadigmTable<Attributes 3 4 > /* Form, Case 2 × 5 = 10 entries */<suffixesNULL ke NULL ete ete NULL NULL era NULL NULL>>

<ParadigmTable<Attributes 5 > /* Emphasizer 3 entries */<suffixes

NULL i o>>

Morphotactic rule (0,1,2)(3,4)(5)

Example (Bengali Verb)

Example: the Input is

balA Verb Person 1 TAM 1 Case 0

First Morphotactic rule is triggered.

Person can have 6 values and TAM can have 10 values. So the extracted suffix number from the paradigm table 1,2 is

10×(Person value) +(TAM value) + 1 = 10×1 + 1 + 1 = 12

i.e., chhisa is to be added first.

From the paradigm table (3) extracted suffix is NULL.

i.e., NULL is to be added next.

Example Contd.

Now balA^chhisa# is the input which will search for suitable Orthographic rule.

Suppose there is an orthographic rule

A^:a/B,$* Where B:*-Y and $: consonant

Then the FSM for this rule will bring the input to the final state. i.e., the rule is triggered. Now “A^” is replaced by “a” and the output is “balachhisa”

Exception List:

Some words which do not match with other words in the orthographic change on those which are changed completely when inflected are said to be exceptions.Those words if added in Orthographic rule will cause a large number of rules with a huge complexity.We handled those words mentioning in a separate file which include the exception words along with all its inflections.

Morph Analyzer

Morphology 2 A case study of developing Bengali morph analyzer and generator Sudeshna Sarkar IIT Kharagpur.

Documents

level rules twolevel

hindi morphology slide

rules component

e d slide

twolevel representation

lexical form

tamgnp matrix slide

analogous twolevel rule