Top Banner
Word normalization in Indian languages by Prasad Pingali, Vasudeva Varma in the proceeding of 4th International Conference on Natural Language Processing (ICON 2005). December 2005. Report No: IIIT/TR/2008/81 Centre for Search and Information Extraction Lab International Institute of Information Technology Hyderabad - 500 032, INDIA June 2008
5

Word normalization in Indian languagesweb2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5... · Word normalization in Indian languages by Prasad Pingali, Vasudeva

Mar 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Word normalization in Indian languagesweb2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5... · Word normalization in Indian languages by Prasad Pingali, Vasudeva

Word normalization in Indian languages

by

Prasad Pingali, Vasudeva Varma

in

the proceeding of 4th International Conference on Natural Language Processing (ICON 2005). December2005.

Report No: IIIT/TR/2008/81

Centre for Search and Information Extraction LabInternational Institute of Information Technology

Hyderabad - 500 032, INDIAJune 2008

Page 2: Word normalization in Indian languagesweb2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5... · Word normalization in Indian languages by Prasad Pingali, Vasudeva

Word normalization in Indian languages

Prasad Pingali Vasudeva VarmaLanguage Technologies Research Centre Language Technologies Research Centre

IIIT, Hyderabad, India IIIT, Hyderabad, India

[email protected] [email protected]

Abstract

Indian   language   words   face   spellingstandardization   issues,   thereby   resulting   inmultiple spelling variants for the same word.The major  reasons  for   this  phenomenon canbe attributed to the phonetic nature of Indianlanguages   and   multiple   dialects,transliteration   of   proper   names,   wordsborrowed   from   foreign   languages,   and   thephonetic variety in Indian language alphabet.Given such variations in spelling it  becomesdifficult   for   web   Information   Retrievalapplications built  for Indian languages, sincefinding   relevant   documents   would   requiremore than performing an exact string match.In this paper we examine the characteristics ofsuch   word   spelling   variations   and   explorehow   to   computationally   model   suchvariations.   We   compare   a   set   of   languagespecific   rules  with  many  approximate  stringmatching algorithms in evaluation.

1 Problem statement

India   is  rich   in   languages,  boasting not  only  theindigenous sprouting of Dravidian and Indo­Aryan

tongues,  but of the absorption of Middle­Easternand European influences as well. This richness isalso evident in the written form of the language. Aremarkable feature of the alphabets of India is themanner   in   which   they   are   organised.   It   isorganised according to phonetic principle,  unlikethe Roman alphabet, which has a random sequenceof   letters.  This  richness  has  also   led   to  a  set  ofproblems over a period of time. The variety in thealphabet,   different   dialects   and   influence   offoreign   languages   has   resulted   in   spellingvariations   of   the   same   word.   Such   variationssometimes   can   be   treated   as   errors   in   writing,while some are very widely used to be called aserrors.   In   this   paper   we   consider   all   types   ofspelling variations of a word in the language.

This study on Indian language words is part of aweb search  engine  project   for   Indian   languages.When dealing with real web data, the data couldbe   really   problematic.   A   lot   of   InformationRetrieval   systems,   web   search   systems   rarelyexplicitly mention the problems of the real worlddata on the web. While comparing strings on realweb,   they   assume  data   to   be   homogeneous   andcomparable   across   different   sources.   But   inpractice  when one   looks  at   real  web  data,   there

Page 3: Word normalization in Indian languagesweb2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5... · Word normalization in Indian languages by Prasad Pingali, Vasudeva

could be lot of variations in strings which need tobe   handled.   Especially   in   the   case   of   Indianlanguages such variations tend to occur a lot moredue to various reasons. Some of such reasons thatwe could identify are the phonetic nature of Indianlanguages,   larger   size   of   alphabet,   lack   ofstandardization in the use of such alphabet, wordsentering from foreign  languages  such as  Englishand   Persian   languages   and   last   but   not   least   tomention the variations in transliteration of propernames.   In   order   to   quantize   these   issues,   werandomly   picked   10   hindi   and   10   telugu   newsarticles.   We   manually   counted   the   number   ofproper names and words borrowed from English inthese news articles. We found that an average of5.19%   of   words   were   proper   names   in   Hindidocuments and 4.8% words were proper names inTelugu documents.  We also found an average of5.73% of  words  were borrowed from English  inHindi documents while this number was 6.9% forTelugu   documents.   Therefore   apart   from   theIndian language words we should also be able tohandle   proper   names   and   English   wordstransliterated in Indian languages since they formsubstantial percentage of words. To give an idea ofthe data problem, the following words were foundon various websites.

अगँरजेी, अगँरजेी, अगँेजी, अगँेजी, अगंरेजी, अगंरेजी, अगंेजी, अगंजेीअनतरराषर ीय, अनतरराषर ीय, अनतरारिषर य, अनतरारषर ीय, अतंरराषर ीय,

अतंरारषर ीय, अतंरराषर ीय, अनतरािषर य, अनतराषर ीयIt  has been empirically  found that  there is  lot  ofdisagreement among website authors with regardto spellings of words. We found that 65,774 wordshad variations out of 278,529 words. These 65,774words  belong   to  28,038  words.  Therefore  about23.61%   of   Indian   language   words   found   atleastone   variant   word.   The   average   number   ofvariations   a   word   would   contain   is   about   2.34words.   It   was   found   that   more   the   number   ofwebsites   being   studied,   more   is   the   amount   ofdisagreement.  This phenomenon was observed in

other   Indian   languages   as   well,   such   as   telugu,tamil and bengali. Given such huge percentage ofwords it becomes important to study what are thecharacteristics of such spelling variations and seeif we can computationally model such variations.

We   propose   two   solutions   for   the   above   saidproblem   and   compare   them.   One   solution   is   tocome up with a set  of rules specific to languagewhich   can   handle   such   variations,   which   couldresult in more precise performance. However sucha solution is not scalable for new languages since aseparate program will need to be written for eachIndian language. Another solution could be to tryapproximate   string   matching   algorithms.   Suchalgorithms are easily extensible to other languagesbut may not perform as well as language specificrules in terms of precision.

2 Rule based algorithm

In this section we discuss an algorithm using a setof  language specific  rules  by taking Hindi  as  anexample.   In   this   algorithm   we   achievenormalization of words  by mapping the alphabetof the given language  L  into another alphabet  L'where  L'   L⊂ .  Before discussing the actual ruleswe would like to introduce chandra­bindu,  bindu,nukta,   halanth,  maatra  and  chandra  in   Hindialphabet which are being referred in the rules. Achandra­bindu  is  a  half­moon with  a dot,  whichhas   the   function  of  vowel  nasalization.  A  bindu(also  called  anusvar)   is   a  dot  written  on   top  ofconsonants which achieves consonant nasalization.A nukta is a dot under a consonant which achievessounds mostly used in words of persian and arabiclanguages.  A  halanth  is   a   consonant   reducer.  Amaatra  is   vowel   character   that   occurs   incombination   with   a   consonant.   A  chandra  is   aspecial  character  which  achieves   the   function  ofvowel   rounding,   such  as   the   sound  of   'o'   in   theword   'documentary'.   The   following   rules   are

Page 4: Word normalization in Indian languagesweb2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5... · Word normalization in Indian languages by Prasad Pingali, Vasudeva

applied on words before comparison of two wordsto achieve normalization.

if found map to Examples

chandra­bindu bindu अगँजे, अगंजे

consonant   +nukta

correspondingconsonant

अगंजे, अगंेज

consonant   +halanth

correspondingconsonant

अगँरेज, अगँजे

longer   vowelmaatra

equivalent shortervowel maatra

अनतरारिषर य,अनतरारषर ीय

character   +chandra

correspondingcharacter

डॉकयमुटेर ी,डाकयमुटेर ी

Table 1: Rules applied to achieve normalizationin Hindi.

While   we   employed   these   basic   rules,   we   alsotried   using   unaspirated   consonants   in   the   placetheir respective aspirated ones. We found that thisoperation   did   not   yield   much   in   recall   anddeteriorated precision. Therefore we dropped thisfeature in our algorithm.

3 Approximate   string   matchingalgorithms 

We   used   a   set   of   approximate   string   matchingalgorithms   from   the   second­string   (found   athttp://secondstring.sourceforge.net)   project   toevaluate to what extent would they help solve theproblem   of   normalizing   Indian   language   words.We   shall   briefly   discuss   about   each   of   thesealgorithms   in   this   section   before   proceeding   toexperimental results. Approximate string matchingalgorithms  decide  whether   two given  strings  areequal by using a distance function between the twostrings. Distance functions map a pair of strings sand t to a real number r, where a smaller value ofr  indicates   greater   similarity   between  s  and  t.Similarity   functions   are   analogous,   except   that

larger  values   indicate  greater  similarity;  at  somerisk of confusion to the reader, we will use theseterms   interchangably,   depending   on   whichinterpretation is most natural. One important classof distance functions are edit  distances, in whichdistance   is   the   cost   of   best   sequence   of   editoperations   that   convert  s  to  t.   Typical   editoperations   are   character   insertion,   deletion,   andsubstitution, and each operation much be assigneda   cost.   We   will   consider   two   edit­distancefunctions. The simple  Levenstein  distance assignsa unit cost to all edit operations. As an example ofa more complex well­tuned distance function, wealso  consider   the  Monge­Elkan  distance  function(Monge   &   Elkan   1996),   which   is   an   affine1variant  of   the  Smith­Waterman  distance   function(Durban   et   al.   1998)   with   particular   costparameters,   and   scaled   to   the   interval   [0,1].   Abroadly similar metric,  which is not based on anedit­distance model, is the Jaro metric (Jaro 1995;1989;   Winkler   1999).   In   the   record­linkageliterature,  good results  have been obtained usingvariants   of   this   method,   which   is   based   on   thenumber   and   order   of   the   common   charactersbetween two strings. Given strings  s = a1  . . . aK

and t = b1  . . . bL  , define a character ai  in s  to becommon with t there is a bj = ai in t such that i ­ H<= j <= i + H , where H = min(|s|.|t|) / 2 . Let s'=  a'1  .   .   .   a'K  be   the   characters   in  s  which   arecommon with  t  (in the same order they appear ins) and let t = b1 . . . bL be analogous; now define atransposition , for s',  t' to be a position i such thatai  not  equals  to  bi  .  Let  Ts',t'  for  s',   t'  be half   thenumber of transpositions for s and t . The Jaro similarity metric for s and t is

 where

Page 5: Word normalization in Indian languagesweb2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5... · Word normalization in Indian languages by Prasad Pingali, Vasudeva

A variant of this metric due to Winkler (1999) alsouses the length P of the longest common prefix ofs and t. Letting P' = max(P, 4) we define Jaro­Winkler(s, t) = 

Jaro (s, t) +  (P' /10) x (1 ­ Jaro (s, t)) 

4 Experiments 

We picked 350 words from the total set of wordsin   the   web   search   engine   index   which   havespelling   variations.   We   selected   these   words   insuch  a  way  that   the   frequency  of  each  of   thesevariations   is   above   a   threshold   value.   Now   wedefine   the   experiment   task   as   identifying'matching words' from the list of given words. Aword­pair is set to be a matching pair if both thewords  semantically  meant   the  same  entity.  Nowthat these words are pre­classified into clusters, weemployed   various   approximate   string   matchingalgorithms   from   the   second­string   project   alongwith our own language specific rules. Since mostof the approximate string matching algorithms aredependent on a distance threshold, for an arbitrarydistance threshold θ, we predict “same entity” forall words A, B such that dist(A,B)<θ ;where dist isthe distance computing function. We   predict thetwo words A, B to be “different” otherwise.  Wethen   create  plots   as   shown  below  by  varying  θfrom ­∞ to +∞.

Figure   1:   Comparative   analysis   of   variousapproximate   string­matching   algorithms   withRecall on x­axis and Precision on y­axis.

As   shown   in   figure   1,   we   find   that   the   IndianLanguage Normalizer algorithm which is the set oflanguage   specific   rules,   performs   very   well   interms   of   precision   when   compared   to   otherapproximate string matching algorithms. Here wehave   compared   the   rules   with   Character   basedJaccard   algorithm,   Dirichlet   Mixture   modeling,Jaro,   Jaro­Winkler,   Levenstein,   Monge   Elkan,Needleman­Wunsch   and   Smith   Watermanalgorithms.

References [Cohen,   W.   W.,   Pradeep   Ravikumar,   Stephen   E.

Fienberg,  2003].  A Comparison of  String DistanceMetrics   for   Name­Matching   Tasks.   AmericanAssociation of Aritificial Intelligence 2003. 

[Durban  R,  Eddy S R, Krogh A, Mitchison G 1998].Biological sequence analysis ­ Probabilistic modelsof proteins and nucleic acids. Cambridge: CambridgeUniversity Press. 

[Jaro,   M.   A.   1989].  Advances   in   record­linkagemethodology as applied to matching the 1985 censusof   Tampa,   Florida.   Journal   of   the   AmericanStatistical Association 84:414420. 

[Jaro, M. A. 1995]. Probabilistic linkage of large publichealth   data   files   (disc:   P687­689).   Statistics   inMedicine 14:491498. 

[Monge,  A.,  and Elkan,  C. 1996].  The field­matchingproblem:   algorithm   and   applications.   SecondInternational Conference on KDD. 

[Monge, A., and Elkan, C. 1997].  An efficient domain­independent  algorithm  for  detecting  approximatelyduplicate   database   records.   SIGMOD   1997workshop on data mining and knowledge discovery. 

[Ristad,   E.   S.,   and   Yianilos,   P.   N.   1998].  Learningstring   edit   distance.   IEEE  Transactions   on  PatternAnalysis and Machine Intelligence 20(5):522532. 

[Winkler, W. E. 1999]. The state of record linkage andcurrent   research   problems.   Statistics   of   IncomeDivision,   Internal   Revenue   Service   PublicationR99/04.