-
Improving Dictionary Accessibility by Maxi-mizing Use of
Available Knowledge
Slaven Bilac�, Timothy Baldwin
�and Hozumi Tanaka
��Department of Computer Science, Tokyo Instituteof Technology,
Tokyo,Japan�������
���������������������������������
����!"�#����%$'&(CSLI,Stanford University, Stanford
CA,USA�����)
*�+,�-�������-���.��������/)0�1�*)�� *�2
ABSTRACT. Thedictionary lookupof unknown words is particularly
difficult in Japanesedueto the requirementof knowingthe correct
word reading. We proposea systemwhich supple-mentspartial knowledge
of word readings by allowing learners of Japaneseto look up
wordsaccording to their expected,but not necessarilycorrect,
reading. This is an improvement fromprevioussystemswhich provide no
handling of incorrect readings. In preprocessing, we
cal-culatethepossiblereadingseach kanji charactercantake
anddifferent typesof phonologicalalternationsand reading errors
that can occur, andassociatea probability with each.
Usingtheseprobabilities andcorpus-basedfrequencieswecalculatea
plausibility measure for eachgeneratedreading givena
dictionaryentry, basedon thenaiveBayesmodel. In responseto
auser-entered reading, thesystemdisplaysa list of candidate
dictionaryentriesfor theusertochoosefrom.Thesystemis implementedin
a web-basedenvironment andavailablefor generaluse. In the
evaluation on JapaneseProficiencyTest data and naturally occurring
misread-ing data,thesystemsignificantly reducedthenumberof
unsuccessful dictionaryquerieswhenqueriedwith incorrect
readings.
RÉSUMÉ.
KEYWORDS:language learner, dictionarylookup, Japanese, kanji
MOTS-CLÉS: apprentissage deslangues,consultationdedictionnaires,
japonais, kanji
L’objet – 8/2002.LMO’2002, pages1 à 57
-
2 L’objet – 8/2002.LMO’2002
1. Introduction
Learning a foreign languageis a time consuming and painstaking
process,andmadeall themoredaunting by theexistenceof unknown words
[GRO 00]. Without afast,low-costwayof lookingunknownwordsupin
adictionary, thelearningprocessisimpeded [HUM 01]. Theproblemof
dictionarylookup is particularly evident in non-alphabetic
languagessuchasJapanesewherethe learnercaneasilybeoverwhelmedby
thesheernumberof charactersandmultitudeof readingsassociatedwith
each.
Educatorshavetriedto lessentheunknown wordproblemby
focusingoneffectivewaysof expandinglearnervocabulary [LAU 01].
However, unlessthelearnerlivesina closedlanguage world, s/heis
alwaysgoing to beexposedto unknown words,par-ticularly in
theearlierstagesof learning. Our philosophy is to
accepttheinevitabilityof unknown wordsandfocusinsteadonminimizing
thedictionary lookup overhead.
Learners often possessonly limited knowledgeof the readings of
charactersandthe phonological andconjugationalprocessesgoverning
word formation. This canmake it difficult to identify
thecorrectreading for
agraphemestring,andthebooleanmatchmechanismadopted by conventional
dictionariesdiscouragestheuserfrom at-temptingto look up a word in
the casethat s/he is uncertainof the reading. Webelieve that if we
canimitate the manner in which learners acquire andclassify
thedifferent readings of charactersand the rules governing overall
readingformation,we shouldbe ableto decipher which dictionary entry
the userwasafter even whenqueried with a (predictably) wrong
reading. Thus,the purposeof this researchis todevelop
acomprehensiveandefficientdictionary interfaceallowing
languagelearnersto look upwordsin
anerror-resilientandintuitivemanner.
Furthermore,animportantunderlying motivation of this researchis to
remove the assumptionof perfect read-ing knowledgemadeby
conventionaldictionary
interfaces,andencouragetheusertoquerythesystemwith plausiblebut not
necessarilycorrect readings.
Theparticularlanguagewetargetin
thispaperisJapanese,andwechooseto modelreadingsby way of
kana(seebelow). Theproblemof dictionary lookup for Japaneseis
particularly complex dueto therebeingover2000 ideographickanji
characterseachwith numerousphonemicrealizations,frequent
wordconjugationandalackof spacesbetweenadjacentwords. A
learnertrying to look up a word in a dictionary needsto copewith
all theseproblemsat once.Theproposedsystemaimsto helpa
userbyallowing directlookupsbasedonthebestguesstheuseris ableto
construct for atargetword in written text
basedonavailableknowledge.
As an example of user–systeminteraction, considerthata
usercomesacrossthenovel kanji compound 354 (happyou“presentation”1)
andwishesto determineits En-6. In this paper, we loosely follow the
Hepburn systemof romanization,with the exception
thatwe romanizelong vowelsasseparatecharactersgiving riseto
hyouinsteadof hyooor hyōfor 798;: .
Theothernotabledivergence—takenfrom [BAC 94]—is theuseof
theupper-caseN for syllable-finalnasals(corresponding to the kana
< ) andlower-casen for syllable-initialnasals(asfoundin = , for
example).
-
FOKSDictionaryInterface 3
glish translation.Lackingprescriptive knowledgeof
thepronunciationfor thestring,theuserappliesknowledgeof
alternatestringcontexts for thecomponentkanji char-acters3 hatsuand
4 hyou to postulatethat the string is readashatsuhyou. S/heinputs
the kana for this string into the dictionary
searchinterface,andgetsback alist of Japanesewords (in both kanji
andcorrect kana-readingforms) with Englishtranslationsfor each.From
among these,s/heis ableto detectthe original string inkanji form,
ascertainthecorrect pronunciation for thestring(happyou)
andobtainthedesiredtranslation(“presentation”).
Although we focuson Japanesein this paper, thebasicmethodwe
proposeis ap-plicableto any language where themapping from
readingto orthography is not self-evident. That is, given
somemeansof describing readings (whether through a pho-netic
representationor someothermeans)andthecanonical orthographiesof
words,it is possibleto applythesameprocedurein predictingpatternsof
readingconfusion.Japaneseis of particularinterestbecause of the
wide rangeof factorswhich affectpronunciationprediction for
anunknown word (seeSection2.4).
Theremainderof this paperis organized asfollows. Section2 givesa
shortintro-ductionto theJapanesewriting
systemanddictionaries,anddiscussesreading errorscommon in learners
of Japanese.Section3 describesthe basicsystemphilosophy,andSection4
theprocessingstepsnecessaryfor generatingandscoringreadings.
Theevaluation of the systemis given in Section5 andthe discussionof
the resultsandpossiblefuture researchdirections are given in
Section6. Finally, Section7 givesconcluding remarks.
2. The Japanese Language, Existing Japanese Dictionaries and
Reading Errors
2.1. The peculiaritiesof the Japanesewriting system
TheJapanesewriting systemconsistsof thethreeorthographiesof
hiragana,katakanaandkanji,whichappearintermingledin
modern-daytexts. Thehiragana andkatakanasyllabaries,collectively
referred to askana, arerelatively small (46
basiccharacterseach),andmostcharacters take a unique andmutually
exclusive reading which caneasilybe memorized.2 Generally speaking,
the function of thesetwo scriptsis dis-tinct although a wide
rangeof variation occurs. Hiragana is mostlyusedfor
functionwordsandconjugational endings of verbs andadjectives(e.g.
>@? suru “(to) do”).Katakana, on the otherhand,is mostly usedfor
wordsof foreign (generally West-ern)origin,
onomatopoeicandstressedexpressions,andto someextent for
plantandanimal names(e.g. ACBEDGF ichou “gingko tree”).
Katakanacharacters are alsoquitecommonly usedaspronunciation guides
for wordswhosereading is notobvious
H. Amongthefew exceptionsto
theuniquereadingrulearekanacharactersI and J whichare
realizedas/zu/ andcharactersK and L which arerealizedas/ M i/.
Here I and J arevoicedversionsof N tsuand O su, respectively.
Accordingly, K and L arevoicedversionsof P chiand Q shi.
-
4 L’objet – 8/2002.LMO’2002
(i.e. uncommonproper nameswritten in kanji or foreign
wordswritten in alphabet)[KNI 98]. Thekana syllabariesarelimited in
sizeandthereis astrict correspondencebetweenindividual characters
and readings. As such,they do not present a majordifficulty to
thelearner of Japanese.
Kanji characters presenta muchbiggerobstacleto thelearner,
mostimmediatelythrough a combinationof their
sheervolume,ideographicnatureandphonetic poly-morphism.
TheJapanesegovernmentprescribes1,945kanji characters for daily
use,andupto 3,000appear in newspapersandformalpublications[NLI 86].
Additionally,while the semanticsof individual charactersoften have
a bearingon the combinedsemanticsof words in which they occur, they
arenot marked for phonetic content.That is, thereis no way of
predicting a priori thepronunciationof kanji character3 ,for
example.3 Finally, eachcharacter canandoftendoestake on several
differentandfrequently unrelatedreadings. Thekanji 3 “emit,
depart”,for example, hasreadingsincluding hatsuandta(tsu), whereas4
“table,exterior, show” hasreadings includinghyou,
omoteandarawa(su).
Theproblemis furthercomplicateddueto theexistenceof
charactercombinationswhichdonottakeoncompositionalreadings.For
example, RTS kaze“common cold”is formednon-compositionally from R
kaze, fuu “wind” and S yokoshima, ja “evil”.Note thatevery kanji
word hasa kanaequivalent (i.e. reading), which is commonlyusedin
indexing Japanesedictionaries(seeSection2.2).
As mentionedabove,whenkanji charactersarecombined to form
words,theread-ingsfrequentlyundergo phonological changeto
giveriseto surfacereadings.Thetwophenomenathatareprevalent in
compoundformationaresequential voicing(rendaku)andsoundeuphony
(onbin). Sequential voicing is the processof voicing the
firstconsonantof thetrailing segmentwhensegments arecombinedin a
binaryfashiontoproducewords.For example, U hoN “book” is
combinedwith V tana“shelf” to giverise to U9V hoNdana “bookshelf”.
Sound euphony is the processof replacingthelast
mora(kanacharacter)in the leadingsegmentwith a mora in phonetic
harmonywith the first moraof the trailing segment[FRE 95]. It
hasseveral forms, the mostcommon of which is assimilatory
gemination or sokuonbin. For example, W koku“country” combinedwith
X kyou “boundary” givesrise to WYX kokkyou “(national)border”.
Notice that sequential voicing occurs in the presence of left
lexical contextwhile assimilatorygeminationoccursin thepresenceof
right lexical context.
2.2. JapaneseDictionaries
Conventional Japanesedictionaries are indexed on the phonemic
realizationofwords,expressedin the form of kana. For example, the
kanji compound 3Z4 hap-[. This is not strictly
true,asstructurallysimilar kanji characters(e.g. \ , ] and ^ )
canshare
a singlecommonreading(zouin this case).Evenhere,however,
alternatepronunciations tendto exist andbehighly divergent (e.g. \
“increase”canalsobe readasma(su)andfu(eru), ]“hate” asniku(mu)and ^
“give” asoku(ru)).
-
FOKSDictionaryInterface 5
pyou“announcement” is listedaccording to its kana-equivalent
_a`cbedYf happyou.The phonemic ordering conventionmakes it easyto
look up words in the casethatthe readingis known dueto kanahaving a
naturalalphabetic ordering, unlike kanji.However, in many casesit
is not straightforwardto extractthereading from
thewordrepresentationaspresent in a target text. As mentioned above
theproblemis mostlydueto kanji characters,
whosephonemerealizationcannot beeasilyidentified. Gen-erally,
eachcharacter’s readingneedsto belearnedindividually before a word
canbelookedupin adictionary. For example,to look up gTh
seNi“transition” theusermustknow that g and h take on thereadings
seNandi, respectively. Frequently charac-tershave several
unrelatedreadings which occurin differentword contexts
(e.g.thereadingsseNandutsurufor g , andi andutsurufor h ) makingit
difficult to postulatethe correctreading of the word even if a
portion of the readings of eachcomponentcharacterareknown.
Whendirect lookup fails, words needto be lookedup
usingadifferentapproach.
Kanji dictionaries provide an alternative lookup methodaimedat
the individualkanji characters. A complicatedsystemof kanji
radicals(bushu) andstroke countsis usedto locatea componentkanji in
the dictionary (e.g. g could be looked upeithervia its radical i or
stroke count of 15), and the target word is then locatedfrom a
supplemental listing of wordscontaining that kanji. If the word is
not foundin the listing, the processmustbe repeatedfor otherkanji
characterspresentin theword (e.g. h could be looked up via its
radical j or stroke count of 11). If theword cannot befoundthrough
any of the individual kanji,4
thelearnermustresorttopostulatingacompositionalreading for
thewholewordandsearchingfor this readingin a conventional
kanadictionary.
To make thingsworse,the kanji radicalandstroke
countsystemleavesa lot ofroom for error on the partof the
uninitiated learner.5 For example, g alsocontainstheradicalsk and l
, whereash alsocontainstheradical m , potentially
leadingtoconfusion asto which radical to look up the
characterunder. Additionally, both nand o consistof a singlestroke,
which is not immediatelyobvious. Suchconfusionresultsin
furtherburdeningthe lookup task. In somecaseslexicographers have
triedto expedite the processby devising additional forms of
indexing kanji dictionaries(e.g.[HAL 98] boastssix differentwaysof
looking upacharacter)but
theseindexingschemesarerarelystandardizedandin all casesneedto
belearnedto beused.Fromthe above we canseethat a userwanting to
look up the translationof a word (e.g.“transition” asa
translationof gZh ) potentiallyneeds to consultat leasttwo
differentdictionaries, andsearchin several passesandthrough
different indexing schemasinorderto obtainthe translation. Clearly,
a systemallowing directandstraightforwardkanji wordlookup
wouldgreatlyassistthelearner by removing or at
leastamelioratingthedifficultiesassociatedwith theprocessof
learningnew words.
p. The word seNi is not listed undereithercharacterindex in [NAG
81] or SharpElectronic
DictionaryPW-9100,while [HAL 98] only lists it underq .r.
Somedictionariesgetaroundthis problemby listing somecharactersunder
severaldifferent
radicalindexesandstroke counts.
-
6 L’objet – 8/2002.LMO’2002
2.3. Existing electronicdictionariesand readingaids
Above, we painteda bleakpicture of Japanesedictionary lookup.
However, withthe advent of computersand electronic dictionaries,
dictionary lookup hasbecomesomewhat moreefficient.
ElectronicJapanesedictionarieshave becomeincreasinglypopular during
the last decade both in portable andserver-basedform due to
theirsuperior usabilityover paperdictionaries.Onereasonfor this is
thatseveraldifferentdictionaries(e.g.kanji, monolingual
Japaneseandbilingual Japanese-English)canbeaccessedthrough a
singleinterface, andnavigatedbetweeneasily.
More significant,however, hasbeenthe introductionof severalnew
searchmeth-odsenabling fasterlookups. For example, it is possibleto
copy/pastestringsandgetthetranslationdirectly whenthesourcetext is
availablein electronic form [BRE 00].Also, most dictionaries
support regular expression-basedsearchesallowing for thelookup of
words from partial (correct) information (e.g. looking up gsh with
theglob-stylequery seNt , or alternatively usingkana–kanji
conversionto input g basedon known readings for g ). In
anotherdevelopment, it hasbecome possibleto lookupkanji
charactersvia thereadingsof meaningful sub-units
(otherthanradicals)con-tainedin
thecharacter(using,e.g.,theSharpElectronic Dictionary PW-9100or
CanonWord TankIDF4000).
2.3.1. Opendomain systems
Also in thelastdecade,several interactivereading
aidsaimedatJapaneselanguagelearners have becomeavailable.A
pioneerin this field wastheDL system[TER 96],capableof performing
morphological analysisof the input sentenceand
providingtranslationsfrom theEDICT dictionary [EDI 01]. Similarly
to DL, theReadingTutor6
[KAW 00, KIT 00] systemperforms text
segmentationandthenprovidesword-leveltranslationandsemanticinformation.Asunaro7
[NIS 00, NIS 02], on theotherhand,providesa multilingual English,
ChineseandThai interfacecapable of
sentenceseg-mentationanddisplaying parsetreesaswell asword-level
translations. All of thesesystemsaim to helpthelearnerby removing
theburdenof segmenting sentencesintowordsandconvertingtheminto a
form suitablefor dictionary
searches.Syntaxtrees,semanticinformation,etc.areadded to improve
thesentencelevel comprehensionofthetargettext.
While thesedictionariesandreading aidsarea valuableaddition to
the learner’srepertoire, they work bestwhen the target text is
available in electronicform andneedsnot be re-enteredinto the
interface. However, in the instancethat the text isavailableonly in
hardcopy, current systemsoffer very little or no usersupport.
Here,current systemsstill requirethat the userhasabsolute
knowledgeof the full readingof theword in orderto achieve direct
lookup. While this is acceptablefor
proficientJapaneselanguageusers,it remainsa majorhandicapfor
learnersof thelanguage.
u. v�w�wx�y{z-z�|-}#~�#�}-��w�.�}�,{'x z. v�w�wx�y{z-z'v�.~
.�,�#�w�#w��.v�}�,{#x z
-
FOKSDictionaryInterface 7
2.4. ProblemsencounteredbyJapaneselearners
Thereis a long history of researchdocumentingthe problems
Japaneselearnershavein readingtextscontaining kanji [NLI 86, MEI
97]. Amongthecommonly-listedproblemsare:
1) Multiple readings for a givenkanji. Thelearneris awareof
thedifferent read-ingsakanji charactercantake,but unable to
decideontheproper readingin thegivencontext. For example, canbe
readaseither tai, dai or oo(kii), so the string Ttaikai
“convention, congress”couldfeasiblybemisreadasookaior daikai.
2) Insufficientknowledgeof readings. Thelearneris only awareof
apropersubsetof readings a givenkanji cantake,
andthuscannotpredictthe correct readingwhenfacedwith new
wordsdrawing onanovel readingfor thatkanji. A userawareonly
oftheoo(kii) readingfor , e.g.,would almostcertainlytry to read T
asookai.
3) Incorrectapplication of phonological andconjugational
rulesgoverning read-ing formation. For example, 3 hatsuand 4
hyouform thecompound 3C4 happyou“announcement”, but
readingssuchashatsuhyou or
hahhyoucouldequallyarisefromthecomponentcharacter readings.
4) Confusion as to the lengthof vowelsor consonants. For
example, shu-sai “organization, sponsorship” canbemistakenly
readasshuusai, or E mottomo“most, extremely” asmotomo. This error
type is common in speakers of languageswhich haveno
vowel/consonantlengthdistinction.
5) Confusion due to graphic similarity of different kanji.
Learnerswith limitedcontactwith kanji
caneasilyconfusecharacters.For example, bo, haka“grave”and ki,
moto“base”aregraphically very similar, resultingin
possiblereadingsub-stitutions(e.g.betweenC bochi “graveyard” and C
kichi “base”).
6) Confusiondueto semanticsimilarity of different kanji.
Characterslike migi“right” and hidari “left” have a similar meaning
andassuchareoften confused,resultingin anerroneousreading.
Semanticconfusionsometimesoccursat thewordlevel, too,suchasbetweenT
kaji “fire” and C kasai“(disastrous)fire”.
7) Confusion dueto word-level co-occurrence. Whentwo characters
commonlyoccurtogethertheir readingscanbesubstitutedwhenappearing
with othercharacters.For example, C soshou“lawsuit” cangive riseto
theerroneousreadingkishoufor kiso “indictment”. Also common is the
superimposition of a known readingontoawordoccurring with acommon
kanasuffix, e.g. @¡9? nagusameru“comfort,console”being readas
osameru(due to knowledgeof the string ¢£¡¤? osameru“study,
cultivate”).
8) Randomerrors. Theseareerrors thatdo not belongto any of
theabove groupsandarevery hardto classifyand/orpredict. As such,it
is hardto imagine a systembeingableto reliablyhandle this typeof
error.
Even though various error typesarediscussedin previous works
[NLI 86, MEI 97],to our knowledge,thereexistsno previous
researchthathaspresenteda quantitativeanalysisof thedifferent
errortypes.
-
8 L’objet – 8/2002.LMO’2002
Notethatproblems1, 2, 3 and8 (thatis theeffectsof phonological
alternationandphonetic polymorphism) alsoapply to spellingconfusion
in English,while all prob-lemsotherthan4 applyin thecaseof Mandarin
Chinese,for example. Thatis,Englishis similar to Japanesein that
the samegraphemesegment canbe readdifferently indifferentcontexts
andphonology producesvariableeffects,but differs in that it
lacksthevowel lengthandcharacter-level semanticeffectsof
Japanese.Mandarin Chineseis associatedwith thesamebasicscopefor
confusionasJapanese,although thebulkof characters areassociatedwith
a uniquereading andproblems1 and2 arethereforeconsiderably
lesspronounced. In this sense,theJapanesewriting systemcanbeseento
beparticularlyhardfor languagelearners.
3. System Outline
TheFOKS (Forgiving OnlineKanji Search)systemaimsto
aidthelearnerin cop-ing with thecomplicatedJapanesewriting
system,andprovide direct,
linguistically-andstatistically-soundsupport for thetypesof
problemsoutlinedabove. Thesystemhasasingleweb-basedinterfacefor
bothknown andunknown readings,whichallowsthe learner to look up
wordsdirectly according to their expected,but not
necessarilycorrect readings. Thesystemis intendedto
handlebothstringsin theform they appearin texts (i.e. in kanji)
andreadings expressedin kana. Given a reading asinput,
thesystemtriesto establisharelationshipbetweenthereading andoneor
moredictionaryentries,andratetheplausibilityof
eachentrybeingrealizedwith theenteredreading.
In a sense,the problem of predicting which word a userseeksfrom
a reading-basedinput is analogousto kana–kanji
conversion(see,e.g.,[TAK 96] and[ICH 00]).That is, we seekto
determine a ranked listing of kanji stringsthat could correspondto
the input kanastringandprovide accessto the desiredword
asefficiently aspos-sible. Thereis onemajor difference,however.
Kana–kanji conversion systemsaredesignedfor nativespeakersof
Japaneseandassuchexpect accurateinput.8 In caseswhenthecorrector
standardized reading is not available,kanji charactershave to
beconvertedoneby one. This canbe troublesomedueto
segmentationambiguity andthe largenumberof characterstakingon
identicalreadings, resultingin long lists ofkanji charactersfor
theuserto choosefrom.
FOKS doesnot assumeabsolutelyaccurateknowledgeof readings, but
insteadexpectsreadings to bepredictably derivedfrom thesource
kanji. Oneassumptionweunavoidably make is thattheuserwill only try
to look up wordscontained in thebasedictionary.9 That is, we
canonly hope to direct usersto wordswe have knowledgeof, while
keeping thenumberof candidateentrieslow enough
sotheusercanquickly
¥. Several kana–kanji conversion systemshandle a limited number
of input errors(e.g.collo-
quial readingsandsubstitutionof phonologically-indistinguishable
kanacharacterssuchas Izu and J zu, and K ji and L ji ). However, as
far aswe areaware, thereis no kana–kanjiconversionsystemthattriesto
systematicallyhandlea widerangeof inputerrors.¦. The
coverageprovided by the interfacedepends solely on the
underlyingdictionary. The
versionof FOKS interfacepublicly available at
v�w�wx�y{z-z#§�§�§¨©�'�ª,{.~�©��z provides access
-
FOKSDictionaryInterface 9
determinewhenthedesiredword is notcontainedin thedictionary.
Assumingwecankeepthe number of word candidates low
enough,userscanusea singleinterfacetosearchfor wordsby eitherthe
correct or derivablewrong reading. We return to thispoint in
Section5.
4. From One Dictionary to Another: the Methodology
While kanji dictionarieslist the mostcommon readings
eachcharactercantake,they give very little additional
informationthatwould be usefulin our task. For ex-ample,most
dictionaries provide no information on the relative frequenciesof
thedifferent readingsa character cantake, simply listing the
readings. Also, while
vari-ouspublicationsdiscussthephonologicalphenomenaaffecting
thecompoundreadingformation[TSU 96, NLI 84], they do not provide a
quantitative analysiswhich couldbeusedasa startingpoint for our
system.Clearly, giventhecommon readings of thecharacters it is
straightforward to generatecompoundreadings basedon the
simpleconcatenation of unit readings. However, if we wereto proceed
in this manner wewould fail to
reflecttherelationshipbetweenthepervasivenessof
somereadingsoverothersor thephonological effectsof compoundword
readingformation. Hence,thissimpleapproachfails to accomplishour
initial goal of modeling themanner in whichlearners of
Japanesearelikely to form a candidatereading for a compoundword
theyarenot familiarwith.
4.1. Modular approach
Insteadof relying on the dataprovided in kanji dictionaries,we
extract the datadirectly from the dictionary we are implementing
the interfacefor. We employ amodular approachin dividing theoverall
problem into severalsmallerproblemsandsolving eachseparately. Given
the solutionandthe modularity of the system,eachpart of the
systemcanbe testedseparately. The modular natureof our
approachisdepictedin Figure1. Theprocessis asfollows:
1) Extractthecompletesetof readingsassociatedwith a given
segment through aprocessof grapheme–phonemealignment.
2) Reducetheobtainedreadingsetby separatingthegenuine
differencesin read-ingsfrom thosewhicharephonological and/or
conjugationalderivationsof underlyingbasereadings in theprocessof
canonization.
3) Exhaustivelygeneratenew readingsfor
eachdictionaryentryandcalculatetheiroverall probability
basedontheprobabilitiesof segment
readingsandcorpusfrequen-cies.
Below we describeeachof themodulesin detail.
to over 100,000entriesin the EDICT general usedictionaryandover
200,000entriesin theENAMDICT propernoundictionary.
-
10 L’objet – 8/2002. LMO’2002
Readings Database
Dictionary entry-reading pairs
Segmented dictionary entry-reading pairs
Segment-set of reading pairs,alternation probabilities
Dictionary entry-reading pairs,score
Alignment Unit
Canonization Unit
Reading GenerationUnit
Figure 1. Themodular structureof theFOKSsystem
4.2. Grapheme–phonemealignment
Givena dictionary entryandits readinggiven in hiragana,we want
to extract thepartof thehiraganareadingresultingfrom eachkanji
character, thatis align thekanji(graphemestrings)with their
readings (phonemestrings). For example, given thecompound «T¬
kaiseki“analysis”,wewould like to identify « ashaving contributeda
reading of kai and ¬ a readingof seki, accounting for the
word-level readingofkaiseki. We
remindthereaderthathiraganacharacters arenot strictly
phonemes,butphonemeclusters. Nonetheless,in our application the
leap is permissible. In thealignment process, we attemptto extract
the complete set of phonemerealizations(componentreadings) for
eachgraphemesegment(kanji segment). The particulardictionary
usedhereandthroughout theresearchis thepublicly-availableEDICT
dic-tionary [EDI 01]. Following the samealignment procedurefor all
dictionary entriescontaining a given kanji, we canextract a
completesetof phonemic realizationsofthekanji. [BAL 00] givea
comparisonof several machine-learning basedmethodsasappliedto
unsupervisedalignment. The methoddescribed below proved
superiorinaccuracy whenno alignment trainingdatais available.It
requiresno supervision andcould be appliedto other languagesin
which the phonetic realizationis not clearlyderivable from the
graphemepresentation. The alignment processproceeds as
fol-lows:
1) For eachgrapheme–phoneme string pair, generate a complete set
of candi-datealignment mappings. We constrainthealignment processby
requiring thateachgraphemecharacteraligns to at leastonecharacter
in the phonemic representation,
-
FOKSDictionaryInterface 11
thatthealignmentis strictly linear(andnon-intersective)
andthatcharactersareindi-visible.
2) Prune candidate alignments through the applicationof
linguistic constraintssuchasrequiring segment boundariesat
scriptboundaries,10 directalignmentof kanaequivalentsandindivisible
syllables.Whenmultiple candidatesexist, we
alsoprunethecandidateswith multiplevoicedobstruentsin a reading
segment[BAL 99].
3) Scoreeachalignment by a variantof theTF-IDF model[SAL 90],
which wasdevelopedfor termweighingin informationretrieval.
4) Iteratively work through thedataselectinga
singlegrapheme–phoneme stringpair to align according to the
highest-scoring candidate alignment at
eachiteration,andupdatingthestatisticalmodel accordingly (to filter
outdisallowedcandidatealign-mentsandscoreup
theselectedalignmentmapping).
Examplesof alignmentsextractedby ouralgorithm are:11
34® happyoū “announcement” °±3³²�4³ hap ´ pyoūµ·¶¹¸»º waribiki
¯ “discount” ° µ¼¶ ² ¸½º wari ´ biki ¯RSC¾® kazegusurī “cold
medicine” °±RCS³²�¾³ kaze ´ gusurī
4.3. Canonization
The alignment datacontains all possiblereadings for a given
graphemesegmentaswere available in the context of a dictionary
usedfor alignment. It can includealternatesdueto sequential
voicing, sound euphony andconjugation(e.g. phonolog-ical variants
of hyouandbyou for 4 chart, and the conjugationalvariantsof yomiand
yomu for the verb ¿ read), and possibly(but not necessarily)the
baseformof eachreading. We canonize the readings to separatethe
basereading dataapartfrom the alternationderived
data,thusminimizing the number of readingtypesandmaximally
extractinginstancesof alternation. This providesa meansof
overcomingdatasparsenessandallows usto
produceunobservedsegment-level readingsthroughnovel
alternationcombinationsover
thebasereadingsandthusincreasethecoverageof predictedreadings.
We observed above that sequential voicing occurs only whenthe
given segmenthasleft lexical context andthat soundeuphony occurs
only in the presence of rightlexical context. To detectthe two
phenomena,therefore, we canclassifysegmentsinto 4 groupsaccording
to thepresenceof left andright lexical context [BAL 02].
a) Level 0 ( À left, À right context): no possibilityof
conjugationor phonologicalalternation6ÂÁ
. With theexceptionof kanji-hiraganaboundarieswhicharenot
enforceddueto conjugativesufficesof
verbsandadjectivesalwaysbeingexpressedin hiragana(i.e. okurigana)
but forminga singlelexical unit togetherwith theheadkanji
character.66
. Notice that in somecases,graphemesegmentscan be madeup of more
than one kanjicharacter, asoccursfor Ã;Ä kaze“common cold”
above.
-
12 L’objet – 8/2002. LMO’2002
Level 0
Level 1 Level 2
Level 3
Canonical Form
Sequential voicing possible
Geminationpossible
Sequential voicing and gemination possible
Figure 2. Canonizationflowchart
b) Level 1 ( À left, Å right context): possibilityof gemination
or conjugationc) Level 2 ( Å left, À right context): possibilityof
sequentialvoicingd) Level 3 ( Å left, Å right context):
possibilityof all of geminationor conjugation,
andsequential voicing
Level 0 singletonsegments canbe assumedto comprise the
basereadings fromwhich readings at other levels are derived. Quite
commonly, readings are derivedthroughzero-derivation, whereby
nophonetic/conjunctivealternationtakesplace.Wework through
thevarious levels in decreasing numeric order, anddetermine
whethera unique basereadingexists for eachgraphemesegmentfrom which
the observedreading hasbeenderived. In the casethat suchan
analysisis possible,we recordthetypeof alternation andupdateits
frequency by incrementing thefrequency of thealternationby the
frequency of the string in which alternationwas found to
occur,combining it with thatof thebasereading. In
thecasethatmultiplematchesarefoundfor variantsof the original
reading with identicalkanji content, the frequency of theoriginal
kanji–readingstring is distributedequallybetweenall matching
entries.Thecanonizationprocessis depictedin Figure2.
First, we perform conjugationalanalysis[BAL 98] at Levels 1 to 3
to establishwhethereachsegment hasanunderlying verbal or adjectival
form. At eachstep,wethenperform a matchover boththeoriginal form
andthebaseconjugational form(s)of thereading. This distribution of
frequency extendsto any phonological alternationor
conjugationassociatedwith eachmatch.
Next we attemptto merge Level 3 entrieswith Level 1 and2
entries,and thenLevel 1 and2 entrieswith Level 0
entries.Thereasonfor this particularordering ofthe canonization
processis that, wherepossible,we wish to isolatethe effects of
asinglephonological processat a time to
maintainanalyticalconsistency throughoutthecanonizationprocess.Many
segments do not occur at Level 0 (i.e. asstand-alone
-
FOKSDictionaryInterface 13
characters) but canbe found in multiple instancesat other
levels. For example, 3hatsu“emit” occurs at all of Levels1 (e.g. 3
4 happyou“presentation”), 2 (e.g. Æ3 geNpatsu“nuclearpower”) and3
(e.g. ÇÈ3 É mihakkou “unpublished”), but notlevel 0. We thushaveno
immediateindicationof its canonical form, but basedon thealignment
datawe know that it takesreadingshaÊ 12 andpatsu. In this example,
theLevel 3 reading of haÊ is not voicedbut hasundergone gemination,
meaning it is notin canonical form. Sincewe have no instancesof
unvoiced,non-geminatecandidatesat Level 3, we postpone
disambiguatingthe canonical form andmerge haÊ with theexisting
Level 1 reading. This leavesuswith two readings: haÊ at Level 1
andpatsuatLevel 2. Thecanonical form for haÊ canbeany oneof
thehatsu,hachi, haku,haki,etc.Ontheotherhand, patsuis semi-voiced,
andis thereforeeitherthecanonical formin itself or derived from
thevoicedbatsuor unvoicedhatsu. Through the interactionof Levels1
and2, wecandetermine thatbothreadingsarederivedfrom
thecanonicalform hatsusowe recordthemassuchandupdate
thecorrespondingfrequencies.Inthecasethatnomerging of readingsis
possiblethrough thecanonizationprocess,eachreading is promotedto
Level 0 asaseparatereadingtype.
After canonization,ourdatafrom abovewould look asfollows:
3³²�4® hap ´ pyoū�°Ë hatsu ´ hyoū Å gemination Å voicingµ¼¶ ²
¸»º wari ´ biki ¯Ì°³ wari ´ hiki ¯ Å voicingRCS³²�¾® kazé gusurī̈
°Ë kaze ´ kusurī Å voicingWhile canonizing thereadings, we
keeptrackof caseswheregenuine alternation
took place(caseswhereentriesat different
levelsweresuccessfullymerged togetherbasedon a conjugation,
gemination and/orsequentialvoicing analysis) so as to beableto
reapplythemasindependentprobabilitiesbelow. Also we count the
numberof occurrencesof eachreading for agivenkanji segment
andconvert thisnumber intothe probability of the given kanji
segment taking eachreading ÍÏÎÑÐ�²ÓÒ�Ô . Notice thatthis
probability dependson the kanji character in question, unlike the
probability ofvoicing andgemination alternations which depend on
the readingrealizationof thesegment in question.We
furtherextendthesetof alternationsweconsiderwith
vowelshortening/lengthening, the probability of which is
calculatedas the percentage ofshort/long vowelsin
ourdictionarysetmultipliedby aweightfactor.13
4.4. ReadingGenerationandScoring
After extracting thesetof
segmentreadingsandcalculatingthevariousalternationprobabilities, we
proceedto generateandscoreplausiblereadings. The first stepinthis
processis to segmentup the target string, so asto be ableto look up
readingsfor theindividual segments andcomposetheseinto anoverall
reading. For thestring
6H. Where“ Õ ” indicatesthe final
kanasyllablehasbeengeminated,i.e. haÕ equatesto the
kana-form Öc× .6[. In all experimentsdescribedin Section5 we
usea weight factorof 0.05 for both vowel
shorteningandlengthening.
-
14 L’objet – 8/2002. LMO’2002
Function:SegmentReading()
BEGINInput:Ø
segmentÙÛÚ= Ü#ÝÑÞ�ßáàâãßåäÂàÌÝÑÞ�æ.àâçæáäÂèéèéèéèéÝÑÞ�ê-àâçêäÂë
where ÝÑÞ�ìàâíìä is a reading,probability tupleî»ï
ÜÝÑð�ßÌàð,æ.àáèñèéè ðò'äÂë whereð�ì is analternationwith
probability âçóáô
for õöõ from Á to ÷ùø 6 doÝÑÞ�úÌàâûúÌä�ü·ÝÑÞ�ìñìàâíìñìäfor õ
from 0 to ýùø 6 doÞ�þ�ÿ��Gü ð ì ÝÑÞ ú äâ þ�ÿ�� ü âûú � âçóáô
if� ÝÑÞ�àöâ ä s.t. ÝÑÞ�àöâ ä�� ÙÛÚ Þ ï Þ þ�ÿ��â ü âûú�� â
þ�ÿ��
elseÙÛÚ ü ÙÛÚ ÜÝÑÞ þ�ÿ�� àâ þ�ÿ�� äÂëenddo
enddo
normalizeÙÛÚ
s.t.
þì�� âíì
ï 6 ÝÑÞ�ìàâ"ìä�� ÙÛÚreturn
ÙÛÚEND
Figure 3. Pseudo-codefor theSegmentReadingfunction
394»> ? happyousuru “to present”, for example, we would
ideally partition it intothe threesegments 3 , 4 and >¤? ; for
the non-compositional RZS kaze“commoncold”, a
single-segmentanalysismaybemoreappropriate.We testtwo
segmentationmethods,basedonbigramprobabilities
andscriptboundaries.
The bigram-basedmethodconsistsof taking eachcharacter bigram in
the targetstring andusingthe grapheme–phoneme alignment datato rate
the probability of asegment occurring at thatpoint. As
notedabove,katakanaandhiraganastringstake aunique
kana-basedreading, irrespective of how we segmentthemup. We
thuschunkall contiguoushiraganaandkatakanacharacters
(andalpha-numericstrings)togetherinto a unigram unit. The output of
this methodis a setof different string segmen-tations,eachof which
is associatedwith a probability basedon the product of
thebigramprobabilitiesat eachpotentialsegment insertionpoint.
The script boundary segmentation methodadoptsa much simpler
approach, ininsertinga segment marker at eachscriptdemarcationpoint
(e.g.betweeneachkanjiandkanacharacter), andadditionally insertinga
segmentbetweeneachpair of kanjicharacters. This segmentation
schemaresultsin a considerablesimplificationof
thegenerationprocess,andproducesauniquesegmentationof
agivenstring.Thiscomesat the costof preventing generationof the
correct reading for multi-kanji segments(e.g. RCS kaze“common cold”
from above).
-
FOKSDictionaryInterface 15
Having segmented thestrings,we next
generatescoredreadingsaccording to thefollowing steps:
1) For eachsegment in word � , usethepreviously calculatedsetof
readings �containing reading–probability tuples(Ð , ÍÏÎ Ð�²ÓÒ�Ô )
andexpand it to include any ad-ditional readings resultingfrom
application of alternations underconsideration.Foreachapplicable
alternation � , we calculatea new tuple (������ , ������ ) where
������ iscalculatedunder assumption of segment independenceas in
equation(1) and �����is the resultingreading. If the reading
wasin the setoriginally, the probabilitiesareaddedandif not thenew
tuple is insertedinto thereadingset.After thecompletesetof
reading–probability tuplesis obtainedwe normalize theprobabilities
to sumto 1.Figure3 gives thealgorithm for generatinga completesetof
readingsfor a segment.
Í ������� Í����! "$#&% Í�' [1]2) Createan exhaustive listing
of reading candidates�( for eachdictionary en-
try � by concatenating individual segment readings and
calculatethe probabilityÍÏÎ Ð ( ²�� Ô of eachbasedon the
evidencefrom step1 and the naive
Bayesmodel(assumingindependencebetweenall parameters)asgivenby
equation (2). Figure4givesthesimplified recursive versionof
thegenerationalgorithm. Theactualimple-mentationis
iterativeandoptimizedto avoid unnecessaryrepetitive
calculations.
ÍÏÎ Ð�(C²)� Ô � ÍÏÎÑÐ�*,+-+ � ²�.�*,+-+ � Ô�
�/�0 * ÍÏÎ Ð
/ ²). / Ô [2]While generatingreadingsweapplyaprobability
threshold keepingonly thereadingswith a higher probability. Then,we
normalizethe probabilitiesof the pruned setofreadingsto sumto
1.
ÍÏÎ1� Ô �2 Î�� Ô/ 2 Î�� / Ô [3]
ÍÏÎ�� ² Ð�Ô � ÍÏÎ1� Ô ÍÏÎ Ð�²�� ÔÍÏÎ Ð�Ô�
2 Î�� Ô/ 2 Î�� / Ô
ÍÏÎ Ð�²�� ÔÍÏÎ Ð�Ô [4]
3) Calculatethecorpus-basedfrequency2 Î�� Ô of eachdictionary
entry � in the
corpus andthenconvert it into a stringprobability ÍÏÎ1� Ô ,
according to equation (3).Notice that the term / 2 Î1� / Ô
dependson the given corpus andis constant for allstrings � in a
samecorpus. UseBayesrule to calculatethe probability ÍÏÎ�� ² Ð�Ô
ofeachresultingreading according to equation(4). Here,aswe areonly
interestedinthe relative scorefor each� given an input Ð , we
canignore ÍÏÎÑÐ�Ô andtheconstant/ 2 Î�� / Ô . Thefinal plausibility
grade of a usersearching for dictionary entry � byquerying with
readingÐ is thusestimatedasin equation (5).
-
16 L’objet – 8/2002. LMO’2002
Function:WordReading()
BEGINInput:3�4 6 à H àáèñèéè 5�6 where 3�4 õ76 is a segment8 4 6
à H à%èéèéè 5�6 where8 4 õ76 is a setÙ:9= of readingsof 3�4 õ76
with associatedprobabiliti esî»ï Üåð�ßÌàð,æ.àáèñèéè ðò#ë whereð�ì
is analternationwith probability âíìÙ ü SegmentReadingÝ 3 ß à 8 ß à
î äif 5@? 6Ù ü ÙBA WordReadingÝ 3�4 H à'èñèéèéàC5�6à 8 4 6 à H
à%èéèéè 5�6à î ä
whereÙ ß A Ù æ ï ÜEDGÞ�àâF?Z´ Þ ï@GIH 5 G ð�J%ÝÑÞ�ßáàÞ�æáäÂàâ ï
âãßLK,âûæ
DcÞ ß àâ ß ?M� Ù ß DGÞ æ àâ æ ?M� Ù æ ëprune
Ùs.t.
Ù ï ÜÝÑÞ�àâ ä ´ âF?Gâ�NPO$Q ÿ Ú OSR1T>U#ënormalize
Ùs.t.
þìV� âíì ï 6 ÝÑÞ�ìàâíìä�� Ù Úreturn
ÙEND
Figure 4. Pseudo-codefor theWordReadingfunction
W ÐV�YXYZ�Î1� ² Ð�Ô � ÍÏÎ Ð�²�� Ô[% 2 Î�� Ô [5]4) To completethe
reading set we insert the correctreadings for all dictionary
entries � " ' � ' that did not containany kanji charactersand
for which no readingsweregeneratedabove,with plausibility grade
calculatedby equation (6).14
W ÐV�YXYZ�Î1� " ' � '�² Ð�Ô � 2 Î1� " ' � '�Ô [6]Furthermore, if
the generation step failed to generate a correct reading for
the
dictionary entrycontaining kanji, we addit to
thereadingsetsincewe wantto assuretheability to searchfor
adictionaryentryby its correctreading.
4.4.1. Failure to generatethecorrectreading
Even though we start out with a correctdictionary readingas the
input to oursystem,it canfail to generateacorrect readingfor
adictionaryentrydueto oneof thefollowing reasons:15
a) Incorrect segmentation. Whenthe initial segmentation of
multi-kanji units isincorrect, it canobstructgenerationof
thecorrect reading. For example, if theinitialsegmentation of \^]`_
omiyage “souvenir” is &\³²a] ²V_ ¯ the systemmay beunable to
generatethecorrectreadingsinceit is notcomposedof
individualcharacterreadings.
6 p. Here,â ÝÑÞ�´>b ê ó þ ó ä is assumedto be1, asthereis
only onepossiblereading(i.e. Þ ).6r. In theworstcaseof
experimentalgeneration,thesystemfailedto
generatecorrectreadings
for 6277readings
-
FOKSDictionaryInterface 17
b) Thresholdprobability. In somecases,the correct reading is
generatedwith avery low probability andfiltered out aspart of the
pruning. However the pruning isnecessarysinceduring the test runsof
our generation algorithm, we run into prob-lemswith very large
numbersof readings beinggeneratedfor eachdictionary
entry,resultingin our readingdatabasegrowing beyondavailabledisk
capacity.
c) Graphemegapping. Gapping takes placewhen certainpart of the
phonemestringis omittedfrom thegraphemestring.For example, ced
yamanote “uptown” iscommonly writtenwithout thenosegment,
whereasthemorecompleterepresentationwould be cgf d . Thecorrect
readingcannot becreatedsincethethesystemcannotaccount for
thegappedsegment.16
d) Alpha-numeric characters. When dictionary entriescontain
alpha-numericcharacters in the graphemestring the phoneme
equivalent usuallycontainsthe tran-scribedkanaequivalent (e.g.
hji^kml eebiishiijuN “alphabetic order” and n�n@op
hyakutoobaN “emergency telephonenumber”17) but our systemdoesnot
generatesuchtranscriptions.
By default, we settheprobability of suchcorrect readings to
equalthe thresholdprobability appliedin filtering readings during
generation andcalculatethe scoreofthereadingaccording to
equation(5) asbefore.
Note againthat all threestagesof the above processingare fully
automated,avaluable quality whendealingwith a volatile
dictionarysuchasEDICT. With minormodifications it
shouldbepossibleto apply our methodology to a
differentlanguagewherephonemerepresentation is not clearlyderivable
from thegraphemerepresenta-tion.
5. Evaluation
Startingwith theEDICT dictionary, we proceededthrough
thestepsdescribedinSection4 to generatenew setsof
scoredreadingswith thecorpusfrequenciesfrom thecompletesetof
200,000+sentences in theEDR Japanesecorpus [EDR 95].
Thenweimplementedawebbasedinterfacewith
pregeneratedreadingsetsaccessiblethrougha CGI interface.18
Consequently, we areableto provide real-timedictionary look
upwithout additional computational overhead. The currently
availableimplementationcovers thefirst four typesof
errorsdescribedin Section2.4.
Here,wewill provideanevaluation carriedoutwith two basicgoalsin
mind: (a)toevaluatetheeffectivenessof theproposedsystemin handling
querieswith erroneous
6u. However, graphemegappingis relatively
infrequentphenomenaappearingin only 0.1%of
the 5000 randomlychosendictionaryentriesusedfor alignment
evaluation. As such,it doesnotsignificantlyaffect
systemperformance6
. Hereeebiisii is theJapanesepronunciation for ABC andhyakutoois
an idiosyncraticpro-nunciationfor 1106¥
. Thesystemis freelyavailableat v�w�wx�yz�z#§�§�§)©�'�ª�
.~�©�-z
-
18 L’objet – 8/2002. LMO’2002
readings, and(b) to examine the effect additional searchoptions
andthe sizeof thereading sethaveon theusersability to find
thedesiredentry.
5.1. Data sets
Fromtheoutsetof our project,we werefacedwith theproblemof
finding a col-lectionof naturally-occurring readingerrors
thatcouldbeusedto evaluatetheFOKSsystem.While therewasa lot of
informationon typesof errors madeby learners
ofJapanese(seeSection2.4),wewereunable to locateadatabaseof
recordednaturally-occurring reading errors. Instead, we look to two
othersourcesfor testdatasets.
The first sourceis a set of practiceproblems for the
JapaneseProficiency Test[SUZ 96, MAT 95]. The Japanesegovernment
hasestablisheda four-level certifi-cationprogramaimedat evaluating
the ability of non-native learnersof Japaneseinreading
comprehension,listeningandvocabulary. We havecollectedanumberof
dif-ferentbooksusedfor thepreparationfor theproficiency
examandextracted420level2 word readingproblems. Eachproblem
consistsof a word given in its normal kanjiform, with four
potential readings in kana,only oneof which is correct. During
thetest, the examinee is requestedto choosethe correct readingfrom
amongthe fourcandidates.Herearesomeexample wordswith
candidatereadings:19
C soshou“lawsuit”: sousho soushou soshoqsrkatamuku “lean”: muku
kizuku uchiakutvunikushiN“blood relative”: nikuoya nisshiN
nikuyaw
kemuri“smoke”: honoo hi susu
Thesecondsetof datais acollectionof 139entriestakenfrom
awebsitedisplay-ing real-world readingerrors or godoku “wrong
reading” madeby native speakersofJapanese.20 Eachentryconsistsof a
word given in kanji–kanacombination
andoneincorrectandcorrectreadingeach.Theseentrieswerecompiledfrom
varioussourcesandassuchshouldreflectthewidevarietyof
possiblereadingerrors.
For bothdatasetswe changedall theverbandadjective formsto
basicdictionaryform for both theword andall of its potential
readings to make themappropriatefordictionary querying.
5.2. Comparisonwith a conventionalsystem
Wefirst createdfour databasesof readings: (a) two usingthebigram
segmentationmodel (labeled“Bi” in
consequenttables)trainedonextractedalignment dataand(b)the other
two usingthe kanji–script boundarysegmentation model (labeled “Ka”
in
6¦. Herewe give thecorrectreadingin thegloss.In
theactualtest,thecorrectreadingscanbe
at any of thefour positions.HÂÁ. v�w�wx�yz�z#§�§�§)
ª.�wyx��z�}E{) ~��{'x�z�ª.v�#�'�!|�}�z#�!}�'�� v�wS|�|
-
FOKSDictionaryInterface 19
Conv. Bi Bi Ka Ka( ~E%L~,V ) ( ~y%�~,�� ) ( ~E%�~SV ) ( ~y%�~,��
)
Total readings 97,927 3,449,866 8,864,800 4,549,152
13,812,273Size(MB) 1.3 116 314 164 534Unique readings 77,627
3,005,900 8,553,828 4,543,893 13,807,014Ave. R/E 1.03 36.37 93.46
47.96 145.62Ave. E/R 1.26 1.30 1.26 1.21 1.14Max. R/E 6 821 5394
317 2223Max. E/R 27 162 182 167 189
Table 1. Basicbreakdown of differentsetsof readings
consequenttables)puttingeachkanji character in aseparatesegment
(seeSection4.4).For eachmodel weusedtwo different thresholds,
~%~,�� and ~%~,�� respectively.Thebasicbreakdown of thesesetsis
givenin Table1.21
Giventhetwo datasetsandfour readingsetswe ranthefollowing
experimentforeachcombination. For eachentry we queriedthe
systemwith the correctand thenthe incorrect readings. As a
baselinewe useddirect matchingover the baseEDICTdictionary to mimic
aconventional system.Whenexecuting thequerywecountedthenumber of
resultsandwhetherthedesiredentrywasamongthecandidatesreturned.Providedthatthesystemsuccessfullyreturnedthedesiredwordasacandidatewealsocounted
its rank. In somecases,theword wasnot containedin thedictionary
soweexcludedit from theevaluation. Theresultsof
theseexperimentsaregivenin Tables2and3. In
eachtable,wegivetheerrorreduction rate,calculatedaccording to
equation(7). This ratereflectstheimprovementover theconventional
system.
Ð�ÐVáÐV�ZEXY ��S Zy.y.y ��M ZÌÐySZE.ãÀ��Y.yZ Z �S Zy.y.E
ZåÐySZy. À�
-
20 L’objet – 8/2002. LMO’2002
Conv. Bi Bi Ka Ka( ~y%�~,�� ) ( ~E%L~,V ) ( ~E%L~,V ) ( ~E%L~,V
)
# Queries 1189 1189 1189 1189 1189Ave. # Results 2.26 10.37
14.58 11.03 15.36Successful 18 484 512 547 574Error Red.(%) 0 39.80
42.19 45.18 47.48MeanRank 1.66 1.96 2.05 1.84 1.94RNM Rank 0.22
0.08 0.07 0.07 0.06
Table 2. Resultsfor Level2 words
Conv. Bi Bi Ka Ka( ~y%�~, � ) ( ~E%L~, V ) ( ~E%L~, V ) ( ~E%L~,
V )
# Queries 77 77 77 77 77Ave. # Results 1.53 7.85 10.96 8.22
11.48Successful 10 38 39 51 55Error Red.(%) 0 41.79 43.28 61.19
67.16MeanRank 1.4 3.58 3.90 3.16 3.36RNM Rank 0.18 0.20 0.24 0.18
0.19
Table 3. Resultsfor godokuwords
From Table 2, we can seethat our systemis able to handle a large
number oferroneousreadings ascomparedto theconventional
system.Theerror ratereductionrangesfrom 39.80%to
47.48%.Theconventional systemis ableto handle18readingsdueto the
fact that thosereadings might be appropriatein different contexts
andassucharerecordedin thedictionary. However,
differentreadingsusuallycoincidewithdifferent meanings (and hence
translations).Due to the natureof the
conventionalsearch,theuserwould not beawareof
alternatereadings/translationsnot returnedbythesystem.In our
system,on theotherhand,we offer a list of all potential
readingsandtranslationsfor the userto choosefrom so the usercanmake
the decisionastowhich translationis appropriatein
thegivencontext.
From the error reduction rates,we can seethat the “Ka”
segmentation methodresultsin bettercoverageof erroneousreadings
even for lower threshold valuesandsmaller reading sets.
Furthermore, the meanrank of the desiredentry among
thecandidatesreturnedis alsolower thanfor the“Bi”
segmentationmodel. As expected,aswedecreasethecut-off threshold,
thenumberof successfullyhandled
queriesrises,asdoestheaveragenumberof candidatesreturned.
Nonetheless,theMeanRankandRNM Rankarebothquite low, showing that
thedesiredentryon average rankshighin thecandidatelist.
Looking to Table3, wecanseethattheerror ratereduction is
evenhigher, with themaximum improvement reaching67.16% for the
largestgeneratedset. We canalso
-
FOKSDictionaryInterface 21
Level 2 godokuQueries 1189 77Successful 587 55Previous Best 574
55Coverage Increase 13 0ErrorReduction(%) 48.59 67.16
Table 4. Queryresultsfor readingssetsgeneratedwith no
thresholdapplied
seethat the number of candidatesreturned is somewhat lower and
that the averagerankof thedesiredentry is higher. This
canbeexplainedby thefact that theaveragecharacterlengthof
erroneousreadings in this
setwas4.35charactersasopposedto2.49charactersfor Level 2
readings.Elsewhere,we haveestablishedthatthenumberof
resultsreturnedis smallerfor longer queries[BIL 02].
Here are several examples of
successfully-handlederroneousqueries,with thenumbersin
bracketscorresponding to theerrortypesin Section2.4:
ryuushu°©¨`ª rusu“absence”[1,2]zeki °¬«` seiki“century”
[1,3]koki °¬®^¯ kouki “secondhalf” [4]
As seenabove, decreasingthe probability threshold to prune the
generatedread-ingsincreasesthecoverageof
theerroneousreadings.Nonetheless,we wantedto seeif thereis anupper
limit to theerror coverage,sowe rananadditional experimentinwhich
we createdanexhaustive reading setfor all thewordsin our
datasetswithoutapplying any threshold; we usedthe samequerieson
this setaswe did in the pre-vious experiments. The
resultsaredepictedin Table4. We canseethat the systemcanhandle 13
more queries for the Level 2 words than the previous bestresult
onsetsgenerated with a threshold, but that the total remains the
samefor the godokuwords. Notethat this increasein coveragecomeswith
anexponentialincreasein thereading setsize(178MB for
764entries)whichpreventsgenerationandstorageof
thecompletereadingsetfor thewholedictionary.
Fromthe above data,it would appearthat the “Ka” segmentation
schemaresultsin highercoverageof
erroneousreadings.Recallthatthissegmentationschemaforceseachkanji
character into a separatesegment andinsertssegmentboundariesat
eachscriptboundary.
5.2.1. RemainingProblems
We analyzedthesetof readingsthatnone of
thesystemstestedcouldhandle,andfound a number of
systematicproblemswith our systemarisingfrom the manner
inwhichwegeneratereadingswithoutaccounting for all
aspectscausingreadingerrors.In the Japaneseproficiency testdata,the
incorrect readings arecommonly readingsof semanticallysimilar
words. Theword
wkemuri “smoke” hasreading candidates
-
22 L’objet – 8/2002. LMO’2002
suchas honoo “flame” and hi “fire”. Also, the readings are often
borrowed fromthewordswith thesametrailing kanacontent. For example,
° ¡s? sadameru“set”hasa candidate readingof kimeru(derivedfrom ±
¡a? kimeru“decide”) dueto thecommon suffix (meru). While we have
discussedthesephenomenain thecontext ofcommon readingerrors, they
arepresentlynot included in our generative model andconsequently
resultin unsuccessfulsearches.
In thegodoku reading data,two othercommontypesof errorpresenteda
problemfor our system.Themajority of entrieswe couldnot handle
weretheresultof confu-siondueto graphical similarity. For example,
²´³ ofuda“talisman” canbeconfusedwith ²`µ orei “thanks, gratitude”.
Another common problem is kanji stringsbeinginterpretedasproper
names,hencetakingonunusualreadings.Althoughwehaveim-plementedananalogoussystemfor
searchingtheEDICT propernoundictionarywithreadingsderivedfrom
common words,wecurrently offer no solutionfor theoppositeproblemof
regularwordsbeinginterpretedasproper names.
While we recognizethat our systemstill hasdeficienciesat
present,our experi-mentshave shown that it significantly
increasesdictionary accessibilityin the casethat the prescriptive
reading is not available, andas suchshouldaid the learner
ofJapanese.Admittedly, this evaluation wasover datasetsof limited
size,largely be-causeof thedifficulty in gaining accessto
naturally-occurring
kanji–readingconfusiondata.Theresultsare,however, promisingasfar
asbothcoverageandappropriatenessof thescoringfunction
areconcerned.
5.3. Readingsetanalysis
Sincewecreatea largenumberof plausiblereadings,apotential
problemwasthata largenumberof candidateswouldbereturnedfor
eachreading, obscuring dictionaryentriesfor which theinput is
thecorrectreading andpenalizing competentuserswhomostly searchthe
dictionary with correct readings. Therefore,we tried to
establishhow many candidates are likely to be returnedfor an
arbitraryuserquery. Due tospaceconstraintsweonly look at
thesmaller“Ka” setwith the ~&%v~, V threshold.22
The distribution of number of word entriesreturnedfor the full
rangeof readingtypesgeneratedby theproposedmethodis givenin
Figure5. In this figure,Baselinerepresentsthereadingsin
theoriginaldictionary, thedistribution of whichiscalculatedover the
original dictionary. Existing is the subsetof readings in the
generated setthat existed in the original dictionary, and All is
all readings in the generatedset.Thedistribution of thelattertwo
setsis calculatedover thegeneratedsetof readings.The ¶ -axis
representsthe number of resultsreturned for the given reading and
the· -axis represents the natural log of the number of readings
returning that numberof results. It canbe seenthat only a few
readings return a high number of entries.943 out of 4,543,893 or
0.02% of the readings returnover 30 results. Note that theaverage
number of dictionary entriesreturnedper reading is 1.21 for the
complete
HH. This is thesetthatis accessibleat v�w�wx�yz�z#§�§�§)©�'�ª�
.~�©�-z
-
FOKSDictionaryInterface 23
¸¹º»¼½$¸½$¹½$º½$»
¸ ¹I¸ ºI¸ »I¸ ¼I¸ ½I¸$¸ ½$¹$¸ ½Iº$¸ ½I»$¸
¾-¿ÀÁ-ÂÃÄÅÆÂÇÈÉ-ÊË ÌÍÎÄËÏ
Ð:Ñ�Ò Ó�ÔCÕ$Ò×Ö�ÑVØÐÑVÙÚÜÛÝ�Þ
ßáà7à
âáã
-
24 L’objet – 8/2002. LMO’2002
of predictable reading errors, therefore justifying
thesoundnessof our model. Whilethe resultsvary according to the
testdataandthe sizeof the generated reading set,our
systemoutperforms the conventional systemin all our
experimentsperformed.Furthermore,the number of responsesis on
average low enough (lessthen4) that itdoesnot inhibit
theusefulnessof theimprovedsearchability.
Nonetheless, the cognitive model can be improved further. We
intend to mod-ify it to incorporatefurther constraints in the
generation processafter observing thecorrelation
betweentheinputsandselecteddictionaryentries.To
thisend,wearecol-lectingusagedatafrom our servers andfeedbackfrom
users.Furthermore,asbrieflypointedout in Section5.2.1, thecurrent
cognitive model still doesnot cover all typesof readingerrors,with
graphic andsemanticsimilarity beingnotable sourcesof errorcurrently
not handled. Theproblemwith including thesetypesof errors into our
cog-nitive modelis that it is not straightforward to quantify them.
In thecaseof graphicsimilarity, limited researchhasbeenconductedon
analyzingkanji similarity at thestroke level [MAE 02] but the
coverageis still too limited for general purposedic-tionaries. On
the other hand, semanticsimilarity hasreceived a lot of
attentioninresearchon disambiguation andlexicography, but still
remains oneof the largerob-staclesin NLP in general. Note thatwe
have taken tentative stepstowards handlingreading confusiondueto
word-level co-occurrence,asdetailedin [BIL 03].
Finally, all thework onthisdictionary interfaceis conductedunder
theassumptionthatthetargetstringis contained in theoriginal
dictionary andthuswebaseall readinggenerationon
theexistingentries,assumingthattheuserwill only attemptto look
upwordswe have knowledgeof. Thesystemallows for motivated reading
errors, but itprovidesno immediatesolutionfor random reading errors
or for caseswhereuserhasno intuition asto how to readthecharacters
in thetarget string.
7. Conclusions and Future Work
In this paper we have describedFOKS,a systemdesigned to
accommodateuserreading errors andsupplement partial knowledgeof the
readingsof Japanesewords.Ourmethodtakesdictionary entriescontaining
kanji charactersandgeneratesreadingsfor each,scoringthemfor
plausibility in the process.Thesescoresareusedto
rankthedifferentword entrieswith generatedreadingscorresponding to
thesysteminput.Theproposedsystemis web-basedandfreely
accessible.Initial evaluation indicatessignificantincreasesin
robustnessovererroneousinputs.
Acknowledgments
We would like to thank Emily Bender, FrancisBond, Mathieu
Mangeot, KikukoNishina, Ryo Okumura and several anonymous reviewers
for helping in the devel-opment of theFOKSsystemandwriting of this
paper.
-
FOKSDictionaryInterface 25
8. References
[BAC 94] BACKHOUSE A. E., TheJapaneseLanguage: An Introduction,
Oxford UniversityPress,1994.
[BAL 98] BALDWIN T., “The Analysisof
JapaneseRelativeClauses”,Master’sThesis,TokyoInstituteof
Technology, 1998.
[BAL 99] BALDWIN T., TANAKA H., “The Applications of
Unsupervised Learning toJapaneseGrapheme-PhonemeAlignment”, Proc.
of ACL Workshop on UnsupervisedLearningin Natural Language
Processing, CollegePark,USA, 1999,p. 9–16.
[BAL 00] BALDWIN T., TANAKA H., “A Comparative Study of
Unsupervised Grapheme-PhonemeAlignmentMethods”, Proc.of the22nd
AnnualMeetingof theCognitive ScienceSociety(CogSci2000),
Philadelphia,USA, 2000,p. 597–602.
[BAL 02] BALDWIN T., BILAC S., OKUMURA R., TOKUNAGA T., TANAKA
H., “EnhancedJapaneseElectronicDictionary Look-up”, Proc. of the
3rd InternationalConferenceonLanguage ResourcesandEvaluation(LREC
2002), LasPalmas,Spain,2002,p. 979–985.
[BIL 02] BILAC S., BALDWIN T., TANAKA H., “Bringing the
Dictionary to the User: theFOKSsystem”, Proc.of
the19thInternational Conferenceon Computational
Linguistics(COLING2002), Taipei,Taiwan,2002.
[BIL 03] BILAC S., BALDWIN T., TANAKA H.,
“Increasingtheerrorcoverageof theFOKSJapanesedictionaryinterface”,
Proc.of ASIALEX2003, Tokyo, Japan,2003,(to appear).
[BRE 00] BREEN J., “A WWW JapaneseDictionary”, JapaneseStudies,
vol. 20, 2000,p. 313–317, JapaneseStudiesAssociationof
Australia.
[EDI 01] EDICT, “EDICT Japanese-EnglishDictionary File”,
©-wx�y{z�z#©�wx� -,ì|'~�}�ª.v�E}�}# z'x�Eí z'~�.v�'~���z ,
2001.[EDR 95] EDR, EDR Electronic Dictionary Technical Guide,
JapanElectronicDictionary
ResearchInstitute,Ltd., 1995,(In Japanese).
[FRE 95] FRELLESVIG B., A CaseStudyIn Diachronic Phonology, The
JapaneseOnbinSound Changes, AarhusUniversityPress,1995.
[GRO 00] GROOT P. J. M.,
“ComputerAssistedSecondLanguageVocabulary Acquisition”,Language
Learning& Technology, vol. 4, num. 1, 2000, p. 60–81.
[HAL 98] HALPERN J., Ed.,New
Japanese-EnglishCharacterDictionary,
KenkyushaLim-ited,6thedition,1998.
[HUM 01] HUMBLE P., DictionariesandLanguage Learners, Haag+
Herchen,2001.
[ICH 00] ICHIMURA Y., SAITO Y., K IMURA K., HIRAKAWA H.,
“Kana-Kanji ConversionSystemwith Input Supportbasedon Prediction”,
Proc. of the18th InternationalConfer-enceonComputational
Linguistics(COLING2000), Saarbrücken, Germany, 2000,p. 341–347.
[KAW 00] KAWAMURA Y., “The Roleof theDictionaryToolsin a
JapaneseLanguageRead-ing Tutorial System”, Ljubljana University
International Seminar, Ljubljana, Slovenia,2000, (In Japanese).
[KIT 00] K ITAMURA T., KAWAMURA Y., “Improving
thedictionarydisplayin areadingsup-port system”,
InternationalSymposiumon JapaneseLanguage Education,
Seoul,Korea,2000, (In Japanese).
-
26 L’objet – 8/2002. LMO’2002
[KNI 98] KNIGHT K., GRAEHL J., “Machine Transliteration”,
Computational Linguistics,vol. 24,1998, p. 599–612.
[LAU 01] LAUFER B., HULSTIJN J., “Incidental Vocabulary
Acquisition in a SecondLan-guage: TheConstructof
Task-InducedInvolvement”, AppliedLinguistics, vol. 22, 2001,p.
1–26.
[MAE 02] MAEDA K., TATSUOKA R., HOKADA K., OSHIKI H.,
“Development of a KanjiLearningSystemtowardproviding
OptimalLearningMaterials”, Proc.of SNLP-OrientalCOCOSDA 2002,
HuaHin, Thailand,2002, p. 243–249.
[MAT 95] MATSUOKA T.,
ProblemsfromJapaneseProficiencyTest,characters andvocabu-lary
(Levels1 and2), KokusyoKankoukai, 1995.
[MEI 97] MEIJI M., Analysisof misuseof JapaneseLanguage, Meiji
Publishing,1997, (InJapanese).
[NAG 81] NAGASAWA K., Ed., ShinmeikaiKanji-JapaneseCharacter
Dictionary, SanseidoPublishing,2ndedition,1981.
[NIS 00] NISHINA K., OKUMURA M., SUGIMOTO S., YAGI Y., ABEKAWA
T., TOTSUGI N.,RYANG F., “Development researchon multilingual
Japanese readingaid for foreign stu-dentswith
scientificbackground”, Research Reportof
TelecommunicationsAdvancementFoundation, vol. 15,2000, p. 151-159,
(In Japanese).
[NIS 02] NISHINA K., OKUMURA M., YAGI Y., TOTSUGI N., RYANG F.,
SUGIMOTO S.,ABEKAWA T., “Development of JapaneseReadingaid with a
multilingual interfaceandsyntaxtreeanalysis”, Proc. of theEight
AnnualMeetingof TheAssociationfor NaturalLanguage
Processing(NLP2002), Keihanna, Japan,2002, p. 228-231, (In
Japanese).
[NLI 84] NLI, Vocabulary, Research andEducation, vol. 13of
JapaneseLanguageEducationReference, NationalLanguageInstitute,1984,
(in Japanese).
[NLI 86] NLI, CharacterandWriting systemEducation, vol. 14 of
JapaneseLanguage Edu-cationReference, NationalLanguage
Institute,1986,(in Japanese).
[SAL 90] SALTON G., BUCKLEY C., “Improving retrieval
performanceby relevance feed-back”, Journal of the
AmericanSocietyfor InformationScience, vol. 44, 1990,p.
288–297.
[SUZ 96] SUZUKAWA K., KATORI F.,
Eds.,JapaneseProficiencyTestPreparation Measure,characters
andvocabulary (Level 2), KokusyoKankoukai,1996.
[TAK 96] TAKAHASHI M., SHICHU T., YOSHIMURA K., SHUDO K.,
“ProcessingHomonymsin theKana-to-KanjiConversion”, Proc.of
the16thInternationalConferenceon Computational Linguistics(COLING
1996), Copenhagen,Denmark,1996,p. 1135–1138.
[TER 96] TERA A., K ITAMURA T., OCHIMIZU K., “Dictlink er, a
Japanesereadingsupportsystem”, Proc.of
ConferenceonJapaneseEducation(Fall), Kyoto,Japan,1996,p. 43-48,(In
Japanese).
[TSU 96] TSUJIMURA N., An Introductionto JapaneseLinguistics,
Blackwell, first edition,1996.
[VAN 87] VANCE T. J., Introductionto JapanesePhonology, SUNY
Press,1987.