Lexical association network maps for basic Japanese vocabulary.

ASIALEX 2005

Lexical association network maps for basic Japanese vocabulary Terry Joyce

Institute of Technology [email protected]

Abstract

Extending psycholinguistic research into the lexical representation of two-kanji compound words within the Japanese mental lexicon (Joyce, 2002, 2004), this paper reports on a large-scale word association survey for basic Japanese vocabulary. The database of word association norms, which is being compiled from various survey formats including a web-based version of the survey, supplements existing databases concerning the lexical features of Japanese vocabulary (Amano & Kondo, 1999; Yokoy ma a, Sasahara, Nozaki & Long, 1998), such as familiarity ratings and frequency counts, which are essential for cognitive science research. A particularly promising application of the word association norms data, however, is the creation of lexical association network maps that capture important proprieties of words and their interconnectivity. These maps complement other approaches that attempt to tap into aspects of lexical knowledge, such as WordNet, thesauri, ontologies, and collocation data, while avoiding some of their problems. There are also direct and interesting lexicographical and Japanese language learning applications of the

114

WORDS IN ASIAN CULTURAL CONTEXTS

database of word association norms and the lexical association maps. In addition to prov in

1968) and, on the other hand, notions of activation spreading through lexical networks (e.g., Collins & Loftus, 1975), much atten on has been devoted to investigating the rich networks of word associations that connect related words together. For instance, Nelson and McEvoy (2003) have recen uence of word associations on cognition, showing that differences in the associative structures of known words—in terms of associate set size, resonance (or backward association), and the connectivity within an associate set—effect performance on memory tasks.

This paper reports on a project to compile a large-scale database of word association norms for basic Japanese vocabulary and to utilize the data to create lexical association network maps that capture important properties of words and their connectivity. After briefly noting some psycholinguistic research into the representation of two-k 004), which highlights the need for comprehensive word association data for cognitive sciencesurvey

ersed order o

he importance of verbal information. In expe

id g more user-friendly means of consulting the lexical entries of electronic dictionaries, the inclusion of the word associative normative data within the lexical entry will also greatly enrich the variety of lexical information. Given the importance of contextual relevance for second language learning, the lexical associative networks also open up extremely effective learning strategies for Japanese language education.

1. Introduction Many areas of cognitive science, such as psychology, artificial intelligence, as

well as natural language processing and computational linguistics, are contributing to our understanding of the mental lexicon and the representation of lexical knowledge. Reflecting, on the one hand, the conviction that associative processes are basic mechanisms underlying cognition (e.g., Cramer,

ti

tly demonstrated the infl

anji compound words within the Japanese mental lexicon (Joyce, 2002, 2

experimentation, the paper outlines the compilation and coding of an initial corpus and data-collection work. Finally, the paper touches on utilizing the

normative data to create lexical association network maps for basic Japanese vocabulary,singling out two promising applications of the maps: namely, as an approach to modeling the semantic representations of connectionist models and in the areas of Japanese lexicography and language learning.

2. The Japanese lemma-unit model (Joyce 2002, 2004) Compounding is an extremely productive word-formation process in Japanese,

making the language particularly important for examining the extent of morphological involvement in the organization of the mental lexicon. In that context, Joyce (2002, 2004) has examined the lexical representation and retrieval of two-kanji compound words by conducting constituent-morpheme priming experiments—comparing the facilitation from component kanji in the lexical decision task—with various kinds of compound words, such as modifier + modified, verb + complement and the rev

f complement + verb, and synonymous pairs. The general results from the experiments are that both constituent prime

conditions facilitate lexical decisions and in the majority of cases at similar levels, indicating that morphology is important in the organization of the mental lexicon. The research has also provided findings pointing to t

riments that manipulated the positional frequency of the verbal constituents in verb + complement and complement + verb compound words, reversed patterns of

115

ASIALEX 2005

priming were observed in high positional frequency conditions where the reaction times for verbal constituent conditions were significantly faster than in the respective complement conditions. To account for the lexical representation and retrieval of two–kanji compound words, Joyce (2002, 2004) has proposed adapting for the Japanese mental lexicon a version of the multilevel interactive-activation framework. A feature of the model is the incorporation of lemma-unit representations that mediate the connections between access representations (both orthographic and phonological form representations) and semantic representations, which is especially appealing for handling the complex nature of the Japanese writing system.

While this visual word recognition research indicates a central role for morphological information in the organization of the Japanese mental lexicon, clearly possible confounding factors, such as association effects, need to be examined further. However, because comprehensive Japanese word association data is currently not available, a central objective of the present research is to construct a large-scale database of word association norms for basic Japanese vocabulary.

3. Word association norms for Japanese

After noting some word association databases for English and Japanese, this section outlines the creation of the initial survey corpus and data-collection preparations.

While Moss and Older (1996) have collected between 40-50 responses for some 2,400 words of British English, Nelson, McEvoy and Schreiber (1998) have compiled the largest database for American English covering some 5,000 words with approximately 150 responses per item. In terms of Japanese language surveys, although the early survey by Umemoto (1969) is well-known and has responses from 1,000 university students, the word corpus is very small with only 210 words. More recently, Ishizaki (2004) has collected word associations for 1,656 nouns for use in building an associative concept dictionary. A drawback with this data, however, is the fact that response category was specified, so it tells us little about free association norms.

In creating an initial corpus of basic Japanese vocabulary for the word association survey, three primary reference sources were used. The first was the survey of basic vocabulary for Japanese language education conducted by the National Language Research Institute (1984), with 6,800 words including a core set of about 2,200 words. The second source was a more recent list of about 4,000 words prepared by Tamamura (2003), while the third reference was a handbook of Japanese orthography (Sanseidō Henshūjo, 1991) listing sanctioned readings for all Jōyō kanji. Based on comparisons of these sources, an initial survey corpus of 5,000 kanji and words has been created, and this will be expanded through the inclusion of associate responses as data collection proceeds.

The discrete free word association task is relatively straightforward—the respondent is simply asked to provide the first meaningfully-related word that comes to mind when presented with a stimulus word. However, a major concern for the present project in seeking to efficiently construct a large-scale database has been to devise an automatic method of generating multiple individual respondent survey lists from the survey corpus, while minimizing as far as practically possible the effects of intra-list association. Accordingly, much of the preparatory efforts have been devoted to coding the survey corpus with information relating to the following criteria: (1) Pronunciation;

116


necessary due to the high incidence of homophones and to explore the influence of orthographic variants (i.e., kanji versus kana). (2) Orthographic type (i.e., single kanji, multi-kanji, and mixed kanji-kana words); to reduce respondent strategies, orthographic type is mixed in each survey list. (3) Component kanji codes; to ensure that a given

anji only appears once in a survey list. (4) Semantic category codes based on the NTT thesaurus (Ikehara, et al, 1999); to ensure that survey lists do not contain multiple

emantic category. (5) Unique ID code; assigned ID codes will

hich should be shortly). However, data collection is already underway employing traditional

nnaires. For the first block of data, a random sample of 2,000 items was

seco ering the

and

asso al elson

WordNet (Fellbaum Thus, the structure in the network emergesthan being theory-driven or constrained by s ecific relationships, which are concerns for associative concept dictionaries and ontologies where the hierarchical structure is pre-defined. The use of free associations also avoids domain biases that are an issue

k

words from a given sserve as additional measure to eliminate intra-list associations as data collection proceeds.

In order to obtain the large-scale quantities of responses required for the atabase, a web-based format of the survey is being developed (wd

availablepaper questiotaken from the corpus, and approximately 100,000 responses were collected from about 1,000 undergraduates (100 items per questionnaire). While the questionnaire format involves data inputting burdens, it is important for addressing reliability issues. A

nd block of data is currently being collected with questionnaires covremaining items in the survey corpus.

4. Association data applications

The large-scale database of free word association norms for basic Japanese vocabulary will be a valuable resource for cognitive science research, such as memory

visual word recognition experiments. The database of association norms also makes it possible to create lexical association network maps as a means of representing

ciate sets and their connectivity. Figure 1 illustrates the basic concept of lexicassociation network maps, with an example for the English word ‘planet’ based Nand McEvoy (2003). In addition to representing associate set size, the maps can highlight differences in association strengths, both forward and backwards (which are independent), as well as the association density of the associate set.

The lexical association network maps can complement other approaches to uring aspects of lexical knowledge, while avoiding some ocapt f their problems. For

example, unlike the lexicographer’s expertise required to define synonym sets in , 1998), the maps are based on the free responses from respondents.

from tapping into free associations, rather p

VENUS

UNIVERSE

STAR

SPACE

EARTH

SATURN

MOON

MARS

PLUTO

PLANET

association network maps (based on Nelson

Figure 1. Basic concept of lexical

& McEvoy, 2003)

117

ASIALEX 2005

118

44

寒い・さむい雪

15

6

6

4

2

夏

冬至

白・白い

氷

冬将軍

切ない

越冬冬眠

北

くま

休息

休み

かまくら

こたつ

春

2

2

2 2

2

2

2

2

2

2

2

冬

15

お金・

10

8

6

人

切手収集

捨てる

ゴミ

集合

集会趣味

コレクター

集まる

2 2

2

2

2

2

2

2

10

6

コレクション

6

4

4

4

4

収めるおち葉

ガラクタ

カン

コレクト

標本

密集

フィギュア

大人買い

2

2

集める

responses, followed by 夏 ‘summer’ and 冬至 ‘winter solstice’, both at 6 percent, and 白・白い ‘white’ at 4 percent. Thus, 冬 ‘winter’ has a relatively small set of core associates with one particularly strong associate.

Figure 2. Associate sets for 冬 ‘winter’ and 集める ‘gather, collect’

In contrast, the verb 集める ‘gather, collect’ has a larger set of core associates, naturally with weaker association strengths. The primary associate here is

for collocation data. Figur se words 冬

‘winter’ and 集める based on th ssociation responses collected so far. ures on th ions represent the percentage of re hows, ry strong primary associate with the word 寒い・さむい ‘cold’, which accounts for 44 percent of all

The second associate of 雪 ‘snow’ represents only 15 percent of the

お金・sponses. There are also two secondary

ses. Compa むい ‘cold’ and the nou

of associations that exist between words—an important aspect of lexical knowledge—there are promising applications for the lexical

ne such application is in modeling the semantic represe

e 2 resents two examples of associates sets, for the Japane ‘gather, collect’, e free word aThe enclosed fig e arrow connect

sponses. As the figure s 冬 ‘winter’ has a ve

responses.

but金 ‘money’ accounting for 15 percent of the reresponses at 10 percent; namely, 切手 ‘stamps’ and 収集 ‘collection’. Some of the remaining core associates are 人 ‘people’ (8%), 集合 ‘set’ (6%), ゴミ ‘rubbish, trash’ (6%), and コレクター ‘collector’ (6%). Consistent with their respective word classes of noun and verb, these two words exhibit different kinds of syntagmatic respon

red to the very strong association between the adjective 寒い・さn 冬 ‘winter’, more of the core responses for the verb 集める ‘gather, collect’

are nouns that could either occupy the direct object slot (i.e., 切手 ‘stamps’, 人 ‘people’, ゴミ ‘rubbish, trash’) or the subject slot (i.e., コレクター ‘collector’).

In capturing the networks

association network maps. Ontations within connectionist models of the Japanese mental lexicon, like the

Japanese lemma-unit model (Joyce, 2002, 2004). However, there are also direct and interesting lexicographical and Japanese

language learning applications of the word association database and the lexical


association network maps. For instance, the inclusion of word association norms within the lexical entry would greatly enrich the variety of lexical information presented to the dictionary user. So, in addition to listing information relating to the entry word’s pronunciation, its definitions, inflectional or derivational forms, and idiomatic express ns, the dictionary could also provide word association information, which together with response frequency data, would assist in identifying for a given entry word the relative importance of different kinds of associative relations, such as synonyms, antonyms, and hyponyms, as well as common modifiers and complements, and various other relations.

upplementing electronic dictionaries with the word association normative data and the lexical association network maps could also support more user-friendly search functions (Zock & Bilac, 2004). In daily life, people regularly consult dictionaries to check the spelling of a known word (or the strokes of a kanji character) or to find out the mea

the tongue’ phenomenon (Brown & McNeill, 1966). Typically in th desired word from mental lexicon, they are often able to provide information about some of the word’s form features, such as its initial and/or final letters, or part of its pronunciation. Unfortunately, these are not the kinds of inform t be ed effectively to search ctronic es. Ho , in addition t g me of the target word’s form char ics, the i idual will also be aware of rious kinds of semantic information related to th get word, such as the assoc s the word has with other w In contrast to partial inf ab features, which is too impr any retrieval, known semantic information is potentially much If ssociation normative data and the lexical

network maps were incorporated into electronic dictionaries, then inputting ted word (either single or multiple word entry) could provide access to the

relevan

g is to instill in the learner lexical knowle

system is an electronic study notebook that would build into a personalized dictionary

io

S

ning of a newly encountered word. However, electronic dictionaries, and even thesauri organized primarily according to synonyms, are of little help to the user experiencing the common ‘tip of

is situation, while the individual is unable to retrieve the their

ation hat can usele dictionari wever o knowin so

acterist ndiv vae tar iative relation

ords. ormation out formecise to be ofmore useful.

value for the word a

associationan associa

t lexical association network map, from which the user could follow appropriate association links until the target word is identified.

Given the importance of contextual relevance for second language learning, the lexical associative networks also open up extremely effective learning strategies for Japanese language education. Memory researchers have long demonstrated that the categorization and semantic organization of stimulus materials have dramatic effects on retrieval performance (e.g., Bower, Clark, Winzenz, & Lesgold, 1969). Accordingly, the lexical association network maps for basic Japanese vocabulary could be a very useful resource in the context of teaching Japanese as a foreign language. To the extent that the goal of second language teachin

dge for the target language that approaches that of the native speaker, the lexical association network maps, depicting sets of associatively-related words based on the free word association responses of native Japanese speakers, represent a kind of authentic study material and a value reference source for the second language learner in judging the naturalness of lexical combinations.

In pursuing these Japanese lexicographical and language learning applications of the database of word association norms and the lexical association network maps, this research project is also working to create a comprehensive kanji database and integrated kanji instruction system. The key concept behind the integrated kanji instruction

119

ASIALEX 2005

by drawing on the database reference source through various learning assignments, ranging from basic ‘look-up tasks’ to more advanced ‘expansion tasks’ that would help the lea

This research project is part of the 21st Century COE Program ‘Framework for Systematization and Application of Large-scale Knowledge Resources’ (program leader; Professor Sadaoki Furui) at the Tokyo Institute of Technology, Japan.

References Amano, Shigeaki, and Kondō, Tadahisa (Eds.), (1999), Nihongo no goitokusei [Lexical properties of

Japanese] Vols. 1-6, NTT database series. Tokyo: Sanseidō. Bower, G. H., Clark, M. C., Winzenz, D., and Lesgold, A. (1969), ‘Hierarchical retrieval schemes in recall

of categorized word lists’, Journal of Verbal Learning and Verbal Behavior, 8, 323-343. Brown, R., and McNeill, D., (1966), The ‘tip of the tongue’ phenomenon. Journal of Verbal Learning and

Verbal Behavior, 5, 325-337. Collins, Allan. M., and Loftus, Elizabeth. F. (1975), ‘A spreading-activation theory of semantic

processing’, Psychological Review, 82, 407-428. Cramer, Phebe, (1968), Word association. New York & London: Academic Press. Fellbaum, Christiane (Ed.), (1998), WordNet: An electronic lexical database, Cambridge: MIT Press. Ikehara, Satoru, Miyazaki, Masahiro, Shirai, Satoshi, Yokoo, Akio, Nakaiwa, Hiromi, Ogura, Kentaro,

Ōyama, Yoshifumi, and Hayashi, Yoshihiko, (1999), Nihongo goi taikei [Goi-taikei-A Japanese lexicon] (CD-Rom), Tokyo: Iwanami Shoten.

Ishizaki, Shun, (2004), Rensō gainen jisho Version 1.0 CD. Joyce, Terry, (2002), ‘Constituent-morpheme priming: Implications from the morphology of two-kanji

compound words’, Japanese Psychological Research, Blackwell: Japan, pp. 79-90. Joyce, Terry, (2004), ‘M cal, orthographic and

phonological conside ychological Research, Vol

me no kihon goi chōsa, Tokyo: Shuei n.

las L., and McEvoy, Cathy L., (2003), ‘Implicitly activated memories: The missing links of rem

anseidō. Tamamura, Fum -28. Umemoto, T., (1 okyo Daigaku

pankai. Yo ōich o aki, H d Long ), Sh shi media

un CD-R yoru kanj ō [Elec nic newspape edia kanji: Kanji y list d on Asahi aper CD-R okyo: Sanseidō.

Zo l, an aven, , Word lo he basi associations rom an idea to a OL G2004 Work on Enhancin using elec nic dictionarie August, Geneva.

rner develop a deeper understanding of kanji structure and the morphological principles of compound words.

Acknowledgements

odeling the Japanese mental lexicon: Morphologirations’, In Serge P. Shohov (Ed.), Advances in Ps

ume 31, (pp. 27-61). Hauppauge, NY: Nova Science. Moss, Helen, and Older, Lianne, (1996), Birkbeck word association norms, Hove: Psychological Press. National Language Research Institute, (1984), Nihongo kyōiku no ta

ShuppaNelson, Doug

embering’, Fourth Tsukuba International Conference of Memory (Human learning and memory: Advances in theory and application), 11-13 January, Epochal Congress Center, Tsukuba, Japan.

Nelson, Douglas L., McEvoy, Cathy L., and Schreiber, Thomas A., (1998), The University of South Florida word association, rhyme, and word fragment norms, http://www.usf.edu/FreeAssociation.

Sanseidō Henshūjo, (1991), Atarashii kokugo hyōki handobukku (Dai yonhan), Tokyo: Sio, (2003), ‘Chūkyūyō goi: Kihon 4000 go’, Nihongo Kyōiku, Tokyo, pp. 5969), Rensō kijunhyō: Daigakusei 1000 nin no jiyū rensō ni yoru, Tokyo: T

Shupkoyama, Sh

nji: Asai., Sasa

nbhara, Hir yuki, Noz ironari, an

y. Eric, (1998 inbun den

no kafrequ

hi Shis base

OM niNewsp

i hindohOM]. T

tro r menc

ck, Michae d Bilac, Sl (2004) okup on t s of : Froadmap, C IN shop g and tro s,

120

Lexical association network maps for basic Japanese vocabulary.

Documents