Using Simple Computational Linguistic Techniques for ...Using Simple Computational Linguistic Techniques for Teaching Collocations Tomonori Nagano and Kenji Kitao This article examines

Journal of Culture and Information Science, 2(1), 1－15. (March 2007)

研究論文研究論文研究論文研究論文研究論文

Using Simple Computational Linguistic Techniques forTeaching Collocations

Tomonori Nagano and Kenji Kitao

This article examines possible applications of collocation extraction techniques to second/foreign language(especially, English) instruction. We will employ four simple collocation extraction measures – t-statistic,chi-square, Mutual Information, and log likelihood – and demonstrate how those collocation measureshelp language teachers identify important collocations in authentic L2 reading. We will also examineseveral typical collocation-related mistakes by Japanese-speaking English language learners. We suggestthat some collocation errors can be explained by the influence of learners’ native language. With thisbackground in mind, we developed two pilot programs (automatic collocation exercise generation andautomatic collocation error detection) using the aforementioned four collocation measures.

1. Introduction

In this paper, we will discuss the notion ofcollocations and its possible contribution to thelanguage education (particularly focusing on theEnglish-language instruction in Japan).

In the field of second language acquisition andEnglish as a Second/Foreign Language (ESL/EFL)education, little attention has been paid to collocationscompared with other domains of language, such asvocabulary, grammar, and phonetics/phonology. It is,however, widely acknowledged that learningcollocations is a challenge for ESL/EFL learners. Thereis often no apparent reason why one collocation isbetter than another, but still substituting a synonymfor a component word (in this paper, we call each wordthat makes up a collocation a component word) in acollocation may result in an ill-formed phrase.Teaching collocations to non-native speakers is also achallenge because few teaching resources focus onteaching collocations.

On the other hand, collocations have received aconsiderable amount of attention in computationallinguistics, especially since the early 1990’s. A fairamount of research has been conducted in various

domains of computational linguistics, some of whichhas taken advantage of linguistically idiosyncraticnature of col locat ions , that is , their non-compositionally and non-substitutability (which willbe explained in the following section).

The goal of this article is to address the gap betweenthe two fields and to consider possible applications ofcomputational approaches to the teaching ofcollocations for ESL/EFL learners. We are especiallyinterested in the application of techniques developedin computational linguistics to second/foreign languagepedagogy. In the second half of the article, we willdemonstrate that very simple computational linguistictechniques can make a considerable contribution tosecond/foreign language education.

2. Definition of collocations

In spite of the familiarity of the term collocation, itsdefinition is rarely discussed in the second/foreignlanguage education and pedagogy literature. Languageeducators often have different definitions ofcollocations and some use the term collocation as asynonym of idioms or phrasal verbs. Thus, thetreatment of collocations varies among languageteachers, and there has not been any agreed-upon

Journal of Culture and Information Science March 20072

definition of collocations in literature.

There is, however, a clear intuitive distinction of thetypes of collocations. Collocations are constrained bytwo broad kinds of constraints – part of speech andlexical constraints. The part-of-speech constraint is agrammatical restriction on collocations. For instance,the combination of adjective + noun (e.g., powerfulcomputer) or verb + preposition (e.g., pitch in) isextremely frequent, but there are few collocationsconsisting of adverb + verb (e.g., fiercely fight) or(underivational) noun + preposition (e.g., bag of)1 .Thus, it can be said that well-formedness ofcollocations is to some extent restricted by specificpart-of-speech sequences.

The other kind of collocation, lexical constraint, is,on the other hand, independent of the grammaticalconstraint. The lexical collocation is lexically specificand each individual word plays a significant role informing a collocation. The lexical constraint is usedto explain the contrast between two collocations thatare comparable in terms of part of speech, such aspowerful computer and *strong computer (both are inthe adjective + noun sequence, but one phrase is farbetter than the other).

It is unfortunate that such characteristics ofcollocations have rarely gained attention in thelanguage classroom. While most language teachers areaware of the importance of collocations and knowsome apparent properties of collocations, theimportance of collocations is at best emphasized aspart of vocabulary learning, and the collocation is rarelythe topic of the language lesson.

In contrast with second/foreign language acquisitionand language education, there is rich research oncollocations in lexicography and computationallinguistics. For example, BBI Dictionary by Benson,Benson, and Ilson (1997) categorizes collocations intosyntactic collocations (e.g., prepositional phrases, theverb + complement phrase combination etc.; equivalent

to the part-of-speech constraint discussed above) andlexical collocations (e.g., adjective + noun, verb +adverbial phrase, etc.; equivalent to the lexicalconstraint). Both categories are analyzed in depth andseveral sub-categories are proposed in both kinds ofcollocations. Benson, Benson, and Ilson also suggestthat second/foreign language speakers typically haveproblems in lexical collocations. Therefore, we willchiefly discuss the lexical collocations in this study.

Benson (1989) proposes a functional definition ofcollocations and attempts to define collocations by theirunique functional properties. We adopt Benson’sfunctional definition of collocations that, in effect,makes the term collocation an umbrella term thatincludes a wide variety of co-occurrence phrasesknown as idioms, fixed combinations, prepositionalphrases, etc. We prefer Benson’s definition because itis independent of the traditional collocation terms andless prone to cause conceptual misunderstanding dueto the biases of each individual language teacher.

Benson’s definition consists of the following threefunctional properties of collocations.

• Non-compositionalityA collocation typically generates extra semanticinformation that is not available from individualwords that make up the collocation. For example,the exact meaning of the expression to follow theinstructions to the letter (which means “follow theinstructions exactly”) is not predictable from themeanings of the component words.2

• Non-substitutabilityIt is not possible to substitute synonyms forcomponent words in a collocation. Phrases like*high building (rather than tall building),*perpetrate suicide (rather than commit suicide)and *make an estimation (rather than make aguess) are awkward for this reason.

• Non-modifiabilityCollocations are not easily modified with

1 We consider the collocation of derivational noun + prepositionis a sub-type of the verb + preposition collocation. For instancepreparation for is structurally identical to prepare for in spiteof the different parts of speech.

2 Light verbs (e.g., make, get, have, etc.) are also characterizedby non-compositionality, but we will put aside the distinctionbetween collocations and light verb phrases.

Using Simple Computational Linguistic Techniques for Teaching CollocationsVol. 2 No. 1 3

additional lexical modifiers such as adjectives andadverbs. For example, ??I have annoying butterfliesin my stomach sounds odd because the modifierannoying has been inserted into have butterfliesin my stomach.

In this paper, we will take the position that non-compositionality is the primary property ofcollocations. Since words that can collocate with eachother are highly specific, extra semantic information(non-compositionality) can be generated only for acombination of limited kinds of lexical items. In thissense, the property of non-substitutability can beconsidered as a by-product of non-compositionality.Although non-substitutability subsumes non-compositionality, its property has an enormouspotential for second/foreign language education, as wewill discuss below. Finally, we will assume that onlystrongly fixed collocations (e.g., idioms) have theproperty of non-modifiability. It is evident since somecollocations are readily modifiable with an adverbialphrase or adjective. For example, it went without amajor hitch (a modified collocation derived fromwithout a hitch) is an acceptable phrase (in contrastwith having annoying butterflies in my stomach).

3. Second/foreign language learners andcollocation mistakes

As mentioned above, collocations have attracted verylittle attention in second/foreign language education,despite their crucial role in determining one’s fluencyin the second language (L2) production. For instance,(1) – (4) are typical sentences by beginner/intermediateL2 English speakers that show apparent characteristicsof non-nativeness. (A better collocation is listed inparentheses after the sentence.)

(1) *There are many high buildings in Tokyo. (tallbuildings)

(2) *I took a business journey to London. (businesstrip)

(3) *A stiff wind rustles the tree. (stiff breeze)(4) *The wrestler faced a powerful challenge.

(strong challenge)

Most native speakers of English and advanced learnersof English as a second/foreign language will find the

above sentences awkward. Such awkwardness is,however, rarely dealt with in English languageeducation. We believe that the following help explainthe underemphasis on collocation-related mistakes.

• Collocation-related mistakes are grammaticaland meaningfulThe central problem in the teaching of collocationsis the fact that collocation-related mistakes areoften grammatical (with respect to the traditionaldescriptive grammar), and they often do notobscure the intended meaning. In fact, sentences(1) – (4) are not ungrammatical, and their intendedmeanings are apparent. Although the bettercollocation is preferable from a communicationperspective, it makes it hard for language teachersto justify why one phrase (e.g., a business trip) isbetter than the other phrase (e.g., a businessjourney).

• No clear measure to compare collocationsAnother problem surrounding collocationinstruction is that the selection of collocations isarbitrary and, in many cases, the correct choicedepends on the speaker’s preference. For example,while (3) does not seem to be a correct collocation,it is hard to tell what the best alternative collocationis among phrases like stiff breeze, strong breeze,and strong wind. Even among native speakers, thejudgments may not be consistent in such a case.

• Dictionaries are not helpfulCrucially, traditional dictionaries do not helplanguage learners learn collocations. The problemis that the number of possible collocations becomesso large that the traditional paper-based dictionarycannot include all useful collocations. Althoughdictionaries often list several sample sentencesusing the headword (which are very often goodcollocations), they by no means cover allcollocations. It is easy to understand whytraditional paper dictionaries are not suitable forcollocations when we consider that the number ofpossible collocations grows exponentially as thesize of a learner’s vocabulary increases. Forinstance, it is possible to present a list of 8000words to learn, but its possible collocations (thatare derived from all possible combinations of 8000


words) are practically impossible to list.

• Frequency is sometimes not reliableThe most common approach to detecting goodcollocations is to find frequent combinations ofthe target word. For instance, when we search forrespectable person on Google, more than 120,000hits are reported. On the other hand, respectableindividual has only 8000 hits, which suggestrespectable person is a far more frequent wordsequence than respectable individual. The questionis, however, whether we can conclude respectableperson is a better collocation than respectableindividual. Later in this article, we will argue thatfrequency is not as reliable as it is assumed. Infact, at least to us, respectable individual is as agood collocation as respectable person.

• Too many collocations to focus on in class In addition to the subtlety of collocation misuse,the volume of collocations makes it difficult tofocus on them in the language classroom. Thereare such a large number of collocations in readingmaterials that the instructor cannot cover them inthe limited class time. In addition, very few studymaterials for collocations are available because itis extremely t ime-consuming (even forprofessional material developers and publishers)to detect collocations and collocation errors in thelanguage education materials.

In summary, the acquisition of collocations is a veryimportant aspect of second/foreign languageeducation. The collocation is, however, undervaluedin classroom instruction due to its own nature asdescribed above. In the following section, we willargue the possibility of the first language influencein the misuse of collocations.

4. Insights from Second Language AcquisitionResearch

Why is it so hard for non-native speakers to usecollocations correctly? It is, of course, in part a matterof fluency – if ESL/EFL learners do not have sufficientvocabulary, they will have trouble using collocations– but the picture is not as simple as it may look.

First of all, collocations remain difficult for advancedESL/EFL speakers. The misuse of collocations is stillobvious in production by advanced ESL/EFL speakersand, in fact, bad collocations (along with accent) oftenappear as a subtle indication of the non-nativeness ofnear-native ESL/EFL speakers.

One of the obvious influences on the non-nativenessof L2 utterances is the influence of their first language.The influence of L1 is termed language transfer andhas been a major topic in second language acquisition(SLA) research. Generally speaking, language transferresearch is concerned about what role the nativelanguage (L1) plays in the SLA process. While it isacknowledged that the L1 is not the sole factor in L2learners’ errors, and some universal cognitivemechanism governs second/foreign language learning,it is generally accepted that the L1 plays a crucial rolewhen considering whether an ESL/EFL learner willsucceed in language learning. Many researchers arguethat the lexical influence of L1 is far greater than thetransfer of L1 grammar; thus, according to this view,L2 speakers have more difficulty with collocations dueto the L1 influence (see Epstein, Flynn, andMartohardjono [1996] for a comprehensive review oflanguage transfer issues). For instance, it is anecdotallysupported (and probably true) that speakers of aGermanic language have an advantage over Japanesespeakers in learning English as a second/foreignlanguage.

One hypothesis of language transfer claims that onlylexical items (vocabulary) transfer to L2, but not thefunctional items (Vainikka & Young-Scholten, 1996).(A simple example is the case of Japanese speakers,who have no problem with the English word orderSVO, in spite of the fact that Japanese has the SOVword order. See Flynn [1987] for the parameter re-setting hypothesis of the ESL of Japanese-speakers.)Odlin (1989) proposes that, in the process of lexicalitem transfer, L2 learners overextend the senses of L2words due to the influence of L1. For instance, aJapanese learner of English may produce sentencessuch as:

• I’ve seen the tallest building in New York.• ??I’ve seen the highest building in New York.


In Japanese, both tall and high are translated intothe same word たかい (takai), and there is no sensedistinction between tall and high as there is in English.Therefore, Japanese ESL/EFL learners oftenoverextend the senses of tall and high and may producean unconventional word sequence as above. It isimportant to note that the overextension of word sensescan take place even if there is a one-to-one wordcorrespondence. For example, an English word namehas a direct translation in Japanese, なまえ (namae).なまえ in Japanese, however, lacks the sense of a well-known or notable person, which exists in English asin a big name. Thus, it is expected that Japanese ESL/EFL learners have difficulty in using phrases like hisname is widely acknowledged.

The differences between two languages might lookinsignificant at the individual word level, but if weconsider that our lexicon consists of a semanticnetwork (as assumed in WordNet [Miller, Beckwith,Fellbaum, Gross, and Miller, 1993]), the lack orabundance of senses will result in a huge distortion ofthe whole semantic network for second/foreignlanguage learners.

In the following section, we will present a briefsurvey of research on collocations in computationallinguistics that sheds new light on the problems in theteaching of collocations in the language classroom.

5. Collocations in computational linguistics

The recent upsurge of collocation studies incomputational linguistics has grown out of the proposalmade by Church and Hanks (1989a; 1989b), whoargued that semantic and syntactic word relationshipsare automatically computable from machine-readablecorpora. Using an information-theoretic measureMutual Information (MI), Church and Hanksdemonstrated that the association between words couldbe numerically computable with an electronic corpusof a reasonable size.

Following this tradition, several alternative measureshave been proposed. Church and Hanks (1989a;1989b) suggest the application of hypothesis testing(i.e., t-test) to the extraction of collocations. Church

and Mercer (1993) propose using non-parametricstatistics, such as chi-square instead of parametricmeasures. Dunning (1993) argues that the statisticalmethods unjustifiably violate the fundamentalassumptions of statistics theories (e.g., independencein parametric statistics and skewed data in non-parametric statistics) and, instead, proposes log-likelihood ratio as an alternative measure. In our study,we employed basic four association measures: t-test,chi-square, Mutual Information, and log likelihood.Further discussion of the statistical approach tocollocation discovery can be found in Appendix.

Next, we will briefly explain how these statisticsapply to analyzing collocations.

When applied to collocation discovery, the t-test isassumed to measure how (un)likely word co-occurrence is above chance. A high t-statistic isconsidered an indication of fixed placement of wordsand thus more likely to be a good collocation, whereasa low t-statistic suggests the words are scatteredthroughout the corpus (therefore, not formingcollocations).

Chi-square is another statistical measure, but unliket-test, the chi-square does not assume the normaldistribution of the population. Since the distributionof words is highly constrained by grammar, theassumption of the normal distribution is undoubtedlyviolated.

Mutual Information (MI) is an information theoreticmeasurement. The MI we adopted in our study is verysimple one; that is, log of the ratio of a joint probability(actual frequency) to an independent probability(expected frequency). However, it is pointed out thatMI is not very reliable when the actual frequency ofthe collocation is fewer than 10 (Manning and Schutze,1999).

Finally, log likelihood is a measure to evaluate thedegree of dependence between words in a collocationphrase. In computing log likelihood, two hypothesesof extreme cases are assumed. H1 assumesindependence of word co-occurrence (thus, no chanceof a collocation) and H2 assumes full dependence ofwords, which means that when one word appears the


other word must appear in the context. Log likelihoodis simply a degree of dependency between two wordsmeasured by the ratio between hypothesis 1(independence) and hypothesis 2 (dependence).

6. Collocation extraction for language instruction

In this section, we will demonstrate that collocationcandidates are easily extracted from text by using rawfrequencies and a large corpus. We will argue, however,that mere raw frequencies are not a reliable measurefor determining the strength of collocations. We willargue that the association measures discussed in theprevious section are more reliable than raw frequency.To demonstrate how efficiently those collocationmeasures can extract collocations, we developed twopilot programs. For the pilot experiment, we used twocorpora (The American National Corpus first release(ANC) (Ide, Reppen, and Suderman, 2002) and TheWall Street Journal Corpus (WSJ) collection from theACL/DCI corpus. After deleting non-words (i.e.,punctuation and non-ASCII symbols), the total numberof tokens was 54 million words (10 million words fromANC and 44 million words from WSJ). Thecollocations are limited to 2-word sequences (bigrams)in this study.

6.1 Raw Frequencies and collocationsOne of the most intuitive facts about collocations is

that good collocations tend to appear more frequentlythan bad collocations or non-collocation phrases. Thisintuition is true to some extent – in a corpus, goodcollocations tend to have higher frequencies whereasnon-collocation phrases do not appear at all or havevery low frequencies.

There are several online tools that take advantage ofthis strong correlation between collocations andfrequencies. For example, VIEW: Variation in EnglishWords and Phrases by Mark Davies (Davies, 2006)lists frequencies of bigrams (two-word phrases) fromthe 100-million-word British National Corpus. VIEWhas a powerful search function that enables the user tolist phrases in a certain syntactic context (e.g., onlyadjective + noun phrases) and a keyword-in-context(KWIC) function that can show exactly in whatcontexts the collocation is used.

VIEW is a useful tool to discover good collocations.For example, if the user wants to know what adjectivescan form good collocations with the word coffee, he/she can get a list all frequent phrases that match the“adjective + coffee” context. A sample output for thissearch condition on VIEW is listed below.

The results include many phrases that we intuitivelyjudge as good collocations. For example, black coffeefulfills our definition of collocations – first, its semanticinterpretation (coffee without sugar and milk) isdifferent from its literal meaning (black-color coffee),meeting the non-compositional definition. It also meetsthe non-substitutability condition because black cannotbe replaced with its synonym; for instance, *inky coffeeand *dusky coffee do not mean black coffee. We believesome other collocations in the list (e.g., strong coffee)also meet our definition of collocations.

Therefore, we think the tools like VIEW are quiteuseful for detecting collocations. However, we alsothink they are not the optimal approach to collocationdetection. While we find quite a few collocations inthe results of VIEW, we also find a lot of non-collocation phrases (e.g, hot coffee, cold coffee, emptycoffee, real coffee, good coffee, and milky coffee).3 (The

3 As mentioned above, collocations are defined as non-compositional phrases in our paper. By using such a semanticjudgment, we intend to prevent the influence of individualpreferences of collocations.


bigram empty coffee only occurs in phrases like emptycoffee cups.) In fact, a frequency-based collocation listoften includes a lot of non-collocation phrases in itsoutput. It is because the frequency is not an absolutemeasure but a relative measure. Thus, the highfrequency of black coffee (97) and the relatively lowfrequency of strong coffee (19) do not directly indicatethe strength of collocations, but rather they are mostlyaccounted for the different frequencies between blackand strong.

Given that result, statistical association measures areconsidered a far better measure to determine thestrength of collocations (Manning & Schutze, 1999;and many others). To test this claim, we computed theassociation measures of each of the phrase in Table 1.

The results in Table 2 clearly show that theassociation measures produce a different order ofcollocation phrases that was not captured by thefrequency-based model like VIEW. For instance, inTable 2, phrases like decaffeinated coffee and instantcoffee are ranked higher than other high-frequencyphrases.

Thus, we conclude that the mere frequency-basedcollocation extraction is not the only approach. Wecan clearly have an alternative approach by using the

association measures. In the following sections, wewill present further analyses of the collocationassociation measures.

6.2 Collocation candidatesThe first set of collocation candidates are givenin Table 1. Those whose native language is notEnglish (or even native speakers of English)are encouraged to try to rank those collocationsbefore reading the results.

The results are sorted in ascending order of t-statisticand MI. There are several interesting facts in the results.


First, it appears that different collocation measuresrank collocations in different manners – within ourdata and examples, t-test and log likelihood seem tobe sensitive to raw frequencies of collocations(although this does not mean that the ranking ofcollocations in those measures is solely determinedby the raw frequency, since with other sample sets,those measures ranked less frequent words higher thanmore frequent words) whereas chi-square and MI arenot as dependent on frequencies as t-test and loglikelihood. Second, with the exception of strongcomputer, very few bad collocations appear in ourcorpus (i.e., a frequency of 0). Since our corpus ismoderately large (54 million words), it might be thecase that mere raw frequencies can eliminate the badcollocations. In other words, if the frequency of acollocation candidate is 0, it can be concluded that thecollocation is likely to be a bad or misused one.However, it should be pointed out that all fourmeasures successfully distinguished strong computerfrom other collocations. This suggests that badcollocations do appear sometimes (strong computerprobably appeared in a context such as strongcomputer skills in which computer is inserted into thecollocation strong skills). Thus, frequency-basedcollocation detection may work in most cases, but itwill fail to exclude bad collocations that appear in thecorpus by chance.

The additional data (as given in Table 6) supportour analyses.

The results of the data set 2 also suggest that themere raw frequency may not be a good indicator for


collocations. The results show that some of low-frequency collocations (e.g., compulsive gambler andoppressive heat) are ranked high, indicating that theseresults evaluate the well-formedness of collocations,independent of the frequency of occurrence. It isimportant to point out that those collocations are asgood as some high-frequency collocations such aspersonal relationships.

To summarize, the four collocation measures appearto be effective in detecting correct collocations. Theyare generally better indicators than the raw frequency,which is otherwise often considered as a soledeterminant of the collocation.

Given the findings above, in the following sectionwe will propose some possible applications of thecollocation extraction methods.

7. Applications of collocation detection measures

7.1 Automatic generation of collocation exercisesOne of the obvious applications of collocation

extraction to second/foreign language education is toautomatically generate exercises on collocations.

As suggested above, collocations will be ill-formedwhen a component word is replaced with its synonym(non-substitutability). For instance, *business journeyis not a good collocation; business trip is preferred.Such ill-formed collocations are extremely difficult fornon-native speakers of English to detect. In spite ofthe obvious need for exercises on collocations, veryfew instructional resources are available on the market.As discussed above, it was because making exerciseson collocations is difficult due to the lack of clear-cutmeasures for collocations.

We argue that the computational linguistic techniquefor collocations may help solve this problem. Asdescribed above, collocation measures such as t-testcan assign numeric values for collocations and rankthem in a certain order. While the ranking varies amongcollocation measures, it seems clear that mostcollocation measures can successfully detect ill-formedcollocations from better ones. In addition, high-speccomputers, which are ubiquitously available now, cancompute collocation measures very rapidly. Thus a

computational approach to collocation exercises is notonly possible but also an optimal approach todeveloping materials on collocations.

Keeping this in mind, we developed a pilot programthat automatically generates multiple-choice exerciseson collocations. The program generates informationto develop traditional 4- to 6-items multiple-choicequestions in which all of the items (answer anddistracters) share the same lexical item that collocateswith other words. The distracters use synonyms of thecorrect collocation, since we assume that L2 speakerswill have trouble those synonym collocations due toL1 transfer.

We employed WordNet (Miller, Beckwith, Fellbaum,Gross, and Miller, 1993) to list synonyms of targetwords. WordNet is an electronic dictionary in whichword meanings are hierarchically structured. Unliketraditional dictionaries, the headwords are word sense(meaning) rather than lexical forms. (Thus, forexample, the lexically identical word bank has severalentries, including a financial institution and slopingland, especially along side a body of water.) Our pilotprogram extracts a word’s synonym (called synset inWordNet) and its immediate hyponym set (sub-ordinatewords) and hypernym set (higher-order words). Thosesynonyms are replaced with words in collocations tomake ill-formed collocations (which are used asdistracters in collocation exercises).

The outline of this program is as follows:

• The program extracts the synonym set for eachcomponent word in a collocation. (Since thewindow of words is limited to 2 in this study, onlytwo-word collocations are considered.) In the caseof business trip, synonym sets for both businessand trip are collected.

• A component word is replaced with its synonym,forming a new collocation. (Journey is in thesynonym set of trip; therefore, journey replacestrip and forms a new phrase business journey. Notethat this process repeats as many times as thenumber of synonyms.)

• The new collocation is evaluated with thecollocation measures (that is, the associationmeasures for business journey are computed.


Depending on the value of the association measure,the new collocation is classified either as a “goodcollocation (correct answer)” or a “bad collocation(distracter)”

• The list of good and bad collocations is produced.

A few sets of sample results are listed as below. Theill-formed collocations are marked with an asterisk andthe questionable collocation is marked with ??.

We believe the output is extremely useful indeveloping materials. If the program can automaticallygenerate collocation exercises, language teachers canuse collocation exercises that are extracted from thereading materials for his/her class. Such exerciseswould be impossible (due to the time and resourceconstraints) without the help of the computer program.

We have a few caveats, however. Teachers need toedit the results before using them in the classroom.First, not all the exercises do exhibit the same level ofdifficulty. Some exercises contain extremely unlikely(or nonsense) items that need to be removed by manualcheck. In some cases, questions have only very unlikelydistracters (e.g., job journey and business sail),resulting in an extremely easy question. On the otherhand, some questions are very difficult because theyhave several good distracters (e.g., business journey).Second, when a distracter’s frequency in the corpus iszero, it may produce a false negative. Although ourprogram can tell whether a collocation with non-zerofrequency is bad (based on the value of collocationmeasures), it may rate a collocation with zero frequencyas being bad when, in fact, it is not. For exampleindisputable case, indisputable truth, indisputable

reason , and indisputable point are all goodcollocations. In other words, there is a chance that agood collocation that just didn’t appear our samplecorpus could be judged as a bad collocation. Thus, theclassroom instructor needs to check each distracterbefore using it in his/her classroom.

We hope that these problems will be solved in thefuture as we improve our program.

7.2 Automatic collocation error detectionAnother possible application of the collocation

detection technique is automatic collocation errordetection. Using several collocation associationmeasures and large-size corpora, it may be possible todetect bad collocations from the writings of learnersof English as a second/foreign language.

In the simplest case, all bigrams (two-wordsequences) that do not appear in the corpus data canbe considered as misused collocations. As mentionedin the previous section, it is not always the case thatzero-frequency collocations are bad collocations. Somegood collocations may not appear in a particular corpusmerely due to the size of the corpus. In order to preventsuch cases, we employed an extra assessment step toidentify bad collocations and try to find replacementsfor them.

As stated above, collocation errors by L2 learnersare often due to the L1 transfer and most misusedcollocations are semantically equivalent to the correctcollocation (e.g., business trip vs. business journey).Thus, we postulated that it is a very strong sign of acollocation mistake if there is a good collocation thatis semantically equivalent for the misused collocation.In other words, if our program detects a goodcollocation candidate (e.g., business trip) for a badcollocation (e.g., business journey) in the L2 writing,the bad collocation is most likely a collocation mistakedue to the L1 transfer.

Based on this logic, we developed another programthat extracts all bad bigrams (that have either a zero-frequency or low collocation measure) from the input(i.e., L2 writing) and search for a better collocationcandidate. The program replaces each word in acollocation with its synonyms and re-computes the


collocation measures. The program identifies a badcollocation if the synonym collocation has a highcollocation value (thus, it’s most likely a misusedcollocation).

The outline of this program is as follows:

• Detecting misused collocations: The associationmeasures of the input collocation are computed.

• If the association measure indicates that thecollocation is ill-formed, synonyms of eachcomponent word in the collocation is extracted.

• Each component word in the ill-formed collocationis replaced with its synonyms.

• If any of the combinations bears a high associationvalue, it is listed as a possible correct collocation.

Although the program does not frequently providegood alternative collocations, when it does, its decisionon ill-formed collocations seems somewhat reliable.In most cases, the program cannot find a bettersynonymous collocation. We assume that that isbecause of the limitation of our sample corpus and thelimited number of synonyms in WordNet.

We believe that like the collocation exercisegeneration program, the basic logic of this program ishighly effective, and better engineering application willimprove the usefulness of the program.

8. Demonstration Websites

For those who are interested in trying out our pilotprograms, we have made them available at the URLsbelow. We also list Perl scripts that are used in theprograms online.

• Collocation (error) detection program http://www.slacorpus.com/programs/jpn.html

The collocations or collocation errors are listed whenthe original text is put in the textbox and is sent to ourserver. This program uses same corpora as our pilotstudy (10-million-word ANC and 44-million-wordWSJ corpus).

• Collocation exercise generation programhttp://www.slacorpus.com/programs/jpn.html

When a collocation is sent to the program, distractersfor an exercise are automatically generated.

Fig1. Collocation exercise generation program (http://www.slacorpus.com/programs/jpn.html)


• Introduction to Perlhttp://www.slacorpus.com/programs/introPerl.htmlBasic Perl scripts are listed on this page. The list

includes modules used for programming the two pilotprograms in this study, such as modules to compute t-statistic, chi-square, MI, and log-likelihood of bigrams.

9. Conclusion

In this paper, we argued that collocations pose a hugeproblem for ESL/EFL learners, but instructionalmaterials are crucially lacking in this area. Wediscussed several underlying problems that make itdifficult for language teachers to focus on collocationsin the classroom. We proposed that language corporacould be a solution to those problems in the teachingof collocations. The raw frequency can successfullyextract good collocations, but better yet, we presentedseveral collocation measures that have been developedin the last two decades in computational linguistics.Finally, we presented two pilot programs that arepotentially useful for second/foreign languageinstruction.

Needless to say, our pilot programs are in too earlya stage to draw definitive conclusions. The analysesof the outputs of our pilot programs, however, seem tobe very promising, given that even very simpleprograms produced interesting results. We hope thatengineering innovations will help improve the conceptof our pilot programs and will achieve results that canbe used in the language classroom.

Acknowledgement

The authors would like to express their appreciationto Dr. S. Kathleen Kitao, who read this manuscriptand made valuable comments.

References

Benson, M. (1989). The structure of the collocationaldictionary. International Journal of Lexicography,2, 1-14.

Fig2. Introduction to Perl (http://www.slacorpus.com/programs/introPerl.html)


Benson, M., Benson, E., and Ilson, R. (1986). The BBIcombinatory dictionary of English: A guide toword combinations. Amsterdam, Netherlands:John Benjamins.

Church, K. W., and Hanks, P. (1989a). Wordassociation norms, mutual information, andlexicography. In The 27th annual conference of theassociation for computational linguistics, 76-83.

Church, K. W., and Hanks, P. (1989b). Wordassociation norms, mutual information andlexicography (rev). Computational Linguistics, 16(1), 22-29.

Church, K.W., and Mercer, R.L. (1993). Introductionto the special issue on computational linguisticusing large corpora. Computational Linguistics,19, 1-24.

Davis, M. (2006). VIEW: Variation in English Wordsand Phrases. Visited on October 22, 2006. http://view.byu.edu/

Dunning, T. (1993). Accurate methods for the statisticsof surprise and coincidence. ComputationalLinguistics, 19, 61-74.

Epstein, S.D., Flynn, S., and Martohardjono, G. (1996).Second language acquisition: Theoretical andexperimental issues in contemporary research.Behavioral and Brain Science, 19(4), 677–758.

Flynn, S. (1987). A Parameter-Setting Model of L2A c q u i s i t i o n . S t u d i e s i n t h e o re t i c a lpsycholinguistics. Norwell, MA: D. ReidelPublishing Company.

Ide, N., Reppen, R., and Suderman, K. (2002). TheAmerican national corpus: More than the web canprovide. In The Third Language Resources andEvaluation Conference (LREC), 839–844.

Manning, C. D., and Schutze, H. (1999). Foundationsof statistical natural language processing.Cambridge, MA: MIT Press.

Miller, G., Beckwith, R., Fellbaum, C., Gross, D., andMiller, K. (1993). Introduction to WordNet: An on-line lexical database. Cambridge: MIT Press.

Odlin, T. (1989). Language transfer: Cross-linguisticinfluence in language learning. Cambridge; NewYork: Cambridge University Press.

Vainikka, V., and Young-Scholten, M. (1996). Gradualdevelopment of L2 phrase structure. SecondLanguage Research, 12, 7–39.


AppendixAppendixAppendixAppendixAppendix

t-testThe t-test (a.k.a. Student’s t-test) is a robust statistic

used for hypothesis testing. The t-test produces astatistic value called t-statistic by looking at the meanx and variance s2 of a sample and evaluates the nullhypothesis (H0) that the sample is collected from adistribution with the mean of μ. The standard formulafor t-test is

When applied to collocation discovery, the t-test isassumed to measure “how (un)likely word co-occurrence are above chance.” A high t-statistic isconsidered an indication of fixed placement of wordsand thus more likely to be a good collocation, whereasa low t-statistic suggests the words are scatteredthroughout the corpus (therefore, not formingcollocations). In our study, we employed thecomputation of t-statistic proposed by Manning andSchutze (2002) as presented as Equation (2).

Where Ofreq is the observed frequency of n-grams,Efreq is the expected frequency (the product of theprobabilities of individual words), s2 is the binominalvariance (i.e., p(1-p)), and N is the number of tokensin a corpus.

chi-squareThe application of t-test to collocation discovery is

common, but is a theoretical nightmare because theunderlying assumptions are indisputably incorrect(e.g., the distribution of words in a corpus is notrandom). Given the theoretical flaw of t-test, someresearchers propose to use the non-parametric statisticsuch as chi-square. We will not go into details of theapplication of chi-square in this paper, but interestedreaders may refer to Manning and Schutze (2002) andDunning (1993).

The computation formula for chi-square statistic thatwe employed in this study is given as Equation (3)

Where Ow1w2 is the observed frequency of word1 andword2 in a collocation, Ew1w2 is the expected frequencyof collocation words, and O¬w1¬w2 is the frequency ofbigrams that do not include any words in the targetcollocation, and N is the number of tokens in a corpus.

Log likelihoodLog likelihood is a measure to evaluate the degree

of dependence between words in a collocation phrase.In computing log likelihood, two hypotheses ofextreme cases are assumed. H1 assumes independenceof word co-occurrence (thus, no chance of acollocation) and H2 assumes full dependence of words,which means that when one word appears the otherword must appear in the context. Log likelihood issimply a degree of dependency between two wordsmeasured by the ratio between hypothesis 1(independence) and hypothesis 2 (dependence). Theemployed formula is shown below as Equation (4)(equation in the second line is a computational form).

Dunning (1993) argues that log likelihood is atheoretically sound approach that does not necessarilyassume independence of each word in a corpus. It isalso argued to be superior to other non-parameticmeasures (e.g., chi-square) because log likelihoodproduces reliable results even with small samplecorpora.

Mutual Information (MI)Finally, we employed an information theoretic

measure (point-wise) Mutual Information (MI) in thisstudy. The application of MI to collocation detectionis owed to Church and Hanks (1989a; 1989b) in whichMI is simply defined as a log of the ratio of a joint


probability to an independent probability.

In spite of its simple computation, MI producesinteresting possible collocations when the MI value ishigh (MI > 10) and the frequency of the collocation islarger than 5.

Using Simple Computational Linguistic Techniques for ...Using Simple Computational Linguistic Techniques for Teaching Collocations Tomonori Nagano and Kenji Kitao This article examines

Documents