Top Banner
Language Resources and Evaluation manuscript No. (will be inserted by the editor) Towards Advanced Collocation Error Correction in Spanish Learner Corpora Gabriela Ferraro · Rogelio Nazar · Margarita Alonso Ramos · Leo Wanner Received: date / Accepted: date Abstract Collocations in the sense of idiosyncratic binary lexical co-occurren- ces are one of the biggest challenges for any language learner. Even advanced learners make collocation mistakes in that they literally translate collocation elements from their native tongue, create new words as collocation elements, choose a wrong subcategorization for one of the elements, etc. Therefore, au- tomatic collocation error detection and correction is increasingly in demand. However, while state-of-the-art models predict, with a reasonable accuracy, whether a given co-occurrence is a valid collocation or not, only few of them manage to suggest appropriate corrections with an acceptable hit rate. Most often, a ranked list of correction options is offered from which the learner has then to choose. This is clearly unsatisfactory. Our proposal focuses on this crit- ical part of the problem in the context of the acquisition of Spanish as second language. For collocation error detection, we use a frequency-based technique. To improve on collocation error correction, we discuss three different metrics with respect to their capability to select the most appropriate correction of miscollocations found in our learner corpus. Keywords Collocation · Collocation error · miscollocation · CALL · collocation error detection · collocation error correction This work has been partially funded by the Spanish Ministry of Science and Innovation under the contract numbers FFI2008-06479-C02-01/02 and FFI2011-30219-CO2-01/02 Gabriela Ferraro Department of Information and Communication Technologies, Pompeu Fabra University Rogelio Nazar Institute for Applied Linguistics, Pompeu Fabra University Margarita Alonso Ramos Faculty of Philology, University of La Coru˜ na Leo Wanner Catalan Institute for Research and Advanced Studies (ICREA) and Department of Informa- tion and Communication Technologies, Pompeu Fabra University, Barcelona, Spain. E-mail: [email protected]
22

Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Jun 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Language Resources and Evaluation manuscript No.(will be inserted by the editor)

Towards Advanced Collocation Error Correction inSpanish Learner Corpora

Gabriela Ferraro · Rogelio Nazar ·Margarita Alonso Ramos · Leo Wanner

Received: date / Accepted: date

Abstract Collocations in the sense of idiosyncratic binary lexical co-occurren-ces are one of the biggest challenges for any language learner. Even advancedlearners make collocation mistakes in that they literally translate collocationelements from their native tongue, create new words as collocation elements,choose a wrong subcategorization for one of the elements, etc. Therefore, au-tomatic collocation error detection and correction is increasingly in demand.However, while state-of-the-art models predict, with a reasonable accuracy,whether a given co-occurrence is a valid collocation or not, only few of themmanage to suggest appropriate corrections with an acceptable hit rate. Mostoften, a ranked list of correction options is offered from which the learner hasthen to choose. This is clearly unsatisfactory. Our proposal focuses on this crit-ical part of the problem in the context of the acquisition of Spanish as secondlanguage. For collocation error detection, we use a frequency-based technique.To improve on collocation error correction, we discuss three different metricswith respect to their capability to select the most appropriate correction ofmiscollocations found in our learner corpus.

Keywords Collocation · Collocation error · miscollocation · CALL ·collocation error detection · collocation error correction

This work has been partially funded by the Spanish Ministry of Science and Innovationunder the contract numbers FFI2008-06479-C02-01/02 and FFI2011-30219-CO2-01/02

Gabriela FerraroDepartment of Information and Communication Technologies, Pompeu Fabra University

Rogelio NazarInstitute for Applied Linguistics, Pompeu Fabra University

Margarita Alonso RamosFaculty of Philology, University of La Coruna

Leo WannerCatalan Institute for Research and Advanced Studies (ICREA) and Department of Informa-tion and Communication Technologies, Pompeu Fabra University, Barcelona, Spain. E-mail:[email protected]

Page 2: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

2 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

1 Introduction

Collocations in the sense of idiosyncratic binary lexical co-occurrences such astake [a] leave, blow [a] kiss, give [a] talk, heavy storm, strong tea, etc., pose oneof the biggest challenges for any learner of a second language (Granger, 1998;Howarth, 1998a; Lewis, 2000; Nesselhauf, 2003; Lesniewska, 2006); see also(Futagi et al., 2008) and (Chang et al., 2008) for further extensive references.This is because the choice of one of the two elements in a collocation is freewhile the choice of the second depends on the first, such that while grammati-cal constructions—including those that are very different from constructions inL11—follow generalized patterns and can thus be applied by analogy once somerepresentative samples have been learned, collocations are much less general-izable and must be learned nearly one by one (Hausmann, 1984; Nation, 2001;Futagi et al., 2008). Even advanced learners who master well the grammar ofL2 make collocation mistakes in that they often literally translate collocationelements from L1 or another foreign language, use non-existing words as col-location elements, get the subcategorizion of one of the elements wrong, etc.(Alonso Ramos et al., 2010a). Automatic means for detection and correctionof collocation mistakes in L2 writings are thus in high demand. However, theresults of the research in the area still lag behind the expectations. First, theovewhelming majority of the proposals are for English as L2—see, among oth-ers, (Pantel and Lin., 2000; Shei and Pain, 2000; Wible et al., 2003; Futagiet al., 2008; Chang et al., 2008; Park et al., 2008; Wu et al., 2010)—with nosufficient evidence that they work equally well for other L2s. Second, whilemiscollocation detection, for which most often frequency-based techniques asused in Natural Language Processing (NLP) for collocation extraction fromcorpora are exploited, achieves in some approaches a nearly operational useaccuracy (e.g., Chang et al. (2008) achieve 90.7%), the accuracy of miscolloca-tion correction is considerably lower. Therefore, it is most common to offer a(ranked) list of potential corrections from which then the learner must choosethe one she considers most appropriate. Chang et al. (2008) report a meanreciprocal rank (MRR), which orders the suggestions by probability of theircorrectness, of 0.66 on their lists and Wu et al. (2010) an MRR of 0.518. Thisis not sufficient for operational use.2 It is thus the stage of miscollocationcorrection, which especially calls for advances. In what follows, we focus onthis stage in the context of Spanish as L2. For collocation error detection, weuse a simple frequency based metric, which is however good enough for ourexperiments. In all our experiments, we use the Spanish learner corpus CorpusEscrito del Espanol L2 (CEDEL2) (Lozano, 2009).3

1 In accordance with the terminology in Second Language Learning literature, we refer tothe native tongue of the learner as ‘L1’ and to her second language as ‘L2’.

2 Liu, Wible, and Tsao (2009) achieve a higher accuracy, however, they start from amanually compiled list of miscollocations rather than from an automatically retrieved list.

3 CEDEL2 is an L1 English–L2 Spanish learner corpus under construction by CristobalLozano in the framework of a bigger corpus-oriented project directed by Amaya Medikoetxeaat the Universidad Autonoma de Madrid. Currently, CEDEL2 contains about 730.000 words

Page 3: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 3

The remainder of the article is structured as follows. Section 2 containsa brief introduction to the phenomenon of collocations and the discussion ofthe challenges that collocations pose to L2 learners. Section 3 briefly reviewsthe related work in the area of collocation error recognition and correction.In Section 4, we present our model for advanced collocation error correction,which we evaluate in an experiment presented in Section 5. Section 6 concludeswith a summary of our findings and an outline of future work.

2 Collocations: A Challenge for the Learners

Long time, second language learning in general and CALL in particular focusedon difficulties of learners with grammatical constructions. The consequence ofthis grammar bias was that while for typical grammatical errors more or lessdetailed analyses have been performed and CALL-techniques to address themwere developed,4 all errors related to the lexicon have been classified simply as“lexical errors”; see, for instance (Gamon et al., 2009; Chen, 2009), without anyfurther distinction. This is certainly a gross oversimplification: a misspelling(as, e.g., Sp. *dispacho ‘office’ instead of despacho) is different from the creationof a non-existent word (as Sp. *llamo instead of llamada ‘call’) and the latter isdifferent from getting wrong a phraseme (as, e.g., *mas temprano o tarde, lit.‘sooner or later’ instead of tarde o temprano, lit. ‘late or soon’) or a collocation(as, e.g., concluir un problema, lit. ‘conclude a problem’ instead of resolver unproblema, lit. ‘resolve a problem’). Especially collocations, where only one ofthe lexemes (the base) has the same meaning as when used in isolation, whilethe meaning of the other (the collocate) depends on the base,5 constitute achallenge to learners. Several studies on English as L2 see a direct correlationbetween the quality of a learner’s writing and the degree this learner masterscollocations (Granger, 1998; Howarth, 1998b; Nesselhauf, 2005; Gilquin, 2007)and identify collocation mistakes as the most recurrent mistakes detected inlearners’s writings (Wible et al., 2003). The prominence of collocations can bealso observed in the case of Spanish as L2. According to Alonso Ramos et al.(2010b)’s analysis of a subcorpus of CEDEL2, about 39% of the collocationsused by advanced learners of Spanish were incorrect.

Since the early 2000s, a considerable amount of work has been carriedout on the development of programs (although focused mainly on English

of essays in Spanish on a predefined range of topics by native speakers of English and (toa smaller extent, for contrastive studies) by native speakers of Spanish. The topics include,among others How is the region where you live?, How do your plans for the future look like?,How did you spend your last holidays?, Analyze the major aspects of immigration, and soon. The level of Spanish of the authors of the essays ranges from “elementary” over “lowerintermediate”, “intermediate”, and “advanced” to “very advanced”. Further information onCEDEL2 can be obtained from http://www.uam.es/proyectoinv/woslac/cedel2.htm.

4 See, e.g., (Atwell, 1987; Knight and Chander, 1994; Hermet, Desilets, and Szpakowicz,2008; Meurers, in press).

5 Recall that we use the term collocation in the sense of idiosyncratic binary lexicalco-occurrences, i.e., following the lexicographic tradition (Hausmann, 1989; Cowie, 1994;Mel’cuk, 1995).

Page 4: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

4 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

as L2) that judge a combination to be a valid or invalid collocation and,in the latter case, attempt to provide a list of correction suggestions. But,again, to consider all collocation errors to be of the same unique class is anoversimplification which does not do justice to the complexity of the problemand thus to the needs of learners. Alonso Ramos et al. (2010b)’s study alsoreveals that learners produce a considerable variety of collocation error types,each of them potentially requiring a different kind of exercise or a different typeof sample material to be provided by the learning environment. Consider, forillustration, the following examples from CEDEL2:6

1. gastar todo el ano estudiando espanol,lit. ‘spend all the year studying Spanish’

2. hacer citas, lit. ‘make appointments’3. escribir [un] examen, lit. ‘write [an] examen’4. tomar puesto, lit. ‘take post’5. hablar un lenguaje, lit. ‘speak a (formal) language’6. derechos *mujeriles ‘women’s rights’7. ensenanza *segundaria ‘secondary education’8. recibir un *llamo, lit. ‘receive a call’9. asistir la universidad, lit. ‘assist a university’

10. Yo tengo el deseo personal de ser bilingue,lit. ‘I have the personal wish to be bilingual’

In (1)–(4), we can observe the most common type of miscollocations: gas-tar ‘spend’, hacer ‘make’, escribir ‘write’, and tomar ‘take’ are not correct ascollocates of ano ‘year’, cita ‘appointment’, examen ‘exam’, and puesto ‘post’,respectively—although they form correct collocations with other bases; com-pare, e.g., gastar dinero ‘spend money’, hacer [una] llamada ‘make [a] call’,escribir [una] carta ‘write [a] letter’, tomar medidas, lit. ‘take measures’. In(5), the base lenguaje ‘language’ is not correct for the intended usage; the cor-rect base would have been lengua ‘language’: hablar una lengua ‘speak a lan-guage’. In (6) and (7), the learner created non-existing collocates in Spanish—mujeriles and segundaria; the right collocates would have been de las mujeres‘of the women’ and secundaria ‘secondary’, respectively. In (8) the same occurswith the base: llamo does not exist in Spanish; the right word is here llamada‘call’. In (9), the collocate requires a preposition: asistir a la universidad, lit.‘assist to a university’ (attend [a] university), and finally, in (10), tener undeseo ‘have a wish’ is, in principle, a correct collocation, however, its use isinappropriate in the given context; the appropriate phrasing would have beenquiero ser bilingue ‘[I] want to be bilingual’.

Three questions arise in view of this amount and diversity of collocationerrors:

(i) is it possible to classify collocation errors such that each class reflects asingle consistent problem type of learners with collocations in L2?

6 The examples show that our error coverage goes beyond what can be considered a strictcollocation, i.e., co-occurrence, error in that we cover all errors that affect the correctnessof a collocation.

Page 5: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 5

(ii) how are collocations and collocation errors to be annotated in a learnercorpus in order to serve best CALL?

(iii) how can we identify and correct collocation errors in the writings of learn-ers and propose targeted exercises and sample material for the errors cor-rected?

The first question has been addressed in (Alonso Ramos et al., 2010a),where a fine-grained multi-dimensional collocation error typology has beenpresented,7 and the second in (Alonso Ramos et al., 2010b), where an annota-tion schema for collocations and collocation errors in learner corpora has beenproposed.8 In what follows, we address the third question. Its satisfactory in-telligent CALL-driven solution naturally consists of two parts. First, to be ableto identify and correct collocation errors in learner corpora, and, second, tobe able to extract from reference corpora illustrative and supportive material:collocations of the type the learner seems to have difficulties with, collocationswith which the learner seems to confuse the collocations she tends to makemistakes in, examples of context in which a specific collocation is used; etc.The second part is currently out of the reach for CALL. It presupposes namelythat we are able not only to identify miscollocations, but also classify themautomatically (for instance, in accordance with Alonso Ramos et al. (2010a)’stypology). We focus on the first part. But before we embark on the outline ofour proposal, let us briefly review the related work in this area.

3 Related Work

The late appearance of collocation-oriented CALL on the research map is cer-tainly also because collocation error detection and correction presupposes, onthe one hand, advanced models for collocation recognition and, on the otherhand, the capacity of ranking the different miscollocation correction sugges-tions. Outside CALL, the identification of collocations in corpora has been ac-tively worked on since the late eighties. Many authors explore purely statisticalmodels (Choueka, 1988; Church and Hanks, 1989; Evert, 2007; Pecina, 2008).These models can be more or less complex, but all of them measure in one wayor the other the distribution of words in combination and in isolation. Someof the works combine a statistical model with the use of syntactic features—for instance, submitting to the statistical model only word co-occurrences thatform valid syntactic structures (Smadja, 1993; Kilgarriff, 2006; Evert and Ker-mes, 2003). Most recent statistical proposals take into account the context ofthe co-occuring words—which allows for the consideration of their distribu-tional semantics (Bouma, 2010). Another strand uses the co-occurrence range

7 The dimensions of the typology capture: (1) the scope of the error (collocate, base orcollocation as a whole), (2) the kind of the error (choice of a wrong element, creation of a non-existing element, etc.) and the source (or motivation) of the error. Automatic classification ofmiscollocations encountered in essays with respect to this typology is another big challenge,which remains to be tackled.

8 This annotation schema is currently used to annotate a fragment of CEDEL2.

Page 6: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

6 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

of a given word, i.e., relative frequencies of tokens that co-occur with this wordmost often (Wible and Tsao, 2010). Opposed to token frequency based modelsis the model that uses explicit semantic features from EuroWordNet (Vossen,1998) to identify and semantically classify collocations (Wanner, Bohnet, andGiereth, 2006).

In CALL, most commonly, statistical models are applied to V(erb)+N(oun)co-occurrences; see, for instance, (Chang et al., 2008; Park et al., 2008; Yin,Gao, and Dolan, 2008; Wu et al., 2010; Dahlmeier and Ng, 2011). Since thepioneering work by Shei and Pain (2000), who still offer only precompiled cor-rection suggestions, quite a few proposals have been made on how to improvethe collocation competence of learners of English. Yin, Gao, and Dolan (2008)acknowledge that compared to other lexical errors, collocation error detectionand correction is more challenging, even when using sophisticated statisticalmodels and large corpora, including the web. They divide a learner sentenceinto (collocation) chunks and individual words, which are then used as queriesto a search engine. The frequency of the chunks in the web determines whetherthe chunks are valid in L2. If they are not, they are substituted by maximallyoverlapping chunks obtained when querying individual words. A precision of37% at 30% recall for this strategy is reported.

Chang et al. (2008) and Dahlmeier and Ng (2011) focus on L1 interfer-ence in learners’ writings. Chang et al. (2008) first extract V-N co-occurrencesfrom a given writing. Then, they check the extracted co-occurrences againsta collocation list obtained before from a reference corpus. Co-occurrences notfound in the collocation list are variegated in that their verbal elements aresubstituted by all English translations of their L1 (Chinese, in this case) coun-terpart in an electronic dictionary. The variants are again matched against thecollocation list. The finally matching co-occurrences that contain the noun ofa non-matching co-occurrence are offered as correction suggestions. Chang etal. (2008) report a precision of 97.5% for the recognition of collocations and90.7% for the recognition of miscollocations. The MRR of the correction listis reported to reach 0.66. Dahlmeier and Ng (2011) produce confusion sets ofsemantically similar words. Given an input text in L2, they generate L1 para-phrases, which are then looked up in a large parallel corpus to obtain the mostlikely L2 co-occurrences. For this strategy, they report a precision of 38%.

Futagi et al. (2008) target the detection of miscollocations in learner writ-ings, leaving the correction aside. Unlike the above proposals, they are notrestricted to V-N co-occurrences. But similar to Chang et al. (2008), they ex-tract the co-occurrences from a learner writing, variegate them and then lookup the original co-occurrence and its variants in a reference list to decide onits status. To obtain the variants, they apply spell checking, variate articlesand inflections and use WordNet to retrieve synonyms of the collocate.

Wu et al. (2010) go somewhat further in that they work with ‘subject’+‘verb’and ‘verb’+‘object’ tuples instead of PoS co-occurrences. The tuples are ex-tracted from a reference corpus (RC) using a dependency parser (Klein andManning, 2003) and filtered to get rid of free rare co-occurrences. A MaximalEntropy (ME) classifier is trained on the lexical context of each collocation

Page 7: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 7

in the RC list. For the correction of a miscollocation, the classifier provides anumber of collocate corrections using the learner sentence as lexical context.The probability predicted by the classifier for each suggestion is used to rankthe suggestions. According to the evaluation included in (Wu et al., 2010), anMRR of 0.518 for the first five correction suggestions has been achieved.9

Liu, Wible, and Tsao (2009)’s goal is to develop a model for automaticsuggestion of corrections for given miscollocations. To retrieve the sugges-tions from a reference corpus, they use three metrics: (i) mutual information(Church and Hanks, 1989), (ii) semantic similarity of an incorrect collocateto other potential collocates based on their distance in WordNet, and (iii) themembership of the incorrect collocate with a potential correct collocate in thesame “collocation cluster”.10 A combination of (ii)+(iii) leads to the best pre-cision achieved for the suggestion of a correction: 55.95%. A combination of(i)+(ii)+(iii) leads to the best precision of 85.71% when a list of 5 possiblecorrections is returned.

4 Towards Advanced Collocation Error Correction

As pointed out in Section 1, we focus on collocation error correction in Spanishlearner essays. To judge whether a V+N, V+Adv, Adj+N or Adj+Adv co-occurrence C in a learner essay is a valid collocation or not, we use a simplefrequency metric.

As far as miscollocation correction is concerned, the previous works showthat a number of criteria call to be taken into account:

– the learner may misspell a collocate such that the exploration of graphicallysimilar collocates is worthwhile when looking for a correction;

– the learner often produces collocation mistakes by collocate calques from L1(Granger, 1998)—which means that the correction may be found amongthe other equivalents of the corresponding collocate in L1 or among thesynonyms of the erroneous collocate;

– as learned from collocation detection reseach in corpus-based NLP, theassociation strength between words is often a rather reliable means to judgewhether a given co-occurrence is a collocation (or miscollocation);

– the context of a miscollocation in a learner essay is essential when searchingfor the best correction.

We attempt to capture these criteria in our miscollocation correction met-rics.

Our reference corpus of Spanish consists of lists of PoS-tagged n-grams(2 ≤ n ≥ 5) extracted from about 730 million words of newspaper material (in

9 Note that the filtering stage is not explicitly discussed in (Wu et al., 2010). Nei-ther do the authors mention the size of the correction list on which they calculate thereported MRR. We deduce both from experiments with the MUST Collocation checker(http://miscollocation.appspot.com), which is based on their proposal.10 Roughly speaking, members of the same “collocation cluster” are values of the samelexical function in the sense of Mel’cuk (1995).

Page 8: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

8 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

total, 70 volumes of two major Spanish newspapers) or from a smaller sampleof it. Further auxiliary resources of which we make use are: the Open Officethesaurus of Spanish, the Spanish EuroWordNet, an automatically compiledbilingual Spanish-English vocabulary, and the web (as additional referenceresource). These auxiliary resources shall help us to take into account thephenomenon of synonymy (the learners may choose a term which is (quasi-)synonymous to the correct element) and 1:n translation equivalence (thelearner may choose a wrong (literal) translation of the collocation elementin L1), and ensure that we have the widest evidence of collocation use in L2.

Given a V+N, V+Adv, Adj+N or Adj+Adv combination C extracted fromthe writing of the learner, our basic algorithm for the collocation error detec-tion and correction is as outlined in Figure 1.

1. Check whether the frequency fC of C := Co + B in the n-gramlist is higher than an empirically determined threshold T

2. IF fC > T , C is considered a correct collocation of Spanish3. ELSE do4. COLLECT the synonyms Cosyn of Co from the auxiliary resources5. IF any csyn ∈ Cosyn forms with B a valid collocation6. RETURN this collocation as correction suggestion7. ELSE do8. COLLECT collocations of B in the reference corpus

as correction suggestions Cocan

9. SELECT among the members ccan of Cocan the best correction,applying one of the correction selection metrics

Fig. 1 The collocation error detection and correction algorithm

In what follows, we explore three different metrics for judging which of thecorrection candidates is the best correction of the supposedly erroneous C (cf.line 9 of the algorithm).

4.1 Affinity Metric

The affinity metric is a local metric which takes into account: 1. the co-occurrence (or association) strength (as) of the collocate candidate ccan withthe base B of the miscollocation in the RC; 2. the graphic similarity gs of ccan

with the original collocate Co, and 3. the synonymy syn of ccan with Co. Theaffinity Af is calculated as

Af = as× gs + syn (1)

The association strength as is a standard parameter in collocation identi-fication procedures; see also (Liu, Wible, and Tsao, 2009). It is measured, forinstance, in terms of pointwise mutual information, log-likelihood, etc. In theexperiment in Section 5, we used log-likelihood (which, somewhat simplified,assesses how likely it is that ccan and B form an idiosyncratic co-occurrence):

Page 9: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 9

as = f(ccan + B)/(√

f(ccan)×√

f(B)) (2)

The graphic similarity gs shall capture cases where the learners mistypeda collocate or erroneously chose a collocate due to its graphic similarity to theintended one (as, e.g., rise instead of raise in English);11 see (Futagi et al.,2008) for a similar idea.

We calculate graphic similarity as the Dice coefficient, using letter bigramsas features (such that it provides the relative overlap of the letter bigrams inCo and cocan):

gs(Co, ccan) = 2× |CoBi ∩ ccanBi|/(|CoBi|+ |ccanBi

|) (3)

(with CoBi as the set of letter bigrams of Co and ccanBias the set of letter

bigrams of ccan). For instance, in the case of the misspelled collocate in *cojer(instead of coger ‘catch’) in *cojer un catarro ‘catch a cold’, CoBi will be {co,oj, je, er} and ccanBi

{co, og, ge, er}, meaning that Co and ccan possess twocommon bigrams, while each of them is composed of four different bigrams.

The synonymy factor syn is ‘1’ if ccan is among the synonyms of Co in thesynonym list obtained from the auxiliary resources and ‘0’ otherwise.

4.2 Lexical Context Metric

In contrast to the affinity metric, the lexical context metric takes into accountthe context in which the miscollocation occurs—as already suggested by Wuet al. (2010). However, unlike Wu et al., who distill the contextual featuresfrom the contexts of the correct equivalent of a miscollocation in the referencecorpus (RC), we think that it is the essay, which should provide the contextualfeatures that are then to be matched with the contexts of the candidate correc-tion searched for in the RC: a learner may use a (mis)collocation in contextsthat deviate from the most common contexts in the RC; should this occur, amachine learning mechanism trained on the most common contexts of a cor-rection will be inaccurate when confronted with features extracted from thelearner’s context (in the extreme case, the mechanism will have discarded allof the features in the learner’s context or assigned to them very small weightswhen being trained). On the contrary, starting from the contextual featuresof a miscollocation, it suffices to collect sufficient evidence that a correctioncandidate is used in the context provided by the learner.12

In other words, the lexical context metric is grounded in the assumptionof distributional semantics, namely that the semantics of a collocation can be

11 As rightly pointed out by one of the reviewers, apart from graphic similarity, phoneticsimilarity should also be considered. A large number of phonetic distance measures is avail-able; see (Kessler, 2005) for an in-depth discussion—starting with the implementation ofRussel and Odell’s Soundex for English.12 Obviously, this strategy harbors the danger of contextual feature occurrence sparseness

if the learner uses a (mis)collocation in very idiosyncratic contexts.

Page 10: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

10 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

approximately deduced from the sentential context in which this collocationappears. Consider, for illustration, the following sentences (taken from theweb) in which one of the words has been removed:

11. She * a conference on the situation of women rights . . .12. Mr. White responded to the changing industry and * a conference of critical

success13. Eventcorp * a conference that met the Conference Committee’s criteria

The reader can deduce with a certain probability that the missing word is[to] deliver or any other support verb that goes with conference and that issynonymous in this context to deliver. We argue that this is due to the factthat the reader uses the distributional semantics of the context of conferencethat allows her to come up with [to] deliver. In contrast, in

14. The mailman * apples, bananas, and coconunts.15. Oo baby, here I am, signed, sealed, *, I’m yours, oh I’m yours . . . [Stevie

Wonder song]16. Flowers * by hand on your behalf by our expert florists.

this is not the case: we cannot reliably guess the missing verb. This gives us ahint that in (14–16) the missing verb does not participate in a collocation.

We can thus hypothesize that the context can be useful for the detectionof collocations, or, in our case, for the search for the most adequate correc-tion suggestion. More precisely, we assume that given the sentential contextc1, c2, . . . , ci, Co, ci+1, . . . , cn of Co in the original sentence of the learner, thecandidate ccan with the highest affinity to c1, c2, . . . ci, ci+1, . . . , cn in the RC isthe most adequate correction of Co—with “affinity” meaning here the highestco-occurrence frequency:

argmaxccan

n∑j=1

f(cj − ccan) (4)

(with ‘cj − ccan’ as the occurrence of cj with ccan at a distance of maximallyk tokens).

In our experiments, we used so far n ≤ 8 with k = 2, always within theborders of a single sentence; duplicates are eliminated.13 For instance, in thelearner sentence (17):

17. Sp. Afortunadamente, su profesora estuvo dispuesta a venderlas y pudocomprar dos mascaras para extender nuestra coleccionlit. ‘Fortunately, his professor was willing to sell-them and could buy twomascs to extend our collection’.

13 In contrast to information retrieval-oriented search, we do not eliminate from the con-text the functional words (which are otherwise considered to be “stop words” that do notcontribute to the quality of the search) since they are essential for our task.

Page 11: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 11

The context tokens during the search for the optimal correction suggestionfor the miscollocation *extender [una] coleccion ‘extend a collection’ would bemascara ‘masc’, para ‘for’ and nuestro ‘our’. In further experiments (see Sec-tion 6), we plan to use the entire sentential context—although some processingtime restrictions may arise for long sentences.

4.3 Context Feature Metric

The context feature metric is similar to the lexical context metric in that itdraws upon the context of the miscollocation in the original learner sentence.However, there are also two significant differences. First, it may take into ac-count not only lexical tokens (although this is what we tested it with so far; seebelow), but any kind of contextual features (POS tags, grammatical functions,punctuation, etc.). Second, its interpretation of these features is very different:Given the contextual features cf1 , cf2 , . . . , cfn of Co in the original sentenceof the learner and a list of candidates Ccan, the idea is to assess whether themajority of the contextual features cf of Co speaks for the preference of oneof the candidates ccan ∈ Ccan. For this purpose, we search, for each featurecf , a ccan that shows towards it the highest affinity:

argmaxccani(i=1,...,n)

N(ccani, B)∑

j=1,...,n N(ccanj, B)

× N(cf , (ccani, B))

N(ccani, B)

(5)

where N(ccani , B) stands for the number of times the combination (ccani , B)occurs in the corpus, and N(cf , (ccani

, B)) for the number of times the featurec and the combination (ccani

, B) co-occur in the corpus at a certain distancefrom each other.

What equation (5) does is to assess how common the feature cf is when agiven candidate correction ccani co-occurs with the base B and how idiosyn-cratic the co-occurrence of ccani with B is. For instance, in the learner sentence(18), the collocation *sacarse [una] operacion, lit. ‘take off an operation’ is notcorrect:

18. Es facil, solo hay que sacarse una operacion como Michael Jackson‘It is easy, you have only to take off an operation as Michael Jackson’.

To find the right correction, the affinity between the candidate collocationsof operacion ‘operation’ and each of the contextual features in the reference cor-pus is examined. In the experiments carried out so far, we used lexical tokensas contextual features (similar to the lexical context metric). The contextualfeatures in (18) would thus be solo, hay, que, como, Michael and Jackson.14

For each of these features, its preference for concluir ‘conclude’, detener ‘stop’,drogar ‘drug’, herir ‘harm’, informar ‘inform’, producir ‘produce’, recibir ‘re-ceive’, rescatar ‘rescue’ or ser ‘be’ (the correction candidates provided by ouralgorithm) is assessed according to equation (5).14 As a matter of fact, proper names are poor features for our task. We plan to discard

them in the future experiments.

Page 12: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

12 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

5 Experiments

In order to test the performance of our collocation recognition algorithm andthe quality of each of the miscollocation correction metrics, we ran the proce-dure in Figure 1 on a number of sentences from CEDEL2, exploring also theinfluence of the nature and size of the reference corpus on the task.

5.1 Set Up of the Experiments

5.1.1 Testing reference corpora

As mentioned above, our strategy for collocation error identification is purelyfrequency based in that it draws upon the frequency of a given co-occurrence ina reference corpus to decide whether this co-occurrence is a correct collocationor not. However, the frequency of a co-occurrence depends on the nature and onthe composition (and thus indirectly also on the size) of the reference corpus.For instance, in the web, the high number of non-native authors of English(and thus the potential use of co-occurrences not accepted by native speakers)may distort the relative frequencies of collocations. In the case of Spanish, wealso need to bear in mind that the frequencies may significantly vary acrossthe different Spanish speaking countries.

The size of the reference corpus also matters since a very large corpus bearsthe risk of containing too much noise and a small corpus may not contain asufficient number and diversity of co-occurrences.

To assess, at least partially, these restrictions, we evaluated the collocationerror identification task against three different corpora: the web (consulted us-ing the Yahoo! Appi), a large corpus of 5GB of newspaper material of peninsu-lar Spanish (about 740 million words) and a smaller sample of approximately2GB (about 185 million words) from the same corpus.

5.1.2 Testing error correction metrics

In a different experiment, we assessed the quality of the error correction met-rics, letting them compete with each other. As reference corpus, the 740 millionword corpus was used. In total, 61 runs on V+N combinations were carried out.In each run, the algorithm in Figure 1 was applied to a given V+N combinationand a sentence from the learner corpus in which this combination occured. Forinstance, one of the combinations was hacer citas, lit. ‘make appointments’ inthe learner sentence:

19. En mi nueva posicion, yo hice planes de viajar para los grupos, acudı eltelefono e hice citas para conferencias con otras companıas para Garylit. ‘In my new position, I made plans to travel for groups, [I] attended thephone and made appointments for conferences with other companies forGary’.

Page 13: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 13

Due to the low (more precisely, zero) frequency of hacer citas in the refer-ence corpus (0 < T = 7), it is judged to be a collocation error. The conditionin Step 5 of the algorithm in Figure 1 returns ‘false’ such that Steps 8 and 9are carried out. Step 8 returns the following list of candidates:15

realizar [una] cita ‘realize an appointment’, producir [una] cita ‘pro-duce [an] appointment’, dar [una] cita ‘give [an] appointment’, tener[una] cita ‘have [an] appointment’, ir [a una] cita ‘go [to an] ap-pointment’, acudir [a una] cita ‘turn [to an] appointment’, declarar[una] cita ‘declare [an] appointment’, haber [una] cita ‘receive [an]appointment’, concertar [una] cita ‘arrange [an] appointment’, ser[una] cita ‘be [an] appointment‘, agenciar [una] cita ‘mediate [an]appointment’.Step 9 applies the individual metrics to select the best correction. The

affinity metric suggests realizar [una] cita ‘realize an appointment’, while thelexical and the context feature metrics suggest concertar [una] cita ‘arrange[an] appointment’—which is, in fact, the most appropriate correction of hacercitas.

Table 1 displays a number of further examples of correction suggestionsof the three metrics. In the 10 displayed trials, the affinity metric failed 5times, the context feature metric 4 times, and the lexical context metric 2times. Note that some wrong correction suggestions may be valid collocationsin Spanish. For example, dar [una] oportunidad ‘[to] give [an] opportunity’is a correct collocation in Spanish, but with different semantics than the onerequired taking into account the sentence of the learner in (20)

20. Como he dicho, me encantaba el espanol, y yo utilizo cada oportunidad quetengo para practicar y mejorar mis aptitudeslit. ‘As [I] have said, [I] loved Spanish, and [I] use every opportunity that[I] have to practice and improve my aptitudes’.

Another example is imponer [una] regla ‘[to] impose [a] rule’, which isa correct collocation, but which, again, does not reflect the intention of thelearner in their writing: Los sistemas religiosos no quieren que los sistemasgubernamentales interrumpen sus reglas y leyes., lit. ‘The religious systems donot want that the governmental systems interrupt their rules and laws.’

In some cases, more than one collocation candidate can be considered asthe correct suggestion, Consider, for instance, the incorrect collocation concluir[un] problema, lit. ‘conclude [a] problem’, in the learner’s sentence (21):

21. Quizas sera la ciencia que descubra en el futuro algo que ayudara a con-cluir definitivamente el problema.lit. ‘Perhaps, [it] will be the science that will discover in the future some-thing that would help to definitely resolve the problem’.

15 The suggestion *agenciar [una] cita as a possible candidate is due to the wrong PoStagging of the bigram agencia cita ‘agency cites’, which is very common in a newspapercorpus such as ours.

Page 14: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

14 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

where all three metrics suggest a different but adequate correction: resolver [unproblema ‘resolve [a] problem’, solucionar [un] problema ‘solve [a] problem’,and acabar [con un] problema ‘terminate [with a] problem’.

# Collocation error Suggested correction by the different metricsAffinity Lexical Context feature

1 realizar meta *hacer alcanzar alcanzar‘realize goal’ ‘make’ ‘reach’ ‘reach’

2 cambiar al cristianismo convertir convertir convertir‘change to Christianity’ ‘convert’ ‘convert’ ‘convert’

3 comer cafe tomar tomar *estar‘eat coffee’ ‘take’ ‘take’ ‘be’

4 quedar la tradicion seguir seguir *pasar‘remain the tradition’ ‘follow’ ‘follow’ ‘pass’

5 utilizar la oportunidad aprovechar *ver *dar‘utilize the opportunity’ ‘take advantage of’ ‘see’ ‘give’

6 concluir un problema resolver solucionar acabar‘conclude a problem’ ‘resolve’ ‘solve’ ‘finish’

7 empezar una familia *acomodar formar formar‘begin a family’ ‘accomodate’ ‘form’ ‘form’

8 interrumpir una regla *establecer *imponer violar‘interrupt a rule’ ‘establish’ ‘impose’ ‘violate’

9 hacer citas *realizar concertar concertar‘make an appointment’ ‘carry out’ ‘arrange’ ‘arrange’

10 hacer influencia *tener ejercer *tener‘make influence’ ‘have’ ‘exert’ ‘have’

Table 1 Examples of correction suggestions by the three different metrics of our program(‘*’ stands for wrong correction suggestion)

5.2 Results and Evaluation of the Experiments

Although we did not make an effort to optimize our collocation (error) identi-fication strategy, we take its output as input for our miscollocation correctionprocedure. Therefore, an evaluation of it is necessary.

5.2.1 Evaluation of the identification of miscollocations

Figure 2 displays the percentage of correctly identified collocations and mis-collocations using the different reference corpora.

According to these figures, the 740 million word corpus is more suitable forthe detection of correct collocations (with a recognition ratio of about 98%)than the 185 million word corpus or the web. This is likely, on the one side,due to the small size or imbalance of the 185 million word corpus and, on theother side, due to the diversity and unreliability of the data sources in the weband thus due to its uncontrolled nature. However, with a recognition ratio ofabout 84%, the web is considerably better than the 185 million word corpus. Aclear advantage of using the web as corpus is that even if it is not balanced, it

Page 15: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 15

Fig. 2 Collocation and miscollocation recogniton using reference corpora of different sizes

contains a wide range of text topics and genres, which facilitates, for instance,the study of the use of collocations in different registers.

With a recognition ratio of about 98%, the web turned out to be moreadequate as reference corpus for the identification of miscollocations than theother two corpora. This is somewhat surprising since one would expect thatits uncontrolled nature mentioned above would introduce considerable noise(or, in other words, traces of the use of collocations judged by native speakersas incorrect). The 740 million word corpus is with a recognition ratio of about73% considerably less reliable, and so is the 185 million word corpus (with49%). Further investigation is needed to clarify this outcome.

Overall, with the web as reference corpus we are able to judge whether acombination is a correct or incorrect collocation in Spanish with an accuracy ofabout 91%. From 61 samples, the algorithm fails in six cases. In five of these sixcases, correct collocations have been judged to be incorrect. This is mainly dueto our frequency based criteria. For instance, apretar [los] dientes, lit. ‘press[the] teeths together’, contar cuentos, lit. ‘tell stories’, dar [la] bienvenida, lit.‘give [the] welcome’, and preparar [la] comida, lit. ‘prepare [the] food’, arecorrect collocations in Spanish, but their frequencies in our reference corpusare too low to consider them valid.16 On the other hand, pasar [la] navidad, lit.‘pass [the] Christmas’ is judged by the program to be a correct collocation dueto its high frequency in the corpus, although it is questionable in peninsularSpanish.17

The figures also show that the 185 million words corpus is not adequateto serve as reference corpus for collocation (error) recognition: it simply doesnot contain enough evidence on the use of the individual co-occurrences.

16 This gives us a hint that a newspaper material corpus is not a well-balanced corpus forthe purposes of collocation-oriented CALL.17 However, it is a standard collocation in Argentinian Spanish.

Page 16: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

16 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

5.2.2 Evaluation of collocation correction suggestions

For the second stage, i.e., the error correction stage, we performed so far twoevaluations. Fist, we evaluated the probability that the list of correction sug-gestions retrieved by our algorithm (Step 8 in Figure 1) contains the adequatecorrection of the given miscollocation. This probability amounts to 0.73 if weconsider the first 15 suggestions (ranked by frequency): in 73% of the trials,the right correction is encountered in the list of the first 15 suggestions offeredby our program. Working with a list of 10 suggestions, we achieve an accuracyof 0.66 and with a list of 5 suggestions, an accuracy of 0.6.

In order to assess the quality of the frequency-based ranking of the correc-tion suggestions, we calculated the ratio of correct miscollocation correctionsversus the size of the correction suggestion list for 30 miscollocations. Theresults of this test are illustrated in Figure 3. For instance, for 18 instances,the adequate correction is in the top five of the list; for 20 instances the cor-rection is in the top ten, and so on. This means that the provided correctionsuggestion list is reasonbly good (although not perfect), and we can choose asingle correction from it using our metrics.

Fig. 3 Correlation of the position of the correct correction with the size of the correctionlist

Second, we evaluated the capacity of the different metrics to pick the rightcorrection from the correction suggestion list. The affinity metric achieved anaccuracy of 17.24%, the lexical context metric 27.58% and the context featuremetric, with features being simply words in the original sentence of the learner,

Page 17: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 17

54.2%.18 The mean reciprocal rank (MRR) of the top five suggestions obtainedusing the contextual feature metric is 0.72.

The poor performance of the affinity metric needs further examination. Itis possible that the metric itself is not adequate. To check this, a reimple-mentation of the metric suggested by Liu, Wible, and Tsao (2009) will help.It is also possible that it is partially due to the deficiency of the auxiliaryresources we used. Thus, Spanish WordNet is less complete than its Englishcounterpart. Furthermore it was derived semi-automatically from the EnglishWordNet, with the consequence that it is also considerably noisier.

The equally poor performance of the lexical context metric is likely due toits simplicity. The context feature metric, which is also based on the notion ofcontext, proved to be more adequate. Its performance challenges state-of-theart proposals such as (Chang et al., 2008) (with an MRR of 0.66) and (Wu etal., 2010) (with the highest MRR of 0.518). However, even 54.2% of accuracyand an MRR of 0.72 are certainly still too low for practical CALL. On theother side, it is to be pointed out that the potential of the contextual featureshas not been fully explored as yet: the use of concrete words is too restrictive.The experience from statistical NLP (e.g., parsing and generation) teaches usthat combinations of morpho-syntactic categories, grammatical functions andwords are more promising. We will carry out experiments in this respect inthe near future.

6 Conclusions and Future Work

We presented a proposal for automatic recognition and correction of miscollo-cations in written material of learners of Spanish. Unlike many other proposalsin the field, our goal was to explore how well we can offer adequate correctionsof miscollocations rather than ranked lists of possible corrections. Althoughwe agree that a ranked list of possible corrections may suffice for advancedlearners, it does not serve well the needs of beginners or intermediate levellearners—especially if the suggestions show subtle semantic differences. Fur-thermore, unlike, e.g., Wu et al. (2010)—the proposal that is most similarto ours in that it also takes the (mis)collocation context into account—ourexperiments have been carried out with real learner essays because it is onlywhen working with real material that we can obtain evidence that the pro-posed approach may serve as the basis for advanced collocation error correctiontechniques.

The results we obtained with our current initial implementation competewith state of the art proposals, although they are still far from being fullysatisfactory. Thus, a collocation/miscollocation recognition rate of 90% meansthat the judgement of each tenth co-occurrence is wrong, and a correctionrate of 54.2% means that for nearly each second miscollocation not the bestcorrection is suggested.

18 The context feature metric considered thus the same “features” as the lexical contextmetrics—only that it interpreted them differently.

Page 18: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

18 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

In order to have some evidence that our error correction proposal is notlimited to Spanish, we carried out a small experiment on English learner ma-terial in that we:

(i) took ten English collocation errors made by Chinese learners of English intheir sentential contexts from a study presented in (Li, 2005):1. The interaction between us really made me grown up mind a lot and

helped me . . .2. These merits she owns make me to take more respect and honour to

her . . .3. I should not break her privacy.4. When you said lies, you may never let others to believe what you say.5. My hand flowed much blood and I was very sad.6. Karen is a charming girl who always keeps a smile on her face.7. . . .my dad told me if I could have a great grade then he would buy it

for me.8. We often changed our secrets with each other.9. I learned a lot of knowledge from him, . . .

10. . . . I decided to change myself and to do something breakthrough.(ii) introduced the ten collocation errors one by one into the MUST collocation

checker interface19 and thus obtained ranked lists of corrections for eachmiscollocation;20

(iii) applied to the lists of MUST’s correction suggestions (interpreted as un-ordered bags of suggestions) the lexical context metric to obtain re-rankedcorrection lists.21

The MRR of the MUST ranked suggestions in (ii) has been 0.4, while the MRRof the suggestions ranked by our lexical context metric in (iii) reached 0.53, i.e.,was considerably higher. For 3 of the 10 miscollocations, the lexical contextmetric ranked first the right correction; for 4 the right correction was rankedsecond, for 1 third, for 1 fifth and for 1 the right correction was not amongthe first five. It is to be expected that the contextual feature metric, which wedid not use in the experiment due to a considerably higher set up cost, wouldhave performed even better (see Section 5.2.2). That is, we can assume thatour collocation error correction metrics are indeed language-independent.

Despite this encouraging outcome, there is still a great potential to improvethe performance of both miscollocation identification and correction—whichwe target in our future work. To further improve (mis)collocation recogni-tion, we plan to go beyond the assumption of context independence of co-occurrences, as has already been suggested by Bouma (2010). To improve onthe correction strategies, we plan to significantly broaden the types of features

19 As already mentioned above, MUST is an implementation of (Wu et al., 2010).20 Instead of grow up mind, we introduced grow mind since MUST does not process collo-

cations with phrasal verbs.21 Since MUST’s corrections for grow mind did not include any right correction, we addedcultivate mind (the correction suggested by Li) to the bag of suggestions for grow mind.

Page 19: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 19

taken into account for the feature context metric. A further experiment willbe dedicated to the exploration of a voting strategy between the different met-rics: as illustrated by Table 1, already a simple majority vote would introducesomewhat more stability into the results.

In parallel, we explore the use of the collocation (error type) annotatedfragment of CEDEL2 (Alonso Ramos et al., 2010b) as training material formachine learning based recognition of miscollocations in learner corpora andthe use of the Google n-gram list as reference corpus.

Acknowledgements Many thanks to Amaya Medikoetxea and Cristobal Lozano for mak-ing the CEDEL2 corpus available to us and to the two anonymous reviewers for their insight-ful comments, which considerably improved the final version of the paper. Our experimentshave been partially run on the Argo cluster of the Department of Communication and Infor-mation Technologies, UPF. We are grateful for this service and would like to thank especiallySilvina Re and Ivan Jimenez for their help.

References

Alonso Ramos, M., L. Wanner, N. Vazquez, O. Vincze, E. Mosqueira, andS. Prieto. 2010a. Tagging collocations for learners. In S. Granger andM. Paquot, editors, eLexicography in the 21st Century: New Challenges,New Applications. Proceedings of eLex 2009, Cahiers du Cental, volume 7,Louvain-la-Neuve.

Alonso Ramos, M., L. Wanner, O. Vincze, G. Casamayor, N. Vazquez,E. Mosqueira, and S. Prieto. 2010b. Towards a motivated annotationschema of collocation errors in learner corpora. In Proceedings of LREC2010, Malta.

Atwell, E. 1987. How to detect grammatical errors in a text without parsingit. In Proceedings of the EACL Conference, pages 38–45, Copenhagen,Denmark.

Bouma, G. 2010. Collocation extraction beyond the independence assump-tion. In Proceedings of the ACL Conference, Short paper track, Uppsala.

Chang, Y.C., J.S. Chang, H.J. Chen, and H.C. Liou. 2008. An Automatic Col-location Writing Assistant for Taiwanese EFL Learners. A case of Corpus-Based NLP technology. Computer Assisted Language Learning, 21(3):283–299.

Chen, H. 2009. Microsoft ESL Assistant and NTNU Statistical GrammarChecker. Computational Linguistics and Chinese Language Processing,14(2):161–180.

Choueka, Y. 1988. Looking for needles in a haystack or locating interestingcollocational expressions in large textual databases. In Proceedings of theRIAO, pages 34–38.

Church, K.W. and P. Hanks. 1989. Word Association Norms, Mutual Infor-mation, and Lexicography. In Proceedings of the 27th Annual Meeting ofthe ACL, pages 76–83.

Page 20: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

20 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

Cowie, A.P. 1994. Phraseology. In Asher and Simpson, editors, The En-cyclopedia of Language and Linguistics, Vol. 6. Pergamon, Oxford, pages3168–3171.

Dahlmeier, D. and H.T. Ng. 2011. Correcting semantic collocation errors withL1-induced paraphrases. In Proceedings of the 2011 Conference on Empir-ical Methods in Natural Language Processing, pages 107–117, Edinburgh,Scotland.

Evert, S. 2007. Corpora and collocations. In A. Ludeling and M. Kyto, edi-tors, Corpus Linguistics. An International Handbook. Mouton de Gruyter,Berlin.

Evert, S. and H. Kermes. 2003. Experiments on candidate data for collo-cation extraction. In Companion Volume to the Proceedings of the 10thConference of the EACL, pages 83–86.

Futagi, Y., P. Deane, M. Chodorow, and J. Tetreault. 2008. A computationalapproach to detecting collocation errors in the writing of non-native speak-ers of English. Computer Assisted Language Learning, 21(1):353–367.

Gamon, M., C. Leacock, C. Brockett, W. Dolan, J. Gao, and D. Belenko.2009. Using statistical techniques and web search to correct ESL errors.CALICO Journal, 26(3):491–511.

Gilquin, G. 2007. To err is not all. What corpus and elicitation can revealabout the use of collocations by learners. Zeitschrift fur Anglistik undAmerikanistik, 55(3):273–291.

Granger, S. 1998. Prefabricated patterns in advanced EFL writing: Collo-cations and formulae. In A. Cowie, editor, Phraseology: Theory, Analysisand Applications. Oxford University Press, Oxford, pages 145–160.

Hausmann, F.-J. 1984. Wortschatzlernen ist Kollokationslernen. Zum Lehrenund Lernen franzosischer Wortwendungen. Praxis des neusprachlichen Un-terrichts, 31(4):395–406.

Hausmann, F.-J. 1989. Le dictionnaire de collocations. In F.-J. Haus-mann, P. Reichmann, H.E. Wiegang, and L. Zgusta, editors, Worterbucher,Dictionaries, Dictionnaires. Ein internationales Handbuch. De Gruyter,Berlin.

Hermet, M., A. Desilets, and S. Szpakowicz. 2008. Using the web as a linguisticresource to automatically correct lexico-syntactic errors. In Proceedings ofthe LREC 2008, pages 54–57, Marrakech.

Howarth, P. 1998a. Phraseology and second language acquisition. AppliedLinguistics, 19(1):24–44.

Howarth, P. 1998b. The phraseology of learner’s academic writing. In A.P.Cowie, editor, Phraseology: Theory, Analysis and Applications. Oxford Uni-versity Press, Oxford, pages 161–186.

Kessler, B. 2005. Phonetic comparison algorithms. Transactions of the Philo-logical Society, 103(2):243–260.

Kilgarriff, A. 2006. Collocationality (and how to measure it). In Proceedingsof the 12th EURALEX International Congress, Torino.

Klein, D. and C.D. Manning. 2003. Accurate unlexicalized parsing. In Pro-ceedings of the ACL Conference, pages 423–430.

Page 21: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

Towards Advanced Collocation Error Correction 21

Knight, K. and I. Chander. 1994. Automated postediting of documents. InProceedings of the AAAI Conference, pages 779–784, Seattle, WA.

Lesniewska, J. 2006. Collocations and second language use. Studia LinguisticaUniversitatis lagellonicae Cracoviensis, 123:95–105.

Lewis, M. 2000. Teaching Collocation. Further Developments in the LexicalApproach. LTP, London.

Li, C.C. 2005. A Study of collocational error types in ESL/EFL College Learn-ers. Ph.D. thesis, Ming Chuan University College of Applied Languages,Department of Applied English.

Liu, A. Li-E., D. Wible, and N.-L. Tsao. 2009. Automated suggestions for mis-collocations. In Proceedings of the NAACL HLT Workshop on InnovativeUse of NLP for Building Educational Applications, pages 47–50, Boulder,CO.

Lozano, C. 2009. CEDEL2: Corpus escrito del espanol L2. In C.M. Bre-tones Callejas, editor, Applied Linguistics Now: Understanding Languageand Mind. Universidad de Almerıa, Almerıa, pages 197–212.

Mel’cuk, I.A. 1995. Phrasemes in Language and Phraseology in Linguistics.In M. Everaert, E.-J. van der Linden, A. Schenk, and R. Schreuder, edi-tors, Idioms: Structural and Psychological Perspectives. Lawrence ErlbaumAssociates, Hillsdale, pages 167–232.

Meurers, D. in press. Natural language processing and language learning.In C.A. Chapelle, editor, Encyclopedia of Applied Linguistics. Blackwell,Hoboken, NJ.

Nation, I.S.P. 2001. Learning Language in another Language. CambridgeUniversity Press, Cambridge.

Nesselhauf, N. 2003. The use of collocations by advanced learners of Englishand some implications for teaching. Applied Linguistics, 24(2):223–242.

Nesselhauf, N. 2005. Collocations in a learner corpus. Benjamins AcademicPublishers, Amsterdam.

Pantel, P. and D Lin. 2000. Word-for-word glossing with contextually similarwords. In Proceedings of 4th NAACL Conference, pages 78–85, Seattle.

Park, T., E. Lank, P. Poupart, and M. Terry. 2008. Is the sky pure today?AwkChecker: An assistive tool for detecting and correcting errors. In Pro-ceedings of the 21st Annual ACM Symposium on User Interface Softwareand Technology (UIST ’08), New York.

Pecina, P. 2008. A machine learning approach to multiword expression ex-traction. In Proceedings of the LREC 2008 Workshop Towards a SharedTask for Multiword Expressions (MWE 2008), pages 54–57, Marrakech.

Shei, C.C. and H. Pain. 2000. An ESL writer’s collocation aid. ComputerAssisted Language Learning, 13(2):167–182.

Smadja, F. 1993. Retrieving Collocations from Text: X-Tract. ComputationalLinguistics, 19(1):143–177.

Vossen, P. 1998. EuroWordNet: A Multilingual Database with Lexical Seman-tic Networks. Kluwer Academic Publishers, Dordrecht.

Wanner, L., B. Bohnet, and M. Giereth. 2006. Making Sense of Collocations.Computer Speech and Language, 20(4):609–624.

Page 22: Towards Advanced Collocation Error Correction in Spanish ...users.cecs.anu.edu.au/~gferraro/LRE-Ferraro-etal-2013.pdf · Abstract Collocations in the sense of idiosyncratic binary

22 G. Ferraro, R. Nazar, M. Alonso Ramos, and L. Wanner

Wible, D., C-H. Kuo, N.-L. Tsao, A.L-E. Liu, and H.-L. Lin. 2003. Bootstrap-ping in a language learning environment. Journal of Computer AssistedLearning, 19(4):90–102.

Wible, D. and N.L. Tsao. 2010. Stringnet as a computational resource for dis-covering and investigating linguistic constructions. In Proceedings of theNAACL-HLT Workshop on Extracting and Using Constructions in Com-putational Linguistics, Los Angeles.

Wu, J.-C., Y.-C. Chang, T. Mitamura, and J.S. Chang. 2010. Automaticcollocation suggestion in academic writing. In Proceedings of the ACLConference, Short paper track, Uppsala.

Yin, X., J. Gao, and W. Dolan. 2008. A web-based English proofing systemfor English as a second language users. In Proceedings of the 3rd Interna-tional Joint Conference on Natural Language Processing, pages 619–624,Hyderabad, India.