Linguistic annotation of learner corpora

Linguistic annotation of learner corpora

A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

1. IntroductionA study on linguistic annotation of learner

corpora, in particular Part-Of-Speech (POS) annotation, which aims to discuss where native POS tagsets fail to accurately describe learner language, by:

• Describing POS annotation practice in learner corpora, and

• Characterizing the areas where properties of learner language differ from those assumed by native POS annotation schemes.

• Learner corpora can play a role in identifying areas of relevance in, for example, FLT, SLA, materials design, etc.

• The terminology used to single out learner language aspects needs to be mapped to instances in the corpus, i.e. annotation.

• Linguistic annotation of learner corpora, in particular POS tagging, is becoming a common practice because:

– By the use of generally agreed linguistic categories, it allows to objectively identify units of interest.

– Other annotations specific to learner corpora (error-tagging) mostly allow research into deviances, it is costly and involves a degree of subjectivity.

– In SLA research there is an interest in the developmental stages of the acquisition process.

– POS tagging can be done automatically.

Recent initiatives: • International Corpus of Learner English

(ICLE)• Cambridge Learner Corpus (CLC)• Japanese EFL Learner Corpus (JEFLL)• Polish Learner Corpus of English

Automatic POS-tagging consists of 2 parts:–Tag look-up: all possible tags for the given

token are determined based on lexical database reference or morphological analysis.

–Tag disambiguation: all possible tags are reduced to the correct tag based on distribution.

Fallback strategies: weaker versions of the 3 previous sources of evidence and, as a last resort, uses of the most frequent tags.

• POS-tagging learner language is essentially perceived as an instance of domain transfer (van Rooy & Schäfer 2003; Thouësny 2009):

– Automatic POS-taggers trained on native data are run on learner data.

– Due to differences in genre and data type, the annotations are less accurate.

– To make up for this degradation of performance, post-correction is often added.

• De Haan (2000) and Van Rooy & Schäfer (2002) investigated into POS tagging error types. Spelling errors seem to be source of major problems, which can be handled rather straightforwardly, especially if they result in non-words.

• De Haan (2000) proposes a fine-grained classification of learner errors that become relevant to the POS tagging process. He suggests adapting the TOSCA-ICLE POS tagset to cater for these learner-specific features.

If native taggers,- Map linguistic categories of native language in POS

tags, based on the combinatory possibilities of stem-morphology-distribution.

The demonstrations ended without confrontation NNS

but learner language - Does not always present the same POS categories

because the combinatory possibilities of stem-morphology-distribution are different,

[…] If he want to know this […]VB/VBP?

Do native taggers always provide the categories needed to describe learner language?

2. Method• This paper is based on a sample of the

NOn-native Corpus of English (NOCE, Díaz Negrillo, 2007), containing around 40,000 words.

• The NOCE corpus is a written corpus of EFL:– Over 300,000 words of written English by

Spanish undergraduates.– 1,054 samples of an average of 250 words

each.

• The samples were collected:– From 2003 to 2009 primarily among first year

students doing the English degree programme at the Universities of Granada and Jaén (Spain),

– At 3 stages in the academic year (beginning, mid-term and end),

– By the students’ lecturers, assisted by corpus compilers and in 1-hour teaching sessions,

– As a timed classroom task: essay writing, and – On a voluntary basis and under the

appropriate anonymous conditions.

• The corpus contains 3 types of annotation:– Editorial annotation: the corpus is annotated for

students’ editions of their own writing (e.g. struckouts, late insertions, reordering of units and missing/unreadable text).

– Error annotation: a section of the corpus of around 40,000 words is error-tagged with the tagset EARS (Error-Annotation and Retrieval System, Díaz Negrillo, 2009).

– POS annotation: the corpus is annotated with 3 automatic POS taggers: TnT, Stanford and Treebank.

• General observations of the corpus’ POS annotations by the 3 POS taggers suggest:– There are areas where the taggers do not

provide the same tag for a given token,

– Certain cases are easy to disambiguate manually, but

– In other cases disambiguation is difficult because the tagsets do not fully map the categories present in the learner corpus.

• A preliminary examination of the mismatches between the native and learner POS categories suggest 4 main types of mismatches.

• The mismatches are discussed on the basis of the 3 sources of information handled by automatic POS taggers in the selection of tags for tokens: – Lexical look-up: token’s stem,

– Morphology: token’s derivational and inflectional markings, and

– Distribution: token’s syntactic context.

3. Mismatches in POS classification variables

(1) You can find a big vary of beautiful beaches […] Verb ≠ Noun

(2) They are very kind and friendship […] Noun ≠ Adjective ≠ Noun

Case 1. Stem-Distribution mismatch

Stem Distribution Morphology


(3) […] one of the favourite places to visit for foreigns. Adjective ≠ Noun ≠

Noun

(4) […] to be choiced for a job […] Noun ≠ Verb ≠ Verb

Case 2. Stem-Distribution Stem-Morphology mismatch



(5) […] this film is one of the bests ever. Adjective ≠ Adjective ≠ Noun

(6) […] television, radio are very subjectives […] Adjective ≠ Adjective ≠ Noun

Case 3. Stem-Morphology mismatch



(7) […] for almost every jobs nowadays. Noun ≠ Noun Sing ≠ Noun Pl

(8) […] it has grew up a lot especially since 1996 […] Verb ≠ Verb PP ≠ Verb PT

Case 4. Distribution-Morphology mismatch


4. POS tagging learner data and deviances

Not all learner errors demand special attention in POS-tagging:

(9) […] Internet can modificate […](10) He runned to by one […] (11) […] The 11th March cames to out minds.(12) Childrens spend so much time […](13) […] people shouldn’t be menospreciated […]

4. Conclusions

• Linguistic annotation of learner data is a powerful means to gain access to learner properties with a view to conducting theoretical and applied research.

• Application of native automatic POS-taggers is a sensible point of departure.

• However, for linguistic annotations to be fully relevant in learner corpus research, annotation should capture the properties of learner language systematically.

• Adaptation of existing native POS-tagsets to learner data specifications seems necessary.

Referencesde Haan, P. 2000. Tagging non-native English with the TOSCA-ICLE tagger.

In C. Mair & M. Hundt (Eds.), Corpus Linguistics and Linguistic Theory (pp. 69-79). Amsterdam: Rodopi.

Díaz Negrillo, A. 2007. A Fine-Grained Error Tagger for Learner Corpora. Unpublished Ph.D. thesis, University of Jaen, Jaén.

Díaz Negrillo, A. 2009. EARS: A User’s Manual. Munich: LINCOM.Thouësny, S. 2009. Increasing the reliability of a part-of-speech tagging tool

for use with learner language. Paper presented at the Automatic Analysis of Learner Language (AALL’09) Workshop, Tempe, AZ.

van Rooy, B. & Schäfer, L. 2002. The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20, 325-335.

van Rooy, B. & Schäfer, L. 2003. An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In D. Archer, P. Rayson, A. Wilson & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference Lancaster University (UK), 28-31 March 2003. Vol. 16 (pp. 835-844). Lancaster: UCREL, Lancaster University.

Linguistic annotation of learner corpora

A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany [email protected] [email protected]

[email protected]

Linguistic annotation of learner corpora

Documents

learner data

learner corpora error

properties of learner

learner language aspects

pos categories

learnerspecific features

pos annotation practice

particular pos tagging