Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany
Feb 11, 2016
Linguistic annotation of learner corpora
A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany
1. IntroductionA study on linguistic annotation of learner
corpora, in particular Part-Of-Speech (POS) annotation, which aims to discuss where native POS tagsets fail to accurately describe learner language, by:
• Describing POS annotation practice in learner corpora, and
• Characterizing the areas where properties of learner language differ from those assumed by native POS annotation schemes.
• Learner corpora can play a role in identifying areas of relevance in, for example, FLT, SLA, materials design, etc.
• The terminology used to single out learner language aspects needs to be mapped to instances in the corpus, i.e. annotation.
• Linguistic annotation of learner corpora, in particular POS tagging, is becoming a common practice because:
– By the use of generally agreed linguistic categories, it allows to objectively identify units of interest.
– Other annotations specific to learner corpora (error-tagging) mostly allow research into deviances, it is costly and involves a degree of subjectivity.
– In SLA research there is an interest in the developmental stages of the acquisition process.
– POS tagging can be done automatically.
Recent initiatives: • International Corpus of Learner English
(ICLE)• Cambridge Learner Corpus (CLC)• Japanese EFL Learner Corpus (JEFLL)• Polish Learner Corpus of English
Automatic POS-tagging consists of 2 parts:–Tag look-up: all possible tags for the given
token are determined based on lexical database reference or morphological analysis.
–Tag disambiguation: all possible tags are reduced to the correct tag based on distribution.
Fallback strategies: weaker versions of the 3 previous sources of evidence and, as a last resort, uses of the most frequent tags.
• POS-tagging learner language is essentially perceived as an instance of domain transfer (van Rooy & Schäfer 2003; Thouësny 2009):
– Automatic POS-taggers trained on native data are run on learner data.
– Due to differences in genre and data type, the annotations are less accurate.
– To make up for this degradation of performance, post-correction is often added.
• De Haan (2000) and Van Rooy & Schäfer (2002) investigated into POS tagging error types. Spelling errors seem to be source of major problems, which can be handled rather straightforwardly, especially if they result in non-words.
• De Haan (2000) proposes a fine-grained classification of learner errors that become relevant to the POS tagging process. He suggests adapting the TOSCA-ICLE POS tagset to cater for these learner-specific features.
If native taggers,- Map linguistic categories of native language in POS
tags, based on the combinatory possibilities of stem-morphology-distribution.
The demonstrations ended without confrontation NNS
but learner language - Does not always present the same POS categories
because the combinatory possibilities of stem-morphology-distribution are different,
[…] If he want to know this […]VB/VBP?
Do native taggers always provide the categories needed to describe learner language?
2. Method• This paper is based on a sample of the
NOn-native Corpus of English (NOCE, Díaz Negrillo, 2007), containing around 40,000 words.
• The NOCE corpus is a written corpus of EFL:– Over 300,000 words of written English by
Spanish undergraduates.– 1,054 samples of an average of 250 words
each.
• The samples were collected:– From 2003 to 2009 primarily among first year
students doing the English degree programme at the Universities of Granada and Jaén (Spain),
– At 3 stages in the academic year (beginning, mid-term and end),
– By the students’ lecturers, assisted by corpus compilers and in 1-hour teaching sessions,
– As a timed classroom task: essay writing, and – On a voluntary basis and under the
appropriate anonymous conditions.
• The corpus contains 3 types of annotation:– Editorial annotation: the corpus is annotated for
students’ editions of their own writing (e.g. struckouts, late insertions, reordering of units and missing/unreadable text).
– Error annotation: a section of the corpus of around 40,000 words is error-tagged with the tagset EARS (Error-Annotation and Retrieval System, Díaz Negrillo, 2009).
– POS annotation: the corpus is annotated with 3 automatic POS taggers: TnT, Stanford and Treebank.
• General observations of the corpus’ POS annotations by the 3 POS taggers suggest:– There are areas where the taggers do not
provide the same tag for a given token,
– Certain cases are easy to disambiguate manually, but
– In other cases disambiguation is difficult because the tagsets do not fully map the categories present in the learner corpus.
• A preliminary examination of the mismatches between the native and learner POS categories suggest 4 main types of mismatches.
• The mismatches are discussed on the basis of the 3 sources of information handled by automatic POS taggers in the selection of tags for tokens: – Lexical look-up: token’s stem,
– Morphology: token’s derivational and inflectional markings, and
– Distribution: token’s syntactic context.
3. Mismatches in POS classification variables
(1) You can find a big vary of beautiful beaches […] Verb ≠ Noun
(2) They are very kind and friendship […] Noun ≠ Adjective ≠ Noun
Case 1. Stem-Distribution mismatch
Stem Distribution Morphology
3. Mismatches in POS classification variables
(3) […] one of the favourite places to visit for foreigns. Adjective ≠ Noun ≠
Noun
(4) […] to be choiced for a job […] Noun ≠ Verb ≠ Verb
Case 2. Stem-Distribution Stem-Morphology mismatch
Stem Distribution Morphology
3. Mismatches in POS classification variables
(5) […] this film is one of the bests ever. Adjective ≠ Adjective ≠ Noun
(6) […] television, radio are very subjectives […] Adjective ≠ Adjective ≠ Noun
Case 3. Stem-Morphology mismatch
Stem Distribution Morphology
3. Mismatches in POS classification variables
(7) […] for almost every jobs nowadays. Noun ≠ Noun Sing ≠ Noun Pl
(8) […] it has grew up a lot especially since 1996 […] Verb ≠ Verb PP ≠ Verb PT
Case 4. Distribution-Morphology mismatch
Stem Distribution Morphology
4. POS tagging learner data and deviances
Not all learner errors demand special attention in POS-tagging:
(9) […] Internet can modificate […](10) He runned to by one […] (11) […] The 11th March cames to out minds.(12) Childrens spend so much time […](13) […] people shouldn’t be menospreciated […]
4. Conclusions
• Linguistic annotation of learner data is a powerful means to gain access to learner properties with a view to conducting theoretical and applied research.
• Application of native automatic POS-taggers is a sensible point of departure.
• However, for linguistic annotations to be fully relevant in learner corpus research, annotation should capture the properties of learner language systematically.
• Adaptation of existing native POS-tagsets to learner data specifications seems necessary.
Referencesde Haan, P. 2000. Tagging non-native English with the TOSCA-ICLE tagger.
In C. Mair & M. Hundt (Eds.), Corpus Linguistics and Linguistic Theory (pp. 69-79). Amsterdam: Rodopi.
Díaz Negrillo, A. 2007. A Fine-Grained Error Tagger for Learner Corpora. Unpublished Ph.D. thesis, University of Jaen, Jaén.
Díaz Negrillo, A. 2009. EARS: A User’s Manual. Munich: LINCOM.Thouësny, S. 2009. Increasing the reliability of a part-of-speech tagging tool
for use with learner language. Paper presented at the Automatic Analysis of Learner Language (AALL’09) Workshop, Tempe, AZ.
van Rooy, B. & Schäfer, L. 2002. The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20, 325-335.
van Rooy, B. & Schäfer, L. 2003. An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In D. Archer, P. Rayson, A. Wilson & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference Lancaster University (UK), 28-31 March 2003. Vol. 16 (pp. 835-844). Lancaster: UCREL, Lancaster University.
Linguistic annotation of learner corpora
A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany [email protected] [email protected]