Elliphant: A Machine Learning Method for Identifying ... · for Identifying Subject Ellipsis and Impersonal Constructions in Spanish Luz Rello Main advisor: Ruslan Mitkov ... Charlie
Post on 15-Apr-2020
5 Views
Preview:
Transcript
Elliphant:
A Machine Learning Method
for Identifying Subject Ellipsis
and Impersonal Constructions
in Spanish
Luz Rello
Main advisor: Ruslan Mitkov
Co-advisor: Xavier Blanco
A thesis submitted for the degree of Erasmus Mundus International Master in
Natural Language Processing and Human Language Technology
Research Group in Computational Linguistics Laboratori fLexSemUniversity of Wolverhampton Universitat Autonoma de Barcelona
June 2010
In memory of Juan Rello
“And then again,” Grandpa Joe went on speaking very slowly
now so that Charlie wouldn’t miss a word, “Mr Willy Wonka
can make marshmallows that taste of violets, and rich caramels
that change colour every ten seconds as you suck them, and little
feathery sweets that melt away deliciously the moment you put
them between your lips. He can make chewing-gum that never
loses its taste, and sugar balloons that you can blow up to enor-
mous sizes before you pop them with a pin and gobble them up.
And, by a most secret method, he can make lovely blue birds’
eggs with black spots on them, and when you put one of these in
your mouth, it gradually gets smaller and smaller until suddenly
there is nothing left except a tiny little pink sugary baby bird
sitting on the tip of your tongue.””
Charlie and the Chocolate Factory, Roald Dahl
Abstract
This thesis presents Elliphant, a machine learning system for classifying
Spanish subject ellipsis as either referential or non-referential. Linguisti-
cally motivated features are incorporated in a system which performs a
ternary classification: verbs with explicit subjects, verbs with omitted but
referential subjects (zero pronouns), and verbs with no subject (impersonal
constructions). To the best of our knowledge, this is the first attempt to
automatically identify non-referential ellipsis in Spanish. In order to en-
able a memory-based strategy, the eszic Corpus was created and manually
annotated. The corpus is composed of Spanish legal and health texts and
contains more than 6,800 annotated instances. A set of 14 features were
defined and a separate training file was created, containing the instances
represented as vectors of feature values. The training data was used with
the Weka package and a set of optimization experiments was carried out
to determine the best machine learning algorithm to use, the parameter op-
timization, the most effective combinations of features, the optimal number
of instances needed to train the classifier, and the optimal settings for clas-
sifying instances occurring in different genres. A comparative evaluation
of Elliphant with Connexor’s Machinese Syntax parser shows the superior-
ity of our system. The overall accuracy of the system is 86.9%. Due to
the fairly frequent elision of subjects in Spanish, this system is useful as the
classification of elliptic subjects as referential or non-referential can improve
the accuracy of Natural Language Processing where zero anaphora resolu-
tion is necessary, inter alia, for information extraction, machine translation,
automatic summarization and text categorization.
Acknowledgements
First, my sincere acknowledgements to Prof. Ruslan Mitkov for providing
everything that can be asked of a supervisor: constant trust, support and
encouragement from the very beginning until the end of this thesis.
There are three other persons without whom this work would not have
been possible (alphabetically): Thank you, Ricardo Baeza-Yates, for your
brilliant ideas; thank you, Richard Evans, for your guidance; and thank
you, Pablo Suarez, for helping the project to become a reality.
I would like to acknowledge the Computational Linguistics Group at the
University of Wolverhampton where my collaboration through the first year
brought its first results, specially to Iustina Ilisei and Naveed Afzal.
Thank you for the assistance received in Universitat Autonoma de Barcelona
by my co-advisor Xavier Blanco and by Jose Marıa Brucart and Joaquim
Llisterri.
I am indebted to the Grupo de Investigacion en Tratamiento Automatico
del Lenguaje Natural of Universitat Pompeu Fabra for their support and
feedback during this last semester, particularly to Gabriela Ferraro and Leo
Wanner.
Finally, thank you to Igor Mel’cuk and Ignacio Bosque for easing doubts
and to Sang Yoon Kim and Ana Suarez Fernandez for their help throughout
the annotation process.
These master studies were supported by a ”La Caixa” grant (Becas de ”La
Caixa” para estudios de master en Espana. Convocatoria 2008).
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 NLP Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 NLP Approaches to Zero Pronouns . . . . . . . . . . . . . . . . . 6
2.1.2 NLP Approaches to Identifying Non-referential Constructions . . 10
2.2 Linguistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Linguistic Approaches to Subject Ellipsis . . . . . . . . . . . . . 14
2.2.2 Linguistic Approaches to Non-referential Ellipsis . . . . . . . . . 18
3 Detecting Ellipsis in Spanish 21
3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Explicit Subjects: Non-elliptic and Referential . . . . . . . . . . 22
3.1.2 Zero Pronouns: Elliptic and Referential . . . . . . . . . . . . . . 23
3.1.3 Impersonal Constructions: Elliptic and Non-referential . . . . . . 24
3.2 Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Building the Training Data . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Annotation Software and Annotation Guidelines . . . . . . . . . 30
3.2.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.4 Purpose Built Tools . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.5 The WEKA Package . . . . . . . . . . . . . . . . . . . . . . . . . 41
vii
4 Evaluation 43
4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Method Selected: K* Algorithm . . . . . . . . . . . . . . . . . . 44
4.1.2 Learning Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.3 Most Effective Features . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.4 Genre Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Conclusions and Future Work 59
5.1 Main Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
References 63
List of Figures
2.1 Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia
Espanola, 2009). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 An example of the output of the Connexor’s Machinese Syntax parser
for Spanish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Screenshot of the annotation program interface. . . . . . . . . . . . . . . 30
3.3 An example of Weka Explorer interface. . . . . . . . . . . . . . . . . . 42
4.1 eszic training data learning curve for accuracy. . . . . . . . . . . . . . . 48
4.2 eszic training data learning curve for precision, recall and f-measure. . . 48
4.3 Learning curve for accuracy, recall and f-measure of the classes. . . . . . 49
4.4 Learning curve for accuracy, recall and f-measure in relation to the num-
ber of instances of each class. . . . . . . . . . . . . . . . . . . . . . . . . 50
ix
List of Tables
3.1 eszic Corpus: tokens, sentences and clauses. . . . . . . . . . . . . . . . 27
3.2 eszic Corpus: number of instances per class. . . . . . . . . . . . . . . . 28
3.3 eszic Corpus annotation tags. . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Features: definitions and values. . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Weka classifiers accuracy (20% of the eszic training set). . . . . . . . . 45
4.2 eszic training data evaluation with K* -B 40 -M a. . . . . . . . . . . . . 46
4.3 Leave-one-out and ten-fold cross-validation comparison. . . . . . . . . . 47
4.4 Selected features by Weka Attribute Selection methods. . . . . . . . . . 51
4.5 Classification using the selected features groups: accuracy. . . . . . . . . 52
4.6 Extrinsic parser features classification results. . . . . . . . . . . . . . . . 53
4.7 Intrinsic parser features classification results. . . . . . . . . . . . . . . . 53
4.8 Single feature omission classifications: accuracy. . . . . . . . . . . . . . . 53
4.9 Legal and health genres comparative evaluation. . . . . . . . . . . . . . 54
4.10 Cross-genre training and testing evaluation. . . . . . . . . . . . . . . . . 55
4.11 Elliphant eszic training data results. . . . . . . . . . . . . . . . . . . . . 56
4.12 Machinese eszic training data results. . . . . . . . . . . . . . . . . . . . 56
4.13 Elliphant Legal eszic training results. . . . . . . . . . . . . . . . . . . . 56
4.14 Machinese Legal eszic training results. . . . . . . . . . . . . . . . . . . . 57
4.15 Elliphant Health eszic training data results. . . . . . . . . . . . . . . . 57
4.16 Machinese Health eszic training data results. . . . . . . . . . . . . . . . 57
xi
Chapter 1
Introduction
This introduction is intended to explain the three primary motivations for this research
(Section 1.1), its objectives (Section 1.2), and to briefly describe its outcomes. These
outcomes include the results of an evaluation of the implemented system and publica-
tions produced over the course of the study (see Section 1.3). The overall structure of
the thesis is also presented in Section 1.4.
1.1 Motivation
There are three reasons motivating the decision to choose this research topic and develop
a tool, Elliphant, to perform the identification of zero pronouns (referential elliptic
subjects) and impersonal constructions (non-referential elliptic non-existing subjects)
in Spanish.
The three justifications for this work are: (1) the highly frequent occurrence of zero
pronouns in Spanish; (2) identification of zero pronouns is a prerequisite for anaphora
resolution in Spanish and also for other Natural Language Processing (nlp) applica-
tions; and (3) this challenge had not yet been fully addressed in the field. The system
presented in this dissertation represents the first attempt to automatically identify
non-referential ellipsis in Spanish.
Since Spanish is a pro-drop language (Chomsky, 1981), subject ellipsis is a recur-
ring phenomenon. It was noted that 26% of the 6,878 cases annotated in the corpus
exploited in this work have an elliptic subject, while only 3% of them occur in imper-
sonal constructions. The topic of subject ellipsis has been addressed in previous work
on other pro-drop languages such as Japanese (Okumura & Tamura, 1996), Chinese
1
1. Introduction 1.2 Objectives
(Zhao & Ng, 2007), Korean (Lee & Byron, 2004) and Russian (Kibrik, 2004). The
related topic of the identification of non-referential pronouns has been addressed in
non-pro-drop languages such as English (Evans, 2001) and French (Danlos, 2005).
The identification of zero pronouns and non-referential impersonal constructions is
necessary for anaphora resolution, since the resolution of zero pronouns (zero anaphora)
implies that they need to be identified. The identification of zero pronouns first requires
that they can be distinguished from non-referential constructions (Mitkov, 2010).
Coreference and anaphora resolution, and in particular zero anaphora resolution,
has been found to be crucial in a number of nlp applications. These include, but
are not limited to, information extraction (Chinchor & Hirschman, 1997), machine
translation (Peral & Ferrandez, 2000), automatic summarisation (Steinberger et al.,
2007), text categorisation (Yeh & Chen, 2003a), topic recognition (Yeh & Chen, 2007),
salience identification (Iida et al., 2009) and word sense disambiguation (Kawahara &
Kurohashi, 2004). Moreover, there is additional research showing that zero pronoun
identification is useful in order to make further developments in centering theory (Mat-
sui, 1999), for name entity recognition (Hirano et al., 2007), for the investigation of
convergence universals in translation (Corpas Pastor et al., 2008) and to discriminate
predicate-argument structure (Imamura et al., 2009).
Finally, the difficulty in detecting non-referential pronouns has been acknowledged
since computational resolution of anaphora was first attempted (Bergsma et al., 2008)
and this task is currently needed in nlp for Spanish. The need for automatic tools
able to detect ellipticals has been stated by Recasens & Hovy (2009) who note that
their application would improve existing methods for zero anaphora resolution in Span-
ish (Ferrandez & Peral, 2000). One particular contribution of the current research is
the recognition of Spanish impersonal constructions which, following from the litera-
ture review presented in Chapter 2, appears not to have been addressed before in the
literature.
1.2 Objectives
The goal of the fully automatic method presented in this dissertation (Elliphant) is
to identify zero pronouns (referential elliptic subjects) and impersonal constructions
(non-referential elliptic subjects) in Spanish. In order to accomplish this objective, it is
2
1. Introduction 1.3 Results
also necessary to identify the cases that occur in the subject position in complementary
distribution. For this reason, the identification of explicit subjects was carried out using
a learning based method which led to a ternary classification method which covers all
the elements (elliptic and explicit, referential and non-referential) of the subject position
in the clause. These three classes are explicit subjects, zero pronouns and impersonal
constructions.
1.3 Results
The results obtained by the Elliphant system and the level of performance that it
reaches are encouraging since this tool not only identifies zero pronouns and imper-
sonal constructions but also outperforms a dependency parser (Connexor’s Machinese
Syntax) in identifying explicit subjects as well as elliptic subjects. A series of ex-
periments undertaken with the algorithm has enabled discovery of the most effective
features for use in the classification tasks. The performance results obtained for the
identification of impersonal constructions are, according to the survey of previous work
carried out in Chapter 2, the first presented for this task in the literature.
The classification results obtained by the algorithm were presented in Rello et al.
(2010b). However, this paper undertook no further investigation into the efficacy of
the features used presented in Rello et al. (2010a).
With regard to the attempt to achieve improved performance from the Elliphant
system, two previous studies have contributed to its design: one concerning the distri-
bution of zero pronouns (Rello & Illisei, 2009a) and the other presenting a rule-based
method for their identification (Rello & Illisei, 2009b). It should be noted however
that despite their contribution, Elliphant differs considerably from these initial studies
in terms of methodology (corpus used, linguistic criteria exploited, and the overall ap-
proach) and the classification task itself (classes to be identified). Overall, the Elliphant
system represents a considerable advancement on those works.
1.4 Thesis Outline
The remainder of this thesis is structured in four Chapters. Chapter 2 provides a lit-
erature review of nlp approaches (see Section 2.1) to zero pronouns (Section 2.1.1)
and identification of non-referential expressions (Section 2.1.2). The review also covers
3
1. Introduction 1.4 Thesis Outline
work in the field of Linguistics, including approaches to referential and non-referential
subject ellipsis (Section 2.2.1 and 2.2.2). Chapter 3 describes the methodology embod-
ied by the Elliphant system. Firstly, the classification task (see Section 3.1) and an
explanation of each of the classes is presented: explicit subjects (Section 3.1.1), zero
pronouns (Section 3.1.2) and impersonal constructions (Section 3.1.3). Secondly, the
machine learning method (see Section 3.2) is described, beginning with the compilation
of the corpus (Section 3.2.1), the guidelines established and the software developed to
facilitate annotation of the corpus by human annotators (Section 3.2.2), a description
of the features (see Section 3.2.3) derived from the corpus and the purpose built tools
(Section 3.2.4) implemented to generate the training data exploited by the machine
learning package, Weka (3.2.5). Elliphant is evaluated in Chapter 4. A set of exper-
iments (Section 4.1) was carried out to determine the method and parameter values
which work best for these classification tasks (Section 4.1.1), its learning curves (Sec-
tion 4.1.2) and the most effective groups of features (Section 4.1.3). A comparative
evaluation of the Elliphant system with an existing parser is presented in Section 4.2.
Finally, in Chapter 5, conclusions are drawn and plans for future work are considered.
4
Chapter 2
Related Work
Both the nlp and linguistics literature address referential and non-referential subject
ellipsis. Although the nlp literature is directly related to this dissertation in terms of
objectives and methodology, more general literature in linguistics contributes various
means by which classes of subject ellipsis and annotation criteria can be established.
Related work in nlp (see Section 2.1) on this topic can be classified as (a) liter-
ature related to zero pronouns (Section 2.1.1), which is mainly concerned with their
identification, resolution and generation, and (b) literature related to the identification
of non-referential constructions (Section 2.1.2).
The literature in linguistics (Section 2.2) concerning different types of ellipsis, in
which both zero pronouns (See Section 2.2.1) and non-referential constructions (See
Section 2.2.2) are included, is focused on the definition, delimitation and description of
their use in language.
2.1 NLP Approaches
The nlp literature on this topic broadly concerns two topics, namely zero pronouns
(Section 2.1.1) and non-referential constructions (Section 2.1.2). The number and va-
riety of studies of the first group is considerably larger than that of the second.
Both topics are mainly related to coreference and anaphora resolution systems as
the resolution of zero pronouns (zero anaphora) implies their prior identification. That
identification requires first the identification of zero pronouns and secondly the identi-
fication of non-referential constructions (Mitkov, 2010).
5
2. Related Work 2.1 NLP Approaches
While undertaking this literature review, no specific studies on the identification
of non-referential constructions were found in Spanish, although it has been indicated
to be a necessary task (Ferrandez & Peral, 2000; Recasens & Hovy, 2009) in anaphora
and coreference resolution. For this reason it is expected that the method presented in
this dissertation will complement current Spanish pronoun resolution systems.
2.1.1 NLP Approaches to Zero Pronouns
A zero pronoun is the resultant “gap” (zero anaphor) where zero anaphora or ellipsis
occurs, when an anaphoric pronoun is omitted but is nevertheless understood (Mitkov,
2002). In linguistics, zero pronouns are also referred to as null subjects, empty subjects,
elliptic subjects, elided subjects, tacit subjects, understood subjects and non-explicit sub-
jects, among others. In the nlp literature such omitted subjects are broadly denoted as
zero pronouns. Some linguistic studies also make use of the term “zero pronoun” which
is not equivalent to the computational concept. The Meaning-Text Theory (mtt) con-
siders a zero pronoun in subject position to be a non-argumental impersonal subject
(Mel’cuk, 2006):
Llueve.
(It) is raining.
while in Generative Grammar, following the Zero Hypothesis (Kratzer, 1998), a zero
pronoun can have phonetic content (full pronoun) or not (null pronoun). In this theory,
the concept of zero pronoun has to do only with its lack of lexical content in contrast
to lexical pronouns (Alonso-Ovalle & D’Introno, 2000). In this work a a zero pronoun
(Mitkov, 2002) corresponds with an omitted subject (Real Academia Espanola, 2009)
in Spanish.
Zero pronouns become crucial when processing any pro-drop language (Chomsky,
1981) –also known as null subject languages– since zero anaphora is fairly frequent in
such languages. By way of example, of the 6,827 annotated cases in our corpus, 26%
of them have an omitted subject.
The current literature review indicates that the following pro-drop languages are
the ones on which related work on zero pronoun processing have been carried out:
6
2. Related Work 2.1 NLP Approaches
– Japanese (Hirano et al., 2007; Iida et al., 2006, 2009; Imamura et al., 2009; Isozaki
& Hirao, 2003; Kawahara & Kurohashi, 2004; Matsui, 1999; Mori & Nakagawa,
1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa, 1997; Nakaiwa & Ikehara,
1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Okumura & Tamura,
1996; Sasano et al., 2008; Seki et al., 2002; Takada & Doi, 1994; Yoshimoto, 1988);
– Chinese (Hu, 2008; Peng & Araki, 2007a,b; Yeh & Chen, 2003a,b, 2007; Yeh &
Mellish, 1997; Zhao & Ng, 2007);
– Korean (Han, 2004; Lee & Byron, 2004; Lee et al., 2005);
– Spanish (Barreras, 1993; Corpas Pastor, 2008; Corpas Pastor et al., 2008; Ferrandez
& Peral, 2000; Peral, 2002; Peral & Ferrandez, 2000; Rello & Illisei, 2009a,b); and
– Russian (Kibrik, 2004).
These studies of zero pronouns address a variety of topics. Depending on their goal,
the literature on zero pronouns can be divided into the following classes:
– Zero pronoun classification or annotation: (Han, 2004; Kibrik, 2004; Lee &
Byron, 2004; Lee et al., 2005; Rello & Illisei, 2009a);
– Zero pronoun identification (Corpas Pastor, 2008; Corpas Pastor et al., 2008;
Nakaiwa, 1997; Rello & Illisei, 2009b; Yoshimoto, 1988);
– Resolution of zero pronouns, including their prior identification (Barreras,
1993; Ferrandez & Peral, 2000; Hu, 2008; Isozaki & Hirao, 2003; Kawahara &
Kurohashi, 2004; Murata et al., 1999; Nakaiwa & Shirai, 1996; Nomoto & Yoshi-
hiko, 1993; Okumura & Tamura, 1996; Peng & Araki, 2007b; Sasano et al., 2008;
Seki et al., 2002; Yeh & Chen, 2003b; Zhao & Ng, 2007); and
– Zero pronoun generation (Peral, 2002; Peral & Ferrandez, 2000; Theune et al.,
2006; Yeh & Mellish, 1997);
Other nlp applications where zero pronouns are taken into consideration are: ma-
chine translation (Nakaiwa & Ikehara, 1992; Nakaiwa & Shirai, 1996; Peng & Araki,
2007a; Peral, 2002; Peral & Ferrandez, 2000); named entity recognition (Hirano et al.,
2007); summarisation (Steinberger et al., 2007); text categorisation (Yeh & Chen,
7
2. Related Work 2.1 NLP Approaches
2003a); topic identification (Yeh & Chen, 2007) and identifying salience in text (Iida
et al., 2009); and word sense disambiguation (Kawahara & Kurohashi, 2004).
Further research topics where zero pronoun identification is useful are: predicate-
argument structure discrimination (Imamura et al., 2009); for further developments
in centering theory (Matsui, 1999) such as improved interpretation of zero pronouns
(Takada & Doi, 1994); or for the investigation of convergence universals in translation
(Corpas Pastor, 2008; Corpas Pastor et al., 2008).
Studies of specific cases of zero pronouns such as those in which their referents take
the semantic role of experiencer (Nakagawa, 1992), zero pronouns in relationships with
conditional constructions (Mori & Nakagawa, 1996) or descriptions of the syntactic
patterns in which zero pronouns are used (Iida et al., 2006), among others.
In terms of methodology, rule-based, machine learning, and a variety of other ap-
proaches have been taken toward zero pronoun identification and resolution:
– Rule-based approaches (Barreras, 1993; Corpas Pastor et al., 2008; Ferrandez
& Peral, 2000; Hu, 2008; Kawahara & Kurohashi, 2004; Kibrik, 2004; Matsui,
1999; Mori & Nakagawa, 1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa &
Ikehara, 1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Peral, 2002;
Peral & Ferrandez, 2000; Rello & Illisei, 2009b; Yeh & Chen, 2003a,b, 2007; Yeh
& Mellish, 1997; Yoshimoto, 1988);
– Machine learning approaches (Hirano et al., 2007; Iida et al., 2006, 2009; Kawa-
hara & Kurohashi, 2004; Peng & Araki, 2007b; Zhao & Ng, 2007);
– Hybrid methods combining rules and learning algorithms (Isozaki & Hirao, 2003);
– Probabilistic models (Sasano et al., 2008; Seki et al., 2002); and
– other techniques such as the exploitation of parallel corpora (Nakaiwa, 1997).
Although it is clear that machine learning methods perform better than other ap-
proaches when identifying non-referential expressions (Boyd et al., 2005), there is some
debate about which approach brings optimal performance when applied in anaphora
resolution systems (Mitkov, 2002).
In Spanish, the most influential work on this topic is the Ferrandez and Peral
algorithm for zero pronoun resolution (Ferrandez & Peral, 2000) together with their
8
2. Related Work 2.1 NLP Approaches
previous related work (Ferrandez et al., 1998, 1999). Their implementation of a zero
pronoun identification and resolution module forms part of a system known as the Slot
Unification Parser for Anaphora resolution (supar) (Ferrandez et al., 1999).
Although substantially related, the work described in this dissertation differs, both,
in form and in aim from this previous research for Spanish (Ferrandez & Peral, 2000).
Firstly, their definition of zero pronouns is broader since it is suited to a different
purpose: the zero class includes not only those zero signs whose referent lies in previous
clauses (anaphoric, according to their classification) and those that lie outside the text
(exophoric), but also those that occur after the verb (cataphoric). Here, it is considered
that those subjects that are within the clause, irrespective of whether they appear before
or after the verb, belong to the explicit subject class.
Secondly, Ferrandez & Peral (2000) take a rule-based approach while the system de-
scribed in this dissertation performs the classification using an instance-based learner.
Additionally, their rules are based on partial parsing, while some of the features ex-
ploited by the Elliphant system make use of information obtained from an analysis
of our corpus by a deep dependency parser. Ferrandez & Peral (2000) tested their
approach to zero pronoun identification and resolution using 1,599 cases, while the ma-
chine learning approach presented in this dissertation was tested on a corpus containing
6,827 classified verbal instances.
Finally, they do not provide a method for the identification of non-referential zero
pronouns. They also make no overt mention of automatic classification of zero pronouns
of the anaphoric or cataphoric kind (Ferrandez & Peral, 2000).
Despite the similarities of Ferrandez & Peral (2000) work to the approach described
in this dissertation, the fact that they take a different definition for zero pronouns,
means that a comparison with the method described in the current work is not feasible
(Section 4.2).
In order to improve on previous work by the current author (Rello & Illisei, 2009b),
this study differs from it in the design of the classification and the methodology. In
Rello & Illisei (2009b) a binary classification as either elliptic-subject or non-elliptic
subject was made as a result of the implementation of a rule-based method which
applies only to zero pronouns, whilst in the present study a ternary classification is
presented which covers all the possible instances of subject position in Spanish. More-
over, while zero pronouns were annotated in Rello & Illisei (2009b), in the present
9
2. Related Work 2.1 NLP Approaches
study the zero pronouns themselves were left unmarked. Instead, the main verb of
each clause is annotated and classified into one of three types. The baseline rule-based
algorithm described in Rello & Illisei (2009b) was based on the zero pronoun identifi-
cation methodology developed in Corpas Pastor et al. (2008) which treats every clause
which does not have an explicit subject as containing a zero pronoun.
2.1.2 NLP Approaches to Identifying Non-referential Constructions
The identification of non-referential pronouns1 is a crucial step in coreference (Boyd
et al., 2005; Mitkov, 2010) and anaphora resolution systems (Mitkov, 2001, 2002). In
comparison to the work addressing zero pronouns, previous research on this topic is
fairly limited, and, as implied by this survey of related work, the approach described in
this dissertation is the first attempt to automatically identify impersonal constructions
in Spanish.
The literature describing approaches to the identification of non-referential expres-
sions is focused on:
– Identification of pleonastic it in English (Denber, 1998; Lappin & Leass, 1994;
Paice & Husk, 1987). Work by Evans (2000, 2001) is exploited by an anaphora
resolution system in Mitkov et al. (2002). Also (Bergsma et al., 2008; Boyd et al.,
2005; Clemente et al., 2004; Gundel et al., 2005; Lambrecht, 2001; Li et al., 2009;
Muller, 2006; Ng & Cardie, 2002); and
– Identification of expletive pronouns in French (Danlos, 2005).
Nevertheless, in those languages where approaches to the identification of non-
referential expressions have been implemented, there is actually an explicit word with
some grammatical information (a third person pronoun) in the text, which is non-
referential (Mitkov, 2010). By contrast, in Spanish, non-referential expressions are not
realised by expletive or pleonastic pronouns but by a certain kind of ellipsis. For this
reason, it is easy to wrongly identify them as zero pronouns, which are referential. For
example, pleonastic pronouns such as:
1In previous work these pronouns have also been referred to as pleonastic, expletive, non-anaphoric,
and non-referential pronouns.
10
2. Related Work 2.1 NLP Approaches
(a.1) (It)1 must be stated that Oskar behaved impeccably.
(b.1) (It) rains, (Il) pleut, (Es) regnet.
(c.1) (It)’s three o’clock.
are all elided in Spanish, resulting in the following non-referential impersonal construc-
tions:
(a.2) Se dice que Oscar se comporto impecablemente.
(b.2) Llueve.
(c.2) Son las tres en punto.
A sizable proportion of the false positives obtained in previous work on identifying
zero pronouns were caused by such non-referential impersonal constructions (Rello &
Illisei, 2009b). Ferrandez & Peral (2000) noted that an inability to identify verbs used
in impersonal constructions has a negative effect on the performance of their anaphora
resolution algorithm2, while in Recasens & Hovy (2009, p. 41) the need for a tool to
identify ellipsis is observed:
“In contrast with previous work, many of the features relied on gold standard
annotations, pointing out the need for automatic tools for ellipticals detection and
deep parsing.”
Four approaches have been implemented to identify non-referential expressions and
described in the literature:
– Rule-based approaches (Danlos, 2005; Denber, 1998; Lappin & Leass, 1994; Paice
& Husk, 1987);
1In this work explicit subjects in the examples are presented in italics., zero pronouns in the
examples are presented by the symbol Ø, while in the English translations the subjects which are
elided in Spanish are marked with parenthesis. Impersonal constructions in the examples are not
explicitly indicated using a symbol (see Section 3.1).2The other two reasons given for the low success rate in the identification of verbs with no subject
are the lack of semantic information and the inaccuracy of the grammar used (Ferrandez & Peral,
2000).
11
2. Related Work 2.1 NLP Approaches
– Machine learning approaches (Bergsma et al., 2008; Boyd et al., 2005; Clemente
et al., 2004; Evans, 2000, 2001; Mitkov et al., 2002; Muller, 2006; Ng & Cardie,
2002);
– Web based approach (Li et al., 2009); and
– Descriptive studies from contextual (Lambrecht, 2001) and intonational points of
view (Gundel et al., 2005).
Paice & Husk (1987) introduce a rule-based method for identifying non-referential
it while Lappin & Leass (1994) and Denber (1998) describe rule-based components of
their pronoun resolution systems which detect non-referential uses of it. Mitkov’s first
anaphora resolution algorithm did not incorporate an approach for detecting pleonastic
it (Mitkov, 1998), while, in more recent versions, mars (Mitkov’s Anaphora Resolution
System), uses Evans (2001) system to detect pleonastic it, and machine learning (Mitkov
et al., 2002).
Instance-based learning approaches are used for identifying pleonastic it in English,
while the only approach for the identification of expletive pronouns in French employs
a ruled-based methodology (Danlos, 2005).
Evans (2001)1 describes the first attempt using a machine learning method to clas-
sify pleonastic it into seven types while Boyd et al. (2005) present a linguistically
motivated classification of non-referential it into four types.
A comparison replicating the approaches developed by Paice & Husk (1987) and
Evans (2001) with the system implemented by Boyd et al. (2005) corroborates the
finding that machine learning outperforms rule-based approaches (Boyd et al., 2005).
Further, it is pointed out that rule-based methods are limited due to their reliance on
lists of verbs and adjectives commonly used in the patterns that they exploit, which
can make them less portable and more difficult to adapt to new texts. Nevertheless, the
basic grammatical patterns are still reasonably consistent indicators of non-referential
occurrences of it (Boyd et al., 2005).
Certain aspects of the work described in this dissertation were inspired by the
methodology of the machine learning approaches for the identification of pleonastic it
specifically by Evans (2001) and Boyd et al. (2005).
1This method is currently incorporated as a component of mars (Mitkov et al., 2002).
12
2. Related Work 2.2 Linguistic Approaches
Due to the fact that the occurrence of non-referential zero pronouns is not very
common1, the size of our corpus was increased in order to achieve a sufficient num-
ber of instances for each class. The training data exploited by the Elliphant system
contains 6,827 instances of which 179 are non-referential examples. In Evans (2001)
3,171 instances of it where classified into seven classes while in Boyd et al. (2005) 2,337
examples were classified into four classes.
Our corpus was analyzed, as in the approach described by Evans (2001), using a
functional dependency parser, Connexor’s Machinese Syntax 2 (Connexor Oy, 2006b;
Tapanainen & Jarvinen, 1997). Moreover, some of the features used in the Elliphant
system, such as the consideration of the lemmas and the parts of speech (POS) of the
preceding and following material, were also implemented in Evans (2001) approach.
In contrast to previous work, the K* algorithm (Cleary & Trigg, 1995) was found
to provide the most accurate classification in the current study. Other approaches have
employed various classification algorithms, including K-nearest neighbors in TiMBL
(Boyd et al., 2005; Evans, 2001) and JRip in Weka (Muller, 2006).
2.2 Linguistic Approaches
Literature related to ellipsis in linguistic theory has served as one basis for establishing
the linguistically motivated classes and the annotation criteria in the current work. The
linguistically related work on this topic is focused on the definition and description of
the use of ellipsis in natural language and the limits of that use.
In Spanish, the use of ellipsis is very widespread. It is a phenomenon that occurs
in a wide range of contexts and is therefore much discussed in the field of linguistics.
To illustrate, some controversial topics in linguistics that pertain to instances of ellipsis
found in our corpus include: the establishment of different types of ellipsis, the identifi-
cation of impersonal sentences (non-referential expressions), the definition of particular
syntactic categories which can function as subjects, and the intricate differentiation of
reflex passive with elliptic subject from impersonal sentences in different varieties of
Spanish.
The concepts used in both types of literature (nlp and linguistic) to distinguish
different types of ellipsis and zero signs are extremely broad and are well debated in
1Only 3% of the verbs found in our corpus (see Section 3.2.1) have non-referential elliptic subjects.2http://www.connexor.eu/technology/machinese/demo/syntax/.
13
2. Related Work 2.2 Linguistic Approaches
the linguistic literature. Elements of the elliptic typology used in this work which were
derived from the literature are stated next while the linguistic and formal criteria used
to identify the chosen classes and which served as the basis for the corpus annotation,
including a typology of the examples found, is explained in Sections 3.1.1, 3.1.2, 3.1.3
and 3.2.2.
2.2.1 Linguistic Approaches to Subject Ellipsis
The study of the omission of some element from the sentence or the discourse in natural
language has been a challenge not only in computing but also in Spanish linguistics
itself –from the Renaissance period through to the present day.
The first occidental grammarian who treated ellipsis as a grammatical phenomenon
(Hernandez Terres, 1984) was Francisco Sanchez de las Brozas, El Brocense (1523-
1600) (Sanchez de las Brozas, [1562] 1976, p. 317), who took the concept of ellipsis
from Apolonio Dıscolo (Dıscolo, [2nd century] 1987) and defined it as:
“La elipsis es la falta de una palabra o de varias en una construccion correcta [...].
“Ellipsis is the omission of one or more items from a correct construction [...].”
This conception, in which grammar serves as a basis for a rational explanation of the
surface form of the language:
“No hay, pues, ninguna duda de que se debe buscar la explicacion racional de
la cosas, tambien de las palabras.” Sanchez de las Brozas ([1562] 1976) cited in
Garcıa Jurado (2007, p. 12)
“There is no doubt about that there shall be pursuit a rational explenation of the
things.”
later inspired the rational grammar of Port-Royal (Lancelot & Arnauld, [1660] 1980)
which was a precursor of Chomsky’s work (Chomsky, [1968] 2006, p. 5):
“One, particularly crucial in the present context, is the very great interest in the
potentialities and capacities of automata, a problem that intrigued the seventeenth-
century mind as fully as it does our own. [...] A similar realisation lies at the base
of Cartesian philosophy.”
14
2. Related Work 2.2 Linguistic Approaches
In order to elide something, a meaning, which is not expressed needs to be assumed. It
thus follows that ellipsis itself was one of the basic mechanisms to explain the transition
from D-structure to S-Structure becoming a central issue (Brucart, 1987) in generative
grammar from its original model, the Standard Theory (Chomsky, 1965) to its latest
revisions (Chomsky, 1995).
Different branches of linguistics have considered ellipsis from different points of
view:
– Semantic: traditionally, the criteria used to define ellipsis were semantic or logical
(Bello, [1847] 1981) and prescriptive (Real Academia Espanola, 2001);
– Descriptive and explicative: (Brucart, 1999);
– Distributional: although structuralism rejected the study of units which were not
codified in the signifier or phonetic realization, some classifications of ellipsis were
presented (Francis, 1958; Fries, 1940);
– Pragmatic: in diverse pragmatic paradigms the role of ellipsis is crucial as it
influences the interpretation of text. As a result it has given rise to several lines
of investigation such as implications though ellipsis (Grice, 1975), ellipsis studied
as a factor to activate textual coherence (Halliday & Hasan, 1976), or indefinite
ellipsis in which a word can stand for one or more sentences in a restrictive code
(Shopen, 1973); and
– Cognitive: in terms of ellipsis processing by the brain (Streb et al., 2004, p. 175):
“Ellipses and pronouns/proper names are processed by distinct mechanisms
being implemented in distinct cortical cell assemblies.”
or as part of the explanation of the language faculty (Chomsky, 1965).
The terminology and linguistic explanations relevant for this work, consider both zero
pronouns and non-referential expressions to be different types of ellipsis (Brucart, 1999).
Four kinds of Spanish subject ellipsis are distinguished (Brucart, 1999, p. 2851).
This classification is presented in correlation with a verb classification (Real Academia
Espanola, 2009), which is related to the omitted subject classification presented in
Bosque (1989).
15
2. Related Work 2.2 Linguistic Approaches
The classification of Spanish omitted subjects presented in Bosque (1989) is: omit-
ted subjects from finite verbs, which can be referential and non-referential and omitted
subjects from non-finite verbs which can be argumental and non-argumental. The ar-
gumental omitted subjects can in turn be referential and non-referential. In this study
non-argumental omitted subjects are claimed not to exist (Bosque, 1989), although in
Brucart (1999), non-argumental omitted subjects are considered a type of ellipsis (Type
4 in Figure 2.1).
1. Omitted sub ject in a clause containing a finite verb: Ø No vendrán
[They] won’t come
Ø Dicen que vendrá [They] say he won’t come[It is] said he won’t come
2. Argumental impersonal subject
En este estudio Ø se trabaja bien.
In this room [one] can work properly.
3. Non-argumental impersonal subject Ø Nieva
[It] is snowing
4. Omitted subject in a non-finite verb clause Juan intentaba (Ø decírselo a
María.) John tried ([John] to tell Mary.)
Verb with no argumental subject
Verb with argumental omitted subject which is represented by pronoun se
Verb with argumental omitted subject with an unespecific interpretation
Verb with argumental omitted sub j ec t w i th an espec ific interpretation
Types of subject ellipsis Types of verbs depending on their subject
(Brucart, 1999) (Real Academia Española, 2009)
(2) Argumental impersonal subject
En este estudio Ø se trabaja bien. In this room one can work properly.
(3) Non-argumental impersonal subject
Ø NievaIt is snowing
(4) Omitted subject in a non-finite verb clause
Juan intentaba (Ø decírselo a María.) John tried (John to tell Mary.)
(1) Omitted subject in a clause containing a finite verb:
Ø No vendrán They won’t come
Ø Dicen que vendrá They say he will comeIt is said he will come
Figure 2.1: Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia
Espanola, 2009).
The first type of ellipsis (see (1) in Figure 2.1) represents omitted subjects and
corresponds to zero pronouns in the nlp literature. An omitted subject is the result
of nominal ellipsis where a non-phonetically/orthographically realized lexical element –
omitted subject– which is needed for the interpretation of the meaning and the structure
of the sentence, is omitted since it can retrieved from its context (Brucart, 1999).
Despite their lack of phonetic realization, omitted subjects are part of the clause (Real
Academia Espanola, 2009).
16
2. Related Work 2.2 Linguistic Approaches
Two types of syntactic ellipsis or lexical-syntactic ellipsis can be distinguished:
verbal ellipsis and nominal ellipsis. These types of subject ellipsis can affect the whole
argument of the verb or be partial and just affect the head of the argument (Brucart,
1999). As detailed in Section 3.2.2, the annotation of our corpus includes both complete
noun phrase ellipsis and noun phrase head ellipsis. Note that nominal ellipsis not
only affects the subjects but also the other arguments of the verb –datives, direct
objects or infinitive objects– although their ellipsis is held to more restricted conditions
(Brucart, 1999). However, this fact is not acknowledged in some prior approaches in
nlp (Ferrandez & Peral, 2000, p. 166):
“While in other languages, zero-pronouns may appear in either the subject’s or
the object’s grammatical position, (e.g. Japanese), in Spanish texts, zero-pronouns
only appear in the position of the subject.”
The interpretation of Type 1 ellipsis can be definite and specific (Brucart, 1999)
or indefinite (Real Academia Espanola, 2009). Since omitted subjects are referential,
they can be lexically retrieved (Gomez Torrego, 1992). An example of omitted subject
could be:
(d) Las leyes no tendran efecto retroactivo si Ø no dispusieren lo contrario.
The law will not have a retroactive effect unless (they) specify otherwise.
The nature of the omitted subject [Ø] itself has been discussed in the linguistic literature
(Real Academia Espanola, 2009). While recent approaches in linguistics agree that the
omitted subject has a pronominal nature (elided pronoun), others contend that the
subject is expressed in the morphology of the verb inflection.
In Generative Grammar subject ellipsis has been understood as a (1) pro-form
(Beavers & Sag, 2004; Chung et al., 1995; Fiengo & May, 1994; Wilder, 1997) or as (2)
a syntactic realization without a phonetic constituent (Merchant, 2001; Ross, 1967).
The Meaning-Text Theory (mtt) contends that ellipsis occurs in the SSyntS (sur-
face syntax) when the elliptic element is deleted during the transition from SSyntS to
DMorphS (deep morphology) (or vice versa) and an empty node stands in for the rep-
resentation of the elliptic element. This procedure for treating ellipses is also proposed
in the MTT for the description for all coordinate structures (Mel’cuk, 2003).
17
2. Related Work 2.2 Linguistic Approaches
The identification of omitted subjects is not problematic when the zero pronoun
belongs to the first or second person but when it is a third person omitted subject, the
reference can be anaphoric or cataphoric (Type 1 ellipsis in Table 2.1) or non-specific1.
A generic or non-specific interpretation can follow in some clauses with singular sec-
ond person and plural third person zero pronouns (Real Academia Espanola, 2009).
However, depending on discourse knowledge, there can be alternators of specific and
non-specific interpretation in clauses which are formally equal, as the next example
shows:
(e) Ø Me han regalado un reloj. (In this example both interpretations, specific and non-
specific, are possible.)
(1) (They) gave me a watch. (When the agent referred to by “they” has been mentioned
previously in the discourse.)
(2) (I) was given a watch. (When no agent has been mentioned previously in the dis-
course.)
where the non-specific interpretation does not exclude a possible specific one (Real
Academia Espanola, 2009). Therefore, both groups of argumental subjects with specific
and non-specific interpretations are included in the same class.
2.2.2 Linguistic Approaches to Non-referential Ellipsis
On the other hand, Type 2 and type 3 ellipsis listed in Figure 2.1 correspond to
non-referential expressions or impersonal sentences. Type 2 ellipsis is composed of
impersonal sentences containing the Spanish particle se, whose argumental omitted
subject always has an unspecific interpretation and is referred to using the pronoun se
(Mendikoetxea, 1994). Type 3 ellipsis corresponds to the set of sentences called imper-
sonal sentences. Although the types of impersonal constructions in Spanish are hetero-
geneous, all of them share a lack of some properties of the subject (Fernandez Soriano
& Taboas Baylın, 1999). Some studies consider different kinds of Spanish imperson-
ality, e.g. semantic and syntactic impersonality (Gomez Torrego, 1992), while others
distinguish several semantic degrees of impersonality (Mendikoetxea, 1999).
1In journalistic headlines with an omitted subject, a non-specific interpretation can occur (Bosque,
1989) even in non-pro-drop languages such as English, French or German (Real Academia Espanola,
2009). Such non-specific interpretations can occur when the antecedent or referent was not previously
mentioned in the discourse.
18
2. Related Work 2.2 Linguistic Approaches
Traditionally –from a semantic point of view– impersonal sentences have been con-
sidered to be those which cannot contain a subject, the agent of the action described
(Real Academia Espanola, 1977). This impersonality can the due either to the nature
of the verb,
(f) Llueve.
(It) rains.
or due to the speaker’s ignorance of the subject (Seco, 1988):
(g) Llaman a la puerta.
(Someone) is knocking the door.
where the subject is unidentified and it is therefore impossible to assign a reference to
it (Bello, [1847] 1981).
The controversy of treating non-referential expressions as a type of ellipsis, given
that they cannot be lexically retrieved, has already been discussed (Gomez Torrego,
1992). While Brucart (1999) considers them a case of ellipsis, as do some Generative
Grammar approaches1, others (Bosque, 1989; Mel’cuk, 2006)2 consider that such elliptic
and non-referential subjects do not exist in language.
A descriptive point of view (Fernandez Soriano & Taboas Baylın, 1999) would
regard impersonal sentences as belonging to either of two main groups (1) impersonal
sentences without a subject and (2) cases of impersonal verbs with the inherent feature
of not having a subject.
In the current dissertation, a prescriptive and descriptive approach (Real Academia
Espanola, 2009) to the consideration of impersonal sentences is taken (See Section
3.1.3).
Type 4 ellipsis (Brucart, 1999) in Figure 2.1 is ignored in our work. However, this
fourth type is much debated in literature; for example, Head-Driven Phrase Structure
Grammar does not consider the infinitive subject as a null category (slash), nor do
Pollard and Sag in their work (Pollard & Sag, 1994).
1Generative Grammar explains these impersonal sentences by labeling the absence of the subject
with a pro-form which presents the same syntactic features as the subject although is has no phonolog-
ical realization. Following the Extended Projection Principle this pro-form embodies all the syntactic
requirements of a subject except for its phonological realization (Chomsky, 1981).2MTT uses the concept of the zero sign to characterize elements whose signifier is empty and is by
no means realized as a perceptible phonetic pause (Mel’cuk, 2006).
19
2. Related Work 2.2 Linguistic Approaches
20
Chapter 3
Detecting Ellipsis in Spanish
This chapter describes the methodology used in this study. The first step is to create
a linguistically motivated classification system (Section 3.1) for all instances of elliptic
and non-elliptic as well as referential and non-referential subjects. Since the machine
learning method requires training data, a corpus (the eszic Corpus) was compiled
(see Section 3.2.1) and a purpose built tool for its annotation was developed, as were
guidelines (see Section 3.2.2). The third task consisted of implementing a method to
extract the features (Section 3.2.3) of instances from the corpus and create training
data (eszic training data; see Section 3.2.4). Finally, once the features of instances
are derived from a document they are exploited for classification by machine learning
using the Weka package (Section 3.2.5).
3.1 Classification
The first step is to create a classification system for all instances of subject and imper-
sonal constructions. The groups into which the subjects were divided were labeled: el-
liptic and non-elliptic subjects as well as referential and non-referential subjects. These
two labels result in a ternary classification:
(1) Explicit subjects: non-elliptic and referential1;
(2) Zero pronouns: elliptic and referential2; and
1Explicit subjects in the examples are presented in italics.2Zero pronouns in the examples are presented by the symbol Ø. In the English translations the
subjects which are elided in Spanish are marked with parenthesis.
21
3. Detecting Ellipsis in Spanish 3.1 Classification
(3) Impersonal constructions: elliptic and non-referential1.
A subject can be non-elliptic (explicit) or elliptic (omitted subject or zero pronoun).
A sign can be referential or non-referential. The distinction lies in the fact that, while
the former can be lexically retrieved, the latter cannot (impersonal construction).
This treatment of the classification as ternary differs from previous work whose
division of subjects was binary: elliptic (zero pronoun) and non-elliptic, both referential
(Ferrandez & Peral, 2000; Rello & Illisei, 2009b) (see Section 2.1.1). In Evans (2001)
the seven fold classification of pleonastic it is based on the type of referent while in
Boyd et al. (2005), classification follows syntactic and semantic criteria (see Section
2.1.2).
In the following sections, each class is described. With regard to cases in which
classification can be controversial, different annotation criteria were applied (see Section
3.2.2).
3.1.1 Explicit Subjects: Non-elliptic and Referential
This class is the one to which explicit subjects belong. They are phonetically realised,
usually by a nominal group: noun, pronoun, noun phrase (a), free relatives, semi-free
relatives, substantival adjectives (Real Academia Espanola, 2009).
(a) Las fuentes del ordenamiento jurıdico espanol son la ley, la costumbre y los principios
generales del derecho.
The sources of the Spanish legal system are the law, the judicial custom and the general
principles of law2.
The syntactic positions of subjects can be pre-verbal or post-verbal. The occur-
rence of post-verbal subjects is restricted by some conditions (Real Academia Espanola,
2009).
(b) Careceran de validez las disposiciones que contradigan otra de rango superior.
The dispositions which contradict the higher range ones will not be valid.
1Impersonal constructions in the examples are not explicitly indicated using a symbol.2Unless otherwise specified, all the examples provided are taken from our corpus (Section 3.2.1).
22
3. Detecting Ellipsis in Spanish 3.1 Classification
Post-verbal subjects, as well as preverbal ones, are also found in passive construc-
tions and passive reflex constructions. As in active clauses, preverbal subjects without
a definite article are rare while post-verbal subjects without a definite article are more
frequent (Real Academia Espanola, 2009).
Projections of non-nominal categories such as clauses containing an infinitive or
a conjugated verb, interrogative indirect clauses, or indirect exclamative clauses, can
function as subjects (Real Academia Espanola, 2009).
(c) Corresponde a los poderes publicos promover las condiciones para que la libertad y la
igualdad del individuo y de los grupos en que se integra sean reales y efectivas.
It corresponds to the public power to promote individual and group liberties to be real
and effective.
3.1.2 Zero Pronouns: Elliptic and Referential
Class 2 is formed by elliptic but referential subjects called zero pronouns. An elliptic
subject is the result of a nominal ellipsis, where a non-phonetically realised lexical
element –elliptic subject– which is needed for the interpretation of the meaning and
the structure of the sentence, is omitted since it can retrieved from its context (Brucart,
1999). Despite their lack of phonetic realisation, elliptic subjects are considered part
of the clause (Real Academia Espanola, 2009).
(d) La Constitucion Espanolai (title in text)
Øi Fue refrendada por el pueblo espanol el 6 de diciembre de 1978.
The Spanish Constitutioni (title in text)
(It)i was countersigned by the Spanish population on the 6th of December of 1978.
Elliptic subjects are considered to be a personal pronoun variant which is not pho-
netically realised (Real Academia Espanola, 2009). Where referential, they can be
lexically retrieved (Gomez Torrego, 1992). That is to say that they can be substituted
by explicit pronouns without changing or losing any of the meaning of the clauses in
which they occur.
The elision of the subject can affect not only the noun head, but also the entire
noun phrase (Brucart, 1999). The noun head can be omitted in Spanish when the
subject of which it is a part fulfills some structural requirements (Brucart, 1999). This
23
3. Detecting Ellipsis in Spanish 3.1 Classification
includes cases in which the subject is referential (Brucart, 1999). The processing of
these subjects has been addressed by the development of specific algorithms in previous
work (Ferrandez et al., 1997).
Ellipsis of the head of the noun phrase is only possible when a definite article occurs.
(e) El Ø que esta obsesionado con que todo el mundo piensa mal es Javier.
The (one) who is obsessed with everyone thinking wrong is Javier.
The article possesses a referential value which could be either anaphoric or cataphoric
(Real Academia Espanola, 2009). Such examples of subjects with an elided head are
instances of semi-free relatives (Real Academia Espanola, 2009) and, as expected, they
are not as frequent in our corpus as elisions of the entire subject noun phrase.
3.1.3 Impersonal Constructions: Elliptic and Non-referential
Impersonal constructions with no subjects, that are both non-referential and elliptic,
do not exist (Bosque, 1989)1.
The appearance of clauses containing zero pronouns and impersonal constructions
is similar. Class 3 is composed of impersonal constructions which are formed by (1)
impersonal and (2) reflex impersonal clauses (impersonal clauses with se).
Impersonal clauses have no argumental subject. Since the subject does not exist,
it cannot be lexically retrieved by any means and no phonetic realisation of it can be
expected (Bosque, 1989). The following cases are considered to be impersonal sentences
(Real Academia Espanola, 2009):
– Non-reflex impersonal clauses denoting natural phenomena describing meteoro-
logical situations:
(f) Nieva.
(It) snows.
– Non-reflex impersonal clauses with verbs haber (to be), hacer (to do), ser (to
be), estar (to be)2, ir (to go) and dar (to give):
1The existence of a non-phonetically realised element in subject position is postulated (see Section
2.2). While Generative Grammar defends their existence (pro-form), mtt does not (zero sign).2Depending on the verbal aspect, there are different Spanish verbs which correspond with the
English verb to be.
24
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
(g) En un kilogramo de gas hay tanta materia como en un kilogramo de solido.
In a kilogram of gas (there) is the same amount of mass as in a kilogram of solid.
(Existential use of the verb haber).
– Non-reflex impersonal clauses with other verbs such as sobrar con (to be too
much), bastar con (to be enough) or faltar con (to have lack of) or the pronominal
unipersonal verb1 with subject zero such as tratarse de (to be about):
(h) Deberan adoptar las precauciones necesarias para su seguridad, especialmente cuando
se trate de ninos.
Necessary measures should be taken, specially when (it) is about children.
(i) Basta con tres sesiones.
(It) is enough with three sessions.
Verbs in such impersonal sentences (Gomez Torrego, 1992), are called lexical impersonal
verbs (Real Academia Espanola, 2009). Due to their lack of subject they are not easily
distinguished from verbs with omitted –but existing– subjects.
Secondly, reflex impersonal clauses have an omitted subject whose reference is non-
specific and cannot be lexically retrieved.
(j) Se estara a lo que establece el apartado siguiente.
(It) will be what is established in the next section
These clauses are formed with the particle se. This particle also serves other syntactic
functions (reflexive pronoun, pronominal pronoun, reciprocal pronoun, etc.) in clauses
with an elided subject.
3.2 Machine Learning Approach
Our corpus was compiled and parsed in order to create training data (referred to as the
eszic training data) for use by a machine learning classification method as explained
in the next section.
A tool was developed for annotation of the corpus (see Section 3.2.2). Fourteen
features were proposed for the purpose of classifying instances of subjects (see Section
1A verb which is only conjugated in the third person.
25
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
3.2.3). The feature vectors, together with their manual classifications, were written to
a training file. A method for obtaining the values of those features for each instance
was implemented. The classification algorithm employed was the K* instance-based
learner available in the Weka package (Witten & Frank, 2005) (see Section 3.2.5).
3.2.1 Building the Training Data
The eszic training data used by the Elliphant system is obtained from the eszic corpus
created ad hoc. The corpus is named after its annotated content “Explicit Subjects,
Zero-pronouns and Impersonal Constructions”.
The corpus contains a total of 79,615 words (titles and sentences that do not contain
at least one finite verb are ignored), including 6,825 finite verbs. Of these verbs, 71%
have an explicit subject, 26% have a zero pronoun and 3% belong to an impersonal
construction. There is an average of 2.3 clauses per sentence with 11.7 words per clause
and 26.9 words per sentence.
The corpus compiled to extract the training data is composed of seventeen docu-
ments, originally written in Spanish, and belonging to two genres: legal and health.
The legal texts1 are composed of laws taken from the: (1) Spanish Constitution
(whole text) (Constitucion Espanola, 1978), (2) Laws on Unfair Competition (whole
text) (Ley 3/1991, 1991), (3) Penal Code (first book) (Ley Organica 10/1995, 1995), (4)
Law for Administrative-contentious Jurisdiction (title 1, articles 1 to 17) (Ley 29/1998,
1998), (5) Civil Code (first book, until title V) (Codigo Civil, 1889), (6) Law for Univer-
sities (introduction) (Ley Organica 6/2001, 2001), (7) Law for Associations (chapter
1) (Ley Organica 1/2002, 2002) and (8) Law for Advertisements (whole text) (Ley
29/2005, 2005).
The nine health texts are taken from psychiatric papers compiled from a Span-
ish digital journal of psychiatry Psiquiatrıa.com2: (1) Cinema as a tool for teaching
personality disorders (Lopez Ortega, 2009), (2) Efficacy, functionality, and empow-
erment for phobic pathology treatment, in the context of specialised public Mental
Health Services (Garcıa Losa, 2008), (3) Emotions in Psychiatry (Sevillano Arroyo
& Ducret Rossier, 2008), (4) And what about siblings? How to help TLP3 siblings
1All the legal texts are available online at: http://noticias.juridicas.com/base_datos/2The full-text articles from Psiquiatrıa.com Journal are available online at: http://www.
psiquiatria.com/.3Trastorno lımite de la personalidad (Borderline Personality Disorder).
26
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
eszic Corpus Number of Number of Number of
Tokens Sentences Clauses
Legal text 1 9,972 941 600
Legal text 2 1,147 47 56
Legal text 3 17,960 1,035 1,181
Legal text 4 3,578 189 191
Legal text 5 12,456 746 891
Legal text 6 3,962 130 219
Legal text 7 2,159 131 136
Legal text 8 5,219 291 282
Health text 1 2,753 110 270
Health text 2 11,339 658 1,028
Health text 3 1,854 47 140
Health text 4 1,937 84 124
Health text 5 2,183 93 148
Health text 6 1,568 63 210
Health text 7 1,296 69 89
Health text 8 1,687 53 127
Health text 9 12,441 525 1,394
Total 93,511 5,212 7,086
Table 3.1: eszic Corpus: tokens, sentences and clauses.
27
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
(Molina Lopez, 2008), (5) Factorial analysis of personal attitudes in secondary educa-
tion (Pintor Garcıa, 2007), (6) The influence of the concept of self and social competence
in children’s depression (Aldea Munoz, 2006), (7) Depression as a mental health prob-
lem in Mexican teenagers (Balcazar Nava et al., 2005), (8) Relationship difficulties in
couples (Dıaz Morfa, 2004), and (9) A case of psychological intervention for children’s
depression (Aldea Munoz, 2003).
Table 3.2 presents the number of instances found in the eszic corpus by class.
Two columns illustrate the number of instances by genre (legal and health) within the
corpus.
Number of instances Legal eszic Health eszic eszic Corpus
per class Corpus Corpus
Explicit subjects 2,739 2,116 4,855
Zero pronouns 619 1,174 1,793
Impersonal constructions 71 108 179
Total 3,429 3,398 6,827
Table 3.2: eszic Corpus: number of instances per class.
The text containing instances to be classified was analysed using Connexor’s Machi-
nese Syntax (Jarvinen & Tapanainen, 1998; Jarvinen et al., 2004; Tapanainen & Jarvi-
nen, 1997)1. This dependency parser returns information on the pos and morphological
lemma of words in a text, as well as returning the dependency relations between those
words. The parsing system employed uses Functional Dependency Grammar (FDG)
(Jarvinen & Tapanainen, 1998; Tapanainen & Jarvinen, 1997) and combines (Jarvinen
et al., 2004) a lexicon and a morphological disambiguator based on constraint grammar
(Tapanainen, 1996). When performing fully automatic parsing it is necessary to ad-
dress word-order phenomena. The formalism used in the parser is capable of referring
simultaneously both to the order in which syntactic dependencies apply and to linear
order. This feature is an extension of Tesniere’s theory (Tesniere, 1959), which does
not formalise linearisation. In the parsed output the linear order is preserved while the
structural order requires that functional information is not coded in the canonical order
1A demo of Connexor’s Machinese Syntax is available at: http://www.connexor.eu/technology/
machinese/.
28
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
of the dependents. The functional information is represented explicitly using arcs with
labels of syntactic functions as shown in Figure 3.1 (Jarvinen et al., 2004).
Figure 3.1: An example of the output of the Connexor’s Machinese Syntax parser for
Spanish.
The dependency information allows the identification of complex constituents in a
text. For example, complex noun phrases can be identified by transitively grouping
together all the words dependent on a noun head (Evans, 2001). Additional software
was implemented to perform this and allow identification of clauses and noun phrases
which are required for implementation of some of the features used in our classification
(see Section 3.2.4).
The eszic training data makes use of the three types of information returned by
Connexor’s Machinese Syntax parser (Connexor Oy, 2006a,b):
1. morphological tags generated for verbs –singular (SG), third person (3P), indica-
tive (IND), among many others– including the pos tags –verb (V), noun (N),
preposition (PREP), etc.–;
2. syntactic tags –main element (@MAIN), nominal head (@NH), auxiliary verb (@AUX),
etc.–; and
29
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
3. syntactic relations –subject (subj), verb chained (v-ch), determiner (det)–. The
lexical information (LEMMA) given by the parser was also taken into consideration
in the set of features.
3.2.2 Annotation Software and Annotation Guidelines
A program was written in Python (see Figure 3.2) to extract all occurrences of finite
verbs from the eszic Corpus and to assign to each the vector of feature values described
in Section 3.1. Two annotators were presented with the clause in which each verb
appears and prompted to classify the verb into one of thirteen classes.
Figure 3.2: Screenshot of the annotation program interface.
Although the goal is to develop training data for a classifier making a ternary
classification of the subject position elements, an annotation scheme which gives more
30
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
detail about each instance was used. This annotation scheme was used with a dual
purpose: to get the most from the annotation task since the instances occur in a
broad number of constructions and because a more detailed annotation could be useful
in future work. The thirteen classes are grouped into the three types: (1) explicit
subjects, (2) zero pronouns or (3) impersonal constructions. In Table 3.3, the linguistic
motivation for each of the annotated classes is shown in correlation with the types to
which they belong. From each annotation class, in addition to the two criteria that
are crucial for this study –elliptic vs. non-elliptic and referential vs. non-referential– a
combination of syntactic, semantic and discourse knowledge can also be encoded during
the annotation. This knowledge includes information about whether the subject is
nominal or non-nominal, whether it is an active or a passive subject or whether the
subject refers to an active participant in the action, state or process denoted by the
verb.
The annotation program extracts from the parsed eszic Corpus the clause in which
each finite verb occurs. As Connexor’s Machinese Syntax parser does not explicitly
perform clause splitting but only sentence splitting, a method was developed to ac-
complish the clause identification task. The method identifies the finite verbs in the
corpus and transitively groups together the words directly and indirectly dependent
upon them1. The identified clauses are then presented to the annotators who are asked
to label the verb.
For each verb classified by an annotator, an xml tag (i.e. <subject>ZERO</subject>)
with its class is added in the token line of the parsed eszic Corpus where the verb oc-
curs. An example (k) of an annotated verb whose subject is a zero pronoun follows:
(k) <token id="w53"><text>entro </text><lemma>entrar </lemma>
<depend head="w51">mod </depend><tags><syntax>@MAIN
</syntax><morpho>V IND PRET SG P3 </morpho><subject>ZERO
</subject> </tags></token>
This manual classification, together with the features (see Section 3.2.3) are written to
the eszic training file.
1A clause splitter module was implemented to extract the features from the eszic Corpus (see
Section 3.2.4).
31
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
eszic Corpus Annotation Tags
Linguistic
information
Phonetic
Realization
Syntactic
cate-
gory
Verbal
Diathe-
sis
Semantic
inter-
preta-
tion
Disclo-
sure
Elliphant
Classes
Linguistic
character-
istics
Elliptic
noun
phrase
Elliptic
noun
phrase
head
Nominal
subject
Active Active
partici-
pant
Referential
subject
Class 1 Explicit sub-
ject
– – + + + +
Explicit
subject
Reflex passive
subject
– – + + – +
Passive
subject
– – + – – +
Omitted sub-
ject
+ – + + + +
Omitted sub-
ject head
– + + + + +
Non-nominal
subject
– – – + + +
Class 2 Reflex passive
omitted sub-
ject
+ – + + – +
Zero
pronoun
Reflex passive
omitted sub-
ject head
– + + + – +
Reflex passive
non-nominal
subject
– – – + – +
Passive omit-
ted subject
+ – + – – +
Passive
non-nominal
subject
– – – – – +
Class 3 Reflex imper-
sonal clause
– – n/a – n/a –
Impersonal (with se)
construction Impersonal
construction
– – n/a + n/a –
(without se)
Table 3.3: eszic Corpus annotation tags.
32
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
Annotating explicit and elliptic subjects as well as impersonal constructions in Span-
ish is not a trivial task. Guidelines were established for the annotation of borderline
instances whose classification is a frequent source of disagreement between annotators.
The following text presents some of these borderline cases that belong to the three
types of finite verb classes, together with the criteria adopted for their annotation.
When distinguishing explicit subjects, in addition to nouns, there are other syntactic
categories which may arguably function as heads of subjects. In the case of adverbial
and prepositional categories, it was decided that they should be considered subjects if
they can be focalised (Real Academia Espanola, 2009).
(`) De acuerdo con la Organizacion Mundial de la Salud, la depresion ocupa el cuarto lugar
entre las enfermedades mas incapacitantes y aproximadamente de 100 a 200 millones de
personas la padecen.
According to the International Health Organization, depression is ranked as the fourth
illness which causes more invalidity and approximately from 100 to 200 million people
suffer from it.
While conditional clauses could be considered subjects, in this work an alternative
analysis is followed. Under this approach, a sentence with a conditional clause func-
tioning as subject is considered to contain a zero pronoun, as its elliptic subject can
be retrieved from the preceding discourse (Real Academia Espanola, 2009). Neverthe-
less, no examples were found of conditional clauses functioning as subjects in the eszic
corpus used in this dissertation.
The correct classification of zero pronouns is also a source of disagreement between
annotators as it may be argued that some instances with postponed non-nominal sub-
jects (see example (m) below) should be interpreted as cataphoric zero pronouns.
In contrast to anaphora, in cataphora the cataphoric expression is situated before
the nominal group to which it points (Real Academia Espanola, 2009). Tanaka (2000)
and Mitkov (2002) point out that there is some scepticism about the concept of cat-
aphora in the NLP literature. For example, Kuno (1972) asserts that there is no genuine
cataphora in its literal sense, as the referent of a seemingly cataphoric pronoun must
already be mentioned in the preceding discourse and, therefore, is predictable when
a reader encounters the pronoun. This viewpoint was refuted by Carden (1982) and
Tanaka (2000) who describe empirical data which shows cases of genuine cataphora
33
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
where the pronoun is the first mention of its referent in the discourse (Carden, 1982;
Tanaka, 2000). Although some examples of genuine cataphora were found in their cor-
pus (Tanaka, 2000), none were found in the eszic Corpus except for occurrences of the
elision of noun heads where the antecedent is postponed, as in example (e).
The annotation guidelines developed for the current work considered these cases
which involve postponed clauses as non-nominal subjects.
(m) Artıculo 46.
No pueden contraer matrimonio:
Los menores de edad no emancipados.
Los que esten ligados con vınculo matrimonial.
Article 46.
(They) cannot get married:
The non-emancipated minors.
The ones which are already married.
Finally, the borderline cases in impersonal constructions are debated in Spanish. The
decision of how to classify reflex impersonal clauses containing se is frequently a diffi-
cult one to make due to the ambiguity of these instances. For example, in the sentence
Se secaron (see example (n) below), the particle se has four possible semantic interpre-
tations in Spanish (Real Academia Espanola, 2009). In these cases, the decision taken
by the annotator depends on the meaning given by the context.
(n) Se secaron (Particle se = reflexive pronoun)
(They) dried (themselves).
Se secaron (Particle se = reciprocal pronoun)
(They) dried (each other).
Se secaron (Particle se = pronominal pronoun and there is an elliptic subject which does
not have control over the action, for instance, the trees.)
The trees got dried.
Se secaron (Particle se = reflex passive in which the referent of the subject would have
to perform the described action under their own free will, for instance, some people over
an object, for instance, the clothes)
(They) dried (the clothes).
34
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
There can be ambiguity between reflex passives containing a zero pronoun and imper-
sonal constructions in which the object is not human (o).
(o) Se firmara el acuerdo.
Ø will sign the agreement.
In such instances, the annotation criterion followed is to annotate them as reflex passive
clauses containing a zero pronoun.
3.2.3 Features
Fourteen features were proposed in order to classify instances according to the types
presented in Section 3.1. The values (see Table 3.4) for the features were derived from
information provided both by Connexor’s Machinese Syntax (Connexor Oy, 2006b)
parser, which processed the eszic Corpus, and a set of lists. An additional program
was implemented in order to extract the values of features for every instance in the
corpus (see Section 3.2.4). These values were used to produce a training vector for each
instance. For a detailed explanation of the feature values see Section 3.2.4.
For the purpose of description, it is convenient to describe each of the features as
broadly belonging to one of ten classes, detailed below.
1 PARSER: the presence or absence of a subject in the clause, as identified by the
parser. It was observed (Rello & Illisei, 2009b) that the analysis returned by Con-
nexor’s Machinese Syntax is particularly inaccurate when identifying coordinated
subjects, subjects containing prepositional modifiers, and appositions occurring
between commas (see example (p) below). Other common cases of parsing error
involve subjects which are distant from the finite verb in the clause. Features 7
and 8 were proposed in an effort to take into consideration potential candidates
for the subject.
(p) La publicidad, por su propia ındole, es una actividad que atraviesa las fronteras.
Advertising, due to its own nature, is an activity which goes beyond boundaries.
2 CLAUSE: the clause types considered are: main clauses, relative clauses, clauses
starting with a complex conjunction, clauses starting with a simple conjunction,
and clauses introduced using punctuation marks (commas, semicolons, etc). A
35
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
Feature Definition Value
1 PARSER Parsed subject True, False
2 CLAUSE Clause type Main, Rel, Imp, Prop, Punct
3 LEMMA Verb lemma Parser’s lemma tag
4 NUMBER Verb morphological number SG, PL
5 PERSON Verb morphological person P1, P2, P3
6 AGREE Agreement in person, FTFF, TTTT, FFFF, TFTF, TTFF,
number, tense and mood FTFT, FTTF, TFTT, FFFT, TTTF,
FFTF, TFFT, FFTT, FTTT, TFFF
TTFT
7 NHPREV Previous noun phrases Number of noun phrases
previous to the verb
8 NHTOT Total noun phrases Number of noun phrases
in the clause
9 INF Infinitive Number of infinitives
in the clause
10 SE Particle se se, no
11 A Preposition a True, False
12 POSpre Four parts of the speech 292 different values combining
previous to the verb the parser’s pos tags,i.e.:
@HN, @CC, @MAIN, etc.
13 POSpos Four parts of the speech 280 different values combining
speech following the verb the parser’s pos tags,i.e.:
@HN, @CC, @MAIN, etc.
14 VERBtype Type of verb: copulative, CIPX, XIXX, XXXT, XXPX, XXXI,
impersonal, pronominal, CIXX, XXPT, XIPX, XIPT, XXXX,
transitive and intransitive XIXI, CXPI, XXPI, XIPI, XXEX
Table 3.4: Features: definitions and values.
method was implemented to identify these different types of clause as the parser
does not explicitly mark the boundaries of clauses within sentences (see Section
3.2.4)
3 LEMMA: lexical information extracted from the parser: the lemma of the finite
verb.
4-5 NUMBER, PERSON: morphological information features of the verb: its
grammatical number (singular or plural) and its person (first, second, or third
36
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
person).
6 AGREE: feature which encodes the tense, mood, person, and number of the
verb in the clause, and its agreement in person, number, tense, and mood with
the preceding verb in the sentence and also with the main verb of the sentence.
When a finite verb appears in a subordinate clause, its tense and mood can assist
in recognition of these features in the verb of the main clause and help to enforce
some restrictions required by this verb, especially when both verbs share the same
referent as subject.
7-9 NHPREV, NHTOT, INF: the candidates for the subject of the clause are
represented by the number of noun phrases in the clause that precede the verb,
the total number of noun phrases in the clause, and the number of infinitive verbs
in the clause.
10 SE: this is a binary feature encoding the presence or absence of the particle se
in close proximity to the verb. When se occurs immediately before or after the
verb or with a maximum of one token (see example (q) below) lying between the
verb and itself, this is considered “close proximity.”
(q) No podra sacarse una ventaja indebida de la reputacion de una marca.
(It) is not allowed to take unfair advantage of a brand reputation.
11 A: this is a binary feature encoding the presence or absence of the preposition
a in the clause. Since, the distinction between passive reflex clauses with zero
pronouns and impersonal constructions sometimes relies on the appearance of
preposition a (to, for, etc.). For instance, example (r) is a passive reflex clause
containing a zero pronoun while example (s) is an impersonal construction.
(r) Se admiten los alumnos que reunan los requisitos.
(They) accept the students who fulfill the requirements.
(s) Se admite a los alumnos que reunan los requisitos.
(It) is accepted for the students who fulfill the requirements.
37
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
12-13 POSpre, POSpos: the pos of eight tokens, that is, the four words preceding and
the four words following the instance1
14 VERBtype: the verb is classified as copulative (yes/no), as a verb with an im-
personal use (yes/no), as a pronominal verb (yes/no), and as a transitive verb
(yes/no/both).
3.2.4 Purpose Built Tools
As training data is required in order to exploit the methods distributed in the Weka
package (Witten & Frank, 2005), a method was implemented to extract the values of
the previously described features for instances occurring in the eszic Corpus. For each
instance (each annotated finite verb) a new line is written in the training data file
with values for the fourteen features separated by commas, together with the manual
classification of the vector using the standard CVS (comma separated values) format.
The values of features 7-9 are numerical while the values of the remaining features are
nominal (i.e. symbolic).
To extract the features, ad hoc software was implemented in Python. The program
exploits morphological and syntactic information, dependency relations reported by the
parser, and lists of verbs grouped by their syntactic and morphological properties (e.g.
transitivity, pronominal use, etc.).
The method implemented includes the following purpose built tools which are de-
scribed below. The description includes information on the particular features whose
values are computed using the tools.
1 Clause splitter module (CLAUSE): since Connexor’s Machinese Syntax (Con-
nexor Oy, 2006a) does not provide any information about the clause boundaries
within sentences, this clause splitter module is required. Each clause is built by
identifying finite verbs in a sentence and then searching for signals that indicate
the boundaries of the clause (relative pronouns, conjunctions, punctuation marks,
etc.). In theory, each clause could be built using dependency information given by
the parser by grouping together all the words dependent on the finite verb. How-
ever, this strategy was not used in order to avoid parsing errors in the dependency
information reported by the parser. Errors of this type are especially common
1This set of features can be regarded as useful for identifying non-nominal it (Evans, 2001).
38
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
when long sentences are parsed using Connexor’s Machinese Syntax. The Clause
splitter module also identifies the type of clause in which the finite verb occurs.
The feature attributes corresponding to the type of clause are:
1.1 Main (Main): when the finite verb belongs to the main clause.
1.2 Relative (Rel): when the finite verb belongs to a relative clause. A list of relative
pronouns was used to identify this type of clause (i.e.: que (that), cuyo (whose),
quien (who), etc.).
1.3 Improper conjunction (Imp): when the finite verb belongs to a clause starting with
an improper conjunction. A list of improper conjunctions was used to identify the
value of this attribute (i.e.: porque (because), luego (so), aunque (although), etc.).
1.4 Proper conjunction (Prop): when the finite verb belongs to a clause starting with
a proper conjunction. A list of proper conjunctions was used (i.e.: y, e (and), o, u
(or), ni (neither), pero (but) and sino (otherwise).
1.5 Punctuation marks (Punct): when the clause in which the finite verb occurs is
preceded by a punctuation mark (‘.’, ‘,’, ‘:’, ‘;’, ‘?’, ‘!’, “”, ‘-’, ‘(’, and ‘)’ ).
2 Noun phrase module (NHPREV, NHTOT): in order to obtain the subject
candidates, this module identifies and counts the noun phrases that precede and
follow the finite verb in the clause. As is the case for the clause splitter, this
module exploits dependency information returned by the parser (Connexor Oy,
2006a).
3 Counter (NHPREV, NHTOT, INF): this module is used to determine the
total number, in the clause, of noun phrases (nhprev, nhtot) and infinitival
forms (inf).
4 Tag taker (PARSER, LEMMA, NUMBER, PERSON, A, POSpre,
POSpos): these Python functions process the attributes of the XML tags output
by the parser (eszic Corpus) to generate a set of features for the eszic train-
ing data. A function generates a binary value that indicates whether or not the
finite verb has a dependent subject (parser). A function consults the lemma
of the verb and takes it as the value for feature (lemma). Other functions ex-
ploit morphological information obtained by the parser such as the number of the
39
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
finite verb (number), which can be either singular (SG) or plural (PL), or the
morphological person of the finite verb (person) which can be first, second or
third person (P1, P2, P3); Another function identifies whether the preposition a
occurs in the clause (a). This information is used as the values for the features;
and, finally, there is another method which obtains the pos of the four words
that precede the instance in the clause ((pos)pre) and the four words that follow
it ((pos)pos).
5 Agreement module (AGREE): this module checks whether the verb used in
the clause agrees (true, T) or disagrees (false, F) in tense and mood, and in person
and number with the main verb that occurs in the sentence1 and the previous
verb occurring within the sentence. This agreement information is combined into
one symbolic feature, such as TTTT (with respect to the verb used in the clause,
the first T denotes agreement in number and person with the main verb of the
sentence, the second T denotes agreement in tense and mood with the main verb
of the sentence, the third T denotes agreement in number and person with the
previous verb in the sentence and the fourth T denotes agreement in tense and
mood with the previous verb in the sentence) or TTFF (when there is agreement
in between the verb in the clause and the main sentence verb but no agreement
with the previous clause verb). There are sixteen possible combinations of true
(T) and false (F) values.
6 Se identifier (SE): this function identifies whether the particle se occurs in
close proximity to the finite verb. Again, in this context, a distance of at most
one token between the finite verb and se is considered “close proximity.” The
value for this feature can be (yes), when se appears, or (no), when it does not.
7 Verb classifier (VERBtype): this module specifies the value of four features
of the finite verb that occurs in the clause. The features encode information
about whether or not the verb appears in four different lists of verbs (the same
instance can occur in more then one list). These four lists2 contain 11,060 different
verb lemmas which are present in the Royal Spanish Academy Dictionary (Real
1In this study, it is considered that sentences may contain several verbs whereas clauses contain
only one finite verb.2The lists 7.2-7.4 of infinitive verb forms were provided by Molino de Ideas s.a.
40
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
Academia Espanola, 2001). The criteria on which these lists (items 7.2-7.4) were
built was the information contained in the dictionary definitions of the verbs (Real
Academia Espanola, 2001):
7.1 Copulative verbs (C): a list containing the copulative verbs, i.e. ser (to be), parecer
(to seem like), etc.;
7.2 Impersonal verbs (I): a list containing all the verbs whose use is impersonal. Such
use is specified in their definition, i.e. llover (to rain), nevar (to snow), etc.;
7.3 Pronominal verbs (P): a list which includes all the pronominal verbs (verbs whose
lemma in the dictionary appears with se) and all the potential pronominal verbs
whose definitions specify a potential pronominal use; and
7.4 Transitive and intransitive verbs (T): a list containing transitive verbs and intran-
sitive verbs that meet the criteria detailed previously in item 7.
3.2.5 The WEKA Package
The Weka workbench1 is a collection of state-of-the-art machine learning algorithms
and data preprocessing tools (Hall et al., 2009; Witten & Frank, 2005). Both Weka
interfaces, the Explorer and the Experimenter were used to discover the methods and
parameter settings that work best for the current classification task.
Standard evaluation measures –precision, recall, f-measure and accuracy (Manning
& Schutze, 1999)– provided by Weka are used. In these measures, true positives (tp)
and true negatives (tn) are the number of cases that the system got right. The wrongly
selected cases are the false positives (fp) while the cases that the system failed to select
are the false negatives (fn). In the current context, true positives and true negatives
would be the numbers of correctly classified instances while the false positives and false
negatives are the numbers of falsely classified instances (Manning & Schutze, 1999).
Precision is defined as the ratio of selected items that the system got right, that is,
the ratio of true positives to the sum of true positives and false positives: p = tptp+fp .
Recall is defined as the proportion of target items that the system selected, that is
the ratio of the number of true positives to the sum of true positives and false negatives:
r = tptp+fn .
1 Weka is available at: http://www.cs.waikato.ac.nz/ml/weka/.
41
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
Figure 3.3: An example of Weka Explorer interface.
F-measure is a single measure of overall performance which combines precision and
recall:
F =2
1r + 1
p
.
Accuracy is the proportion of correctly classified objects:
A =tp + tn
tp + tn + fp + fn.
42
Chapter 4
Evaluation
“Then you should say what you mean” [...]
“I do,” Alice hastily replied; “at least I mean what I say that’s the same thing,
you know.”
“Not the same thing a bit!” said the Hatter. “Why, you might just as well say
that ‘I see what I eat’ is the same thing as ‘I eat what I see’ !”
Alice in Wonderland, Lewis Carroll
This chapter presents the evaluation of the Elliphant system and some optimisa-
tion experiments carried out with the machine learning method (see Section 4.1). A
comparative evaluation of Elliphant’s performance with that of Connexor’s Machinese
Syntax parser is also described (see Section 4.2).
Standard evaluation measures (precision, recall, f-measure and accuracy) are used
to evaluate Elliphant with regard to the identification of the three classes: explicit
subjects, zero pronouns and impersonal constructions.
4.1 Experiments
A set of experiments was executed using the in Weka package with the purpose of
answering the following questions:
(1) Which method and parameter values work best for our problem? (see Section 4.1.1)
(2) How many instances are needed to train the algorithm? (see Section 4.1.2)
43
4. Evaluation 4.1 Experiments
(3) Does the genre matter? (see Section 4.1.3)
(4) Which are the most significant features and what are the most effective combinations of
features? (see Section 4.1.4)
4.1.1 Method Selected: K* Algorithm
A comparison of the learning algorithms implemented in Weka (Witten & Frank,
2005) was carried out to determine the most accurate method for each classification
task. A comparison of the accuracy levels (see Table 4.1), which presents all of the
Weka classifiers which exploit the features utilised in the Elliphant system is shown
below, with default parameter settings. The experiment was executed using 20% of the
instances in the training data, which were selected randomly. Ten-fold cross-validation
was used in the evaluation. All methods with an accuracy within 1% of K*’s are marked
in italics.
The seven1 highest performance classifiers were compared using 100% of the training
data and 10-fold cross-validation. The Bayes classifiers (BayesNet, NaiveBayes and
NaiveBayesUpdateable) obtained an accuracy score of 0.846, the function classifier
(RBFNetwork) offers an accuracy of 0.850 and the tree classifier (LADTree) an accuracy
of 0.830. With an accuracy of 0.860, the lazy learning classifier K* is the best performing
one, and hence our chosen technique.
Although lazy learning requires a relatively large amount of memory to store the
entire training set, the eszic training data is small enough that it can be classified
within a few minutes.
Instance-based learners classify new instances by comparing them to the manually
classified instances in the training data. The fundamental assumption is that similar
instances will have similar classifications. Nearest neighbor algorithms are the simplest
of the instance-based learners. They use a domain-specific distance measure to retrieve
the single most similar instance from the training set. In a nearest-neighbor method
each instance in the training set is represented by a vector of feature values that has
been explicitly classified. When a new vector of feature values is presented, a distance
measure is computed between the new vector and the set of vectors held in the training
1Unfortunately, due to hardware limitations, it was not possible to obtain results from the NBTree
classifier and the JRip rule classifier when using the entire set of training data.
44
4. Evaluation 4.1 Experiments
Weka classifiers Accuracy Weka classifiers Accuracy
Bayes: BayesNet 0.848 Meta: RacedIncrementalLogitBoost 0.717
Bayes: NaiveBayes 0.848 Meta: RandomSubSpace 0.731
Bayes: NaiveBayesSimple 0.842 Meta: Stacking 0.717
Bayes: NaiveBayesUpdateable 0.848 Meta: StackingC 0.717
Functions: RBFNetwork 0.848 Meta: Vote 0.717
Lazy: IB1 0.804 Misc: HyperPipes 0.715
Lazy: IBk 0.810 Misc: VFI 0.704
Lazy: K* 0.850 Rules: ConjunctiveRule 0.809
Lazy: LWL 0.809 Rules: DecisionTable 0.834
Meta: AdaBoostM1 0.81 Rules: DTNB 0.834
Meta: AttributeSelectedClassifier 0.836 Rules: JRip 0.845
Meta: ClassificationViaClustering 0.66 Rules: NNge 0.740
Meta: CVParameterSelection 0.717 Rules: OneR 0.762
Meta: Decorate 0.795 Rules: PART 0.795
Meta: END 0.809 Rules: Ridor 0.821
Meta: EnsembleSelection 0.762 Rules: ZeroR 0.717
Meta: FilteredClassifier 0.810 Trees: BFTree 0.760
Meta: Grading 0.717 Trees: DecisionStump 0.810
Meta: LogitBoost 0.841 Trees: J48 0.810
Meta: MultiBoostAB 0.810 Trees: J48graft 0.813
Meta: MultiClassClassifier 0.661 Trees: LADTree 0.846
Meta: MultiScheme 0.717 Trees: NBTree 0.850
NestedDichotomies: ClassBalancedND 0.809 Trees: RandomForest 0.793
NestedDichotomies: DataNearBalancedND 0.809 Trees: RandomTree 0.749
NestedDichotomies: ND 0.809 Trees: REPTree 0.723
Meta: OrdinalClassClassifier 0.810 Trees: SimpleCart 0.763
Table 4.1: Weka classifiers accuracy (20% of the eszic training set).
set (Cleary & Trigg, 1995). The k nearest ones are identified and the new vector is
assigned the class shared by the majority of the nearest neighbors1.
K* is an instance-based classifier. The class of a test instance is based upon the
classes of those training instances that are similar to it, as determined by some sim-
ilarity function. It differs from other instance-based learners in that this algorithm
computes the distance between two instances using a method motivated by informa-
1Evans (2001) and Boyd et al. (2005) executed their experiments with the k nearest neighbor
classifier which is also a lazy learning algorithm.
45
4. Evaluation 4.1 Experiments
tion theory in which an entropy-based distance function is used (Cleary & Trigg, 1995;
Witten & Frank, 2005). The distance between instances is defined as the complexity
of transforming one instance into another. The calculation of the complexity between
instances is detailed in Cleary & Trigg (1995).
When using K*, the most effective classification is made when using a blending
parameter1 of 40%2 and the rest of the parameters remain with their default values:
the missing Mode parameter3 set to the average column entropy curves and the entropic
Auto Blend parameter set to false. Table 4.2 presents the evaluation of Elliphant when
exploiting the K* classifier with the parameters set as explained before, using ten-fold
cross-validation.
Class Precision Recall F-measure
Explicit subjects 0.900 0.923 0.911
Zero pronouns 0.772 0.740 0.756
Impersonal constructions 0.889 0.626 0.734
eszic training data Accuracy: 0.867 (ten-fold cross-validation)
Table 4.2: eszic training data evaluation with K* -B 40 -M a.
There is a marginal reduction in accuracy when the system is evaluated using ten-
fold cross-validation (0.867) instead of leave-one-out cross-validation (0.869), though
its statistical significance is minimal. When decreasing the proportion of training data
used, the difference in performance levels between both evaluation methods remains
stable except when using 50% of the training data and is just 0.005. Although leave-
one-out cross-validation obtains more accurate results as it is easier to classify test
instances using almost 100% of the training data than from only 90% of it, in practice
a classifier is trained and tested on instances derived from different data sets. Ten-fold
cross-validation is thus a more accurate simulation of real-world classification scenarios.
Moreover, it can be computed far more quickly than leave-one-out cross-validation.
1The parameter for global blending.2Blending percentages up to 50% were tested.3The missing Mode determines how missing attribute values are treated.
46
4. Evaluation 4.1 Experiments
eszic training data Ten-fold cross Leave-one-out
percentage validation validation
10% 0.836 0.834
20% 0.859 0.862
30% 0.854 0.851
40% 0.855 0.858
50% 0.858 0.863
60% 0.860 0.862
70% 0.860 0.862
80% 0.865 0.863
90% 0.866 0.869
100% 0.867 0.868
Table 4.3: Leave-one-out and ten-fold cross-validation comparison.
4.1.2 Learning Curve
A learning curve shows how accuracy changes with varying sample sizes, plotting the
number of correctly classified instances against the number of instances in the training
data. To calculate the learning curve of the Elliphant system, the eszic training data
was used to generate ten training samples, representing 10%, 20%, 40%, 50%, 60%,
70%, 80%, 90% and 100% of the data set. The instances contained in the eszic training
file were randomly ordered so that the genre variable could not influence the results
presented below. In these experiments, the K* algorithm was used with the parameter
settings described in Section 4.1.1 and the evaluation was carried out using ten-fold
cross-validation.
The learning curve shown in Figure 4.1 presents the increase in accuracy obtained
by the Elliphant system using the eszic training data. Performance reaches a plateau
at its maximum level when using 90% of the training instances.1
Figure 4.2 displays the precision, recall and f-measure of classification for all classes
1One thing to be noted is that the ordering of the instances makes a slight difference to the accuracy
of classification. While the system obtains an accuracy of 0.867 when the instances are placed in their
original order of occurrence in the eszic training data, 0.866 is obtained when the same instances are
presented in random order to the classifier using ten-fold cross validation. This difference also occurs
when leave-one-out cross-validation is used. In this case, the method obtains an accuracy of 0.869
when the instances are placed in their original order of occurrence and 0.868 when presented in random
order.
47
4. Evaluation 4.1 Experiments
0,830
0,835
0,840
0,845
0,851
0,856
0,861
0,866
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Accuracy
0.836
0.859
0.8540.855
0.8580.86 0.86
0.865 0.866 0.866
Figure 4.1: eszic training data learning curve for accuracy.
in the eszic training data. The values of the three measures are maximal when uti-
lizing 90% of the training set. While recall plateaus at this sample size, precision
and f-measure decrease slightly when the amount of training data is further increased,
although this decline is not sufficiently marked to be attributed to overtraining.
0,830
0,835
0,840
0,845
0,851
0,856
0,861
0,866
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Precision Recall F-measure
0.831
0.856
0.852
0.853
0.8570.858 0.858
0.8630.865 0.864
0.83
0.857
0.8520.852
0.8560.858
0.858
0.863
0.836
0.8640.865
0.859
0.854 0.855
0.8580.86 0.86
0.8650.866 0.866
Figure 4.2: eszic training data learning curve for precision, recall and f-measure.
The learning curve in Figure 4.3 shows the classification accuracy for each of the
48
4. Evaluation 4.1 Experiments
classes while Figure 4.4 presents this accuracy in relation to the number of training
instances for each section of the eszic training data.
Under all conditions, subjects are classified with a high accuracy since the infor-
mation given by the parser (collected in the features) facilitates an f-measure of 0.801
for the identification of explicit subjects. By contrast to explicit subjects, the parser
does not recognise zero pronouns in impersonal constructions but can recognise them in
clauses with no subject. The accuracy with which these types can be classified begins
at a lower level (0.662 and 0.621 respectively). Classification of both zero pronouns of
impersonal constructions reaches its maximum when 90% of the training data is ex-
ploited. There is also some evidence of overtraining in the classification of impersonal
constructions when using 100% of the training data.
0,621
0,662
0,704
0,745
0,787
0,828
0,870
0,911
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0,621
0,6530,642
0,6610,671
0,682
0,6510,672
0,7360,721
0,662
0,735 0,728 0,737 0,741 0,744 0,742 0,748 0,754 0,754
0,8950,907 0,905 0,904 0,906 0,907 0,908 0,911 0,911 0,911
Explicit Subject Zero pronoun Impersonal
Figure 4.3: Learning curve for accuracy, recall and f-measure of the classes.
The zero pronoun class has the steepest learning curve. Utilising only 735 instances
(50% of the training set), the Elliphant system obtains an accuracy (0.741) close to that
obtained when using 100% of the training data. The learning curve for the subject class
is more gradual due to the great variety of subjects occurring in the training data. In
addition, increasing accuracy from a greater starting point (0.907 using just 20% of the
training data) is far more expensive in terms of the addition of training instances. The
impersonal sentence class is also learned rapidly by Elliphant. Utilising a training set
of only 179 instances, it reaches a classification accuracy of 0.721 (See Figure 4.4).
49
4. Evaluation 4.1 Experiments
0,621
0,662
0,704
0,745
0,787
0,828
0,870
0,911
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Explicit Subject Zero pronoun Impersonal
498978 1461 1929 2433 2898 3400 3899 4386 4854
354 537 735 898 1094 1249 1416 15931793
167
17
3249
6682
103
129146
163 179
Explicit subjects
Zeropronouns
Impersonalconstructions
Figure 4.4: Learning curve for accuracy, recall and f-measure in relation to the number
of instances of each class.
This demonstrates that Elliphant is not heavily reliant on very large sets of expen-
sive training data and is able to reach adequate levels of performance when exploiting
far less training instances. Overall, we see that we only need a small set of annotated
instances (1,500) to achieve reasonable results.
4.1.3 Most Effective Features
With Weka’s Attribute Selection option, it is possible to evaluate the features by
considering the individual predictive ability of each of the features along with the degree
of redundancy between them. Table 4.4 shows the relevant ordered features evaluated
using different algorithms implemented in Weka’s attribute selection module which
can handle the features type (symbolic, numerical, etc.) from the eszic training data.
The filters used for each Attribute Selection method are the ones provided by default
in Weka1.
Considering the group of features selected using each Weka Attribute Selection
algorithm, 11 classifications using the K* classifier were made over the complete eszic
1BestFirst filter for the CfsSubsetEval method; Attribute ranking filter for the ChiSquaredAttribu-
teEval, FilteredAttributeEval, GainRatioAttributeEval, InfoGainAttributeEval, OneRAttributeEval,
ReliefFAttributeEval and SymmetricalUncertAttributeEval; and Greedy Stepwise filter for the Consis-
tencySubsetEval and FilteredAttributeEval methods.
50
4. Evaluation 4.1 Experiments
Weka Attribute Selection Selected features
CfsSubsetEval PARSER, NUMBER, NHPREV, NHTOT,
VERBtype, PERSON
ChiSquaredAttributeEval LEMMA, POSpos, NHTOT, NHPREV, POSpre,
PARSER
ConsistencySubsetEval PARSER, LEMMA, NUMBER, AGREE, NHTOT,
POSpos, POSpre
FilteredAttributeEval POSpos, LEMMA, NHPREV, NHTOT, PARSER,
POSpre
FilteredSubsetEval PARSER, NHPREV, NHTOT
GainRatioAttributeEval NHPREV, PARSER, PERSON, NHTOT, POSpos,
CLAUSE
InfoGainAttributeEval POSpos, LEMMA, NHPREV, NHTOT, PARSER,
POSpre
OneRAttributeEval NHTOT, POSpos, CLAUSE, PERSON, NHPREV,
PARSER
ReliefFAttributeEval POSpos, VERBtype, LEMMA, PARSER, CLAUSE,
POSpre
SymmetricalUncertAttributeEval NHPREV, PARSER, NHTOT, POSpos, PERSON,
LEMMA
Table 4.4: Selected features by Weka Attribute Selection methods.
training data using only the features selected by each method. Table 4.5 presents the
accuracy of each classification using ten-fold cross-validation.
The most effective group of six features in combination is the one selected by
Weka’s SymmetricalUncertAttributeEval Attribute Selection algorithm, since the clas-
sification using those six features together already offers an accuracy of 0.851. Likewise,
a group consisting of only three features (parser, nhprev, nhtot) was selected by
the FilteredSubsetEval algorithm. These three features are the most frequently selected
ones among those chosen by all the Attribute Selection methods. A classification which
exploits only the three features obtains an accuracy of 0.819.
A set of experiments were conducted in which features were selected on the basis
of the degree of computational effort needed to generate them. Two sets of features
were proposed. One group corresponds to features intrinsic to the parser, whose values
can be obtained by trivial exploitation of the tags produced in its output (parser,
51
4. Evaluation 4.1 Experiments
Weka Attribute Selection Accuracy
CfsSubsetEval 0.824
ChiSquaredAttributeEval 0.848
ConsistencySubsetEval 0.843
FilteredAttributeEval 0.848
FilteredSubsetEval 0.819
GainRatioAttributeEval 0.833
InfoGainAttributeEval 0.848
OneRAttributeEval 0.833
ReliefFAttributeEval 0.825
SymmetricalUncertAttributeEval 0.851
Table 4.5: Classification using the selected features groups: accuracy.
lemma, person, pospos, pospre). The second group of features (clause, agree,
nhprev, nhtot, verbtype) has values derived by methods extrinsic to the parser
and rules for the recognition of elements that are independent of it. Derivation of
this second group of features necessitated the implementation of more sophisticated
modules to identify the boundaries of syntactic constituents such as clauses and noun
phrases. These modules are rule-based and operate over the often erroneous output
of the parser (see Section 3.2.4). The results obtained when the classifier exclusively
exploits each of these intrinsic and extrinsic groups of features are shown in Tables 4.6
and 4.7
A recurrent issue in anaphora resolutions studies is determining the quantity and
type of knowledge needed for identification of candidates and selection of a candidate
as antecedent. In Mitkov (2002) it is stated that, given the natural linguistic ambiguity
of various cases, the resolution of any kind of anaphor requires not only morphological,
lexical, and syntactic knowledge but also semantic knowledge, discourse knowledge, and
real world knowledge. Nevertheless, current anaphora resolution methods rely mainly
on restrictions and preference heuristics, which employ information originating from
morpho-syntactic or shallow semantic analysis (Ferrandez & Peral, 2000; Mitkov, 1998),
while some previous approaches have exploited full parsing (Hobbs, 1977; Lappin &
Leass, 1994). As described in this dissertation, Elliphant makes use of deep dependency
parsing plus the morphological knowledge contained in the verb lists used.
52
4. Evaluation 4.1 Experiments
There are two findings of note in Table 4.6. The first is that no impersonal con-
structions are identified when only features extrinsic to the parser are used. The second
is that there is a reduction in recall when using only intrinsic features. It is therefore
better to classify instances using a feature group that combines both types of features.
eszic training data Precision Recall F-measure
Explicit subjects 0.654 0.664 0.659
Zero pronouns 0.865 0.891 0.878
Impersonal constructions 0 0 0
Extrinsic parser features eszic training data accuracy: 0.808
Table 4.6: Extrinsic parser features classification results.
eszic training data Precision Recall F-measure
Explicit subjects 0.866 0.312 0.459
Zero pronouns 0.779 0.983 0.869
Impersonal constructions 0.944 0.285 0.438
Intrinsic parser features eszic training data accuracy: 0.789
Table 4.7: Intrinsic parser features classification results.
To estimate the weight of each feature, classifications were made in which each
feature was omitted from the training instances that were presented to the classifier
and ten-fold cross-validation was applied. Table 4.8 presents the accuracy of these
classifications. Omission of all but one of the features a led to a reduction in accuracy,
justifying their inclusion in the training instances.
Feature omitted Accuracy Feature omitted Accuracy
PARSER 0.854 VERBtype 0.863
NHTOT 0.860 NUMBER 0.864
LEMMA 0.861 INF 0.864
POSpos 0.861 AGREE 0.865
NHPREV 0.862 POSpre 0.866
PERSON 0.863 SE 0.866
CLAUSE 0.863 A 0.867
Table 4.8: Single feature omission classifications: accuracy.
53
4. Evaluation 4.1 Experiments
4.1.4 Genre Analysis
As the eszic training data is composed of instances belonging to two different genres
(legal and health), two subgroups of the eszic training data were generated: the Legal
eszic training data and the Health eszic training data containing all the instances
derived from legal and health texts, respectively. A comparative evaluation using ten-
fold cross-validation over the two subgroups shows that Elliphant is more successful
when classifying instances of explicit subjects in legal texts (see Table 4.9). This may
be explained by the uniformity of the sentences in the legal texts which present less
variation than the ones from the health genre. Texts from the health genre present
the additional complication of specialised named entities and acronyms which are used
quite frequently in the health texts from the eszic Corpus (i.e.: CCDSD1, DSM-IV2 or
TLP3). Further, there is a larger number of explicit subjects in the legal training data
(2,739, compared with 2,116 explicit subjects occurring in the health texts). Similarly,
better performance in the detection of zero pronouns and impersonal sentences in the
health texts may be due to their higher occurrence in the health genre: 108 impersonal
constructions and 1,174 zero pronouns compared with 71 impersonal constructions and
619 zero pronouns in the legal texts (see Table 3.2 for details about the number of class
instances in each subgroup of the training data).
Class Precision Recall F-measure
Legal genre Explicit subjects 0.920 0.955 0.937
Health genre Explicit subjects 0.881 0.888 0.884
Legal genre Zero pronouns 0.761 0.649 0.701
Health genre Zero pronouns 0.784 0.796 0.790
Legal genre Impersonal constructions 0.786 0.620 0.693
Health genre Impersonal constructions 0.905 0.620 0.736
Legal genre accuracy: 0.893 (ten-fold cross-validation)
Health genre accuracy: 0.848 (ten-fold cross-validation)
Table 4.9: Legal and health genres comparative evaluation.
1Cuestionario Clınico para el Diagnostico del Sındrome Depresivo (Clinic Questionnaire for De-
pressive Syndrome Diagnosis).2Manual Diagnostico y Estadıstico de los Trastornos Mentales IV (Diagnostic and Statistical Man-
ual of Mental Disorders IV).3Trastorno lımite de la personalidad (Borderline Personality Disorder).
54
4. Evaluation 4.2 Comparative Evaluation
We have also studied the effect of training the classifier on data derived from one
genre and testing on instances derived from a different genre. Table 4.10 shows that
instances from legal texts are not only more homogeneous, as the classifier obtains
higher accuracy when testing and training only on legal instances (0.895) but they are
also more informative because when combining both legal and health genres as training
data, the results in testing the algorithm only on instances from the health genre show
significantly increased accuracy (0.933). These results imply that the instances from
the health genre are the most heterogeneous ones. Subsets of legal documents where
our method achieves an accuracy of 0.942 were also found.
```````````````Training set
Testing setLegal Health eszic Corpus
Legal 0.895 0.859 0.885
Health 0.858 0.841 0.887
eszic Corpus (all) 0.920 0.933 0.869
Accuracy: cross-genre training and testing (ten-fold cross-validation)
Table 4.10: Cross-genre training and testing evaluation.
4.2 Comparative Evaluation
Due to the lack of previous work on this topic, a comparison with other methods is
not feasible. Despite its similarities to this approach, Ferrandez & Peral (2000) use a
different definition for zero pronouns, and therefore a comparison is not appropriate.
As a guideline, the results obtained by Connexor’s Machinese Syntax are presented
regarding the existence (or not) of a subject inside the clause. Since this parser does
not distinguish between referential and non-referential elliptic subjects, both categories
have been merged into one. Needless to say, a comparison of the results obtained by
these two methods should be made with caution. They are presented here only as a
point of reference. It is clear from the figures that the Elliphant system offers not only
improved f-measure in the classification of both elliptic subject classes, but also obtains
superior f-measure when classifying the non-omitted subject class.
The evaluation was carried out using both the entire set of eszic training data and
also the genre-specific subsets of the training data (Legal and Health eszic training
55
4. Evaluation 4.2 Comparative Evaluation
data). The evaluation of the Elliphant system was made using leave-one-out cross-
validation.
eszic training data Precision Recall F-measure
Elliphant Explicit subjects 0.901 0.924 0.913
Elliphant Zero pronouns 0.774 0.743 0.758
Elliphant Impersonal constructions 0.889 0.626 0.734
Elliphant eszic training data accuracy: 0.869 (leave-one-out cross-validation)
Table 4.11: Elliphant eszic training data results.
eszic training data Precision Recall F-measure
Machinese Explicit subjects 0.911 0.716 0.802
Machinese Zero pronouns
+ Impersonal constructions 0.543 0.829 0.656
Machinese eszic training data accuracy: 0.749
Table 4.12: Machinese eszic training data results.
When evaluating over the entire eszic training set, Elliphant outperforms the parser
on every measure. When detecting explicit pronouns in Elliphant, the obtained recall
score is considerably higher (0.924 compared to the 0.716 of the parser). The aver-
ages of the evaluation measures obtained for the identification of zero pronouns and
impersonal constructions (precision: 0.831; recall: 0. 684; f-measure: 0.746) were also
compared. The comparison demonstrated Elliphant’s superiority over Connexor’s Ma-
chinese Syntax parser, in this task, for all measures except recall.
Legal genre eszic training data Precision Recall F-measure
Legal genre Elliphant Explicit subjects 0.922 0.955 0.938
Legal genre Elliphant Zero pronouns 0.760 0.654 0.934
Legal genre Elliphant Impersonal constructions 0.797 0.662 0.723
Elliphant Legal eszic training accuracy: 0.895
Table 4.13: Elliphant Legal eszic training results.
When processing only the Legal eszic training data, the accuracy of the parser is
reduced (0.726), while the performance of the Elliphant system is improved (0.895).
56
4. Evaluation 4.2 Comparative Evaluation
Legal genre eszic training data Precision Recall F-measure
Legal genre Machinese Explicit subjects 0.940 0.702 0.803
Legal genre Machinese Zero pronouns
+ Impersonal constructions 0.410 0.823 0.547
Machinese Legal eszic training accuracy: 0.726
Table 4.14: Machinese Legal eszic training results.
The two systems were used to classify instances of elision (zero pronouns and imper-
sonal constructions) in texts from the legal genre. The averaged evaluation measures
obtained by the Elliphant system (precision: 0.778; recall: 0. 658; f-measure: 0.828)
were found to be superior to those obtained by the parser for all measures except recall
(precision: 0.675; recall: 0. 763; f-measure: 0.675).
Health genre eszic training data Precision Recall F-measure
Health genre Elliphant Explicit subjects 0.879 0.879 0.879
Health genre Elliphant Zero pronouns 0.773 0.795 0.784
Health genre Elliphant Impersonal constructions 0.882 0.620 0.728
Elliphant Health eszic training data accuracy: 0.841
Table 4.15: Elliphant Health eszic training data results.
Health genre eszic training data Precision Recall F-measure
Health genre Machinese Explicit subjects 0.879 0.735 0.801
Health genre Machinese Zero pronouns
+ Impersonal constructions 0.656 0.833 0.734
Machinese Health eszic training data accuracy: 0.772
Table 4.16: Machinese Health eszic training data results.
When classifying instances derived from texts in the health genre (using Health
eszic training data), the accuracy of both the Elliphant system and the parser was
reduced. However, Elliphant still outperforms the parser in this context.
When considering the classification of instances of elision in the health genre, Con-
nexor’s Machinese Syntax parser does obtain higher measures for the averaged eval-
uation measures than Elliphant (precision: 0.827; recall: 0.707; f-measure: 0.756).
57
4. Evaluation 4.2 Comparative Evaluation
Nevertheless, unlike the parser, the Elliphant system distinguishes referential (zero
pronouns) and non-referential (impersonal constructions) elided subjects. This can be
considered one of its main contributions as this task is necessary in order to improve
practical anaphora resolution systems.
58
Chapter 5
Conclusions and Future Work
In this dissertation, a machine learning approach to the identification of zero pronouns,
impersonal constructions, and explicit subjects was presented. In treating this range
of classes, complete coverage is provided for all possible constituents which may occur
in subject position in Spanish clauses.
In order to enable a machine learning approach to classification, a parsed corpus of
Spanish texts from the health and legal genres was compiled. The corpus was manually
annotated to encode information about the element in subject position for every finite
verb in the corpus (the eszic Corpus). A set of 14 features was formulated and training
data consisting of 6,827 instances represented by vectors of the feature values was cre-
ated (eszic training data). The training data was utilised by classification algorithms
distributed with the Weka package. Empirical observation revealed that use of the K*
algorithm was optimal for the purpose of this classification. The performance of this
machine learning approach was compared with that of Connexor’s Machinese Syntax
parser. Elliphant offers a classification with superior accuracy in the recognition of
both of the elliptic classes (zero pronouns and impersonal constructions), and also in
the classification of the non-elliptic subject class (explicit subjects). The method pre-
sented in this dissertation is also able to identify impersonal constructions in Spanish.
This is a task which appears not to have been dealt with before in the literature.
In addition to presenting results with regard to algorithm selection, additional ex-
periments carried out with the underlying method included parameter optimisation,
learning of the most effective combinations of features, the optimal number of instances
to include in the training data and the relationships between the results and the differ-
ent genres on which the Elliphant system was tested. This chapter presents the findings
59
5. Conclusions and Future Work 5.1 Main Observations
of all of these experiments (see section 5.1). In future research, it is intended that op-
timisation of the approach and its adaptability to other genres will be investigated in
more depth (see section 5.2).
5.1 Main Observations
Algorithm selection: the instance-based learning algorithm K* was selected for clas-
sification of elliptic vs. explicit subject instances and referential vs. non-referential
subject instances. This decision was taken on the basis of having compared the accu-
racy of this classifier with the rest of the classifiers available in the Weka package. In
terms of accuracy, the K* algorithm is closely followed by the Bayes based algorithms
in Weka.
Parameter optimisation was investigated by checking the impact of the param-
eter setting on the performance of the K* classifier. Although Weka provides sensible
default settings, it is by no means certain that they will be optimal for this particular
task. The default settings were changed so that a blending parameter of 40% was used
with regard to the K* algorithm.
Feature selection: the set of experiments conducted to determine an optimal
group of features to be utilised by the classification algorithm revealed that of the en-
tire set of 14 features, the most effective group comprises six of the features: nhprev
(number of noun phrases previous to the verb), parser (parsed subject), nhtot (num-
ber of noun phrases in the clause), pospos (four pos following the verb), person (verb
morphological person), and lemma (verbal lemma). This study showed that feature a
(preposition a) does not make any meaningful contribution to the classification.
Training data required: learning curves experiments showed the correlation be-
tween the accuracy of the classifier and the size of the training set, whose performance
reaches a plateau at its maximum level when using 90% of the available data.
Genre interference: We evaluated the performance of the Elliphant system sep-
arately in two different genres, legal and health, showing that there is some genre
interference on the classification tasks. Elliphant classifies zero pronouns and explicit
subjects in legal texts with a higher accuracy than is the case in health texts. By con-
trast, impersonal constructions are more accurately classified in health texts. Cross-
genre training and testing demonstrated that legal instances are more informative and
60
5. Conclusions and Future Work 5.2 Future Research
homogeneous than health genre cases.
5.2 Future Research
Future research goals are related to improvements in: (1) optimisation of the Elliphant
system, (2) adaptation of the system to other genres, (3) inter-annotation agreement
of the eszic Corpus, (4) the comparison of Elliphant with a rule based approach and
the (5) design of an algorithm to resolve zero anaphora in Spanish.
Firstly, with regard to further improvement of the Elliphant system, the interaction
between (a) feature selection and parameter optimisation, and (b) class distribution will
be addressed. In related work, it was found that optimal settings for feature selection
and parameter optimisation should not be sought independently of one another since
there is an interaction between the two. The joint optimisation of feature selection
and parameter optimisation can cause variations in the accuracy levels obtained by
classifiers (Hoste, 2005). Additionally, an investigation will be made into how the class
distribution of the data affects learning. This will facilitate the compilation of an
optimal set of training instances as it has been found that training data containing a
lower distribution of negative instances can be beneficial to classification (Hoste, 2005).
In future work, evaluation and learning curve experiments in which training in-
stances derived from texts in one genre are used to classify instances derived from texts
in a different genre will provide an insight into the optimal type/combination of train-
ing data that enables better classification using less instances in various types/genres
of text, as well as provide additional robustness to our system.
Inter-annotator agreement will be measured and it is planned to design a ruled based
algorithm to identify and to resolve zero anaphora in Spanish as there is some debate
about which approach, machine learning or rule-based, brings optimal performance
when applied in anaphora resolution systems (Mitkov, 2002).
61
5. Conclusions and Future Work 5.2 Future Research
62
References
Aldea Munoz, S. (2003). Un caso de intervencion psicologica de la depresion infantil. psiquia-tria.com, 7. 28
Aldea Munoz, S. (2006). Influencia del autoconcepto y de la competencia social en la de-presion infantil. psiquiatria.com, 10. 28
Alonso-Ovalle, L. & D’Introno, F. (2000). Full and null pronouns in Spanish: the zeropronoun hypothesis. In H. Campos, E. Herburger, A. Morales-Front & T.J. Walsh, eds.,Hispanic linguistics at the turn of the millennium. Papers from the 3rd Hispanic LinguisticsSymposium, 189–210, Cascadilla Press, Sommerville, MA. 6
Balcazar Nava, P., Bonilla Munoz, M.P., Gurrola Pena, G.M., Oudhof van Barn-eveld, H. & Aguilar Mercado, M.R. (2005). La depresion como problema de saludmental en los adolescentes mexicanos. psiquiatria.com, 9. 28
Barreras, J. (1993). Resolucion de elipsis y tecnicas de parsing en una interficie de lenguajenatural. Procesamiento del lenguaje natural , 13, 247–258. 7, 8
Beavers, J. & Sag, I. (2004). Coordinate ellipsis and apparent non-constituent coordination.In S. Muller, ed., Proceedings of the 11th International Conference on Head-Driven PhraseStructure Grammar (HPSG-04), 48–69, CSLI Publications, Stanford, CA. 17
Bello, A. ([1847] 1981). Gramatica de la lengua castellana destinada al uso de los americanos.Instituto Universitario de Linguıstica Andres Bello, Cabildo Insular de Tenerife, Santa Cruzde Tenerife. 15, 19
Bergsma, S., Lin, D. & Goebel, R. (2008). Distributional identification of non-referentialpronouns. In Proceedings of the 46th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL/HLT-08), 10–18. 2, 10, 12
Bosque, I. (1989). Clases de sujetos tacitos. In J. Borrego Nieto, ed., Philologica: homenajea Antonio Llorente, vol. 2, 91–112, Servicio de Publicaciones, Universidad Pontificia deSalamanca, Salamanca. 15, 16, 18, 19, 24
Boyd, A., Gegg-Harrison, W. & Byron, D. (2005). Identifying non-referential it : amachine learning approach incorporating linguistically motivated patterns. In Proceedingsof the ACL Workshop on Feature Engineering for Machine Learning in Natural LanguageProcessing. 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05),40–47. 8, 10, 12, 13, 22, 45
63
References
Brucart, J.M. (1987). La elision sintactica en espanol . Universitat Autonoma de Barcelona,Bellaterra. 15
Brucart, J.M. (1999). La elipsis. In I. Bosque & V. Demonte, eds., Gramatica descriptiva dela lengua espanola, vol. 2, 2787–2863, Espasa-Calpe, Madrid. ix, 15, 16, 17, 19, 23, 24
Carden, G. (1982). Backwards anaphora in discourse context. Journal of Linguistics, 18,361–87. 33, 34
Chinchor, N. & Hirschman, L. (1997). MUC-7 Coreference task definition (version 3.0). InProceedings of the 1997 Message Understanding Conference (MUC-97). 2
Chomsky, N. (1965). Aspects of the theory of syntax . The MIT Press, Cambridge, MA. 15
Chomsky, N. ([1968] 2006). Language and mind . Cambridge University Press, Cambridge, 3rdedn. 14
Chomsky, N. (1981). Lectures on government and binding . Mouton de Gruyter, Berlin, NewYork. 1, 6, 19
Chomsky, N. (1995). The minimalist program. The MIT Press, Cambridge, MA. 15
Chung, S., Ladusaw, W. & McCloskey, J. (1995). Sluicing and logical form. NaturalLanguage Semantics, 3, 239–282. 17
Cleary, J. & Trigg, L. (1995). K*: an instance-based learner using an entropic distancemeasure. In Proceedings of the 12th International Conference on Machine Learning (ICML-95), 108–114. 13, 45, 46
Clemente, J., Torisawa, K. & Satou, K. (2004). Improving the identification of non-anaphoric it using Support Vector Machines. In Proceedings of the International Joint Work-shop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP-04), 58–61. 10, 12
Codigo Civil (1889). Texto de la edicion del Codigo Civil mandada publicar por el RealDecreto de 24 del corriente en cumplimiento de la ley de 26 de mayo ultimo. Gaceta deMadrid , 206, 249–312. 26
Connexor Oy (2006a). Conexor functional dependency grammar 3.7. User’s manual . 29, 38,39
Connexor Oy (2006b). Machinese language model . 13, 29, 35
Constitucion Espanola (1978). Constitucion Espanola de 27 de diciembre de 1978. BoletınOficial del Estado, 311, 29313–29424. 26
Corpas Pastor, G. (2008). Investigar con corpus en traduccion: los retos de un nuevoparadigma. Peter Lang, Frankfurt am Main. 7, 8
Corpas Pastor, G., Mitkov, R., Afzal, N. & Pekar, V. (2008). Translation universals:do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings ofthe 8th Conference of the Association for Machine Translation in the Americas (AMTA-08),75–81. 2, 7, 8, 10
64
References
Danlos, L. (2005). Automatic recognition of French expletive pronoun occurrences. In R. Dale,K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedings of the2nd International Joint Conference on Natural Language Processing (IJCNLP-05), 73–78,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 3651. 2,10, 11, 12
Denber, M. (1998). Automatic resolution of anaphora in English. Tech. rep., Eastman KodakCo. 10, 11, 12
Dıaz Morfa, J. (2004). La crisis de las aventuras en las relaciones de pareja. psiquiatria.com,8. 28
Dıscolo, A. ([2nd century] 1987). Sintaxis. Gredos, Madrid. 14
Evans, R. (2000). A comparison of rule-based and machine learning methods for identifyingnon-nominal it. In D.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Pro-ceedings of the 2nd International Conference on Natural Language Processing (NLP-2000),233–241, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.1835. 10, 12
Evans, R. (2001). Applying machine learning: toward an automatic classification of it. Literaryand Linguistic Computing , 16, 45–57. 2, 10, 12, 13, 22, 29, 38, 45
Fernandez Soriano, O. & Taboas Baylın, S. (1999). Construcciones impersonales noreflejas. In I. Bosque & V. Demonte, eds., Gramatica descriptiva de la lengua espanola,vol. 2, 1631–1722, Espasa-Calpe, Madrid. 18, 19
Ferrandez, A. & Peral, J. (2000). A computational approach to zero-pronouns in Spanish.In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics(ACL-2000), 166–172. 2, 6, 7, 8, 9, 11, 17, 22, 52, 55
Ferrandez, A., Palomar, A. & Moreno, L. (1997). El problema del nucleo del sintagmanominal: ¿elipsis o anafora? Procesamiento del lenguaje natural , 20, 13–26. 24
Ferrandez, A., Palomar, A. & Moreno, L. (1998). Anaphor resolution in unrestrictedtexts with partial parsing. In Proceedings of the 36th Annual Meeting of the Association forComputational Linguistics and 17th International Conference on Computational Linguistics(ACL/COLING-98), 385–391. 9
Ferrandez, A., Palomar, A. & Moreno, L. (1999). An empirical approach to Spanishanaphora resolution. Machine Translation, 14, 191–216. 9
Fiengo, R. & May, R. (1994). Indices and identity . The MIT Press, Cambridge MA. 17
Francis, W. (1958). The structure of American English. Ronald Press, New York. 15
Fries, C. (1940). American English grammar . Appleton-Century-Crofts, New York. 15
Garcıa Jurado, F. (2007). La etimologıa como historia de las palabras. E-excellence, Areade Cultura Clasica, Filologıa Clasica, 39, 1–27. 14
Garcıa Losa, E. (2008). Efectividad, operatividad y potenciacion del tratamiento en patologıafobica, en el contexto de los servicios especializados de salud mental publicos: la utilizacionen la sala de consulta de los recursos de Internet. psiquiatria.com, 12. 26
65
References
Gomez Torrego, L. (1992). La impersonalidad gramatical: descripcion y norma. Arco Libros,Madrid. 17, 18, 19, 23, 25
Grice, H. (1975). Logic and conversation. In P. Cole & J.L. Morgan, eds., Syntax and seman-tics, vol. 3: Speech Acts, 41–58, Academic Press, New York. 15
Gundel, J., Hedberg, N. & Zacharski, R. (2005). Pronouns without NP antecedents:how do we know when a pronoun is referential? In A. Branco, T. McEnery & R. Mitkov,eds., Anaphora processing: linguistic, cognitive and computational modelling , 351–364, JohnBenjamins, Amsterdam. 10, 12
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H.(2009). The WEKA data mining software: an update. SIGKDD Explorations, 11, 10–18. 41
Halliday, M.A.K. & Hasan, R. (1976). Cohesion in English. Longman, London. 15
Han, N. (2004). Korean null pronouns: classification and annotation. In Proceedings of theWorkshop on Discourse Annotation. 42nd Annual Meeting of the Association for Computa-tional Linguistics (ACL-04), 33–40. 7
Hernandez Terres, J.M. (1984). La elipsis en la teorıa gramatical . Universidad de Murcia,Murcia. 14
Hirano, T., Matsuo, Y. & Kikui, G. (2007). Detecting semantic relations between namedentities in text using contextual features. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics. Companion volume proceedings of the demo andposter sessions (ACL-05), 157–160. 2, 7, 8
Hobbs, J. (1977). Resolving pronoun references. Lingua, 44, 311–338. 52
Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D.thesis, University of Antwerp. 61
Hu, Q. (2008). A corpus-based study on zero anaphora resolution in Chinese discourse. Ph.D.thesis, City University of Hong Kong. 7, 8
Iida, R., Inui, K. & Matsumoto, Y. (2006). Exploiting syntactic patterns as clues in zero-anaphora resolution. In Proceedings of the 44th Annual Meeting of the Association for Com-putational Linguistics and the 21st International Conference on Computational Linguistics(ACL/COLING-06), 625–632. 7, 8
Iida, R., Kentaro, I. & Matsumoto, Y. (2009). Capturing salience with a trainable cachemodel for zero-anaphora resolution. In Proceedings of the Joint Conference of the 47th AnnualMeeting of the Association for Computational Linguistics and the 4th International Confer-ence on Natural Language Processing of the Asian Federation of Natural Language Processing(ACL/AFNLP-09), 647–655. 2, 7, 8
Imamura, K., Saito, K. & Izumi, T. (2009). Discriminative approach to predicate-argumentstructure analysis with zero-anaphora resolution. In Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for Computational Linguistics and the 4thInternational Conference on Natural Language Processing of the Asian Federation of NaturalLanguage Processing (ACL/AFNLP-09), 85–88. 2, 7, 8
66
References
Isozaki, H. & Hirao, T. (2003). Japanese zero pronoun resolution based on ranking rulesand machine learning. In Theoretical Issues in Natural Language Processing. Proceedings ofthe 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03),184–191. 7, 8
Jarvinen, T. & Tapanainen, P. (1998). Towards an implementable dependency grammar. InA. Polguere & S. Kahane, eds., Proceedings of the Workshop on Processing of Dependency-Based Grammars. 36th Annual Meeting of the Association for Computational Linguistics and17th International Conference on Computational Linguistics (ACL/COLING-98), 1–10. 28
Jarvinen, T., Laari, M., Lahtinen, T., Paajanen, S., Paljakka, P., Soininen, M. &Tapanainen, P. (2004). Robust language analysis components for practical applications. InProceedings of the 20th International Conference on Computational Linguistics (COLING-04), 53–56. 28, 29
Kawahara, D. & Kurohashi, S. (2004). Improving Japanese zero pronoun resolution byglobal word sense disambiguation. In Proceedings of the 20th International Conference onComputational Linguistics (COLING-04), 343–349. 2, 7, 8
Kibrik, A.A. (2004). Zero anaphora vs. zero person marking in Slavic: a chicken/egg dilemma?In Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-04), 87–90. 2, 7, 8
Kratzer, A. (1998). More structural analogies between pronouns and tenses. In Proceedingsof Semantics and Linguistic Theory VIII (SALT-88), Cornell University, Ithaca, NY. 6
Kuno, S. (1972). Functional sentence perspective: a case study from Japanese and English.Linguistic Inquiry , 3, 269–320. 33
Lambrecht, K. (2001). A framework for the analysis of cleft constructions. Linguistics, 39,463–516. 10, 12
Lancelot, C. & Arnauld, A. ([1660] 1980). Gramatica general y razonada. Sociedad GeneralEspanola de Librerıa, Madrid. 14
Lappin, S. & Leass, H. (1994). An algorithm for pronominal anaphora resolution. Computa-tional Linguistics, 20, 535–561. 10, 11, 12, 52
Lee, S. & Byron, D. (2004). Semantic resolution of zero and pronoun anaphors in Korean. InProceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-04), 103–108. 2, 7
Lee, S., Byron, D. & Jang, S. (2005). Why is zero marking important in Korean? InR. Dale, K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedingsof the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05),588–599, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.3651. 7
Ley 29/1998 (1998). Ley 29/1998, de 13 de julio, reguladora de la Jurisdiccion Contencioso-administrativa. Boletın Oficial del Estado, 167, 23516–23551. 26
Ley 29/2005 (2005). Ley 29/2005, de 29 de diciembre, de Publicidad y Comunicacion Institu-cional. Boletın Oficial del Estado, 312, 42902–42905. 26
67
References
Ley 3/1991 (1991). Ley 3/1991, de 10 de enero, de Competencia Desleal. Boletın Oficial delEstado, 10, 959–962. 26
Ley Organica 10/1995 (1995). Ley Organica 10/1995, de 23 de noviembre, del Codigo Penal.Boletın Oficial del Estado, 281, 33987–34058. 26
Ley Organica 1/2002 (2002). Ley Organica 1/2002, de 22 de marzo, reguladora del Derechode Asociacion. Boletın Oficial del Estado, 73, 11981–11991. 26
Ley Organica 6/2001 (2001). Ley Organica 6/2001, de 21 de diciembre, de Universidades.Boletın Oficial del Estado, 307, 49400–49425. 26
Li, Y., Musilek, P. & Wyard-Scott, L. (2009). Identification of pleonastic it using theweb. Computer Engineering , 34, 339–389. 10, 12
Lopez Ortega, M.A. (2009). El cine como herramienta ilustrativa en la ensenanza de lostrastornos de la personalidad. psiquiatria.com, 13. 26
Manning, C. & Schutze, H. (1999). Foundations of statistical natural language processing .The MIT Press, Cambridge, MA. 41
Matsui, T. (1999). Approaches to Japanese zero pronouns: centering and relevance. InD. Cristea, N. Ide & D. Marcu, eds., Proceedings of the Workshop on the Relation of Dis-course/Dialogue Structure and Reference. 37th Annual Meeting of the Association Compu-tational Linguistics (ACL-99), 11–20. 2, 7, 8
Mel’cuk, I. (2003). Levels of dependency in linguistic description: concepts and problems.In Dependency and valency. An International handbook of contemporary research, 188–229,Mouton de Gruyter, Berlin, New York. 17
Mel’cuk, I. (2006). Zero sign in morphology. In Aspects of the theory of morphology , 447–495,Mouton de Gruyer, Berlin, New York. 6, 19
Mendikoetxea, A. (1994). La semantica de la impersonalidad. In C. Sanchez, ed., Las con-strucciones con se, 239–267, Visor, Madrid. 18
Mendikoetxea, A. (1999). Construcciones con se: medias, pasivas e impersonales. InI. Bosque & V. Demonte, eds., Gramatica descriptiva de la lengua espanola, vol. 2, 1575–1630,Espasa-Calpe, Madrid. 18
Merchant, J. (2001). The syntax of silence. Sluicing, islands and the theory of ellipsis. OxfordUniversity Press, Oxford. 17
Mitkov, R. (1998). Robust pronoun resolution with limited knowledge. In Proceedings of the36th Annual Meeting of the Association for Computational Linguistics and 17th InternationalConference on Computational Linguistics (ACL/COLING-98), 869–875. 12, 52
Mitkov, R. (2001). Outstanding issues in anaphora resolution. In A. Gelbukh, ed., Proceed-ings of the 2nd International Conference on Computational Linguistics and Intelligent TextProcessing (CICLing-01), 110–125, Springer, Berlin, Heidelberg, New York, Lecture Notesin Computer Science, Vol. 2004. 10
Mitkov, R. (2002). Anaphora resolution. Longman, London. 6, 8, 10, 33, 52, 61
68
References
Mitkov, R. (2010). Discourse processing. In A. Clark, C. Fox & S. Lappin, eds., The hand-book of computational linguistics and natural language processing , 599–629, Wiley Blackwell,Oxford. 2, 5, 10
Mitkov, R., Evans, R. & Orasan, C. (2002). A new, fully automatic version of Mitkov’sknowledge-poor pronoun resolution method. In Proceedings of the 3rd International Con-ference on Computational Linguistics and Intelligent Text Processing (CICLing-02), 69–83,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 2276. 10,12
Molina Lopez, D. (2008). Y de los hermanos ¿que? Como ayudar a los hermanos de un TLP.psiquiatria.com, 12. 28
Mori, T. & Nakagawa, H. (1996). Zero pronouns and conditionals in Japanese instructionmanuals. In Proceedings of the 16th International Conference on Computational Linguistics(COLING-96), 782–787. 7, 8
Muller, C. (2006). Automatic detection of nonreferential it in spoken multi-party dialog. InProceedings of the 11th Conference of the European Chapter of the Association for Compu-tational Linguistics (EACL-06), 49–56. 10, 12, 13
Murata, M., Isahara, H. & Nagao, M. (1999). Pronoun resolution in Japanese sentencesusing surface expressions and examples. In A. Bagga, B. Baldwin & S. Shelton, eds., Pro-ceedings of the Workshop on Coreference and Its Applications. 37th Annual Meeting of theAssociation for Computational Linguistics (ACL-99), 39–46. 7, 8
Nakagawa, H. (1992). Zero pronouns as experiencer in Japanese discourse. In Proceedings ofthe 15th International Conference on Computational Linguistics (COLING-92), 324–330. 7,8
Nakaiwa, H. (1997). Automatic identification of zero pronouns and their antecedents withinaligned sentence pairs. In Proceedings of the 3rd Annual Meeting of the Association for Nat-ural Language Processing in Japan (ANLP-97), 127–141. 7, 8
Nakaiwa, H. & Ikehara, S. (1992). Zero pronoun resolution in a Japanese to English machinetranslation system by using verbal semantic attributes. In Proceedings of the 3rd Conferenceon Applied Natural Language Processing (ANLP-92), 201–208. 7, 8
Nakaiwa, H. & Shirai, S. (1996). Anaphora resolution of Japanese zero pronouns with deicticreference. In Proceedings of the 16th International Conference on Computational Linguistics(COLING-96), 812–817. 7, 8
Ng, V. & Cardie, C. (2002). Identifying anaphoric and non-anaphoric noun phrases to im-prove coreference resolution. In Proceedings of the 19th International Conference on Compu-tational Linguistics (COLING-02), 1–7. 10, 12
Nomoto, T. & Yoshihiko, N. (1993). Resolving zero anaphora in Japanese. In Proceedings ofthe 6th Conference of the European Chapter of the Association for Computational Linguistics(EACL-93), 315–321. 7, 8
Okumura, M. & Tamura, K. (1996). Zero pronoun resolution in Japanese discourse basedon centering theory. In Proceedings of the 16th International Conference on ComputationalLinguistics (COLING-96), 871–876. 1, 7
69
References
Paice, C.D. & Husk, G.D. (1987). Towards an automatic recognition of anaphoric featuresin English text: the impersonal pronoun it. Computer Speech and Language, 2, 109–132. 10,11, 12
Peng, J. & Araki, K. (2007a). Zero anaphora resolution in Chinese and its application inChinese-English machine translation. In Z. Kedad, N. Lammari, E. Metais, F. Meziane &Y. Rezgui, eds., Natural language processing and information systems. Proceedings of the12th International Conference on Applications of Natural Language to Information Systems(NLDB-07), 364–375, Springer, Berlin, Heidelberg, New York, Lecture Notes in ComputerScience, Vol. 4592. 7
Peng, J. & Araki, K. (2007b). Zero-anaphora resolution in Chinese using maximum entropy.IEICE - Transactions on Information and Systems, E90-D, 1092–1102. 7, 8
Peral, J. (2002). Resolucion y generacion de la anafora nominal en espanol e ingles en unsistema de traduccion automatica. Procesamiento del lenguaje natural , 28, 127–128. 7, 8
Peral, J. & Ferrandez, A. (2000). Generation of Spanish zero-pronouns into English. InD.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Proceedings of the 2ndInternational Conference on Natural Language Processing (NLP-2000), 252–260, Springer,Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 1835. 2, 7, 8
Pintor Garcıa, M. (2007). Analisis factorial de las actitudes personales en educacion secun-daria. Un estudio empırico en la Comunidad de Madrid. psiquiatria.com, 11. 28
Pollard, C. & Sag, I. (1994). Head Driven Phrase Structure Grammar . CSLI Publications,Stanford, CA. 19
Real Academia Espanola (1977). Esbozo de una nueva gramatica de la lengua espanola.Espasa-Calpe, Madrid. 19
Real Academia Espanola (2001). Diccionario de la lengua espanola. Espasa-Calpe, Madrid,22nd edn. 15, 40, 41
Real Academia Espanola (2009). Nueva gramatica de la lengua espanola. Espasa-Calpe,Madrid. ix, 6, 15, 16, 17, 18, 19, 22, 23, 24, 25, 33, 34
Recasens, M. & Hovy, E. (2009). A deeper look into features for coreference resolution. InL.D. Sobha, A. Branco & R. Mitkov, eds., Anaphora Processing and Applications. Proceedingsof the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-09), 29–42,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 5847. 2, 6,11
Rello, L. & Illisei, I. (2009a). A comparative study of Spanish zero pronoun distribution. InProceedings of the International Symposium on Data and Sense Mining, Machine Translationand Controlled Languages, and their application to emergencies and safety critical domains(ISMTCL-09), 209–214, Presses Universitaires de Franche-Comte, Besancon. 3, 7
Rello, L. & Illisei, I. (2009b). A rule-based approach to the identification of Spanish zeropronouns. In Student Research Workshop. International Conference on Recent Advances inNatural Language Processing (RANLP-09), 209–214. 3, 7, 8, 9, 10, 11, 22, 35
70
References
Rello, L., Baeza-Yates, R. & Mitkov, R. (2010a). Improved subject ellipsis detection inSpanish. submitted . 3
Rello, L., Suarez, P. & Mitkov, R. (2010b). A machine learning method for identify-ing non-referential impersonal sentences and zero pronouns in Spanish. Procesamiento delLenguaje Natural , 45, 281–287. 3
Ross, J. (1967). Constrains on variables in syntax . Ph.D. thesis, Massachusetts Institute ofTechnology. 17
Sanchez de las Brozas, F. ([1562] 1976). Minerva. De la propiedad de la lengua latina.Catedra, Madrid. 14
Sasano, R., Kawahara, D. & Kurohashi, S. (2008). A fully-lexicalized probabilistic modelfor Japanese zero anaphora resolution. In Proceedings of the 22nd International Conferenceon Computational Linguistics (COLING-08), 769–776. 7, 8
Seco, M. (1988). Manual de gramatica espanola. Aguilar, Madrid. 19
Seki, K., Fujii, A. & Ishikawa, T. (2002). A probabilistic method for analyzing Japaneseanaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th In-ternational Conference on Computational Linguistics (COLING-02), 911–917. 7, 8
Sevillano Arroyo, M.A. & Ducret Rossier, F.E. (2008). Las emociones en la psiquiatrıa.psiquiatria.com, 12. 26
Shopen, T. (1973). Ellipsis as grammatical indeterminacy. Foundations of Language, 10, 65–77. 15
Steinberger, J., Poesio, M., Kabadjov, M.A. & Jeek, K. (2007). Two uses of anaphoraresolution in summarization. Information Processing and Management , 43, 1663–1680. 2, 7
Streb, J., Hennighausen, E. & Rosler, F. (2004). Different anaphoric expressions areinvestigated by event-related brain potentials. Journal of Psycholinguistic Research, 33, 175–201. 15
Takada, S. & Doi, N. (1994). Centering in Japanese: a step towards better interpretation ofpronouns and zero-pronouns. In Proceedings of the 15th International Conference on Com-putational Linguistics (COLING-94), 1151–1156. 7, 8
Tanaka, I. (2000). Cataphoric personal pronouns in English news reportage. In Proceedings ofthe 3rd Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-2000), 108–117.33, 34
Tapanainen, P. (1996). The constraint grammar parser CG-2 . Department of General Lin-guistics, University of Helsinki, Publications, Vol. 27. 28
Tapanainen, P. & Jarvinen, T. (1997). A non-projective dependency parser. In Proceedingsof the 5th Conference on Applied Natural Language Processing (ANLP-97), 64–71. 13, 28
Tesniere, L. (1959). Elements de syntaxe. Klincksieck, Paris. 28
71
References
Theune, M., Hielkema, F. & Hendriks, P. (2006). Performing aggregation and ellipsis us-ing discourse structures. In Research on Language & Computation, vol. 4, 353–375, Springer,Berlin, Heidelberg, New York. 7
Wilder, C. (1997). Some properties of ellipsis in coordination. In Studies in universal grammarand typological variation, 59–107, John Benjamins, Amsterdam. 17
Witten, I.H. & Frank, E. (2005). Data mining: practical machine learning tools and tech-niques. Morgan Kaufmann, London, 2nd edn. 26, 38, 41, 44, 46
Yeh, C. & Chen, Y. (2003a). Using zero anaphora resolution to improve text categorization. InProceedings of the 17th Pacific Asia Conference on Language, Information and Computation(PACLIC-03), 423–430. 2, 7, 8
Yeh, C. & Chen, Y. (2003b). Zero anaphora resolution in Chinese with partial parsing basedon centering theory. In Proceedings of the International Conference on Natural LanguageProcessing and Knowledge Engineering (NLP-KE-03), 683–688. 7, 8
Yeh, C. & Chen, Y. (2007). Topic identification in Chinese based on centering model. Journalof Chinese Language and Computing , 17, 83–96. 2, 7, 8
Yeh, C. & Mellish, C. (1997). An empirical study on the generation of zero anaphors inChinese. Computational Linguistics, 23, 171–190. 7, 8
Yoshimoto, K. (1988). Identifying zero pronouns in Japanese dialogue. In Proceedings of the12th International Conference on Computational Linguistics (COLING-88), 779–784. 7, 8
Zhao, S. & Ng, H. (2007). Identification and resolution of Chinese zero pronouns: a machinelearning approach. In Proceedings of the 2007 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Natural Language Learning (EMNLP/CNLL-07), 541–550. 2, 7, 8
72
top related