Elliphant: A Machine Learning Method for Identifying Subject Ellipsis and Impersonal Constructions in Spanish Luz Rello Main advisor: Ruslan Mitkov Co-advisor: Xavier Blanco A thesis submitted for the degree of Erasmus Mundus International Master in Natural Language Processing and Human Language Technology Research Group in Computational Linguistics Laboratori fLexSem University of Wolverhampton Universitat Aut` onoma de Barcelona June 2010
86
Embed
Elliphant: A Machine Learning Method for Identifying ... · for Identifying Subject Ellipsis and Impersonal Constructions in Spanish Luz Rello Main advisor: Ruslan Mitkov ... Charlie
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Elliphant:
A Machine Learning Method
for Identifying Subject Ellipsis
and Impersonal Constructions
in Spanish
Luz Rello
Main advisor: Ruslan Mitkov
Co-advisor: Xavier Blanco
A thesis submitted for the degree of Erasmus Mundus International Master in
Natural Language Processing and Human Language Technology
Research Group in Computational Linguistics Laboratori fLexSemUniversity of Wolverhampton Universitat Autonoma de Barcelona
1In this work explicit subjects in the examples are presented in italics., zero pronouns in the
examples are presented by the symbol Ø, while in the English translations the subjects which are
elided in Spanish are marked with parenthesis. Impersonal constructions in the examples are not
explicitly indicated using a symbol (see Section 3.1).2The other two reasons given for the low success rate in the identification of verbs with no subject
are the lack of semantic information and the inaccuracy of the grammar used (Ferrandez & Peral,
2000).
11
2. Related Work 2.1 NLP Approaches
– Machine learning approaches (Bergsma et al., 2008; Boyd et al., 2005; Clemente
et al., 2004; Evans, 2000, 2001; Mitkov et al., 2002; Muller, 2006; Ng & Cardie,
2002);
– Web based approach (Li et al., 2009); and
– Descriptive studies from contextual (Lambrecht, 2001) and intonational points of
view (Gundel et al., 2005).
Paice & Husk (1987) introduce a rule-based method for identifying non-referential
it while Lappin & Leass (1994) and Denber (1998) describe rule-based components of
their pronoun resolution systems which detect non-referential uses of it. Mitkov’s first
anaphora resolution algorithm did not incorporate an approach for detecting pleonastic
it (Mitkov, 1998), while, in more recent versions, mars (Mitkov’s Anaphora Resolution
System), uses Evans (2001) system to detect pleonastic it, and machine learning (Mitkov
et al., 2002).
Instance-based learning approaches are used for identifying pleonastic it in English,
while the only approach for the identification of expletive pronouns in French employs
a ruled-based methodology (Danlos, 2005).
Evans (2001)1 describes the first attempt using a machine learning method to clas-
sify pleonastic it into seven types while Boyd et al. (2005) present a linguistically
motivated classification of non-referential it into four types.
A comparison replicating the approaches developed by Paice & Husk (1987) and
Evans (2001) with the system implemented by Boyd et al. (2005) corroborates the
finding that machine learning outperforms rule-based approaches (Boyd et al., 2005).
Further, it is pointed out that rule-based methods are limited due to their reliance on
lists of verbs and adjectives commonly used in the patterns that they exploit, which
can make them less portable and more difficult to adapt to new texts. Nevertheless, the
basic grammatical patterns are still reasonably consistent indicators of non-referential
occurrences of it (Boyd et al., 2005).
Certain aspects of the work described in this dissertation were inspired by the
methodology of the machine learning approaches for the identification of pleonastic it
specifically by Evans (2001) and Boyd et al. (2005).
1This method is currently incorporated as a component of mars (Mitkov et al., 2002).
12
2. Related Work 2.2 Linguistic Approaches
Due to the fact that the occurrence of non-referential zero pronouns is not very
common1, the size of our corpus was increased in order to achieve a sufficient num-
ber of instances for each class. The training data exploited by the Elliphant system
contains 6,827 instances of which 179 are non-referential examples. In Evans (2001)
3,171 instances of it where classified into seven classes while in Boyd et al. (2005) 2,337
examples were classified into four classes.
Our corpus was analyzed, as in the approach described by Evans (2001), using a
functional dependency parser, Connexor’s Machinese Syntax 2 (Connexor Oy, 2006b;
Tapanainen & Jarvinen, 1997). Moreover, some of the features used in the Elliphant
system, such as the consideration of the lemmas and the parts of speech (POS) of the
preceding and following material, were also implemented in Evans (2001) approach.
In contrast to previous work, the K* algorithm (Cleary & Trigg, 1995) was found
to provide the most accurate classification in the current study. Other approaches have
employed various classification algorithms, including K-nearest neighbors in TiMBL
(Boyd et al., 2005; Evans, 2001) and JRip in Weka (Muller, 2006).
2.2 Linguistic Approaches
Literature related to ellipsis in linguistic theory has served as one basis for establishing
the linguistically motivated classes and the annotation criteria in the current work. The
linguistically related work on this topic is focused on the definition and description of
the use of ellipsis in natural language and the limits of that use.
In Spanish, the use of ellipsis is very widespread. It is a phenomenon that occurs
in a wide range of contexts and is therefore much discussed in the field of linguistics.
To illustrate, some controversial topics in linguistics that pertain to instances of ellipsis
found in our corpus include: the establishment of different types of ellipsis, the identifi-
cation of impersonal sentences (non-referential expressions), the definition of particular
syntactic categories which can function as subjects, and the intricate differentiation of
reflex passive with elliptic subject from impersonal sentences in different varieties of
Spanish.
The concepts used in both types of literature (nlp and linguistic) to distinguish
different types of ellipsis and zero signs are extremely broad and are well debated in
1Only 3% of the verbs found in our corpus (see Section 3.2.1) have non-referential elliptic subjects.2http://www.connexor.eu/technology/machinese/demo/syntax/.
the linguistic literature. Elements of the elliptic typology used in this work which were
derived from the literature are stated next while the linguistic and formal criteria used
to identify the chosen classes and which served as the basis for the corpus annotation,
including a typology of the examples found, is explained in Sections 3.1.1, 3.1.2, 3.1.3
and 3.2.2.
2.2.1 Linguistic Approaches to Subject Ellipsis
The study of the omission of some element from the sentence or the discourse in natural
language has been a challenge not only in computing but also in Spanish linguistics
itself –from the Renaissance period through to the present day.
The first occidental grammarian who treated ellipsis as a grammatical phenomenon
(Hernandez Terres, 1984) was Francisco Sanchez de las Brozas, El Brocense (1523-
1600) (Sanchez de las Brozas, [1562] 1976, p. 317), who took the concept of ellipsis
from Apolonio Dıscolo (Dıscolo, [2nd century] 1987) and defined it as:
“La elipsis es la falta de una palabra o de varias en una construccion correcta [...].
“Ellipsis is the omission of one or more items from a correct construction [...].”
This conception, in which grammar serves as a basis for a rational explanation of the
surface form of the language:
“No hay, pues, ninguna duda de que se debe buscar la explicacion racional de
la cosas, tambien de las palabras.” Sanchez de las Brozas ([1562] 1976) cited in
Garcıa Jurado (2007, p. 12)
“There is no doubt about that there shall be pursuit a rational explenation of the
things.”
later inspired the rational grammar of Port-Royal (Lancelot & Arnauld, [1660] 1980)
which was a precursor of Chomsky’s work (Chomsky, [1968] 2006, p. 5):
“One, particularly crucial in the present context, is the very great interest in the
potentialities and capacities of automata, a problem that intrigued the seventeenth-
century mind as fully as it does our own. [...] A similar realisation lies at the base
of Cartesian philosophy.”
14
2. Related Work 2.2 Linguistic Approaches
In order to elide something, a meaning, which is not expressed needs to be assumed. It
thus follows that ellipsis itself was one of the basic mechanisms to explain the transition
from D-structure to S-Structure becoming a central issue (Brucart, 1987) in generative
grammar from its original model, the Standard Theory (Chomsky, 1965) to its latest
revisions (Chomsky, 1995).
Different branches of linguistics have considered ellipsis from different points of
view:
– Semantic: traditionally, the criteria used to define ellipsis were semantic or logical
(Bello, [1847] 1981) and prescriptive (Real Academia Espanola, 2001);
– Descriptive and explicative: (Brucart, 1999);
– Distributional: although structuralism rejected the study of units which were not
codified in the signifier or phonetic realization, some classifications of ellipsis were
presented (Francis, 1958; Fries, 1940);
– Pragmatic: in diverse pragmatic paradigms the role of ellipsis is crucial as it
influences the interpretation of text. As a result it has given rise to several lines
of investigation such as implications though ellipsis (Grice, 1975), ellipsis studied
as a factor to activate textual coherence (Halliday & Hasan, 1976), or indefinite
ellipsis in which a word can stand for one or more sentences in a restrictive code
(Shopen, 1973); and
– Cognitive: in terms of ellipsis processing by the brain (Streb et al., 2004, p. 175):
“Ellipses and pronouns/proper names are processed by distinct mechanisms
being implemented in distinct cortical cell assemblies.”
or as part of the explanation of the language faculty (Chomsky, 1965).
The terminology and linguistic explanations relevant for this work, consider both zero
pronouns and non-referential expressions to be different types of ellipsis (Brucart, 1999).
Four kinds of Spanish subject ellipsis are distinguished (Brucart, 1999, p. 2851).
This classification is presented in correlation with a verb classification (Real Academia
Espanola, 2009), which is related to the omitted subject classification presented in
Bosque (1989).
15
2. Related Work 2.2 Linguistic Approaches
The classification of Spanish omitted subjects presented in Bosque (1989) is: omit-
ted subjects from finite verbs, which can be referential and non-referential and omitted
subjects from non-finite verbs which can be argumental and non-argumental. The ar-
gumental omitted subjects can in turn be referential and non-referential. In this study
non-argumental omitted subjects are claimed not to exist (Bosque, 1989), although in
Brucart (1999), non-argumental omitted subjects are considered a type of ellipsis (Type
4 in Figure 2.1).
1. Omitted sub ject in a clause containing a finite verb: Ø No vendrán
[They] won’t come
Ø Dicen que vendrá [They] say he won’t come[It is] said he won’t come
2. Argumental impersonal subject
En este estudio Ø se trabaja bien.
In this room [one] can work properly.
3. Non-argumental impersonal subject Ø Nieva
[It] is snowing
4. Omitted subject in a non-finite verb clause Juan intentaba (Ø decírselo a
María.) John tried ([John] to tell Mary.)
Verb with no argumental subject
Verb with argumental omitted subject which is represented by pronoun se
Verb with argumental omitted subject with an unespecific interpretation
Verb with argumental omitted sub j ec t w i th an espec ific interpretation
Types of subject ellipsis Types of verbs depending on their subject
(Brucart, 1999) (Real Academia Española, 2009)
(2) Argumental impersonal subject
En este estudio Ø se trabaja bien. In this room one can work properly.
(3) Non-argumental impersonal subject
Ø NievaIt is snowing
(4) Omitted subject in a non-finite verb clause
Juan intentaba (Ø decírselo a María.) John tried (John to tell Mary.)
(1) Omitted subject in a clause containing a finite verb:
Ø No vendrán They won’t come
Ø Dicen que vendrá They say he will comeIt is said he will come
Figure 2.1: Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia
Espanola, 2009).
The first type of ellipsis (see (1) in Figure 2.1) represents omitted subjects and
corresponds to zero pronouns in the nlp literature. An omitted subject is the result
of nominal ellipsis where a non-phonetically/orthographically realized lexical element –
omitted subject– which is needed for the interpretation of the meaning and the structure
of the sentence, is omitted since it can retrieved from its context (Brucart, 1999).
Despite their lack of phonetic realization, omitted subjects are part of the clause (Real
Academia Espanola, 2009).
16
2. Related Work 2.2 Linguistic Approaches
Two types of syntactic ellipsis or lexical-syntactic ellipsis can be distinguished:
verbal ellipsis and nominal ellipsis. These types of subject ellipsis can affect the whole
argument of the verb or be partial and just affect the head of the argument (Brucart,
1999). As detailed in Section 3.2.2, the annotation of our corpus includes both complete
noun phrase ellipsis and noun phrase head ellipsis. Note that nominal ellipsis not
only affects the subjects but also the other arguments of the verb –datives, direct
objects or infinitive objects– although their ellipsis is held to more restricted conditions
(Brucart, 1999). However, this fact is not acknowledged in some prior approaches in
nlp (Ferrandez & Peral, 2000, p. 166):
“While in other languages, zero-pronouns may appear in either the subject’s or
the object’s grammatical position, (e.g. Japanese), in Spanish texts, zero-pronouns
only appear in the position of the subject.”
The interpretation of Type 1 ellipsis can be definite and specific (Brucart, 1999)
or indefinite (Real Academia Espanola, 2009). Since omitted subjects are referential,
they can be lexically retrieved (Gomez Torrego, 1992). An example of omitted subject
could be:
(d) Las leyes no tendran efecto retroactivo si Ø no dispusieren lo contrario.
The law will not have a retroactive effect unless (they) specify otherwise.
The nature of the omitted subject [Ø] itself has been discussed in the linguistic literature
(Real Academia Espanola, 2009). While recent approaches in linguistics agree that the
omitted subject has a pronominal nature (elided pronoun), others contend that the
subject is expressed in the morphology of the verb inflection.
In Generative Grammar subject ellipsis has been understood as a (1) pro-form
(Beavers & Sag, 2004; Chung et al., 1995; Fiengo & May, 1994; Wilder, 1997) or as (2)
a syntactic realization without a phonetic constituent (Merchant, 2001; Ross, 1967).
The Meaning-Text Theory (mtt) contends that ellipsis occurs in the SSyntS (sur-
face syntax) when the elliptic element is deleted during the transition from SSyntS to
DMorphS (deep morphology) (or vice versa) and an empty node stands in for the rep-
resentation of the elliptic element. This procedure for treating ellipses is also proposed
in the MTT for the description for all coordinate structures (Mel’cuk, 2003).
17
2. Related Work 2.2 Linguistic Approaches
The identification of omitted subjects is not problematic when the zero pronoun
belongs to the first or second person but when it is a third person omitted subject, the
reference can be anaphoric or cataphoric (Type 1 ellipsis in Table 2.1) or non-specific1.
A generic or non-specific interpretation can follow in some clauses with singular sec-
ond person and plural third person zero pronouns (Real Academia Espanola, 2009).
However, depending on discourse knowledge, there can be alternators of specific and
non-specific interpretation in clauses which are formally equal, as the next example
shows:
(e) Ø Me han regalado un reloj. (In this example both interpretations, specific and non-
specific, are possible.)
(1) (They) gave me a watch. (When the agent referred to by “they” has been mentioned
previously in the discourse.)
(2) (I) was given a watch. (When no agent has been mentioned previously in the dis-
course.)
where the non-specific interpretation does not exclude a possible specific one (Real
Academia Espanola, 2009). Therefore, both groups of argumental subjects with specific
and non-specific interpretations are included in the same class.
2.2.2 Linguistic Approaches to Non-referential Ellipsis
On the other hand, Type 2 and type 3 ellipsis listed in Figure 2.1 correspond to
non-referential expressions or impersonal sentences. Type 2 ellipsis is composed of
impersonal sentences containing the Spanish particle se, whose argumental omitted
subject always has an unspecific interpretation and is referred to using the pronoun se
(Mendikoetxea, 1994). Type 3 ellipsis corresponds to the set of sentences called imper-
sonal sentences. Although the types of impersonal constructions in Spanish are hetero-
geneous, all of them share a lack of some properties of the subject (Fernandez Soriano
& Taboas Baylın, 1999). Some studies consider different kinds of Spanish imperson-
ality, e.g. semantic and syntactic impersonality (Gomez Torrego, 1992), while others
distinguish several semantic degrees of impersonality (Mendikoetxea, 1999).
1In journalistic headlines with an omitted subject, a non-specific interpretation can occur (Bosque,
1989) even in non-pro-drop languages such as English, French or German (Real Academia Espanola,
2009). Such non-specific interpretations can occur when the antecedent or referent was not previously
mentioned in the discourse.
18
2. Related Work 2.2 Linguistic Approaches
Traditionally –from a semantic point of view– impersonal sentences have been con-
sidered to be those which cannot contain a subject, the agent of the action described
(Real Academia Espanola, 1977). This impersonality can the due either to the nature
of the verb,
(f) Llueve.
(It) rains.
or due to the speaker’s ignorance of the subject (Seco, 1988):
(g) Llaman a la puerta.
(Someone) is knocking the door.
where the subject is unidentified and it is therefore impossible to assign a reference to
it (Bello, [1847] 1981).
The controversy of treating non-referential expressions as a type of ellipsis, given
that they cannot be lexically retrieved, has already been discussed (Gomez Torrego,
1992). While Brucart (1999) considers them a case of ellipsis, as do some Generative
Grammar approaches1, others (Bosque, 1989; Mel’cuk, 2006)2 consider that such elliptic
and non-referential subjects do not exist in language.
A descriptive point of view (Fernandez Soriano & Taboas Baylın, 1999) would
regard impersonal sentences as belonging to either of two main groups (1) impersonal
sentences without a subject and (2) cases of impersonal verbs with the inherent feature
of not having a subject.
In the current dissertation, a prescriptive and descriptive approach (Real Academia
Espanola, 2009) to the consideration of impersonal sentences is taken (See Section
3.1.3).
Type 4 ellipsis (Brucart, 1999) in Figure 2.1 is ignored in our work. However, this
fourth type is much debated in literature; for example, Head-Driven Phrase Structure
Grammar does not consider the infinitive subject as a null category (slash), nor do
Pollard and Sag in their work (Pollard & Sag, 1994).
1Generative Grammar explains these impersonal sentences by labeling the absence of the subject
with a pro-form which presents the same syntactic features as the subject although is has no phonolog-
ical realization. Following the Extended Projection Principle this pro-form embodies all the syntactic
requirements of a subject except for its phonological realization (Chomsky, 1981).2MTT uses the concept of the zero sign to characterize elements whose signifier is empty and is by
no means realized as a perceptible phonetic pause (Mel’cuk, 2006).
19
2. Related Work 2.2 Linguistic Approaches
20
Chapter 3
Detecting Ellipsis in Spanish
This chapter describes the methodology used in this study. The first step is to create
a linguistically motivated classification system (Section 3.1) for all instances of elliptic
and non-elliptic as well as referential and non-referential subjects. Since the machine
learning method requires training data, a corpus (the eszic Corpus) was compiled
(see Section 3.2.1) and a purpose built tool for its annotation was developed, as were
guidelines (see Section 3.2.2). The third task consisted of implementing a method to
extract the features (Section 3.2.3) of instances from the corpus and create training
data (eszic training data; see Section 3.2.4). Finally, once the features of instances
are derived from a document they are exploited for classification by machine learning
using the Weka package (Section 3.2.5).
3.1 Classification
The first step is to create a classification system for all instances of subject and imper-
sonal constructions. The groups into which the subjects were divided were labeled: el-
liptic and non-elliptic subjects as well as referential and non-referential subjects. These
two labels result in a ternary classification:
(1) Explicit subjects: non-elliptic and referential1;
(2) Zero pronouns: elliptic and referential2; and
1Explicit subjects in the examples are presented in italics.2Zero pronouns in the examples are presented by the symbol Ø. In the English translations the
subjects which are elided in Spanish are marked with parenthesis.
21
3. Detecting Ellipsis in Spanish 3.1 Classification
(3) Impersonal constructions: elliptic and non-referential1.
A subject can be non-elliptic (explicit) or elliptic (omitted subject or zero pronoun).
A sign can be referential or non-referential. The distinction lies in the fact that, while
the former can be lexically retrieved, the latter cannot (impersonal construction).
This treatment of the classification as ternary differs from previous work whose
division of subjects was binary: elliptic (zero pronoun) and non-elliptic, both referential
(Ferrandez & Peral, 2000; Rello & Illisei, 2009b) (see Section 2.1.1). In Evans (2001)
the seven fold classification of pleonastic it is based on the type of referent while in
Boyd et al. (2005), classification follows syntactic and semantic criteria (see Section
2.1.2).
In the following sections, each class is described. With regard to cases in which
classification can be controversial, different annotation criteria were applied (see Section
3.2.2).
3.1.1 Explicit Subjects: Non-elliptic and Referential
This class is the one to which explicit subjects belong. They are phonetically realised,
usually by a nominal group: noun, pronoun, noun phrase (a), free relatives, semi-free
relatives, substantival adjectives (Real Academia Espanola, 2009).
(a) Las fuentes del ordenamiento jurıdico espanol son la ley, la costumbre y los principios
generales del derecho.
The sources of the Spanish legal system are the law, the judicial custom and the general
principles of law2.
The syntactic positions of subjects can be pre-verbal or post-verbal. The occur-
rence of post-verbal subjects is restricted by some conditions (Real Academia Espanola,
2009).
(b) Careceran de validez las disposiciones que contradigan otra de rango superior.
The dispositions which contradict the higher range ones will not be valid.
1Impersonal constructions in the examples are not explicitly indicated using a symbol.2Unless otherwise specified, all the examples provided are taken from our corpus (Section 3.2.1).
22
3. Detecting Ellipsis in Spanish 3.1 Classification
Post-verbal subjects, as well as preverbal ones, are also found in passive construc-
tions and passive reflex constructions. As in active clauses, preverbal subjects without
a definite article are rare while post-verbal subjects without a definite article are more
frequent (Real Academia Espanola, 2009).
Projections of non-nominal categories such as clauses containing an infinitive or
a conjugated verb, interrogative indirect clauses, or indirect exclamative clauses, can
function as subjects (Real Academia Espanola, 2009).
(c) Corresponde a los poderes publicos promover las condiciones para que la libertad y la
igualdad del individuo y de los grupos en que se integra sean reales y efectivas.
It corresponds to the public power to promote individual and group liberties to be real
and effective.
3.1.2 Zero Pronouns: Elliptic and Referential
Class 2 is formed by elliptic but referential subjects called zero pronouns. An elliptic
subject is the result of a nominal ellipsis, where a non-phonetically realised lexical
element –elliptic subject– which is needed for the interpretation of the meaning and
the structure of the sentence, is omitted since it can retrieved from its context (Brucart,
1999). Despite their lack of phonetic realisation, elliptic subjects are considered part
of the clause (Real Academia Espanola, 2009).
(d) La Constitucion Espanolai (title in text)
Øi Fue refrendada por el pueblo espanol el 6 de diciembre de 1978.
The Spanish Constitutioni (title in text)
(It)i was countersigned by the Spanish population on the 6th of December of 1978.
Elliptic subjects are considered to be a personal pronoun variant which is not pho-
netically realised (Real Academia Espanola, 2009). Where referential, they can be
lexically retrieved (Gomez Torrego, 1992). That is to say that they can be substituted
by explicit pronouns without changing or losing any of the meaning of the clauses in
which they occur.
The elision of the subject can affect not only the noun head, but also the entire
noun phrase (Brucart, 1999). The noun head can be omitted in Spanish when the
subject of which it is a part fulfills some structural requirements (Brucart, 1999). This
23
3. Detecting Ellipsis in Spanish 3.1 Classification
includes cases in which the subject is referential (Brucart, 1999). The processing of
these subjects has been addressed by the development of specific algorithms in previous
work (Ferrandez et al., 1997).
Ellipsis of the head of the noun phrase is only possible when a definite article occurs.
(e) El Ø que esta obsesionado con que todo el mundo piensa mal es Javier.
The (one) who is obsessed with everyone thinking wrong is Javier.
The article possesses a referential value which could be either anaphoric or cataphoric
(Real Academia Espanola, 2009). Such examples of subjects with an elided head are
instances of semi-free relatives (Real Academia Espanola, 2009) and, as expected, they
are not as frequent in our corpus as elisions of the entire subject noun phrase.
3.1.3 Impersonal Constructions: Elliptic and Non-referential
Impersonal constructions with no subjects, that are both non-referential and elliptic,
do not exist (Bosque, 1989)1.
The appearance of clauses containing zero pronouns and impersonal constructions
is similar. Class 3 is composed of impersonal constructions which are formed by (1)
impersonal and (2) reflex impersonal clauses (impersonal clauses with se).
Impersonal clauses have no argumental subject. Since the subject does not exist,
it cannot be lexically retrieved by any means and no phonetic realisation of it can be
expected (Bosque, 1989). The following cases are considered to be impersonal sentences
– Non-reflex impersonal clauses with verbs haber (to be), hacer (to do), ser (to
be), estar (to be)2, ir (to go) and dar (to give):
1The existence of a non-phonetically realised element in subject position is postulated (see Section
2.2). While Generative Grammar defends their existence (pro-form), mtt does not (zero sign).2Depending on the verbal aspect, there are different Spanish verbs which correspond with the
English verb to be.
24
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
(g) En un kilogramo de gas hay tanta materia como en un kilogramo de solido.
In a kilogram of gas (there) is the same amount of mass as in a kilogram of solid.
(Existential use of the verb haber).
– Non-reflex impersonal clauses with other verbs such as sobrar con (to be too
much), bastar con (to be enough) or faltar con (to have lack of) or the pronominal
unipersonal verb1 with subject zero such as tratarse de (to be about):
(h) Deberan adoptar las precauciones necesarias para su seguridad, especialmente cuando
se trate de ninos.
Necessary measures should be taken, specially when (it) is about children.
(i) Basta con tres sesiones.
(It) is enough with three sessions.
Verbs in such impersonal sentences (Gomez Torrego, 1992), are called lexical impersonal
verbs (Real Academia Espanola, 2009). Due to their lack of subject they are not easily
distinguished from verbs with omitted –but existing– subjects.
Secondly, reflex impersonal clauses have an omitted subject whose reference is non-
specific and cannot be lexically retrieved.
(j) Se estara a lo que establece el apartado siguiente.
(It) will be what is established in the next section
These clauses are formed with the particle se. This particle also serves other syntactic
functions (reflexive pronoun, pronominal pronoun, reciprocal pronoun, etc.) in clauses
with an elided subject.
3.2 Machine Learning Approach
Our corpus was compiled and parsed in order to create training data (referred to as the
eszic training data) for use by a machine learning classification method as explained
in the next section.
A tool was developed for annotation of the corpus (see Section 3.2.2). Fourteen
features were proposed for the purpose of classifying instances of subjects (see Section
1A verb which is only conjugated in the third person.
25
3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach
3.2.3). The feature vectors, together with their manual classifications, were written to
a training file. A method for obtaining the values of those features for each instance
was implemented. The classification algorithm employed was the K* instance-based
learner available in the Weka package (Witten & Frank, 2005) (see Section 3.2.5).
3.2.1 Building the Training Data
The eszic training data used by the Elliphant system is obtained from the eszic corpus
created ad hoc. The corpus is named after its annotated content “Explicit Subjects,
Zero-pronouns and Impersonal Constructions”.
The corpus contains a total of 79,615 words (titles and sentences that do not contain
at least one finite verb are ignored), including 6,825 finite verbs. Of these verbs, 71%
have an explicit subject, 26% have a zero pronoun and 3% belong to an impersonal
construction. There is an average of 2.3 clauses per sentence with 11.7 words per clause
and 26.9 words per sentence.
The corpus compiled to extract the training data is composed of seventeen docu-
ments, originally written in Spanish, and belonging to two genres: legal and health.
The legal texts1 are composed of laws taken from the: (1) Spanish Constitution
Law for Administrative-contentious Jurisdiction (title 1, articles 1 to 17) (Ley 29/1998,
1998), (5) Civil Code (first book, until title V) (Codigo Civil, 1889), (6) Law for Univer-
sities (introduction) (Ley Organica 6/2001, 2001), (7) Law for Associations (chapter
1) (Ley Organica 1/2002, 2002) and (8) Law for Advertisements (whole text) (Ley
29/2005, 2005).
The nine health texts are taken from psychiatric papers compiled from a Span-
ish digital journal of psychiatry Psiquiatrıa.com2: (1) Cinema as a tool for teaching
personality disorders (Lopez Ortega, 2009), (2) Efficacy, functionality, and empow-
erment for phobic pathology treatment, in the context of specialised public Mental
Health Services (Garcıa Losa, 2008), (3) Emotions in Psychiatry (Sevillano Arroyo
& Ducret Rossier, 2008), (4) And what about siblings? How to help TLP3 siblings
1All the legal texts are available online at: http://noticias.juridicas.com/base_datos/2The full-text articles from Psiquiatrıa.com Journal are available online at: http://www.
psiquiatria.com/.3Trastorno lımite de la personalidad (Borderline Personality Disorder).
Health genre eszic training data Precision Recall F-measure
Health genre Elliphant Explicit subjects 0.879 0.879 0.879
Health genre Elliphant Zero pronouns 0.773 0.795 0.784
Health genre Elliphant Impersonal constructions 0.882 0.620 0.728
Elliphant Health eszic training data accuracy: 0.841
Table 4.15: Elliphant Health eszic training data results.
Health genre eszic training data Precision Recall F-measure
Health genre Machinese Explicit subjects 0.879 0.735 0.801
Health genre Machinese Zero pronouns
+ Impersonal constructions 0.656 0.833 0.734
Machinese Health eszic training data accuracy: 0.772
Table 4.16: Machinese Health eszic training data results.
When classifying instances derived from texts in the health genre (using Health
eszic training data), the accuracy of both the Elliphant system and the parser was
reduced. However, Elliphant still outperforms the parser in this context.
When considering the classification of instances of elision in the health genre, Con-
nexor’s Machinese Syntax parser does obtain higher measures for the averaged eval-
uation measures than Elliphant (precision: 0.827; recall: 0.707; f-measure: 0.756).
57
4. Evaluation 4.2 Comparative Evaluation
Nevertheless, unlike the parser, the Elliphant system distinguishes referential (zero
pronouns) and non-referential (impersonal constructions) elided subjects. This can be
considered one of its main contributions as this task is necessary in order to improve
practical anaphora resolution systems.
58
Chapter 5
Conclusions and Future Work
In this dissertation, a machine learning approach to the identification of zero pronouns,
impersonal constructions, and explicit subjects was presented. In treating this range
of classes, complete coverage is provided for all possible constituents which may occur
in subject position in Spanish clauses.
In order to enable a machine learning approach to classification, a parsed corpus of
Spanish texts from the health and legal genres was compiled. The corpus was manually
annotated to encode information about the element in subject position for every finite
verb in the corpus (the eszic Corpus). A set of 14 features was formulated and training
data consisting of 6,827 instances represented by vectors of the feature values was cre-
ated (eszic training data). The training data was utilised by classification algorithms
distributed with the Weka package. Empirical observation revealed that use of the K*
algorithm was optimal for the purpose of this classification. The performance of this
machine learning approach was compared with that of Connexor’s Machinese Syntax
parser. Elliphant offers a classification with superior accuracy in the recognition of
both of the elliptic classes (zero pronouns and impersonal constructions), and also in
the classification of the non-elliptic subject class (explicit subjects). The method pre-
sented in this dissertation is also able to identify impersonal constructions in Spanish.
This is a task which appears not to have been dealt with before in the literature.
In addition to presenting results with regard to algorithm selection, additional ex-
periments carried out with the underlying method included parameter optimisation,
learning of the most effective combinations of features, the optimal number of instances
to include in the training data and the relationships between the results and the differ-
ent genres on which the Elliphant system was tested. This chapter presents the findings
59
5. Conclusions and Future Work 5.1 Main Observations
of all of these experiments (see section 5.1). In future research, it is intended that op-
timisation of the approach and its adaptability to other genres will be investigated in
more depth (see section 5.2).
5.1 Main Observations
Algorithm selection: the instance-based learning algorithm K* was selected for clas-
sification of elliptic vs. explicit subject instances and referential vs. non-referential
subject instances. This decision was taken on the basis of having compared the accu-
racy of this classifier with the rest of the classifiers available in the Weka package. In
terms of accuracy, the K* algorithm is closely followed by the Bayes based algorithms
in Weka.
Parameter optimisation was investigated by checking the impact of the param-
eter setting on the performance of the K* classifier. Although Weka provides sensible
default settings, it is by no means certain that they will be optimal for this particular
task. The default settings were changed so that a blending parameter of 40% was used
with regard to the K* algorithm.
Feature selection: the set of experiments conducted to determine an optimal
group of features to be utilised by the classification algorithm revealed that of the en-
tire set of 14 features, the most effective group comprises six of the features: nhprev
(number of noun phrases previous to the verb), parser (parsed subject), nhtot (num-
ber of noun phrases in the clause), pospos (four pos following the verb), person (verb
morphological person), and lemma (verbal lemma). This study showed that feature a
(preposition a) does not make any meaningful contribution to the classification.
Training data required: learning curves experiments showed the correlation be-
tween the accuracy of the classifier and the size of the training set, whose performance
reaches a plateau at its maximum level when using 90% of the available data.
Genre interference: We evaluated the performance of the Elliphant system sep-
arately in two different genres, legal and health, showing that there is some genre
interference on the classification tasks. Elliphant classifies zero pronouns and explicit
subjects in legal texts with a higher accuracy than is the case in health texts. By con-
trast, impersonal constructions are more accurately classified in health texts. Cross-
genre training and testing demonstrated that legal instances are more informative and
60
5. Conclusions and Future Work 5.2 Future Research
homogeneous than health genre cases.
5.2 Future Research
Future research goals are related to improvements in: (1) optimisation of the Elliphant
system, (2) adaptation of the system to other genres, (3) inter-annotation agreement
of the eszic Corpus, (4) the comparison of Elliphant with a rule based approach and
the (5) design of an algorithm to resolve zero anaphora in Spanish.
Firstly, with regard to further improvement of the Elliphant system, the interaction
between (a) feature selection and parameter optimisation, and (b) class distribution will
be addressed. In related work, it was found that optimal settings for feature selection
and parameter optimisation should not be sought independently of one another since
there is an interaction between the two. The joint optimisation of feature selection
and parameter optimisation can cause variations in the accuracy levels obtained by
classifiers (Hoste, 2005). Additionally, an investigation will be made into how the class
distribution of the data affects learning. This will facilitate the compilation of an
optimal set of training instances as it has been found that training data containing a
lower distribution of negative instances can be beneficial to classification (Hoste, 2005).
In future work, evaluation and learning curve experiments in which training in-
stances derived from texts in one genre are used to classify instances derived from texts
in a different genre will provide an insight into the optimal type/combination of train-
ing data that enables better classification using less instances in various types/genres
of text, as well as provide additional robustness to our system.
Inter-annotator agreement will be measured and it is planned to design a ruled based
algorithm to identify and to resolve zero anaphora in Spanish as there is some debate
about which approach, machine learning or rule-based, brings optimal performance
when applied in anaphora resolution systems (Mitkov, 2002).
61
5. Conclusions and Future Work 5.2 Future Research
62
References
Aldea Munoz, S. (2003). Un caso de intervencion psicologica de la depresion infantil. psiquia-tria.com, 7. 28
Aldea Munoz, S. (2006). Influencia del autoconcepto y de la competencia social en la de-presion infantil. psiquiatria.com, 10. 28
Alonso-Ovalle, L. & D’Introno, F. (2000). Full and null pronouns in Spanish: the zeropronoun hypothesis. In H. Campos, E. Herburger, A. Morales-Front & T.J. Walsh, eds.,Hispanic linguistics at the turn of the millennium. Papers from the 3rd Hispanic LinguisticsSymposium, 189–210, Cascadilla Press, Sommerville, MA. 6
Balcazar Nava, P., Bonilla Munoz, M.P., Gurrola Pena, G.M., Oudhof van Barn-eveld, H. & Aguilar Mercado, M.R. (2005). La depresion como problema de saludmental en los adolescentes mexicanos. psiquiatria.com, 9. 28
Barreras, J. (1993). Resolucion de elipsis y tecnicas de parsing en una interficie de lenguajenatural. Procesamiento del lenguaje natural , 13, 247–258. 7, 8
Beavers, J. & Sag, I. (2004). Coordinate ellipsis and apparent non-constituent coordination.In S. Muller, ed., Proceedings of the 11th International Conference on Head-Driven PhraseStructure Grammar (HPSG-04), 48–69, CSLI Publications, Stanford, CA. 17
Bello, A. ([1847] 1981). Gramatica de la lengua castellana destinada al uso de los americanos.Instituto Universitario de Linguıstica Andres Bello, Cabildo Insular de Tenerife, Santa Cruzde Tenerife. 15, 19
Bergsma, S., Lin, D. & Goebel, R. (2008). Distributional identification of non-referentialpronouns. In Proceedings of the 46th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL/HLT-08), 10–18. 2, 10, 12
Bosque, I. (1989). Clases de sujetos tacitos. In J. Borrego Nieto, ed., Philologica: homenajea Antonio Llorente, vol. 2, 91–112, Servicio de Publicaciones, Universidad Pontificia deSalamanca, Salamanca. 15, 16, 18, 19, 24
Boyd, A., Gegg-Harrison, W. & Byron, D. (2005). Identifying non-referential it : amachine learning approach incorporating linguistically motivated patterns. In Proceedingsof the ACL Workshop on Feature Engineering for Machine Learning in Natural LanguageProcessing. 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05),40–47. 8, 10, 12, 13, 22, 45
63
References
Brucart, J.M. (1987). La elision sintactica en espanol . Universitat Autonoma de Barcelona,Bellaterra. 15
Brucart, J.M. (1999). La elipsis. In I. Bosque & V. Demonte, eds., Gramatica descriptiva dela lengua espanola, vol. 2, 2787–2863, Espasa-Calpe, Madrid. ix, 15, 16, 17, 19, 23, 24
Carden, G. (1982). Backwards anaphora in discourse context. Journal of Linguistics, 18,361–87. 33, 34
Chinchor, N. & Hirschman, L. (1997). MUC-7 Coreference task definition (version 3.0). InProceedings of the 1997 Message Understanding Conference (MUC-97). 2
Chomsky, N. (1965). Aspects of the theory of syntax . The MIT Press, Cambridge, MA. 15
Chomsky, N. ([1968] 2006). Language and mind . Cambridge University Press, Cambridge, 3rdedn. 14
Chomsky, N. (1981). Lectures on government and binding . Mouton de Gruyter, Berlin, NewYork. 1, 6, 19
Chomsky, N. (1995). The minimalist program. The MIT Press, Cambridge, MA. 15
Chung, S., Ladusaw, W. & McCloskey, J. (1995). Sluicing and logical form. NaturalLanguage Semantics, 3, 239–282. 17
Cleary, J. & Trigg, L. (1995). K*: an instance-based learner using an entropic distancemeasure. In Proceedings of the 12th International Conference on Machine Learning (ICML-95), 108–114. 13, 45, 46
Clemente, J., Torisawa, K. & Satou, K. (2004). Improving the identification of non-anaphoric it using Support Vector Machines. In Proceedings of the International Joint Work-shop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP-04), 58–61. 10, 12
Codigo Civil (1889). Texto de la edicion del Codigo Civil mandada publicar por el RealDecreto de 24 del corriente en cumplimiento de la ley de 26 de mayo ultimo. Gaceta deMadrid , 206, 249–312. 26
Connexor Oy (2006b). Machinese language model . 13, 29, 35
Constitucion Espanola (1978). Constitucion Espanola de 27 de diciembre de 1978. BoletınOficial del Estado, 311, 29313–29424. 26
Corpas Pastor, G. (2008). Investigar con corpus en traduccion: los retos de un nuevoparadigma. Peter Lang, Frankfurt am Main. 7, 8
Corpas Pastor, G., Mitkov, R., Afzal, N. & Pekar, V. (2008). Translation universals:do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings ofthe 8th Conference of the Association for Machine Translation in the Americas (AMTA-08),75–81. 2, 7, 8, 10
64
References
Danlos, L. (2005). Automatic recognition of French expletive pronoun occurrences. In R. Dale,K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedings of the2nd International Joint Conference on Natural Language Processing (IJCNLP-05), 73–78,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 3651. 2,10, 11, 12
Denber, M. (1998). Automatic resolution of anaphora in English. Tech. rep., Eastman KodakCo. 10, 11, 12
Dıaz Morfa, J. (2004). La crisis de las aventuras en las relaciones de pareja. psiquiatria.com,8. 28
Dıscolo, A. ([2nd century] 1987). Sintaxis. Gredos, Madrid. 14
Evans, R. (2000). A comparison of rule-based and machine learning methods for identifyingnon-nominal it. In D.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Pro-ceedings of the 2nd International Conference on Natural Language Processing (NLP-2000),233–241, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.1835. 10, 12
Evans, R. (2001). Applying machine learning: toward an automatic classification of it. Literaryand Linguistic Computing , 16, 45–57. 2, 10, 12, 13, 22, 29, 38, 45
Fernandez Soriano, O. & Taboas Baylın, S. (1999). Construcciones impersonales noreflejas. In I. Bosque & V. Demonte, eds., Gramatica descriptiva de la lengua espanola,vol. 2, 1631–1722, Espasa-Calpe, Madrid. 18, 19
Ferrandez, A. & Peral, J. (2000). A computational approach to zero-pronouns in Spanish.In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics(ACL-2000), 166–172. 2, 6, 7, 8, 9, 11, 17, 22, 52, 55
Ferrandez, A., Palomar, A. & Moreno, L. (1997). El problema del nucleo del sintagmanominal: ¿elipsis o anafora? Procesamiento del lenguaje natural , 20, 13–26. 24
Ferrandez, A., Palomar, A. & Moreno, L. (1998). Anaphor resolution in unrestrictedtexts with partial parsing. In Proceedings of the 36th Annual Meeting of the Association forComputational Linguistics and 17th International Conference on Computational Linguistics(ACL/COLING-98), 385–391. 9
Ferrandez, A., Palomar, A. & Moreno, L. (1999). An empirical approach to Spanishanaphora resolution. Machine Translation, 14, 191–216. 9
Fiengo, R. & May, R. (1994). Indices and identity . The MIT Press, Cambridge MA. 17
Francis, W. (1958). The structure of American English. Ronald Press, New York. 15
Fries, C. (1940). American English grammar . Appleton-Century-Crofts, New York. 15
Garcıa Jurado, F. (2007). La etimologıa como historia de las palabras. E-excellence, Areade Cultura Clasica, Filologıa Clasica, 39, 1–27. 14
Garcıa Losa, E. (2008). Efectividad, operatividad y potenciacion del tratamiento en patologıafobica, en el contexto de los servicios especializados de salud mental publicos: la utilizacionen la sala de consulta de los recursos de Internet. psiquiatria.com, 12. 26
65
References
Gomez Torrego, L. (1992). La impersonalidad gramatical: descripcion y norma. Arco Libros,Madrid. 17, 18, 19, 23, 25
Grice, H. (1975). Logic and conversation. In P. Cole & J.L. Morgan, eds., Syntax and seman-tics, vol. 3: Speech Acts, 41–58, Academic Press, New York. 15
Gundel, J., Hedberg, N. & Zacharski, R. (2005). Pronouns without NP antecedents:how do we know when a pronoun is referential? In A. Branco, T. McEnery & R. Mitkov,eds., Anaphora processing: linguistic, cognitive and computational modelling , 351–364, JohnBenjamins, Amsterdam. 10, 12
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H.(2009). The WEKA data mining software: an update. SIGKDD Explorations, 11, 10–18. 41
Halliday, M.A.K. & Hasan, R. (1976). Cohesion in English. Longman, London. 15
Han, N. (2004). Korean null pronouns: classification and annotation. In Proceedings of theWorkshop on Discourse Annotation. 42nd Annual Meeting of the Association for Computa-tional Linguistics (ACL-04), 33–40. 7
Hernandez Terres, J.M. (1984). La elipsis en la teorıa gramatical . Universidad de Murcia,Murcia. 14
Hirano, T., Matsuo, Y. & Kikui, G. (2007). Detecting semantic relations between namedentities in text using contextual features. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics. Companion volume proceedings of the demo andposter sessions (ACL-05), 157–160. 2, 7, 8
Hobbs, J. (1977). Resolving pronoun references. Lingua, 44, 311–338. 52
Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D.thesis, University of Antwerp. 61
Hu, Q. (2008). A corpus-based study on zero anaphora resolution in Chinese discourse. Ph.D.thesis, City University of Hong Kong. 7, 8
Iida, R., Inui, K. & Matsumoto, Y. (2006). Exploiting syntactic patterns as clues in zero-anaphora resolution. In Proceedings of the 44th Annual Meeting of the Association for Com-putational Linguistics and the 21st International Conference on Computational Linguistics(ACL/COLING-06), 625–632. 7, 8
Iida, R., Kentaro, I. & Matsumoto, Y. (2009). Capturing salience with a trainable cachemodel for zero-anaphora resolution. In Proceedings of the Joint Conference of the 47th AnnualMeeting of the Association for Computational Linguistics and the 4th International Confer-ence on Natural Language Processing of the Asian Federation of Natural Language Processing(ACL/AFNLP-09), 647–655. 2, 7, 8
Imamura, K., Saito, K. & Izumi, T. (2009). Discriminative approach to predicate-argumentstructure analysis with zero-anaphora resolution. In Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for Computational Linguistics and the 4thInternational Conference on Natural Language Processing of the Asian Federation of NaturalLanguage Processing (ACL/AFNLP-09), 85–88. 2, 7, 8
66
References
Isozaki, H. & Hirao, T. (2003). Japanese zero pronoun resolution based on ranking rulesand machine learning. In Theoretical Issues in Natural Language Processing. Proceedings ofthe 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03),184–191. 7, 8
Jarvinen, T. & Tapanainen, P. (1998). Towards an implementable dependency grammar. InA. Polguere & S. Kahane, eds., Proceedings of the Workshop on Processing of Dependency-Based Grammars. 36th Annual Meeting of the Association for Computational Linguistics and17th International Conference on Computational Linguistics (ACL/COLING-98), 1–10. 28
Jarvinen, T., Laari, M., Lahtinen, T., Paajanen, S., Paljakka, P., Soininen, M. &Tapanainen, P. (2004). Robust language analysis components for practical applications. InProceedings of the 20th International Conference on Computational Linguistics (COLING-04), 53–56. 28, 29
Kawahara, D. & Kurohashi, S. (2004). Improving Japanese zero pronoun resolution byglobal word sense disambiguation. In Proceedings of the 20th International Conference onComputational Linguistics (COLING-04), 343–349. 2, 7, 8
Kibrik, A.A. (2004). Zero anaphora vs. zero person marking in Slavic: a chicken/egg dilemma?In Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-04), 87–90. 2, 7, 8
Kratzer, A. (1998). More structural analogies between pronouns and tenses. In Proceedingsof Semantics and Linguistic Theory VIII (SALT-88), Cornell University, Ithaca, NY. 6
Kuno, S. (1972). Functional sentence perspective: a case study from Japanese and English.Linguistic Inquiry , 3, 269–320. 33
Lambrecht, K. (2001). A framework for the analysis of cleft constructions. Linguistics, 39,463–516. 10, 12
Lancelot, C. & Arnauld, A. ([1660] 1980). Gramatica general y razonada. Sociedad GeneralEspanola de Librerıa, Madrid. 14
Lappin, S. & Leass, H. (1994). An algorithm for pronominal anaphora resolution. Computa-tional Linguistics, 20, 535–561. 10, 11, 12, 52
Lee, S. & Byron, D. (2004). Semantic resolution of zero and pronoun anaphors in Korean. InProceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-04), 103–108. 2, 7
Lee, S., Byron, D. & Jang, S. (2005). Why is zero marking important in Korean? InR. Dale, K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedingsof the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05),588–599, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.3651. 7
Ley 29/1998 (1998). Ley 29/1998, de 13 de julio, reguladora de la Jurisdiccion Contencioso-administrativa. Boletın Oficial del Estado, 167, 23516–23551. 26
Ley 29/2005 (2005). Ley 29/2005, de 29 de diciembre, de Publicidad y Comunicacion Institu-cional. Boletın Oficial del Estado, 312, 42902–42905. 26
67
References
Ley 3/1991 (1991). Ley 3/1991, de 10 de enero, de Competencia Desleal. Boletın Oficial delEstado, 10, 959–962. 26
Ley Organica 10/1995 (1995). Ley Organica 10/1995, de 23 de noviembre, del Codigo Penal.Boletın Oficial del Estado, 281, 33987–34058. 26
Ley Organica 1/2002 (2002). Ley Organica 1/2002, de 22 de marzo, reguladora del Derechode Asociacion. Boletın Oficial del Estado, 73, 11981–11991. 26
Ley Organica 6/2001 (2001). Ley Organica 6/2001, de 21 de diciembre, de Universidades.Boletın Oficial del Estado, 307, 49400–49425. 26
Li, Y., Musilek, P. & Wyard-Scott, L. (2009). Identification of pleonastic it using theweb. Computer Engineering , 34, 339–389. 10, 12
Lopez Ortega, M.A. (2009). El cine como herramienta ilustrativa en la ensenanza de lostrastornos de la personalidad. psiquiatria.com, 13. 26
Manning, C. & Schutze, H. (1999). Foundations of statistical natural language processing .The MIT Press, Cambridge, MA. 41
Matsui, T. (1999). Approaches to Japanese zero pronouns: centering and relevance. InD. Cristea, N. Ide & D. Marcu, eds., Proceedings of the Workshop on the Relation of Dis-course/Dialogue Structure and Reference. 37th Annual Meeting of the Association Compu-tational Linguistics (ACL-99), 11–20. 2, 7, 8
Mel’cuk, I. (2003). Levels of dependency in linguistic description: concepts and problems.In Dependency and valency. An International handbook of contemporary research, 188–229,Mouton de Gruyter, Berlin, New York. 17
Mel’cuk, I. (2006). Zero sign in morphology. In Aspects of the theory of morphology , 447–495,Mouton de Gruyer, Berlin, New York. 6, 19
Mendikoetxea, A. (1994). La semantica de la impersonalidad. In C. Sanchez, ed., Las con-strucciones con se, 239–267, Visor, Madrid. 18
Mendikoetxea, A. (1999). Construcciones con se: medias, pasivas e impersonales. InI. Bosque & V. Demonte, eds., Gramatica descriptiva de la lengua espanola, vol. 2, 1575–1630,Espasa-Calpe, Madrid. 18
Merchant, J. (2001). The syntax of silence. Sluicing, islands and the theory of ellipsis. OxfordUniversity Press, Oxford. 17
Mitkov, R. (1998). Robust pronoun resolution with limited knowledge. In Proceedings of the36th Annual Meeting of the Association for Computational Linguistics and 17th InternationalConference on Computational Linguistics (ACL/COLING-98), 869–875. 12, 52
Mitkov, R. (2001). Outstanding issues in anaphora resolution. In A. Gelbukh, ed., Proceed-ings of the 2nd International Conference on Computational Linguistics and Intelligent TextProcessing (CICLing-01), 110–125, Springer, Berlin, Heidelberg, New York, Lecture Notesin Computer Science, Vol. 2004. 10
Mitkov, R. (2010). Discourse processing. In A. Clark, C. Fox & S. Lappin, eds., The hand-book of computational linguistics and natural language processing , 599–629, Wiley Blackwell,Oxford. 2, 5, 10
Mitkov, R., Evans, R. & Orasan, C. (2002). A new, fully automatic version of Mitkov’sknowledge-poor pronoun resolution method. In Proceedings of the 3rd International Con-ference on Computational Linguistics and Intelligent Text Processing (CICLing-02), 69–83,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 2276. 10,12
Molina Lopez, D. (2008). Y de los hermanos ¿que? Como ayudar a los hermanos de un TLP.psiquiatria.com, 12. 28
Mori, T. & Nakagawa, H. (1996). Zero pronouns and conditionals in Japanese instructionmanuals. In Proceedings of the 16th International Conference on Computational Linguistics(COLING-96), 782–787. 7, 8
Muller, C. (2006). Automatic detection of nonreferential it in spoken multi-party dialog. InProceedings of the 11th Conference of the European Chapter of the Association for Compu-tational Linguistics (EACL-06), 49–56. 10, 12, 13
Murata, M., Isahara, H. & Nagao, M. (1999). Pronoun resolution in Japanese sentencesusing surface expressions and examples. In A. Bagga, B. Baldwin & S. Shelton, eds., Pro-ceedings of the Workshop on Coreference and Its Applications. 37th Annual Meeting of theAssociation for Computational Linguistics (ACL-99), 39–46. 7, 8
Nakagawa, H. (1992). Zero pronouns as experiencer in Japanese discourse. In Proceedings ofthe 15th International Conference on Computational Linguistics (COLING-92), 324–330. 7,8
Nakaiwa, H. (1997). Automatic identification of zero pronouns and their antecedents withinaligned sentence pairs. In Proceedings of the 3rd Annual Meeting of the Association for Nat-ural Language Processing in Japan (ANLP-97), 127–141. 7, 8
Nakaiwa, H. & Ikehara, S. (1992). Zero pronoun resolution in a Japanese to English machinetranslation system by using verbal semantic attributes. In Proceedings of the 3rd Conferenceon Applied Natural Language Processing (ANLP-92), 201–208. 7, 8
Nakaiwa, H. & Shirai, S. (1996). Anaphora resolution of Japanese zero pronouns with deicticreference. In Proceedings of the 16th International Conference on Computational Linguistics(COLING-96), 812–817. 7, 8
Ng, V. & Cardie, C. (2002). Identifying anaphoric and non-anaphoric noun phrases to im-prove coreference resolution. In Proceedings of the 19th International Conference on Compu-tational Linguistics (COLING-02), 1–7. 10, 12
Nomoto, T. & Yoshihiko, N. (1993). Resolving zero anaphora in Japanese. In Proceedings ofthe 6th Conference of the European Chapter of the Association for Computational Linguistics(EACL-93), 315–321. 7, 8
Okumura, M. & Tamura, K. (1996). Zero pronoun resolution in Japanese discourse basedon centering theory. In Proceedings of the 16th International Conference on ComputationalLinguistics (COLING-96), 871–876. 1, 7
69
References
Paice, C.D. & Husk, G.D. (1987). Towards an automatic recognition of anaphoric featuresin English text: the impersonal pronoun it. Computer Speech and Language, 2, 109–132. 10,11, 12
Peng, J. & Araki, K. (2007a). Zero anaphora resolution in Chinese and its application inChinese-English machine translation. In Z. Kedad, N. Lammari, E. Metais, F. Meziane &Y. Rezgui, eds., Natural language processing and information systems. Proceedings of the12th International Conference on Applications of Natural Language to Information Systems(NLDB-07), 364–375, Springer, Berlin, Heidelberg, New York, Lecture Notes in ComputerScience, Vol. 4592. 7
Peng, J. & Araki, K. (2007b). Zero-anaphora resolution in Chinese using maximum entropy.IEICE - Transactions on Information and Systems, E90-D, 1092–1102. 7, 8
Peral, J. (2002). Resolucion y generacion de la anafora nominal en espanol e ingles en unsistema de traduccion automatica. Procesamiento del lenguaje natural , 28, 127–128. 7, 8
Peral, J. & Ferrandez, A. (2000). Generation of Spanish zero-pronouns into English. InD.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Proceedings of the 2ndInternational Conference on Natural Language Processing (NLP-2000), 252–260, Springer,Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 1835. 2, 7, 8
Pintor Garcıa, M. (2007). Analisis factorial de las actitudes personales en educacion secun-daria. Un estudio empırico en la Comunidad de Madrid. psiquiatria.com, 11. 28
Pollard, C. & Sag, I. (1994). Head Driven Phrase Structure Grammar . CSLI Publications,Stanford, CA. 19
Real Academia Espanola (1977). Esbozo de una nueva gramatica de la lengua espanola.Espasa-Calpe, Madrid. 19
Real Academia Espanola (2001). Diccionario de la lengua espanola. Espasa-Calpe, Madrid,22nd edn. 15, 40, 41
Real Academia Espanola (2009). Nueva gramatica de la lengua espanola. Espasa-Calpe,Madrid. ix, 6, 15, 16, 17, 18, 19, 22, 23, 24, 25, 33, 34
Recasens, M. & Hovy, E. (2009). A deeper look into features for coreference resolution. InL.D. Sobha, A. Branco & R. Mitkov, eds., Anaphora Processing and Applications. Proceedingsof the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-09), 29–42,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 5847. 2, 6,11
Rello, L. & Illisei, I. (2009a). A comparative study of Spanish zero pronoun distribution. InProceedings of the International Symposium on Data and Sense Mining, Machine Translationand Controlled Languages, and their application to emergencies and safety critical domains(ISMTCL-09), 209–214, Presses Universitaires de Franche-Comte, Besancon. 3, 7
Rello, L. & Illisei, I. (2009b). A rule-based approach to the identification of Spanish zeropronouns. In Student Research Workshop. International Conference on Recent Advances inNatural Language Processing (RANLP-09), 209–214. 3, 7, 8, 9, 10, 11, 22, 35
70
References
Rello, L., Baeza-Yates, R. & Mitkov, R. (2010a). Improved subject ellipsis detection inSpanish. submitted . 3
Rello, L., Suarez, P. & Mitkov, R. (2010b). A machine learning method for identify-ing non-referential impersonal sentences and zero pronouns in Spanish. Procesamiento delLenguaje Natural , 45, 281–287. 3
Ross, J. (1967). Constrains on variables in syntax . Ph.D. thesis, Massachusetts Institute ofTechnology. 17
Sanchez de las Brozas, F. ([1562] 1976). Minerva. De la propiedad de la lengua latina.Catedra, Madrid. 14
Sasano, R., Kawahara, D. & Kurohashi, S. (2008). A fully-lexicalized probabilistic modelfor Japanese zero anaphora resolution. In Proceedings of the 22nd International Conferenceon Computational Linguistics (COLING-08), 769–776. 7, 8
Seco, M. (1988). Manual de gramatica espanola. Aguilar, Madrid. 19
Seki, K., Fujii, A. & Ishikawa, T. (2002). A probabilistic method for analyzing Japaneseanaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th In-ternational Conference on Computational Linguistics (COLING-02), 911–917. 7, 8
Sevillano Arroyo, M.A. & Ducret Rossier, F.E. (2008). Las emociones en la psiquiatrıa.psiquiatria.com, 12. 26
Shopen, T. (1973). Ellipsis as grammatical indeterminacy. Foundations of Language, 10, 65–77. 15
Steinberger, J., Poesio, M., Kabadjov, M.A. & Jeek, K. (2007). Two uses of anaphoraresolution in summarization. Information Processing and Management , 43, 1663–1680. 2, 7
Streb, J., Hennighausen, E. & Rosler, F. (2004). Different anaphoric expressions areinvestigated by event-related brain potentials. Journal of Psycholinguistic Research, 33, 175–201. 15
Takada, S. & Doi, N. (1994). Centering in Japanese: a step towards better interpretation ofpronouns and zero-pronouns. In Proceedings of the 15th International Conference on Com-putational Linguistics (COLING-94), 1151–1156. 7, 8
Tanaka, I. (2000). Cataphoric personal pronouns in English news reportage. In Proceedings ofthe 3rd Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-2000), 108–117.33, 34
Tapanainen, P. (1996). The constraint grammar parser CG-2 . Department of General Lin-guistics, University of Helsinki, Publications, Vol. 27. 28
Tapanainen, P. & Jarvinen, T. (1997). A non-projective dependency parser. In Proceedingsof the 5th Conference on Applied Natural Language Processing (ANLP-97), 64–71. 13, 28
Tesniere, L. (1959). Elements de syntaxe. Klincksieck, Paris. 28
71
References
Theune, M., Hielkema, F. & Hendriks, P. (2006). Performing aggregation and ellipsis us-ing discourse structures. In Research on Language & Computation, vol. 4, 353–375, Springer,Berlin, Heidelberg, New York. 7
Wilder, C. (1997). Some properties of ellipsis in coordination. In Studies in universal grammarand typological variation, 59–107, John Benjamins, Amsterdam. 17
Witten, I.H. & Frank, E. (2005). Data mining: practical machine learning tools and tech-niques. Morgan Kaufmann, London, 2nd edn. 26, 38, 41, 44, 46
Yeh, C. & Chen, Y. (2003a). Using zero anaphora resolution to improve text categorization. InProceedings of the 17th Pacific Asia Conference on Language, Information and Computation(PACLIC-03), 423–430. 2, 7, 8
Yeh, C. & Chen, Y. (2003b). Zero anaphora resolution in Chinese with partial parsing basedon centering theory. In Proceedings of the International Conference on Natural LanguageProcessing and Knowledge Engineering (NLP-KE-03), 683–688. 7, 8
Yeh, C. & Chen, Y. (2007). Topic identification in Chinese based on centering model. Journalof Chinese Language and Computing , 17, 83–96. 2, 7, 8
Yeh, C. & Mellish, C. (1997). An empirical study on the generation of zero anaphors inChinese. Computational Linguistics, 23, 171–190. 7, 8
Yoshimoto, K. (1988). Identifying zero pronouns in Japanese dialogue. In Proceedings of the12th International Conference on Computational Linguistics (COLING-88), 779–784. 7, 8
Zhao, S. & Ng, H. (2007). Identification and resolution of Chinese zero pronouns: a machinelearning approach. In Proceedings of the 2007 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Natural Language Learning (EMNLP/CNLL-07), 541–550. 2, 7, 8