Elliphant: A Machine Learning Method for Identifying ... · for Identifying Subject Ellipsis and Impersonal Constructions in Spanish Luz Rello Main advisor: Ruslan Mitkov ... Charlie

Elliphant:

A Machine Learning Method

for Identifying Subject Ellipsis

and Impersonal Constructions

in Spanish

Luz Rello

Main advisor: Ruslan Mitkov

Co-advisor: Xavier Blanco

A thesis submitted for the degree of Erasmus Mundus International Master in

Natural Language Processing and Human Language Technology

Research Group in Computational Linguistics Laboratori fLexSemUniversity of Wolverhampton Universitat Autonoma de Barcelona

June 2010

In memory of Juan Rello

“And then again,” Grandpa Joe went on speaking very slowly

now so that Charlie wouldn’t miss a word, “Mr Willy Wonka

can make marshmallows that taste of violets, and rich caramels

that change colour every ten seconds as you suck them, and little

feathery sweets that melt away deliciously the moment you put

them between your lips. He can make chewing-gum that never

loses its taste, and sugar balloons that you can blow up to enor-

mous sizes before you pop them with a pin and gobble them up.

And, by a most secret method, he can make lovely blue birds’

eggs with black spots on them, and when you put one of these in

your mouth, it gradually gets smaller and smaller until suddenly

there is nothing left except a tiny little pink sugary baby bird

sitting on the tip of your tongue.””

Charlie and the Chocolate Factory, Roald Dahl

Abstract

This thesis presents Elliphant, a machine learning system for classifying

Spanish subject ellipsis as either referential or non-referential. Linguisti-

cally motivated features are incorporated in a system which performs a

ternary classification: verbs with explicit subjects, verbs with omitted but

referential subjects (zero pronouns), and verbs with no subject (impersonal

constructions). To the best of our knowledge, this is the first attempt to

automatically identify non-referential ellipsis in Spanish. In order to en-

able a memory-based strategy, the eszic Corpus was created and manually

annotated. The corpus is composed of Spanish legal and health texts and

contains more than 6,800 annotated instances. A set of 14 features were

defined and a separate training file was created, containing the instances

represented as vectors of feature values. The training data was used with

the Weka package and a set of optimization experiments was carried out

to determine the best machine learning algorithm to use, the parameter op-

timization, the most effective combinations of features, the optimal number

of instances needed to train the classifier, and the optimal settings for clas-

sifying instances occurring in different genres. A comparative evaluation

of Elliphant with Connexor’s Machinese Syntax parser shows the superior-

ity of our system. The overall accuracy of the system is 86.9%. Due to

the fairly frequent elision of subjects in Spanish, this system is useful as the

classification of elliptic subjects as referential or non-referential can improve

the accuracy of Natural Language Processing where zero anaphora resolu-

tion is necessary, inter alia, for information extraction, machine translation,

automatic summarization and text categorization.

Acknowledgements

First, my sincere acknowledgements to Prof. Ruslan Mitkov for providing

everything that can be asked of a supervisor: constant trust, support and

encouragement from the very beginning until the end of this thesis.

There are three other persons without whom this work would not have

been possible (alphabetically): Thank you, Ricardo Baeza-Yates, for your

brilliant ideas; thank you, Richard Evans, for your guidance; and thank

you, Pablo Suarez, for helping the project to become a reality.

I would like to acknowledge the Computational Linguistics Group at the

University of Wolverhampton where my collaboration through the first year

brought its first results, specially to Iustina Ilisei and Naveed Afzal.

Thank you for the assistance received in Universitat Autonoma de Barcelona

by my co-advisor Xavier Blanco and by Jose Marıa Brucart and Joaquim

Llisterri.

I am indebted to the Grupo de Investigacion en Tratamiento Automatico

del Lenguaje Natural of Universitat Pompeu Fabra for their support and

feedback during this last semester, particularly to Gabriela Ferraro and Leo

Wanner.

Finally, thank you to Igor Mel’cuk and Ignacio Bosque for easing doubts

and to Sang Yoon Kim and Ana Suarez Fernandez for their help throughout

the annotation process.

These master studies were supported by a ”La Caixa” grant (Becas de ”La

Caixa” para estudios de master en Espana. Convocatoria 2008).

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 5

2.1 NLP Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 NLP Approaches to Zero Pronouns . . . . . . . . . . . . . . . . . 6

2.1.2 NLP Approaches to Identifying Non-referential Constructions . . 10

2.2 Linguistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Linguistic Approaches to Subject Ellipsis . . . . . . . . . . . . . 14

2.2.2 Linguistic Approaches to Non-referential Ellipsis . . . . . . . . . 18

3 Detecting Ellipsis in Spanish 21

3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Explicit Subjects: Non-elliptic and Referential . . . . . . . . . . 22

3.1.2 Zero Pronouns: Elliptic and Referential . . . . . . . . . . . . . . 23

3.1.3 Impersonal Constructions: Elliptic and Non-referential . . . . . . 24

3.2 Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Building the Training Data . . . . . . . . . . . . . . . . . . . . . 26

3.2.2 Annotation Software and Annotation Guidelines . . . . . . . . . 30

3.2.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.4 Purpose Built Tools . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.5 The WEKA Package . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Evaluation 43

4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.1 Method Selected: K* Algorithm . . . . . . . . . . . . . . . . . . 44

4.1.2 Learning Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.3 Most Effective Features . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.4 Genre Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Conclusions and Future Work 59

5.1 Main Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

References 63

List of Figures

2.1 Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia

Espanola, 2009). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 An example of the output of the Connexor’s Machinese Syntax parser

for Spanish. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Screenshot of the annotation program interface. . . . . . . . . . . . . . . 30

3.3 An example of Weka Explorer interface. . . . . . . . . . . . . . . . . . 42

4.1 eszic training data learning curve for accuracy. . . . . . . . . . . . . . . 48

4.2 eszic training data learning curve for precision, recall and f-measure. . . 48

4.3 Learning curve for accuracy, recall and f-measure of the classes. . . . . . 49

4.4 Learning curve for accuracy, recall and f-measure in relation to the num-

ber of instances of each class. . . . . . . . . . . . . . . . . . . . . . . . . 50

List of Tables

3.1 eszic Corpus: tokens, sentences and clauses. . . . . . . . . . . . . . . . 27

3.2 eszic Corpus: number of instances per class. . . . . . . . . . . . . . . . 28

3.3 eszic Corpus annotation tags. . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Features: definitions and values. . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Weka classifiers accuracy (20% of the eszic training set). . . . . . . . . 45

4.2 eszic training data evaluation with K* -B 40 -M a. . . . . . . . . . . . . 46

4.3 Leave-one-out and ten-fold cross-validation comparison. . . . . . . . . . 47

4.4 Selected features by Weka Attribute Selection methods. . . . . . . . . . 51

4.5 Classification using the selected features groups: accuracy. . . . . . . . . 52

4.6 Extrinsic parser features classification results. . . . . . . . . . . . . . . . 53

4.7 Intrinsic parser features classification results. . . . . . . . . . . . . . . . 53

4.8 Single feature omission classifications: accuracy. . . . . . . . . . . . . . . 53

4.9 Legal and health genres comparative evaluation. . . . . . . . . . . . . . 54

4.10 Cross-genre training and testing evaluation. . . . . . . . . . . . . . . . . 55

4.11 Elliphant eszic training data results. . . . . . . . . . . . . . . . . . . . . 56

4.12 Machinese eszic training data results. . . . . . . . . . . . . . . . . . . . 56

4.13 Elliphant Legal eszic training results. . . . . . . . . . . . . . . . . . . . 56

4.14 Machinese Legal eszic training results. . . . . . . . . . . . . . . . . . . . 57

4.15 Elliphant Health eszic training data results. . . . . . . . . . . . . . . . 57

4.16 Machinese Health eszic training data results. . . . . . . . . . . . . . . . 57

Chapter 1

Introduction

This introduction is intended to explain the three primary motivations for this research

(Section 1.1), its objectives (Section 1.2), and to briefly describe its outcomes. These

outcomes include the results of an evaluation of the implemented system and publica-

tions produced over the course of the study (see Section 1.3). The overall structure of

the thesis is also presented in Section 1.4.

1.1 Motivation

There are three reasons motivating the decision to choose this research topic and develop

a tool, Elliphant, to perform the identification of zero pronouns (referential elliptic

subjects) and impersonal constructions (non-referential elliptic non-existing subjects)

in Spanish.

The three justifications for this work are: (1) the highly frequent occurrence of zero

pronouns in Spanish; (2) identification of zero pronouns is a prerequisite for anaphora

resolution in Spanish and also for other Natural Language Processing (nlp) applica-

tions; and (3) this challenge had not yet been fully addressed in the field. The system

presented in this dissertation represents the first attempt to automatically identify

non-referential ellipsis in Spanish.

Since Spanish is a pro-drop language (Chomsky, 1981), subject ellipsis is a recur-

ring phenomenon. It was noted that 26% of the 6,878 cases annotated in the corpus

exploited in this work have an elliptic subject, while only 3% of them occur in imper-

sonal constructions. The topic of subject ellipsis has been addressed in previous work

on other pro-drop languages such as Japanese (Okumura & Tamura, 1996), Chinese

1. Introduction 1.2 Objectives

(Zhao & Ng, 2007), Korean (Lee & Byron, 2004) and Russian (Kibrik, 2004). The

related topic of the identification of non-referential pronouns has been addressed in

non-pro-drop languages such as English (Evans, 2001) and French (Danlos, 2005).

The identification of zero pronouns and non-referential impersonal constructions is

necessary for anaphora resolution, since the resolution of zero pronouns (zero anaphora)

implies that they need to be identified. The identification of zero pronouns first requires

that they can be distinguished from non-referential constructions (Mitkov, 2010).

Coreference and anaphora resolution, and in particular zero anaphora resolution,

has been found to be crucial in a number of nlp applications. These include, but

are not limited to, information extraction (Chinchor & Hirschman, 1997), machine

translation (Peral & Ferrandez, 2000), automatic summarisation (Steinberger et al.,

2007), text categorisation (Yeh & Chen, 2003a), topic recognition (Yeh & Chen, 2007),

salience identification (Iida et al., 2009) and word sense disambiguation (Kawahara &

Kurohashi, 2004). Moreover, there is additional research showing that zero pronoun

identification is useful in order to make further developments in centering theory (Mat-

sui, 1999), for name entity recognition (Hirano et al., 2007), for the investigation of

convergence universals in translation (Corpas Pastor et al., 2008) and to discriminate

predicate-argument structure (Imamura et al., 2009).

Finally, the difficulty in detecting non-referential pronouns has been acknowledged

since computational resolution of anaphora was first attempted (Bergsma et al., 2008)

and this task is currently needed in nlp for Spanish. The need for automatic tools

able to detect ellipticals has been stated by Recasens & Hovy (2009) who note that

their application would improve existing methods for zero anaphora resolution in Span-

ish (Ferrandez & Peral, 2000). One particular contribution of the current research is

the recognition of Spanish impersonal constructions which, following from the litera-

ture review presented in Chapter 2, appears not to have been addressed before in the

literature.

1.2 Objectives

The goal of the fully automatic method presented in this dissertation (Elliphant) is

to identify zero pronouns (referential elliptic subjects) and impersonal constructions

(non-referential elliptic subjects) in Spanish. In order to accomplish this objective, it is

1. Introduction 1.3 Results

also necessary to identify the cases that occur in the subject position in complementary

distribution. For this reason, the identification of explicit subjects was carried out using

a learning based method which led to a ternary classification method which covers all

the elements (elliptic and explicit, referential and non-referential) of the subject position

in the clause. These three classes are explicit subjects, zero pronouns and impersonal

constructions.

1.3 Results

The results obtained by the Elliphant system and the level of performance that it

reaches are encouraging since this tool not only identifies zero pronouns and imper-

sonal constructions but also outperforms a dependency parser (Connexor’s Machinese

Syntax) in identifying explicit subjects as well as elliptic subjects. A series of ex-

periments undertaken with the algorithm has enabled discovery of the most effective

features for use in the classification tasks. The performance results obtained for the

identification of impersonal constructions are, according to the survey of previous work

carried out in Chapter 2, the first presented for this task in the literature.

The classification results obtained by the algorithm were presented in Rello et al.

(2010b). However, this paper undertook no further investigation into the efficacy of

the features used presented in Rello et al. (2010a).

With regard to the attempt to achieve improved performance from the Elliphant

system, two previous studies have contributed to its design: one concerning the distri-

bution of zero pronouns (Rello & Illisei, 2009a) and the other presenting a rule-based

method for their identification (Rello & Illisei, 2009b). It should be noted however

that despite their contribution, Elliphant differs considerably from these initial studies

in terms of methodology (corpus used, linguistic criteria exploited, and the overall ap-

proach) and the classification task itself (classes to be identified). Overall, the Elliphant

system represents a considerable advancement on those works.

1.4 Thesis Outline

The remainder of this thesis is structured in four Chapters. Chapter 2 provides a lit-

erature review of nlp approaches (see Section 2.1) to zero pronouns (Section 2.1.1)

and identification of non-referential expressions (Section 2.1.2). The review also covers

1. Introduction 1.4 Thesis Outline

work in the field of Linguistics, including approaches to referential and non-referential

subject ellipsis (Section 2.2.1 and 2.2.2). Chapter 3 describes the methodology embod-

ied by the Elliphant system. Firstly, the classification task (see Section 3.1) and an

explanation of each of the classes is presented: explicit subjects (Section 3.1.1), zero

pronouns (Section 3.1.2) and impersonal constructions (Section 3.1.3). Secondly, the

machine learning method (see Section 3.2) is described, beginning with the compilation

of the corpus (Section 3.2.1), the guidelines established and the software developed to

facilitate annotation of the corpus by human annotators (Section 3.2.2), a description

of the features (see Section 3.2.3) derived from the corpus and the purpose built tools

(Section 3.2.4) implemented to generate the training data exploited by the machine

learning package, Weka (3.2.5). Elliphant is evaluated in Chapter 4. A set of exper-

iments (Section 4.1) was carried out to determine the method and parameter values

which work best for these classification tasks (Section 4.1.1), its learning curves (Sec-

tion 4.1.2) and the most effective groups of features (Section 4.1.3). A comparative

evaluation of the Elliphant system with an existing parser is presented in Section 4.2.

Finally, in Chapter 5, conclusions are drawn and plans for future work are considered.

Chapter 2

Related Work

Both the nlp and linguistics literature address referential and non-referential subject

ellipsis. Although the nlp literature is directly related to this dissertation in terms of

objectives and methodology, more general literature in linguistics contributes various

means by which classes of subject ellipsis and annotation criteria can be established.

Related work in nlp (see Section 2.1) on this topic can be classified as (a) liter-

ature related to zero pronouns (Section 2.1.1), which is mainly concerned with their

identification, resolution and generation, and (b) literature related to the identification

of non-referential constructions (Section 2.1.2).

The literature in linguistics (Section 2.2) concerning different types of ellipsis, in

which both zero pronouns (See Section 2.2.1) and non-referential constructions (See

Section 2.2.2) are included, is focused on the definition, delimitation and description of

their use in language.

2.1 NLP Approaches

The nlp literature on this topic broadly concerns two topics, namely zero pronouns

(Section 2.1.1) and non-referential constructions (Section 2.1.2). The number and va-

riety of studies of the first group is considerably larger than that of the second.

Both topics are mainly related to coreference and anaphora resolution systems as

the resolution of zero pronouns (zero anaphora) implies their prior identification. That

identification requires first the identification of zero pronouns and secondly the identi-

fication of non-referential constructions (Mitkov, 2010).

2. Related Work 2.1 NLP Approaches

While undertaking this literature review, no specific studies on the identification

of non-referential constructions were found in Spanish, although it has been indicated

to be a necessary task (Ferrandez & Peral, 2000; Recasens & Hovy, 2009) in anaphora

and coreference resolution. For this reason it is expected that the method presented in

this dissertation will complement current Spanish pronoun resolution systems.

2.1.1 NLP Approaches to Zero Pronouns

A zero pronoun is the resultant “gap” (zero anaphor) where zero anaphora or ellipsis

occurs, when an anaphoric pronoun is omitted but is nevertheless understood (Mitkov,

2002). In linguistics, zero pronouns are also referred to as null subjects, empty subjects,

elliptic subjects, elided subjects, tacit subjects, understood subjects and non-explicit sub-

jects, among others. In the nlp literature such omitted subjects are broadly denoted as

zero pronouns. Some linguistic studies also make use of the term “zero pronoun” which

is not equivalent to the computational concept. The Meaning-Text Theory (mtt) con-

siders a zero pronoun in subject position to be a non-argumental impersonal subject

(Mel’cuk, 2006):

Llueve.

(It) is raining.

while in Generative Grammar, following the Zero Hypothesis (Kratzer, 1998), a zero

pronoun can have phonetic content (full pronoun) or not (null pronoun). In this theory,

the concept of zero pronoun has to do only with its lack of lexical content in contrast

to lexical pronouns (Alonso-Ovalle & D’Introno, 2000). In this work a a zero pronoun

(Mitkov, 2002) corresponds with an omitted subject (Real Academia Espanola, 2009)

in Spanish.

Zero pronouns become crucial when processing any pro-drop language (Chomsky,

1981) –also known as null subject languages– since zero anaphora is fairly frequent in

such languages. By way of example, of the 6,827 annotated cases in our corpus, 26%

of them have an omitted subject.

The current literature review indicates that the following pro-drop languages are

the ones on which related work on zero pronoun processing have been carried out:

– Japanese (Hirano et al., 2007; Iida et al., 2006, 2009; Imamura et al., 2009; Isozaki

& Hirao, 2003; Kawahara & Kurohashi, 2004; Matsui, 1999; Mori & Nakagawa,

1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa, 1997; Nakaiwa & Ikehara,

1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Okumura & Tamura,

1996; Sasano et al., 2008; Seki et al., 2002; Takada & Doi, 1994; Yoshimoto, 1988);

– Chinese (Hu, 2008; Peng & Araki, 2007a,b; Yeh & Chen, 2003a,b, 2007; Yeh &

Mellish, 1997; Zhao & Ng, 2007);

– Korean (Han, 2004; Lee & Byron, 2004; Lee et al., 2005);

– Spanish (Barreras, 1993; Corpas Pastor, 2008; Corpas Pastor et al., 2008; Ferrandez

& Peral, 2000; Peral, 2002; Peral & Ferrandez, 2000; Rello & Illisei, 2009a,b); and

– Russian (Kibrik, 2004).

These studies of zero pronouns address a variety of topics. Depending on their goal,

the literature on zero pronouns can be divided into the following classes:

– Zero pronoun classification or annotation: (Han, 2004; Kibrik, 2004; Lee &

Byron, 2004; Lee et al., 2005; Rello & Illisei, 2009a);

– Zero pronoun identification (Corpas Pastor, 2008; Corpas Pastor et al., 2008;

Nakaiwa, 1997; Rello & Illisei, 2009b; Yoshimoto, 1988);

– Resolution of zero pronouns, including their prior identification (Barreras,

1993; Ferrandez & Peral, 2000; Hu, 2008; Isozaki & Hirao, 2003; Kawahara &

Kurohashi, 2004; Murata et al., 1999; Nakaiwa & Shirai, 1996; Nomoto & Yoshi-

hiko, 1993; Okumura & Tamura, 1996; Peng & Araki, 2007b; Sasano et al., 2008;

Seki et al., 2002; Yeh & Chen, 2003b; Zhao & Ng, 2007); and

– Zero pronoun generation (Peral, 2002; Peral & Ferrandez, 2000; Theune et al.,

2006; Yeh & Mellish, 1997);

Other nlp applications where zero pronouns are taken into consideration are: ma-

chine translation (Nakaiwa & Ikehara, 1992; Nakaiwa & Shirai, 1996; Peng & Araki,

2007a; Peral, 2002; Peral & Ferrandez, 2000); named entity recognition (Hirano et al.,

2007); summarisation (Steinberger et al., 2007); text categorisation (Yeh & Chen,

2003a); topic identification (Yeh & Chen, 2007) and identifying salience in text (Iida

et al., 2009); and word sense disambiguation (Kawahara & Kurohashi, 2004).

Further research topics where zero pronoun identification is useful are: predicate-

argument structure discrimination (Imamura et al., 2009); for further developments

in centering theory (Matsui, 1999) such as improved interpretation of zero pronouns

(Takada & Doi, 1994); or for the investigation of convergence universals in translation

(Corpas Pastor, 2008; Corpas Pastor et al., 2008).

Studies of specific cases of zero pronouns such as those in which their referents take

the semantic role of experiencer (Nakagawa, 1992), zero pronouns in relationships with

conditional constructions (Mori & Nakagawa, 1996) or descriptions of the syntactic

patterns in which zero pronouns are used (Iida et al., 2006), among others.

In terms of methodology, rule-based, machine learning, and a variety of other ap-

proaches have been taken toward zero pronoun identification and resolution:

– Rule-based approaches (Barreras, 1993; Corpas Pastor et al., 2008; Ferrandez

& Peral, 2000; Hu, 2008; Kawahara & Kurohashi, 2004; Kibrik, 2004; Matsui,

1999; Mori & Nakagawa, 1996; Murata et al., 1999; Nakagawa, 1992; Nakaiwa &

Ikehara, 1992; Nakaiwa & Shirai, 1996; Nomoto & Yoshihiko, 1993; Peral, 2002;

Peral & Ferrandez, 2000; Rello & Illisei, 2009b; Yeh & Chen, 2003a,b, 2007; Yeh

& Mellish, 1997; Yoshimoto, 1988);

– Machine learning approaches (Hirano et al., 2007; Iida et al., 2006, 2009; Kawa-

hara & Kurohashi, 2004; Peng & Araki, 2007b; Zhao & Ng, 2007);

– Hybrid methods combining rules and learning algorithms (Isozaki & Hirao, 2003);

– Probabilistic models (Sasano et al., 2008; Seki et al., 2002); and

– other techniques such as the exploitation of parallel corpora (Nakaiwa, 1997).

Although it is clear that machine learning methods perform better than other ap-

proaches when identifying non-referential expressions (Boyd et al., 2005), there is some

debate about which approach brings optimal performance when applied in anaphora

resolution systems (Mitkov, 2002).

In Spanish, the most influential work on this topic is the Ferrandez and Peral

algorithm for zero pronoun resolution (Ferrandez & Peral, 2000) together with their

previous related work (Ferrandez et al., 1998, 1999). Their implementation of a zero

pronoun identification and resolution module forms part of a system known as the Slot

Unification Parser for Anaphora resolution (supar) (Ferrandez et al., 1999).

Although substantially related, the work described in this dissertation differs, both,

in form and in aim from this previous research for Spanish (Ferrandez & Peral, 2000).

Firstly, their definition of zero pronouns is broader since it is suited to a different

purpose: the zero class includes not only those zero signs whose referent lies in previous

clauses (anaphoric, according to their classification) and those that lie outside the text

(exophoric), but also those that occur after the verb (cataphoric). Here, it is considered

that those subjects that are within the clause, irrespective of whether they appear before

or after the verb, belong to the explicit subject class.

Secondly, Ferrandez & Peral (2000) take a rule-based approach while the system de-

scribed in this dissertation performs the classification using an instance-based learner.

Additionally, their rules are based on partial parsing, while some of the features ex-

ploited by the Elliphant system make use of information obtained from an analysis

of our corpus by a deep dependency parser. Ferrandez & Peral (2000) tested their

approach to zero pronoun identification and resolution using 1,599 cases, while the ma-

chine learning approach presented in this dissertation was tested on a corpus containing

6,827 classified verbal instances.

Finally, they do not provide a method for the identification of non-referential zero

pronouns. They also make no overt mention of automatic classification of zero pronouns

of the anaphoric or cataphoric kind (Ferrandez & Peral, 2000).

Despite the similarities of Ferrandez & Peral (2000) work to the approach described

in this dissertation, the fact that they take a different definition for zero pronouns,

means that a comparison with the method described in the current work is not feasible

(Section 4.2).

In order to improve on previous work by the current author (Rello & Illisei, 2009b),

this study differs from it in the design of the classification and the methodology. In

Rello & Illisei (2009b) a binary classification as either elliptic-subject or non-elliptic

subject was made as a result of the implementation of a rule-based method which

applies only to zero pronouns, whilst in the present study a ternary classification is

presented which covers all the possible instances of subject position in Spanish. More-

over, while zero pronouns were annotated in Rello & Illisei (2009b), in the present

study the zero pronouns themselves were left unmarked. Instead, the main verb of

each clause is annotated and classified into one of three types. The baseline rule-based

algorithm described in Rello & Illisei (2009b) was based on the zero pronoun identifi-

cation methodology developed in Corpas Pastor et al. (2008) which treats every clause

which does not have an explicit subject as containing a zero pronoun.

2.1.2 NLP Approaches to Identifying Non-referential Constructions

The identification of non-referential pronouns1 is a crucial step in coreference (Boyd

et al., 2005; Mitkov, 2010) and anaphora resolution systems (Mitkov, 2001, 2002). In

comparison to the work addressing zero pronouns, previous research on this topic is

fairly limited, and, as implied by this survey of related work, the approach described in

this dissertation is the first attempt to automatically identify impersonal constructions

in Spanish.

The literature describing approaches to the identification of non-referential expres-

sions is focused on:

– Identification of pleonastic it in English (Denber, 1998; Lappin & Leass, 1994;

Paice & Husk, 1987). Work by Evans (2000, 2001) is exploited by an anaphora

resolution system in Mitkov et al. (2002). Also (Bergsma et al., 2008; Boyd et al.,

2005; Clemente et al., 2004; Gundel et al., 2005; Lambrecht, 2001; Li et al., 2009;

Muller, 2006; Ng & Cardie, 2002); and

– Identification of expletive pronouns in French (Danlos, 2005).

Nevertheless, in those languages where approaches to the identification of non-

referential expressions have been implemented, there is actually an explicit word with

some grammatical information (a third person pronoun) in the text, which is non-

referential (Mitkov, 2010). By contrast, in Spanish, non-referential expressions are not

realised by expletive or pleonastic pronouns but by a certain kind of ellipsis. For this

reason, it is easy to wrongly identify them as zero pronouns, which are referential. For

example, pleonastic pronouns such as:

1In previous work these pronouns have also been referred to as pleonastic, expletive, non-anaphoric,

and non-referential pronouns.

(a.1) (It)1 must be stated that Oskar behaved impeccably.

(b.1) (It) rains, (Il) pleut, (Es) regnet.

(c.1) (It)’s three o’clock.

are all elided in Spanish, resulting in the following non-referential impersonal construc-

tions:

(a.2) Se dice que Oscar se comporto impecablemente.

(b.2) Llueve.

(c.2) Son las tres en punto.

A sizable proportion of the false positives obtained in previous work on identifying

zero pronouns were caused by such non-referential impersonal constructions (Rello &

Illisei, 2009b). Ferrandez & Peral (2000) noted that an inability to identify verbs used

in impersonal constructions has a negative effect on the performance of their anaphora

resolution algorithm2, while in Recasens & Hovy (2009, p. 41) the need for a tool to

identify ellipsis is observed:

“In contrast with previous work, many of the features relied on gold standard

annotations, pointing out the need for automatic tools for ellipticals detection and

deep parsing.”

Four approaches have been implemented to identify non-referential expressions and

described in the literature:

– Rule-based approaches (Danlos, 2005; Denber, 1998; Lappin & Leass, 1994; Paice

& Husk, 1987);

1In this work explicit subjects in the examples are presented in italics., zero pronouns in the

examples are presented by the symbol Ø, while in the English translations the subjects which are

elided in Spanish are marked with parenthesis. Impersonal constructions in the examples are not

explicitly indicated using a symbol (see Section 3.1).2The other two reasons given for the low success rate in the identification of verbs with no subject

are the lack of semantic information and the inaccuracy of the grammar used (Ferrandez & Peral,

2000).

– Machine learning approaches (Bergsma et al., 2008; Boyd et al., 2005; Clemente

et al., 2004; Evans, 2000, 2001; Mitkov et al., 2002; Muller, 2006; Ng & Cardie,

2002);

– Web based approach (Li et al., 2009); and

– Descriptive studies from contextual (Lambrecht, 2001) and intonational points of

view (Gundel et al., 2005).

Paice & Husk (1987) introduce a rule-based method for identifying non-referential

it while Lappin & Leass (1994) and Denber (1998) describe rule-based components of

their pronoun resolution systems which detect non-referential uses of it. Mitkov’s first

anaphora resolution algorithm did not incorporate an approach for detecting pleonastic

it (Mitkov, 1998), while, in more recent versions, mars (Mitkov’s Anaphora Resolution

System), uses Evans (2001) system to detect pleonastic it, and machine learning (Mitkov

et al., 2002).

Instance-based learning approaches are used for identifying pleonastic it in English,

while the only approach for the identification of expletive pronouns in French employs

a ruled-based methodology (Danlos, 2005).

Evans (2001)1 describes the first attempt using a machine learning method to clas-

sify pleonastic it into seven types while Boyd et al. (2005) present a linguistically

motivated classification of non-referential it into four types.

A comparison replicating the approaches developed by Paice & Husk (1987) and

Evans (2001) with the system implemented by Boyd et al. (2005) corroborates the

finding that machine learning outperforms rule-based approaches (Boyd et al., 2005).

Further, it is pointed out that rule-based methods are limited due to their reliance on

lists of verbs and adjectives commonly used in the patterns that they exploit, which

can make them less portable and more difficult to adapt to new texts. Nevertheless, the

basic grammatical patterns are still reasonably consistent indicators of non-referential

occurrences of it (Boyd et al., 2005).

Certain aspects of the work described in this dissertation were inspired by the

methodology of the machine learning approaches for the identification of pleonastic it

specifically by Evans (2001) and Boyd et al. (2005).

1This method is currently incorporated as a component of mars (Mitkov et al., 2002).

2. Related Work 2.2 Linguistic Approaches

Due to the fact that the occurrence of non-referential zero pronouns is not very

common1, the size of our corpus was increased in order to achieve a sufficient num-

ber of instances for each class. The training data exploited by the Elliphant system

contains 6,827 instances of which 179 are non-referential examples. In Evans (2001)

3,171 instances of it where classified into seven classes while in Boyd et al. (2005) 2,337

examples were classified into four classes.

Our corpus was analyzed, as in the approach described by Evans (2001), using a

functional dependency parser, Connexor’s Machinese Syntax 2 (Connexor Oy, 2006b;

Tapanainen & Jarvinen, 1997). Moreover, some of the features used in the Elliphant

system, such as the consideration of the lemmas and the parts of speech (POS) of the

preceding and following material, were also implemented in Evans (2001) approach.

In contrast to previous work, the K* algorithm (Cleary & Trigg, 1995) was found

to provide the most accurate classification in the current study. Other approaches have

employed various classification algorithms, including K-nearest neighbors in TiMBL

(Boyd et al., 2005; Evans, 2001) and JRip in Weka (Muller, 2006).

2.2 Linguistic Approaches

Literature related to ellipsis in linguistic theory has served as one basis for establishing

the linguistically motivated classes and the annotation criteria in the current work. The

linguistically related work on this topic is focused on the definition and description of

the use of ellipsis in natural language and the limits of that use.

In Spanish, the use of ellipsis is very widespread. It is a phenomenon that occurs

in a wide range of contexts and is therefore much discussed in the field of linguistics.

To illustrate, some controversial topics in linguistics that pertain to instances of ellipsis

found in our corpus include: the establishment of different types of ellipsis, the identifi-

cation of impersonal sentences (non-referential expressions), the definition of particular

syntactic categories which can function as subjects, and the intricate differentiation of

reflex passive with elliptic subject from impersonal sentences in different varieties of

Spanish.

The concepts used in both types of literature (nlp and linguistic) to distinguish

different types of ellipsis and zero signs are extremely broad and are well debated in

1Only 3% of the verbs found in our corpus (see Section 3.2.1) have non-referential elliptic subjects.2http://www.connexor.eu/technology/machinese/demo/syntax/.

the linguistic literature. Elements of the elliptic typology used in this work which were

derived from the literature are stated next while the linguistic and formal criteria used

to identify the chosen classes and which served as the basis for the corpus annotation,

including a typology of the examples found, is explained in Sections 3.1.1, 3.1.2, 3.1.3

and 3.2.2.

2.2.1 Linguistic Approaches to Subject Ellipsis

The study of the omission of some element from the sentence or the discourse in natural

language has been a challenge not only in computing but also in Spanish linguistics

itself –from the Renaissance period through to the present day.

The first occidental grammarian who treated ellipsis as a grammatical phenomenon

(Hernandez Terres, 1984) was Francisco Sanchez de las Brozas, El Brocense (1523-

1600) (Sanchez de las Brozas, [1562] 1976, p. 317), who took the concept of ellipsis

from Apolonio Dıscolo (Dıscolo, [2nd century] 1987) and defined it as:

“La elipsis es la falta de una palabra o de varias en una construccion correcta [...].

“Ellipsis is the omission of one or more items from a correct construction [...].”

This conception, in which grammar serves as a basis for a rational explanation of the

surface form of the language:

“No hay, pues, ninguna duda de que se debe buscar la explicacion racional de

la cosas, tambien de las palabras.” Sanchez de las Brozas ([1562] 1976) cited in

Garcıa Jurado (2007, p. 12)

“There is no doubt about that there shall be pursuit a rational explenation of the

things.”

later inspired the rational grammar of Port-Royal (Lancelot & Arnauld, [1660] 1980)

which was a precursor of Chomsky’s work (Chomsky, [1968] 2006, p. 5):

“One, particularly crucial in the present context, is the very great interest in the

potentialities and capacities of automata, a problem that intrigued the seventeenth-

century mind as fully as it does our own. [...] A similar realisation lies at the base

of Cartesian philosophy.”

In order to elide something, a meaning, which is not expressed needs to be assumed. It

thus follows that ellipsis itself was one of the basic mechanisms to explain the transition

from D-structure to S-Structure becoming a central issue (Brucart, 1987) in generative

grammar from its original model, the Standard Theory (Chomsky, 1965) to its latest

revisions (Chomsky, 1995).

Different branches of linguistics have considered ellipsis from different points of

– Semantic: traditionally, the criteria used to define ellipsis were semantic or logical

(Bello, [1847] 1981) and prescriptive (Real Academia Espanola, 2001);

– Descriptive and explicative: (Brucart, 1999);

– Distributional: although structuralism rejected the study of units which were not

codified in the signifier or phonetic realization, some classifications of ellipsis were

presented (Francis, 1958; Fries, 1940);

– Pragmatic: in diverse pragmatic paradigms the role of ellipsis is crucial as it

influences the interpretation of text. As a result it has given rise to several lines

of investigation such as implications though ellipsis (Grice, 1975), ellipsis studied

as a factor to activate textual coherence (Halliday & Hasan, 1976), or indefinite

ellipsis in which a word can stand for one or more sentences in a restrictive code

(Shopen, 1973); and

– Cognitive: in terms of ellipsis processing by the brain (Streb et al., 2004, p. 175):

“Ellipses and pronouns/proper names are processed by distinct mechanisms

being implemented in distinct cortical cell assemblies.”

or as part of the explanation of the language faculty (Chomsky, 1965).

The terminology and linguistic explanations relevant for this work, consider both zero

pronouns and non-referential expressions to be different types of ellipsis (Brucart, 1999).

Four kinds of Spanish subject ellipsis are distinguished (Brucart, 1999, p. 2851).

This classification is presented in correlation with a verb classification (Real Academia

Espanola, 2009), which is related to the omitted subject classification presented in

Bosque (1989).

The classification of Spanish omitted subjects presented in Bosque (1989) is: omit-

ted subjects from finite verbs, which can be referential and non-referential and omitted

subjects from non-finite verbs which can be argumental and non-argumental. The ar-

gumental omitted subjects can in turn be referential and non-referential. In this study

non-argumental omitted subjects are claimed not to exist (Bosque, 1989), although in

Brucart (1999), non-argumental omitted subjects are considered a type of ellipsis (Type

4 in Figure 2.1).

1. Omitted sub ject in a clause containing a finite verb: Ø No vendrán

[They] won’t come

Ø Dicen que vendrá [They] say he won’t come[It is] said he won’t come

2. Argumental impersonal subject

En este estudio Ø se trabaja bien.

In this room [one] can work properly.

3. Non-argumental impersonal subject Ø Nieva

[It] is snowing

4. Omitted subject in a non-finite verb clause Juan intentaba (Ø decírselo a

María.) John tried ([John] to tell Mary.)

Verb with no argumental subject

Verb with argumental omitted subject which is represented by pronoun se

Verb with argumental omitted subject with an unespecific interpretation

Verb with argumental omitted sub j ec t w i th an espec ific interpretation

Types of subject ellipsis Types of verbs depending on their subject

(Brucart, 1999) (Real Academia Española, 2009)

(2) Argumental impersonal subject

En este estudio Ø se trabaja bien. In this room one can work properly.

(3) Non-argumental impersonal subject

Ø NievaIt is snowing

(4) Omitted subject in a non-finite verb clause

Juan intentaba (Ø decírselo a María.) John tried (John to tell Mary.)

(1) Omitted subject in a clause containing a finite verb:

Ø No vendrán They won’t come

Ø Dicen que vendrá They say he will comeIt is said he will come

Figure 2.1: Types of subject ellipsis (Brucart, 1999) and types of verbs (Real Academia

Espanola, 2009).

The first type of ellipsis (see (1) in Figure 2.1) represents omitted subjects and

corresponds to zero pronouns in the nlp literature. An omitted subject is the result

of nominal ellipsis where a non-phonetically/orthographically realized lexical element –

omitted subject– which is needed for the interpretation of the meaning and the structure

of the sentence, is omitted since it can retrieved from its context (Brucart, 1999).

Despite their lack of phonetic realization, omitted subjects are part of the clause (Real

Academia Espanola, 2009).

Two types of syntactic ellipsis or lexical-syntactic ellipsis can be distinguished:

verbal ellipsis and nominal ellipsis. These types of subject ellipsis can affect the whole

argument of the verb or be partial and just affect the head of the argument (Brucart,

1999). As detailed in Section 3.2.2, the annotation of our corpus includes both complete

noun phrase ellipsis and noun phrase head ellipsis. Note that nominal ellipsis not

only affects the subjects but also the other arguments of the verb –datives, direct

objects or infinitive objects– although their ellipsis is held to more restricted conditions

(Brucart, 1999). However, this fact is not acknowledged in some prior approaches in

nlp (Ferrandez & Peral, 2000, p. 166):

“While in other languages, zero-pronouns may appear in either the subject’s or

the object’s grammatical position, (e.g. Japanese), in Spanish texts, zero-pronouns

only appear in the position of the subject.”

The interpretation of Type 1 ellipsis can be definite and specific (Brucart, 1999)

or indefinite (Real Academia Espanola, 2009). Since omitted subjects are referential,

they can be lexically retrieved (Gomez Torrego, 1992). An example of omitted subject

could be:

(d) Las leyes no tendran efecto retroactivo si Ø no dispusieren lo contrario.

The law will not have a retroactive effect unless (they) specify otherwise.

The nature of the omitted subject [Ø] itself has been discussed in the linguistic literature

(Real Academia Espanola, 2009). While recent approaches in linguistics agree that the

omitted subject has a pronominal nature (elided pronoun), others contend that the

subject is expressed in the morphology of the verb inflection.

In Generative Grammar subject ellipsis has been understood as a (1) pro-form

(Beavers & Sag, 2004; Chung et al., 1995; Fiengo & May, 1994; Wilder, 1997) or as (2)

a syntactic realization without a phonetic constituent (Merchant, 2001; Ross, 1967).

The Meaning-Text Theory (mtt) contends that ellipsis occurs in the SSyntS (sur-

face syntax) when the elliptic element is deleted during the transition from SSyntS to

DMorphS (deep morphology) (or vice versa) and an empty node stands in for the rep-

resentation of the elliptic element. This procedure for treating ellipses is also proposed

in the MTT for the description for all coordinate structures (Mel’cuk, 2003).

The identification of omitted subjects is not problematic when the zero pronoun

belongs to the first or second person but when it is a third person omitted subject, the

reference can be anaphoric or cataphoric (Type 1 ellipsis in Table 2.1) or non-specific1.

A generic or non-specific interpretation can follow in some clauses with singular sec-

ond person and plural third person zero pronouns (Real Academia Espanola, 2009).

However, depending on discourse knowledge, there can be alternators of specific and

non-specific interpretation in clauses which are formally equal, as the next example

shows:

(e) Ø Me han regalado un reloj. (In this example both interpretations, specific and non-

specific, are possible.)

(1) (They) gave me a watch. (When the agent referred to by “they” has been mentioned

previously in the discourse.)

(2) (I) was given a watch. (When no agent has been mentioned previously in the dis-

course.)

where the non-specific interpretation does not exclude a possible specific one (Real

Academia Espanola, 2009). Therefore, both groups of argumental subjects with specific

and non-specific interpretations are included in the same class.

2.2.2 Linguistic Approaches to Non-referential Ellipsis

On the other hand, Type 2 and type 3 ellipsis listed in Figure 2.1 correspond to

non-referential expressions or impersonal sentences. Type 2 ellipsis is composed of

impersonal sentences containing the Spanish particle se, whose argumental omitted

subject always has an unspecific interpretation and is referred to using the pronoun se

(Mendikoetxea, 1994). Type 3 ellipsis corresponds to the set of sentences called imper-

sonal sentences. Although the types of impersonal constructions in Spanish are hetero-

geneous, all of them share a lack of some properties of the subject (Fernandez Soriano

& Taboas Baylın, 1999). Some studies consider different kinds of Spanish imperson-

ality, e.g. semantic and syntactic impersonality (Gomez Torrego, 1992), while others

distinguish several semantic degrees of impersonality (Mendikoetxea, 1999).

1In journalistic headlines with an omitted subject, a non-specific interpretation can occur (Bosque,

1989) even in non-pro-drop languages such as English, French or German (Real Academia Espanola,

2009). Such non-specific interpretations can occur when the antecedent or referent was not previously

mentioned in the discourse.

Traditionally –from a semantic point of view– impersonal sentences have been con-

sidered to be those which cannot contain a subject, the agent of the action described

(Real Academia Espanola, 1977). This impersonality can the due either to the nature

of the verb,

(f) Llueve.

(It) rains.

or due to the speaker’s ignorance of the subject (Seco, 1988):

(g) Llaman a la puerta.

(Someone) is knocking the door.

where the subject is unidentified and it is therefore impossible to assign a reference to

it (Bello, [1847] 1981).

The controversy of treating non-referential expressions as a type of ellipsis, given

that they cannot be lexically retrieved, has already been discussed (Gomez Torrego,

1992). While Brucart (1999) considers them a case of ellipsis, as do some Generative

Grammar approaches1, others (Bosque, 1989; Mel’cuk, 2006)2 consider that such elliptic

and non-referential subjects do not exist in language.

A descriptive point of view (Fernandez Soriano & Taboas Baylın, 1999) would

regard impersonal sentences as belonging to either of two main groups (1) impersonal

sentences without a subject and (2) cases of impersonal verbs with the inherent feature

of not having a subject.

In the current dissertation, a prescriptive and descriptive approach (Real Academia

Espanola, 2009) to the consideration of impersonal sentences is taken (See Section

3.1.3).

Type 4 ellipsis (Brucart, 1999) in Figure 2.1 is ignored in our work. However, this

fourth type is much debated in literature; for example, Head-Driven Phrase Structure

Grammar does not consider the infinitive subject as a null category (slash), nor do

Pollard and Sag in their work (Pollard & Sag, 1994).

1Generative Grammar explains these impersonal sentences by labeling the absence of the subject

with a pro-form which presents the same syntactic features as the subject although is has no phonolog-

ical realization. Following the Extended Projection Principle this pro-form embodies all the syntactic

requirements of a subject except for its phonological realization (Chomsky, 1981).2MTT uses the concept of the zero sign to characterize elements whose signifier is empty and is by

no means realized as a perceptible phonetic pause (Mel’cuk, 2006).

Chapter 3

Detecting Ellipsis in Spanish

This chapter describes the methodology used in this study. The first step is to create

a linguistically motivated classification system (Section 3.1) for all instances of elliptic

and non-elliptic as well as referential and non-referential subjects. Since the machine

learning method requires training data, a corpus (the eszic Corpus) was compiled

(see Section 3.2.1) and a purpose built tool for its annotation was developed, as were

guidelines (see Section 3.2.2). The third task consisted of implementing a method to

extract the features (Section 3.2.3) of instances from the corpus and create training

data (eszic training data; see Section 3.2.4). Finally, once the features of instances

are derived from a document they are exploited for classification by machine learning

using the Weka package (Section 3.2.5).

3.1 Classification

The first step is to create a classification system for all instances of subject and imper-

sonal constructions. The groups into which the subjects were divided were labeled: el-

liptic and non-elliptic subjects as well as referential and non-referential subjects. These

two labels result in a ternary classification:

(1) Explicit subjects: non-elliptic and referential1;

(2) Zero pronouns: elliptic and referential2; and

1Explicit subjects in the examples are presented in italics.2Zero pronouns in the examples are presented by the symbol Ø. In the English translations the

subjects which are elided in Spanish are marked with parenthesis.

3. Detecting Ellipsis in Spanish 3.1 Classification

(3) Impersonal constructions: elliptic and non-referential1.

A subject can be non-elliptic (explicit) or elliptic (omitted subject or zero pronoun).

A sign can be referential or non-referential. The distinction lies in the fact that, while

the former can be lexically retrieved, the latter cannot (impersonal construction).

This treatment of the classification as ternary differs from previous work whose

division of subjects was binary: elliptic (zero pronoun) and non-elliptic, both referential

(Ferrandez & Peral, 2000; Rello & Illisei, 2009b) (see Section 2.1.1). In Evans (2001)

the seven fold classification of pleonastic it is based on the type of referent while in

Boyd et al. (2005), classification follows syntactic and semantic criteria (see Section

2.1.2).

In the following sections, each class is described. With regard to cases in which

classification can be controversial, different annotation criteria were applied (see Section

3.2.2).

3.1.1 Explicit Subjects: Non-elliptic and Referential

This class is the one to which explicit subjects belong. They are phonetically realised,

usually by a nominal group: noun, pronoun, noun phrase (a), free relatives, semi-free

relatives, substantival adjectives (Real Academia Espanola, 2009).

(a) Las fuentes del ordenamiento jurıdico espanol son la ley, la costumbre y los principios

generales del derecho.

The sources of the Spanish legal system are the law, the judicial custom and the general

principles of law2.

The syntactic positions of subjects can be pre-verbal or post-verbal. The occur-

rence of post-verbal subjects is restricted by some conditions (Real Academia Espanola,

2009).

(b) Careceran de validez las disposiciones que contradigan otra de rango superior.

The dispositions which contradict the higher range ones will not be valid.

1Impersonal constructions in the examples are not explicitly indicated using a symbol.2Unless otherwise specified, all the examples provided are taken from our corpus (Section 3.2.1).

Post-verbal subjects, as well as preverbal ones, are also found in passive construc-

tions and passive reflex constructions. As in active clauses, preverbal subjects without

a definite article are rare while post-verbal subjects without a definite article are more

frequent (Real Academia Espanola, 2009).

Projections of non-nominal categories such as clauses containing an infinitive or

a conjugated verb, interrogative indirect clauses, or indirect exclamative clauses, can

function as subjects (Real Academia Espanola, 2009).

(c) Corresponde a los poderes publicos promover las condiciones para que la libertad y la

igualdad del individuo y de los grupos en que se integra sean reales y efectivas.

It corresponds to the public power to promote individual and group liberties to be real

and effective.

3.1.2 Zero Pronouns: Elliptic and Referential

Class 2 is formed by elliptic but referential subjects called zero pronouns. An elliptic

subject is the result of a nominal ellipsis, where a non-phonetically realised lexical

element –elliptic subject– which is needed for the interpretation of the meaning and

the structure of the sentence, is omitted since it can retrieved from its context (Brucart,

1999). Despite their lack of phonetic realisation, elliptic subjects are considered part

of the clause (Real Academia Espanola, 2009).

(d) La Constitucion Espanolai (title in text)

Øi Fue refrendada por el pueblo espanol el 6 de diciembre de 1978.

The Spanish Constitutioni (title in text)

(It)i was countersigned by the Spanish population on the 6th of December of 1978.

Elliptic subjects are considered to be a personal pronoun variant which is not pho-

netically realised (Real Academia Espanola, 2009). Where referential, they can be

lexically retrieved (Gomez Torrego, 1992). That is to say that they can be substituted

by explicit pronouns without changing or losing any of the meaning of the clauses in

which they occur.

The elision of the subject can affect not only the noun head, but also the entire

noun phrase (Brucart, 1999). The noun head can be omitted in Spanish when the

subject of which it is a part fulfills some structural requirements (Brucart, 1999). This

includes cases in which the subject is referential (Brucart, 1999). The processing of

these subjects has been addressed by the development of specific algorithms in previous

work (Ferrandez et al., 1997).

Ellipsis of the head of the noun phrase is only possible when a definite article occurs.

(e) El Ø que esta obsesionado con que todo el mundo piensa mal es Javier.

The (one) who is obsessed with everyone thinking wrong is Javier.

The article possesses a referential value which could be either anaphoric or cataphoric

(Real Academia Espanola, 2009). Such examples of subjects with an elided head are

instances of semi-free relatives (Real Academia Espanola, 2009) and, as expected, they

are not as frequent in our corpus as elisions of the entire subject noun phrase.

3.1.3 Impersonal Constructions: Elliptic and Non-referential

Impersonal constructions with no subjects, that are both non-referential and elliptic,

do not exist (Bosque, 1989)1.

The appearance of clauses containing zero pronouns and impersonal constructions

is similar. Class 3 is composed of impersonal constructions which are formed by (1)

impersonal and (2) reflex impersonal clauses (impersonal clauses with se).

Impersonal clauses have no argumental subject. Since the subject does not exist,

it cannot be lexically retrieved by any means and no phonetic realisation of it can be

expected (Bosque, 1989). The following cases are considered to be impersonal sentences

(Real Academia Espanola, 2009):

– Non-reflex impersonal clauses denoting natural phenomena describing meteoro-

logical situations:

(f) Nieva.

(It) snows.

– Non-reflex impersonal clauses with verbs haber (to be), hacer (to do), ser (to

be), estar (to be)2, ir (to go) and dar (to give):

1The existence of a non-phonetically realised element in subject position is postulated (see Section

2.2). While Generative Grammar defends their existence (pro-form), mtt does not (zero sign).2Depending on the verbal aspect, there are different Spanish verbs which correspond with the

English verb to be.

3. Detecting Ellipsis in Spanish 3.2 Machine Learning Approach

(g) En un kilogramo de gas hay tanta materia como en un kilogramo de solido.

In a kilogram of gas (there) is the same amount of mass as in a kilogram of solid.

(Existential use of the verb haber).

– Non-reflex impersonal clauses with other verbs such as sobrar con (to be too

much), bastar con (to be enough) or faltar con (to have lack of) or the pronominal

unipersonal verb1 with subject zero such as tratarse de (to be about):

(h) Deberan adoptar las precauciones necesarias para su seguridad, especialmente cuando

se trate de ninos.

Necessary measures should be taken, specially when (it) is about children.

(i) Basta con tres sesiones.

(It) is enough with three sessions.

Verbs in such impersonal sentences (Gomez Torrego, 1992), are called lexical impersonal

verbs (Real Academia Espanola, 2009). Due to their lack of subject they are not easily

distinguished from verbs with omitted –but existing– subjects.

Secondly, reflex impersonal clauses have an omitted subject whose reference is non-

specific and cannot be lexically retrieved.

(j) Se estara a lo que establece el apartado siguiente.

(It) will be what is established in the next section

These clauses are formed with the particle se. This particle also serves other syntactic

functions (reflexive pronoun, pronominal pronoun, reciprocal pronoun, etc.) in clauses

with an elided subject.

3.2 Machine Learning Approach

Our corpus was compiled and parsed in order to create training data (referred to as the

eszic training data) for use by a machine learning classification method as explained

in the next section.

A tool was developed for annotation of the corpus (see Section 3.2.2). Fourteen

features were proposed for the purpose of classifying instances of subjects (see Section

1A verb which is only conjugated in the third person.

3.2.3). The feature vectors, together with their manual classifications, were written to

a training file. A method for obtaining the values of those features for each instance

was implemented. The classification algorithm employed was the K* instance-based

learner available in the Weka package (Witten & Frank, 2005) (see Section 3.2.5).

3.2.1 Building the Training Data

The eszic training data used by the Elliphant system is obtained from the eszic corpus

created ad hoc. The corpus is named after its annotated content “Explicit Subjects,

Zero-pronouns and Impersonal Constructions”.

The corpus contains a total of 79,615 words (titles and sentences that do not contain

at least one finite verb are ignored), including 6,825 finite verbs. Of these verbs, 71%

have an explicit subject, 26% have a zero pronoun and 3% belong to an impersonal

construction. There is an average of 2.3 clauses per sentence with 11.7 words per clause

and 26.9 words per sentence.

The corpus compiled to extract the training data is composed of seventeen docu-

ments, originally written in Spanish, and belonging to two genres: legal and health.

The legal texts1 are composed of laws taken from the: (1) Spanish Constitution

(whole text) (Constitucion Espanola, 1978), (2) Laws on Unfair Competition (whole

text) (Ley 3/1991, 1991), (3) Penal Code (first book) (Ley Organica 10/1995, 1995), (4)

Law for Administrative-contentious Jurisdiction (title 1, articles 1 to 17) (Ley 29/1998,

1998), (5) Civil Code (first book, until title V) (Codigo Civil, 1889), (6) Law for Univer-

sities (introduction) (Ley Organica 6/2001, 2001), (7) Law for Associations (chapter

1) (Ley Organica 1/2002, 2002) and (8) Law for Advertisements (whole text) (Ley

29/2005, 2005).

The nine health texts are taken from psychiatric papers compiled from a Span-

ish digital journal of psychiatry Psiquiatrıa.com2: (1) Cinema as a tool for teaching

personality disorders (Lopez Ortega, 2009), (2) Efficacy, functionality, and empow-

erment for phobic pathology treatment, in the context of specialised public Mental

Health Services (Garcıa Losa, 2008), (3) Emotions in Psychiatry (Sevillano Arroyo

& Ducret Rossier, 2008), (4) And what about siblings? How to help TLP3 siblings

1All the legal texts are available online at: http://noticias.juridicas.com/base_datos/2The full-text articles from Psiquiatrıa.com Journal are available online at: http://www.

psiquiatria.com/.3Trastorno lımite de la personalidad (Borderline Personality Disorder).

eszic Corpus Number of Number of Number of

Tokens Sentences Clauses

Legal text 1 9,972 941 600

Legal text 2 1,147 47 56

Legal text 3 17,960 1,035 1,181

Legal text 4 3,578 189 191

Legal text 5 12,456 746 891

Legal text 6 3,962 130 219

Legal text 7 2,159 131 136

Legal text 8 5,219 291 282

Health text 1 2,753 110 270

Health text 2 11,339 658 1,028

Health text 3 1,854 47 140

Health text 4 1,937 84 124

Health text 5 2,183 93 148

Health text 6 1,568 63 210

Health text 7 1,296 69 89

Health text 8 1,687 53 127

Health text 9 12,441 525 1,394

Total 93,511 5,212 7,086

Table 3.1: eszic Corpus: tokens, sentences and clauses.

(Molina Lopez, 2008), (5) Factorial analysis of personal attitudes in secondary educa-

tion (Pintor Garcıa, 2007), (6) The influence of the concept of self and social competence

in children’s depression (Aldea Munoz, 2006), (7) Depression as a mental health prob-

lem in Mexican teenagers (Balcazar Nava et al., 2005), (8) Relationship difficulties in

couples (Dıaz Morfa, 2004), and (9) A case of psychological intervention for children’s

depression (Aldea Munoz, 2003).

Table 3.2 presents the number of instances found in the eszic corpus by class.

Two columns illustrate the number of instances by genre (legal and health) within the

corpus.

Number of instances Legal eszic Health eszic eszic Corpus

per class Corpus Corpus

Explicit subjects 2,739 2,116 4,855

Zero pronouns 619 1,174 1,793

Impersonal constructions 71 108 179

Total 3,429 3,398 6,827

Table 3.2: eszic Corpus: number of instances per class.

The text containing instances to be classified was analysed using Connexor’s Machi-

nese Syntax (Jarvinen & Tapanainen, 1998; Jarvinen et al., 2004; Tapanainen & Jarvi-

nen, 1997)1. This dependency parser returns information on the pos and morphological

lemma of words in a text, as well as returning the dependency relations between those

words. The parsing system employed uses Functional Dependency Grammar (FDG)

(Jarvinen & Tapanainen, 1998; Tapanainen & Jarvinen, 1997) and combines (Jarvinen

et al., 2004) a lexicon and a morphological disambiguator based on constraint grammar

(Tapanainen, 1996). When performing fully automatic parsing it is necessary to ad-

dress word-order phenomena. The formalism used in the parser is capable of referring

simultaneously both to the order in which syntactic dependencies apply and to linear

order. This feature is an extension of Tesniere’s theory (Tesniere, 1959), which does

not formalise linearisation. In the parsed output the linear order is preserved while the

structural order requires that functional information is not coded in the canonical order

1A demo of Connexor’s Machinese Syntax is available at: http://www.connexor.eu/technology/

machinese/.

of the dependents. The functional information is represented explicitly using arcs with

labels of syntactic functions as shown in Figure 3.1 (Jarvinen et al., 2004).

Figure 3.1: An example of the output of the Connexor’s Machinese Syntax parser for

Spanish.

The dependency information allows the identification of complex constituents in a

text. For example, complex noun phrases can be identified by transitively grouping

together all the words dependent on a noun head (Evans, 2001). Additional software

was implemented to perform this and allow identification of clauses and noun phrases

which are required for implementation of some of the features used in our classification

(see Section 3.2.4).

The eszic training data makes use of the three types of information returned by

Connexor’s Machinese Syntax parser (Connexor Oy, 2006a,b):

1. morphological tags generated for verbs –singular (SG), third person (3P), indica-

tive (IND), among many others– including the pos tags –verb (V), noun (N),

preposition (PREP), etc.–;

2. syntactic tags –main element (@MAIN), nominal head (@NH), auxiliary verb (@AUX),

etc.–; and

3. syntactic relations –subject (subj), verb chained (v-ch), determiner (det)–. The

lexical information (LEMMA) given by the parser was also taken into consideration

in the set of features.

3.2.2 Annotation Software and Annotation Guidelines

A program was written in Python (see Figure 3.2) to extract all occurrences of finite

verbs from the eszic Corpus and to assign to each the vector of feature values described

in Section 3.1. Two annotators were presented with the clause in which each verb

appears and prompted to classify the verb into one of thirteen classes.

Figure 3.2: Screenshot of the annotation program interface.

Although the goal is to develop training data for a classifier making a ternary

classification of the subject position elements, an annotation scheme which gives more

detail about each instance was used. This annotation scheme was used with a dual

purpose: to get the most from the annotation task since the instances occur in a

broad number of constructions and because a more detailed annotation could be useful

in future work. The thirteen classes are grouped into the three types: (1) explicit

subjects, (2) zero pronouns or (3) impersonal constructions. In Table 3.3, the linguistic

motivation for each of the annotated classes is shown in correlation with the types to

which they belong. From each annotation class, in addition to the two criteria that

are crucial for this study –elliptic vs. non-elliptic and referential vs. non-referential– a

combination of syntactic, semantic and discourse knowledge can also be encoded during

the annotation. This knowledge includes information about whether the subject is

nominal or non-nominal, whether it is an active or a passive subject or whether the

subject refers to an active participant in the action, state or process denoted by the

The annotation program extracts from the parsed eszic Corpus the clause in which

each finite verb occurs. As Connexor’s Machinese Syntax parser does not explicitly

perform clause splitting but only sentence splitting, a method was developed to ac-

complish the clause identification task. The method identifies the finite verbs in the

corpus and transitively groups together the words directly and indirectly dependent

upon them1. The identified clauses are then presented to the annotators who are asked

to label the verb.

For each verb classified by an annotator, an xml tag (i.e. <subject>ZERO</subject>)

with its class is added in the token line of the parsed eszic Corpus where the verb oc-

curs. An example (k) of an annotated verb whose subject is a zero pronoun follows:

(k) <token id="w53"><text>entro </text><lemma>entrar </lemma>

<depend head="w51">mod </depend><tags><syntax>@MAIN

</syntax><morpho>V IND PRET SG P3 </morpho><subject>ZERO

</subject> </tags></token>

This manual classification, together with the features (see Section 3.2.3) are written to

the eszic training file.

1A clause splitter module was implemented to extract the features from the eszic Corpus (see

Section 3.2.4).

eszic Corpus Annotation Tags

Linguistic

information

Phonetic

Realization

Syntactic

Verbal

Diathe-

Semantic

inter-

preta-

Disclo-

Elliphant

Classes

Linguistic

character-

istics

Elliptic

phrase

Elliptic

phrase

Nominal

subject

Active Active

partici-

Referential

subject

Class 1 Explicit sub-

– – + + + +

Explicit

subject

Reflex passive

subject

– – + + – +

Passive

subject

– – + – – +

Omitted sub-

+ – + + + +

Omitted sub-

ject head

– + + + + +

Non-nominal

subject

– – – + + +

Class 2 Reflex passive

omitted sub-

+ – + + – +

pronoun

Reflex passive

omitted sub-

ject head

– + + + – +

Reflex passive

non-nominal

subject

– – – + – +

Passive omit-

ted subject

+ – + – – +

Passive

non-nominal

subject

– – – – – +

Class 3 Reflex imper-

sonal clause

– – n/a – n/a –

Impersonal (with se)

construction Impersonal

construction

– – n/a + n/a –

(without se)

Table 3.3: eszic Corpus annotation tags.

Annotating explicit and elliptic subjects as well as impersonal constructions in Span-

ish is not a trivial task. Guidelines were established for the annotation of borderline

instances whose classification is a frequent source of disagreement between annotators.

The following text presents some of these borderline cases that belong to the three

types of finite verb classes, together with the criteria adopted for their annotation.

When distinguishing explicit subjects, in addition to nouns, there are other syntactic

categories which may arguably function as heads of subjects. In the case of adverbial

and prepositional categories, it was decided that they should be considered subjects if

they can be focalised (Real Academia Espanola, 2009).

(`) De acuerdo con la Organizacion Mundial de la Salud, la depresion ocupa el cuarto lugar

entre las enfermedades mas incapacitantes y aproximadamente de 100 a 200 millones de

personas la padecen.

According to the International Health Organization, depression is ranked as the fourth

illness which causes more invalidity and approximately from 100 to 200 million people

suffer from it.

While conditional clauses could be considered subjects, in this work an alternative

analysis is followed. Under this approach, a sentence with a conditional clause func-

tioning as subject is considered to contain a zero pronoun, as its elliptic subject can

be retrieved from the preceding discourse (Real Academia Espanola, 2009). Neverthe-

less, no examples were found of conditional clauses functioning as subjects in the eszic

corpus used in this dissertation.

The correct classification of zero pronouns is also a source of disagreement between

annotators as it may be argued that some instances with postponed non-nominal sub-

jects (see example (m) below) should be interpreted as cataphoric zero pronouns.

In contrast to anaphora, in cataphora the cataphoric expression is situated before

the nominal group to which it points (Real Academia Espanola, 2009). Tanaka (2000)

and Mitkov (2002) point out that there is some scepticism about the concept of cat-

aphora in the NLP literature. For example, Kuno (1972) asserts that there is no genuine

cataphora in its literal sense, as the referent of a seemingly cataphoric pronoun must

already be mentioned in the preceding discourse and, therefore, is predictable when

a reader encounters the pronoun. This viewpoint was refuted by Carden (1982) and

Tanaka (2000) who describe empirical data which shows cases of genuine cataphora

where the pronoun is the first mention of its referent in the discourse (Carden, 1982;

Tanaka, 2000). Although some examples of genuine cataphora were found in their cor-

pus (Tanaka, 2000), none were found in the eszic Corpus except for occurrences of the

elision of noun heads where the antecedent is postponed, as in example (e).

The annotation guidelines developed for the current work considered these cases

which involve postponed clauses as non-nominal subjects.

(m) Artıculo 46.

No pueden contraer matrimonio:

Los menores de edad no emancipados.

Los que esten ligados con vınculo matrimonial.

Article 46.

(They) cannot get married:

The non-emancipated minors.

The ones which are already married.

Finally, the borderline cases in impersonal constructions are debated in Spanish. The

decision of how to classify reflex impersonal clauses containing se is frequently a diffi-

cult one to make due to the ambiguity of these instances. For example, in the sentence

Se secaron (see example (n) below), the particle se has four possible semantic interpre-

tations in Spanish (Real Academia Espanola, 2009). In these cases, the decision taken

by the annotator depends on the meaning given by the context.

(n) Se secaron (Particle se = reflexive pronoun)

(They) dried (themselves).

Se secaron (Particle se = reciprocal pronoun)

(They) dried (each other).

Se secaron (Particle se = pronominal pronoun and there is an elliptic subject which does

not have control over the action, for instance, the trees.)

The trees got dried.

Se secaron (Particle se = reflex passive in which the referent of the subject would have

to perform the described action under their own free will, for instance, some people over

an object, for instance, the clothes)

(They) dried (the clothes).

There can be ambiguity between reflex passives containing a zero pronoun and imper-

sonal constructions in which the object is not human (o).

(o) Se firmara el acuerdo.

Ø will sign the agreement.

In such instances, the annotation criterion followed is to annotate them as reflex passive

clauses containing a zero pronoun.

3.2.3 Features

Fourteen features were proposed in order to classify instances according to the types

presented in Section 3.1. The values (see Table 3.4) for the features were derived from

information provided both by Connexor’s Machinese Syntax (Connexor Oy, 2006b)

parser, which processed the eszic Corpus, and a set of lists. An additional program

was implemented in order to extract the values of features for every instance in the

corpus (see Section 3.2.4). These values were used to produce a training vector for each

instance. For a detailed explanation of the feature values see Section 3.2.4.

For the purpose of description, it is convenient to describe each of the features as

broadly belonging to one of ten classes, detailed below.

1 PARSER: the presence or absence of a subject in the clause, as identified by the

parser. It was observed (Rello & Illisei, 2009b) that the analysis returned by Con-

nexor’s Machinese Syntax is particularly inaccurate when identifying coordinated

subjects, subjects containing prepositional modifiers, and appositions occurring

between commas (see example (p) below). Other common cases of parsing error

involve subjects which are distant from the finite verb in the clause. Features 7

and 8 were proposed in an effort to take into consideration potential candidates

for the subject.

(p) La publicidad, por su propia ındole, es una actividad que atraviesa las fronteras.

Advertising, due to its own nature, is an activity which goes beyond boundaries.

2 CLAUSE: the clause types considered are: main clauses, relative clauses, clauses

starting with a complex conjunction, clauses starting with a simple conjunction,

and clauses introduced using punctuation marks (commas, semicolons, etc). A

Feature Definition Value

1 PARSER Parsed subject True, False

2 CLAUSE Clause type Main, Rel, Imp, Prop, Punct

3 LEMMA Verb lemma Parser’s lemma tag

4 NUMBER Verb morphological number SG, PL

5 PERSON Verb morphological person P1, P2, P3

6 AGREE Agreement in person, FTFF, TTTT, FFFF, TFTF, TTFF,

number, tense and mood FTFT, FTTF, TFTT, FFFT, TTTF,

FFTF, TFFT, FFTT, FTTT, TFFF

7 NHPREV Previous noun phrases Number of noun phrases

previous to the verb

8 NHTOT Total noun phrases Number of noun phrases

in the clause

9 INF Infinitive Number of infinitives

in the clause

10 SE Particle se se, no

11 A Preposition a True, False

12 POSpre Four parts of the speech 292 different values combining

previous to the verb the parser’s pos tags,i.e.:

@HN, @CC, @MAIN, etc.

13 POSpos Four parts of the speech 280 different values combining

speech following the verb the parser’s pos tags,i.e.:

@HN, @CC, @MAIN, etc.

14 VERBtype Type of verb: copulative, CIPX, XIXX, XXXT, XXPX, XXXI,

impersonal, pronominal, CIXX, XXPT, XIPX, XIPT, XXXX,

transitive and intransitive XIXI, CXPI, XXPI, XIPI, XXEX

Table 3.4: Features: definitions and values.

method was implemented to identify these different types of clause as the parser

does not explicitly mark the boundaries of clauses within sentences (see Section

3.2.4)

3 LEMMA: lexical information extracted from the parser: the lemma of the finite

4-5 NUMBER, PERSON: morphological information features of the verb: its

grammatical number (singular or plural) and its person (first, second, or third

person).

6 AGREE: feature which encodes the tense, mood, person, and number of the

verb in the clause, and its agreement in person, number, tense, and mood with

the preceding verb in the sentence and also with the main verb of the sentence.

When a finite verb appears in a subordinate clause, its tense and mood can assist

in recognition of these features in the verb of the main clause and help to enforce

some restrictions required by this verb, especially when both verbs share the same

referent as subject.

7-9 NHPREV, NHTOT, INF: the candidates for the subject of the clause are

represented by the number of noun phrases in the clause that precede the verb,

the total number of noun phrases in the clause, and the number of infinitive verbs

in the clause.

10 SE: this is a binary feature encoding the presence or absence of the particle se

in close proximity to the verb. When se occurs immediately before or after the

verb or with a maximum of one token (see example (q) below) lying between the

verb and itself, this is considered “close proximity.”

(q) No podra sacarse una ventaja indebida de la reputacion de una marca.

(It) is not allowed to take unfair advantage of a brand reputation.

11 A: this is a binary feature encoding the presence or absence of the preposition

a in the clause. Since, the distinction between passive reflex clauses with zero

pronouns and impersonal constructions sometimes relies on the appearance of

preposition a (to, for, etc.). For instance, example (r) is a passive reflex clause

containing a zero pronoun while example (s) is an impersonal construction.

(r) Se admiten los alumnos que reunan los requisitos.

(They) accept the students who fulfill the requirements.

(s) Se admite a los alumnos que reunan los requisitos.

(It) is accepted for the students who fulfill the requirements.

12-13 POSpre, POSpos: the pos of eight tokens, that is, the four words preceding and

the four words following the instance1

14 VERBtype: the verb is classified as copulative (yes/no), as a verb with an im-

personal use (yes/no), as a pronominal verb (yes/no), and as a transitive verb

(yes/no/both).

3.2.4 Purpose Built Tools

As training data is required in order to exploit the methods distributed in the Weka

package (Witten & Frank, 2005), a method was implemented to extract the values of

the previously described features for instances occurring in the eszic Corpus. For each

instance (each annotated finite verb) a new line is written in the training data file

with values for the fourteen features separated by commas, together with the manual

classification of the vector using the standard CVS (comma separated values) format.

The values of features 7-9 are numerical while the values of the remaining features are

nominal (i.e. symbolic).

To extract the features, ad hoc software was implemented in Python. The program

exploits morphological and syntactic information, dependency relations reported by the

parser, and lists of verbs grouped by their syntactic and morphological properties (e.g.

transitivity, pronominal use, etc.).

The method implemented includes the following purpose built tools which are de-

scribed below. The description includes information on the particular features whose

values are computed using the tools.

1 Clause splitter module (CLAUSE): since Connexor’s Machinese Syntax (Con-

nexor Oy, 2006a) does not provide any information about the clause boundaries

within sentences, this clause splitter module is required. Each clause is built by

identifying finite verbs in a sentence and then searching for signals that indicate

the boundaries of the clause (relative pronouns, conjunctions, punctuation marks,

etc.). In theory, each clause could be built using dependency information given by

the parser by grouping together all the words dependent on the finite verb. How-

ever, this strategy was not used in order to avoid parsing errors in the dependency

information reported by the parser. Errors of this type are especially common

1This set of features can be regarded as useful for identifying non-nominal it (Evans, 2001).

when long sentences are parsed using Connexor’s Machinese Syntax. The Clause

splitter module also identifies the type of clause in which the finite verb occurs.

The feature attributes corresponding to the type of clause are:

1.1 Main (Main): when the finite verb belongs to the main clause.

1.2 Relative (Rel): when the finite verb belongs to a relative clause. A list of relative

pronouns was used to identify this type of clause (i.e.: que (that), cuyo (whose),

quien (who), etc.).

1.3 Improper conjunction (Imp): when the finite verb belongs to a clause starting with

an improper conjunction. A list of improper conjunctions was used to identify the

value of this attribute (i.e.: porque (because), luego (so), aunque (although), etc.).

1.4 Proper conjunction (Prop): when the finite verb belongs to a clause starting with

a proper conjunction. A list of proper conjunctions was used (i.e.: y, e (and), o, u

(or), ni (neither), pero (but) and sino (otherwise).

1.5 Punctuation marks (Punct): when the clause in which the finite verb occurs is

preceded by a punctuation mark (‘.’, ‘,’, ‘:’, ‘;’, ‘?’, ‘!’, “”, ‘-’, ‘(’, and ‘)’ ).

2 Noun phrase module (NHPREV, NHTOT): in order to obtain the subject

candidates, this module identifies and counts the noun phrases that precede and

follow the finite verb in the clause. As is the case for the clause splitter, this

module exploits dependency information returned by the parser (Connexor Oy,

2006a).

3 Counter (NHPREV, NHTOT, INF): this module is used to determine the

total number, in the clause, of noun phrases (nhprev, nhtot) and infinitival

forms (inf).

4 Tag taker (PARSER, LEMMA, NUMBER, PERSON, A, POSpre,

POSpos): these Python functions process the attributes of the XML tags output

by the parser (eszic Corpus) to generate a set of features for the eszic train-

ing data. A function generates a binary value that indicates whether or not the

finite verb has a dependent subject (parser). A function consults the lemma

of the verb and takes it as the value for feature (lemma). Other functions ex-

ploit morphological information obtained by the parser such as the number of the

finite verb (number), which can be either singular (SG) or plural (PL), or the

morphological person of the finite verb (person) which can be first, second or

third person (P1, P2, P3); Another function identifies whether the preposition a

occurs in the clause (a). This information is used as the values for the features;

and, finally, there is another method which obtains the pos of the four words

that precede the instance in the clause ((pos)pre) and the four words that follow

it ((pos)pos).

5 Agreement module (AGREE): this module checks whether the verb used in

the clause agrees (true, T) or disagrees (false, F) in tense and mood, and in person

and number with the main verb that occurs in the sentence1 and the previous

verb occurring within the sentence. This agreement information is combined into

one symbolic feature, such as TTTT (with respect to the verb used in the clause,

the first T denotes agreement in number and person with the main verb of the

sentence, the second T denotes agreement in tense and mood with the main verb

of the sentence, the third T denotes agreement in number and person with the

previous verb in the sentence and the fourth T denotes agreement in tense and

mood with the previous verb in the sentence) or TTFF (when there is agreement

in between the verb in the clause and the main sentence verb but no agreement

with the previous clause verb). There are sixteen possible combinations of true

(T) and false (F) values.

6 Se identifier (SE): this function identifies whether the particle se occurs in

close proximity to the finite verb. Again, in this context, a distance of at most

one token between the finite verb and se is considered “close proximity.” The

value for this feature can be (yes), when se appears, or (no), when it does not.

7 Verb classifier (VERBtype): this module specifies the value of four features

of the finite verb that occurs in the clause. The features encode information

about whether or not the verb appears in four different lists of verbs (the same

instance can occur in more then one list). These four lists2 contain 11,060 different

verb lemmas which are present in the Royal Spanish Academy Dictionary (Real

1In this study, it is considered that sentences may contain several verbs whereas clauses contain

only one finite verb.2The lists 7.2-7.4 of infinitive verb forms were provided by Molino de Ideas s.a.

Academia Espanola, 2001). The criteria on which these lists (items 7.2-7.4) were

built was the information contained in the dictionary definitions of the verbs (Real

Academia Espanola, 2001):

7.1 Copulative verbs (C): a list containing the copulative verbs, i.e. ser (to be), parecer

(to seem like), etc.;

7.2 Impersonal verbs (I): a list containing all the verbs whose use is impersonal. Such

use is specified in their definition, i.e. llover (to rain), nevar (to snow), etc.;

7.3 Pronominal verbs (P): a list which includes all the pronominal verbs (verbs whose

lemma in the dictionary appears with se) and all the potential pronominal verbs

whose definitions specify a potential pronominal use; and

7.4 Transitive and intransitive verbs (T): a list containing transitive verbs and intran-

sitive verbs that meet the criteria detailed previously in item 7.

3.2.5 The WEKA Package

The Weka workbench1 is a collection of state-of-the-art machine learning algorithms

and data preprocessing tools (Hall et al., 2009; Witten & Frank, 2005). Both Weka

interfaces, the Explorer and the Experimenter were used to discover the methods and

parameter settings that work best for the current classification task.

Standard evaluation measures –precision, recall, f-measure and accuracy (Manning

& Schutze, 1999)– provided by Weka are used. In these measures, true positives (tp)

and true negatives (tn) are the number of cases that the system got right. The wrongly

selected cases are the false positives (fp) while the cases that the system failed to select

are the false negatives (fn). In the current context, true positives and true negatives

would be the numbers of correctly classified instances while the false positives and false

negatives are the numbers of falsely classified instances (Manning & Schutze, 1999).

Precision is defined as the ratio of selected items that the system got right, that is,

the ratio of true positives to the sum of true positives and false positives: p = tptp+fp .

Recall is defined as the proportion of target items that the system selected, that is

the ratio of the number of true positives to the sum of true positives and false negatives:

r = tptp+fn .

1 Weka is available at: http://www.cs.waikato.ac.nz/ml/weka/.

Figure 3.3: An example of Weka Explorer interface.

F-measure is a single measure of overall performance which combines precision and

recall:

1r + 1

Accuracy is the proportion of correctly classified objects:

A =tp + tn

tp + tn + fp + fn.

Chapter 4

Evaluation

“Then you should say what you mean” [...]

“I do,” Alice hastily replied; “at least I mean what I say that’s the same thing,

you know.”

“Not the same thing a bit!” said the Hatter. “Why, you might just as well say

that ‘I see what I eat’ is the same thing as ‘I eat what I see’ !”

Alice in Wonderland, Lewis Carroll

This chapter presents the evaluation of the Elliphant system and some optimisa-

tion experiments carried out with the machine learning method (see Section 4.1). A

comparative evaluation of Elliphant’s performance with that of Connexor’s Machinese

Syntax parser is also described (see Section 4.2).

Standard evaluation measures (precision, recall, f-measure and accuracy) are used

to evaluate Elliphant with regard to the identification of the three classes: explicit

subjects, zero pronouns and impersonal constructions.

4.1 Experiments

A set of experiments was executed using the in Weka package with the purpose of

answering the following questions:

(1) Which method and parameter values work best for our problem? (see Section 4.1.1)

(2) How many instances are needed to train the algorithm? (see Section 4.1.2)

4. Evaluation 4.1 Experiments

(3) Does the genre matter? (see Section 4.1.3)

(4) Which are the most significant features and what are the most effective combinations of

features? (see Section 4.1.4)

4.1.1 Method Selected: K* Algorithm

A comparison of the learning algorithms implemented in Weka (Witten & Frank,

2005) was carried out to determine the most accurate method for each classification

task. A comparison of the accuracy levels (see Table 4.1), which presents all of the

Weka classifiers which exploit the features utilised in the Elliphant system is shown

below, with default parameter settings. The experiment was executed using 20% of the

instances in the training data, which were selected randomly. Ten-fold cross-validation

was used in the evaluation. All methods with an accuracy within 1% of K*’s are marked

in italics.

The seven1 highest performance classifiers were compared using 100% of the training

data and 10-fold cross-validation. The Bayes classifiers (BayesNet, NaiveBayes and

NaiveBayesUpdateable) obtained an accuracy score of 0.846, the function classifier

(RBFNetwork) offers an accuracy of 0.850 and the tree classifier (LADTree) an accuracy

of 0.830. With an accuracy of 0.860, the lazy learning classifier K* is the best performing

one, and hence our chosen technique.

Although lazy learning requires a relatively large amount of memory to store the

entire training set, the eszic training data is small enough that it can be classified

within a few minutes.

Instance-based learners classify new instances by comparing them to the manually

classified instances in the training data. The fundamental assumption is that similar

instances will have similar classifications. Nearest neighbor algorithms are the simplest

of the instance-based learners. They use a domain-specific distance measure to retrieve

the single most similar instance from the training set. In a nearest-neighbor method

each instance in the training set is represented by a vector of feature values that has

been explicitly classified. When a new vector of feature values is presented, a distance

measure is computed between the new vector and the set of vectors held in the training

1Unfortunately, due to hardware limitations, it was not possible to obtain results from the NBTree

classifier and the JRip rule classifier when using the entire set of training data.

Weka classifiers Accuracy Weka classifiers Accuracy

Bayes: BayesNet 0.848 Meta: RacedIncrementalLogitBoost 0.717

Bayes: NaiveBayes 0.848 Meta: RandomSubSpace 0.731

Bayes: NaiveBayesSimple 0.842 Meta: Stacking 0.717

Bayes: NaiveBayesUpdateable 0.848 Meta: StackingC 0.717

Functions: RBFNetwork 0.848 Meta: Vote 0.717

Lazy: IB1 0.804 Misc: HyperPipes 0.715

Lazy: IBk 0.810 Misc: VFI 0.704

Lazy: K* 0.850 Rules: ConjunctiveRule 0.809

Lazy: LWL 0.809 Rules: DecisionTable 0.834

Meta: AdaBoostM1 0.81 Rules: DTNB 0.834

Meta: AttributeSelectedClassifier 0.836 Rules: JRip 0.845

Meta: ClassificationViaClustering 0.66 Rules: NNge 0.740

Meta: CVParameterSelection 0.717 Rules: OneR 0.762

Meta: Decorate 0.795 Rules: PART 0.795

Meta: END 0.809 Rules: Ridor 0.821

Meta: EnsembleSelection 0.762 Rules: ZeroR 0.717

Meta: FilteredClassifier 0.810 Trees: BFTree 0.760

Meta: Grading 0.717 Trees: DecisionStump 0.810

Meta: LogitBoost 0.841 Trees: J48 0.810

Meta: MultiBoostAB 0.810 Trees: J48graft 0.813

Meta: MultiClassClassifier 0.661 Trees: LADTree 0.846

Meta: MultiScheme 0.717 Trees: NBTree 0.850

NestedDichotomies: ClassBalancedND 0.809 Trees: RandomForest 0.793

NestedDichotomies: DataNearBalancedND 0.809 Trees: RandomTree 0.749

NestedDichotomies: ND 0.809 Trees: REPTree 0.723

Meta: OrdinalClassClassifier 0.810 Trees: SimpleCart 0.763

Table 4.1: Weka classifiers accuracy (20% of the eszic training set).

set (Cleary & Trigg, 1995). The k nearest ones are identified and the new vector is

assigned the class shared by the majority of the nearest neighbors1.

K* is an instance-based classifier. The class of a test instance is based upon the

classes of those training instances that are similar to it, as determined by some sim-

ilarity function. It differs from other instance-based learners in that this algorithm

computes the distance between two instances using a method motivated by informa-

1Evans (2001) and Boyd et al. (2005) executed their experiments with the k nearest neighbor

classifier which is also a lazy learning algorithm.

tion theory in which an entropy-based distance function is used (Cleary & Trigg, 1995;

Witten & Frank, 2005). The distance between instances is defined as the complexity

of transforming one instance into another. The calculation of the complexity between

instances is detailed in Cleary & Trigg (1995).

When using K*, the most effective classification is made when using a blending

parameter1 of 40%2 and the rest of the parameters remain with their default values:

the missing Mode parameter3 set to the average column entropy curves and the entropic

Auto Blend parameter set to false. Table 4.2 presents the evaluation of Elliphant when

exploiting the K* classifier with the parameters set as explained before, using ten-fold

cross-validation.

Class Precision Recall F-measure

Explicit subjects 0.900 0.923 0.911

Zero pronouns 0.772 0.740 0.756

Impersonal constructions 0.889 0.626 0.734

eszic training data Accuracy: 0.867 (ten-fold cross-validation)

Table 4.2: eszic training data evaluation with K* -B 40 -M a.

There is a marginal reduction in accuracy when the system is evaluated using ten-

fold cross-validation (0.867) instead of leave-one-out cross-validation (0.869), though

its statistical significance is minimal. When decreasing the proportion of training data

used, the difference in performance levels between both evaluation methods remains

stable except when using 50% of the training data and is just 0.005. Although leave-

one-out cross-validation obtains more accurate results as it is easier to classify test

instances using almost 100% of the training data than from only 90% of it, in practice

a classifier is trained and tested on instances derived from different data sets. Ten-fold

cross-validation is thus a more accurate simulation of real-world classification scenarios.

Moreover, it can be computed far more quickly than leave-one-out cross-validation.

1The parameter for global blending.2Blending percentages up to 50% were tested.3The missing Mode determines how missing attribute values are treated.

eszic training data Ten-fold cross Leave-one-out

percentage validation validation

10% 0.836 0.834

20% 0.859 0.862

30% 0.854 0.851

40% 0.855 0.858

50% 0.858 0.863

60% 0.860 0.862

70% 0.860 0.862

80% 0.865 0.863

90% 0.866 0.869

100% 0.867 0.868

Table 4.3: Leave-one-out and ten-fold cross-validation comparison.

4.1.2 Learning Curve

A learning curve shows how accuracy changes with varying sample sizes, plotting the

number of correctly classified instances against the number of instances in the training

data. To calculate the learning curve of the Elliphant system, the eszic training data

was used to generate ten training samples, representing 10%, 20%, 40%, 50%, 60%,

70%, 80%, 90% and 100% of the data set. The instances contained in the eszic training

file were randomly ordered so that the genre variable could not influence the results

presented below. In these experiments, the K* algorithm was used with the parameter

settings described in Section 4.1.1 and the evaluation was carried out using ten-fold

cross-validation.

The learning curve shown in Figure 4.1 presents the increase in accuracy obtained

by the Elliphant system using the eszic training data. Performance reaches a plateau

at its maximum level when using 90% of the training instances.1

Figure 4.2 displays the precision, recall and f-measure of classification for all classes

1One thing to be noted is that the ordering of the instances makes a slight difference to the accuracy

of classification. While the system obtains an accuracy of 0.867 when the instances are placed in their

original order of occurrence in the eszic training data, 0.866 is obtained when the same instances are

presented in random order to the classifier using ten-fold cross validation. This difference also occurs

when leave-one-out cross-validation is used. In this case, the method obtains an accuracy of 0.869

when the instances are placed in their original order of occurrence and 0.868 when presented in random

order.

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Accuracy

0.8540.855

0.8580.86 0.86

0.865 0.866 0.866

Figure 4.1: eszic training data learning curve for accuracy.

in the eszic training data. The values of the three measures are maximal when uti-

lizing 90% of the training set. While recall plateaus at this sample size, precision

and f-measure decrease slightly when the amount of training data is further increased,

although this decline is not sufficiently marked to be attributed to overtraining.

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Precision Recall F-measure

0.8570.858 0.858

0.8630.865 0.864

0.8520.852

0.8560.858

0.8640.865

0.854 0.855

0.8580.86 0.86

0.8650.866 0.866

Figure 4.2: eszic training data learning curve for precision, recall and f-measure.

The learning curve in Figure 4.3 shows the classification accuracy for each of the

classes while Figure 4.4 presents this accuracy in relation to the number of training

instances for each section of the eszic training data.

Under all conditions, subjects are classified with a high accuracy since the infor-

mation given by the parser (collected in the features) facilitates an f-measure of 0.801

for the identification of explicit subjects. By contrast to explicit subjects, the parser

does not recognise zero pronouns in impersonal constructions but can recognise them in

clauses with no subject. The accuracy with which these types can be classified begins

at a lower level (0.662 and 0.621 respectively). Classification of both zero pronouns of

impersonal constructions reaches its maximum when 90% of the training data is ex-

ploited. There is also some evidence of overtraining in the classification of impersonal

constructions when using 100% of the training data.

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

0,6530,642

0,6610,671

0,6510,672

0,7360,721

0,735 0,728 0,737 0,741 0,744 0,742 0,748 0,754 0,754

0,8950,907 0,905 0,904 0,906 0,907 0,908 0,911 0,911 0,911

Explicit Subject Zero pronoun Impersonal

Figure 4.3: Learning curve for accuracy, recall and f-measure of the classes.

The zero pronoun class has the steepest learning curve. Utilising only 735 instances

(50% of the training set), the Elliphant system obtains an accuracy (0.741) close to that

obtained when using 100% of the training data. The learning curve for the subject class

is more gradual due to the great variety of subjects occurring in the training data. In

addition, increasing accuracy from a greater starting point (0.907 using just 20% of the

training data) is far more expensive in terms of the addition of training instances. The

impersonal sentence class is also learned rapidly by Elliphant. Utilising a training set

of only 179 instances, it reaches a classification accuracy of 0.721 (See Figure 4.4).

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Explicit Subject Zero pronoun Impersonal

498978 1461 1929 2433 2898 3400 3899 4386 4854

354 537 735 898 1094 1249 1416 15931793

129146

163 179

Explicit subjects

Zeropronouns

Impersonalconstructions

Figure 4.4: Learning curve for accuracy, recall and f-measure in relation to the number

of instances of each class.

This demonstrates that Elliphant is not heavily reliant on very large sets of expen-

sive training data and is able to reach adequate levels of performance when exploiting

far less training instances. Overall, we see that we only need a small set of annotated

instances (1,500) to achieve reasonable results.

4.1.3 Most Effective Features

With Weka’s Attribute Selection option, it is possible to evaluate the features by

considering the individual predictive ability of each of the features along with the degree

of redundancy between them. Table 4.4 shows the relevant ordered features evaluated

using different algorithms implemented in Weka’s attribute selection module which

can handle the features type (symbolic, numerical, etc.) from the eszic training data.

The filters used for each Attribute Selection method are the ones provided by default

in Weka1.

Considering the group of features selected using each Weka Attribute Selection

algorithm, 11 classifications using the K* classifier were made over the complete eszic

1BestFirst filter for the CfsSubsetEval method; Attribute ranking filter for the ChiSquaredAttribu-

teEval, FilteredAttributeEval, GainRatioAttributeEval, InfoGainAttributeEval, OneRAttributeEval,

ReliefFAttributeEval and SymmetricalUncertAttributeEval; and Greedy Stepwise filter for the Consis-

tencySubsetEval and FilteredAttributeEval methods.

Weka Attribute Selection Selected features

CfsSubsetEval PARSER, NUMBER, NHPREV, NHTOT,

VERBtype, PERSON

ChiSquaredAttributeEval LEMMA, POSpos, NHTOT, NHPREV, POSpre,

PARSER

ConsistencySubsetEval PARSER, LEMMA, NUMBER, AGREE, NHTOT,

POSpos, POSpre

FilteredAttributeEval POSpos, LEMMA, NHPREV, NHTOT, PARSER,

POSpre

FilteredSubsetEval PARSER, NHPREV, NHTOT

GainRatioAttributeEval NHPREV, PARSER, PERSON, NHTOT, POSpos,

CLAUSE

InfoGainAttributeEval POSpos, LEMMA, NHPREV, NHTOT, PARSER,

POSpre

OneRAttributeEval NHTOT, POSpos, CLAUSE, PERSON, NHPREV,

PARSER

ReliefFAttributeEval POSpos, VERBtype, LEMMA, PARSER, CLAUSE,

POSpre

SymmetricalUncertAttributeEval NHPREV, PARSER, NHTOT, POSpos, PERSON,

Table 4.4: Selected features by Weka Attribute Selection methods.

training data using only the features selected by each method. Table 4.5 presents the

accuracy of each classification using ten-fold cross-validation.

The most effective group of six features in combination is the one selected by

Weka’s SymmetricalUncertAttributeEval Attribute Selection algorithm, since the clas-

sification using those six features together already offers an accuracy of 0.851. Likewise,

a group consisting of only three features (parser, nhprev, nhtot) was selected by

the FilteredSubsetEval algorithm. These three features are the most frequently selected

ones among those chosen by all the Attribute Selection methods. A classification which

exploits only the three features obtains an accuracy of 0.819.

A set of experiments were conducted in which features were selected on the basis

of the degree of computational effort needed to generate them. Two sets of features

were proposed. One group corresponds to features intrinsic to the parser, whose values

can be obtained by trivial exploitation of the tags produced in its output (parser,

Weka Attribute Selection Accuracy

CfsSubsetEval 0.824

ChiSquaredAttributeEval 0.848

ConsistencySubsetEval 0.843

FilteredAttributeEval 0.848

FilteredSubsetEval 0.819

GainRatioAttributeEval 0.833

InfoGainAttributeEval 0.848

OneRAttributeEval 0.833

ReliefFAttributeEval 0.825

SymmetricalUncertAttributeEval 0.851

Table 4.5: Classification using the selected features groups: accuracy.

lemma, person, pospos, pospre). The second group of features (clause, agree,

nhprev, nhtot, verbtype) has values derived by methods extrinsic to the parser

and rules for the recognition of elements that are independent of it. Derivation of

this second group of features necessitated the implementation of more sophisticated

modules to identify the boundaries of syntactic constituents such as clauses and noun

phrases. These modules are rule-based and operate over the often erroneous output

of the parser (see Section 3.2.4). The results obtained when the classifier exclusively

exploits each of these intrinsic and extrinsic groups of features are shown in Tables 4.6

and 4.7

A recurrent issue in anaphora resolutions studies is determining the quantity and

type of knowledge needed for identification of candidates and selection of a candidate

as antecedent. In Mitkov (2002) it is stated that, given the natural linguistic ambiguity

of various cases, the resolution of any kind of anaphor requires not only morphological,

lexical, and syntactic knowledge but also semantic knowledge, discourse knowledge, and

real world knowledge. Nevertheless, current anaphora resolution methods rely mainly

on restrictions and preference heuristics, which employ information originating from

morpho-syntactic or shallow semantic analysis (Ferrandez & Peral, 2000; Mitkov, 1998),

while some previous approaches have exploited full parsing (Hobbs, 1977; Lappin &

Leass, 1994). As described in this dissertation, Elliphant makes use of deep dependency

parsing plus the morphological knowledge contained in the verb lists used.

There are two findings of note in Table 4.6. The first is that no impersonal con-

structions are identified when only features extrinsic to the parser are used. The second

is that there is a reduction in recall when using only intrinsic features. It is therefore

better to classify instances using a feature group that combines both types of features.

eszic training data Precision Recall F-measure

Zero pronouns 0.865 0.891 0.878

Impersonal constructions 0 0 0

Extrinsic parser features eszic training data accuracy: 0.808

Table 4.6: Extrinsic parser features classification results.

Zero pronouns 0.779 0.983 0.869

Impersonal constructions 0.944 0.285 0.438

Intrinsic parser features eszic training data accuracy: 0.789

Table 4.7: Intrinsic parser features classification results.

To estimate the weight of each feature, classifications were made in which each

feature was omitted from the training instances that were presented to the classifier

and ten-fold cross-validation was applied. Table 4.8 presents the accuracy of these

classifications. Omission of all but one of the features a led to a reduction in accuracy,

justifying their inclusion in the training instances.

Feature omitted Accuracy Feature omitted Accuracy

PARSER 0.854 VERBtype 0.863

NHTOT 0.860 NUMBER 0.864

LEMMA 0.861 INF 0.864

POSpos 0.861 AGREE 0.865

NHPREV 0.862 POSpre 0.866

PERSON 0.863 SE 0.866

CLAUSE 0.863 A 0.867

Table 4.8: Single feature omission classifications: accuracy.

4.1.4 Genre Analysis

As the eszic training data is composed of instances belonging to two different genres

(legal and health), two subgroups of the eszic training data were generated: the Legal

eszic training data and the Health eszic training data containing all the instances

derived from legal and health texts, respectively. A comparative evaluation using ten-

fold cross-validation over the two subgroups shows that Elliphant is more successful

when classifying instances of explicit subjects in legal texts (see Table 4.9). This may

be explained by the uniformity of the sentences in the legal texts which present less

variation than the ones from the health genre. Texts from the health genre present

the additional complication of specialised named entities and acronyms which are used

quite frequently in the health texts from the eszic Corpus (i.e.: CCDSD1, DSM-IV2 or

TLP3). Further, there is a larger number of explicit subjects in the legal training data

(2,739, compared with 2,116 explicit subjects occurring in the health texts). Similarly,

better performance in the detection of zero pronouns and impersonal sentences in the

health texts may be due to their higher occurrence in the health genre: 108 impersonal

constructions and 1,174 zero pronouns compared with 71 impersonal constructions and

619 zero pronouns in the legal texts (see Table 3.2 for details about the number of class

instances in each subgroup of the training data).

Class Precision Recall F-measure

Legal genre Explicit subjects 0.920 0.955 0.937

Health genre Explicit subjects 0.881 0.888 0.884

Legal genre Zero pronouns 0.761 0.649 0.701

Health genre Zero pronouns 0.784 0.796 0.790

Legal genre Impersonal constructions 0.786 0.620 0.693

Health genre Impersonal constructions 0.905 0.620 0.736

Legal genre accuracy: 0.893 (ten-fold cross-validation)

Health genre accuracy: 0.848 (ten-fold cross-validation)

Table 4.9: Legal and health genres comparative evaluation.

1Cuestionario Clınico para el Diagnostico del Sındrome Depresivo (Clinic Questionnaire for De-

pressive Syndrome Diagnosis).2Manual Diagnostico y Estadıstico de los Trastornos Mentales IV (Diagnostic and Statistical Man-

ual of Mental Disorders IV).3Trastorno lımite de la personalidad (Borderline Personality Disorder).

4. Evaluation 4.2 Comparative Evaluation

We have also studied the effect of training the classifier on data derived from one

genre and testing on instances derived from a different genre. Table 4.10 shows that

instances from legal texts are not only more homogeneous, as the classifier obtains

higher accuracy when testing and training only on legal instances (0.895) but they are

also more informative because when combining both legal and health genres as training

data, the results in testing the algorithm only on instances from the health genre show

significantly increased accuracy (0.933). These results imply that the instances from

the health genre are the most heterogeneous ones. Subsets of legal documents where

our method achieves an accuracy of 0.942 were also found.

```````````````Training set

Testing setLegal Health eszic Corpus

Legal 0.895 0.859 0.885

Health 0.858 0.841 0.887

eszic Corpus (all) 0.920 0.933 0.869

Accuracy: cross-genre training and testing (ten-fold cross-validation)

Table 4.10: Cross-genre training and testing evaluation.

4.2 Comparative Evaluation

Due to the lack of previous work on this topic, a comparison with other methods is

not feasible. Despite its similarities to this approach, Ferrandez & Peral (2000) use a

different definition for zero pronouns, and therefore a comparison is not appropriate.

As a guideline, the results obtained by Connexor’s Machinese Syntax are presented

regarding the existence (or not) of a subject inside the clause. Since this parser does

not distinguish between referential and non-referential elliptic subjects, both categories

have been merged into one. Needless to say, a comparison of the results obtained by

these two methods should be made with caution. They are presented here only as a

point of reference. It is clear from the figures that the Elliphant system offers not only

improved f-measure in the classification of both elliptic subject classes, but also obtains

superior f-measure when classifying the non-omitted subject class.

The evaluation was carried out using both the entire set of eszic training data and

also the genre-specific subsets of the training data (Legal and Health eszic training

data). The evaluation of the Elliphant system was made using leave-one-out cross-

validation.

Elliphant Explicit subjects 0.901 0.924 0.913

Elliphant Zero pronouns 0.774 0.743 0.758

Elliphant Impersonal constructions 0.889 0.626 0.734

Elliphant eszic training data accuracy: 0.869 (leave-one-out cross-validation)

Table 4.11: Elliphant eszic training data results.

Machinese Explicit subjects 0.911 0.716 0.802

Machinese Zero pronouns

+ Impersonal constructions 0.543 0.829 0.656

Machinese eszic training data accuracy: 0.749

Table 4.12: Machinese eszic training data results.

When evaluating over the entire eszic training set, Elliphant outperforms the parser

on every measure. When detecting explicit pronouns in Elliphant, the obtained recall

score is considerably higher (0.924 compared to the 0.716 of the parser). The aver-

ages of the evaluation measures obtained for the identification of zero pronouns and

impersonal constructions (precision: 0.831; recall: 0. 684; f-measure: 0.746) were also

compared. The comparison demonstrated Elliphant’s superiority over Connexor’s Ma-

chinese Syntax parser, in this task, for all measures except recall.

Legal genre eszic training data Precision Recall F-measure

Legal genre Elliphant Explicit subjects 0.922 0.955 0.938

Legal genre Elliphant Zero pronouns 0.760 0.654 0.934

Legal genre Elliphant Impersonal constructions 0.797 0.662 0.723

Elliphant Legal eszic training accuracy: 0.895

Table 4.13: Elliphant Legal eszic training results.

When processing only the Legal eszic training data, the accuracy of the parser is

reduced (0.726), while the performance of the Elliphant system is improved (0.895).

Legal genre eszic training data Precision Recall F-measure

Legal genre Machinese Explicit subjects 0.940 0.702 0.803

Legal genre Machinese Zero pronouns

Machinese Legal eszic training accuracy: 0.726

Table 4.14: Machinese Legal eszic training results.

The two systems were used to classify instances of elision (zero pronouns and imper-

sonal constructions) in texts from the legal genre. The averaged evaluation measures

obtained by the Elliphant system (precision: 0.778; recall: 0. 658; f-measure: 0.828)

were found to be superior to those obtained by the parser for all measures except recall

(precision: 0.675; recall: 0. 763; f-measure: 0.675).

Health genre eszic training data Precision Recall F-measure

Health genre Elliphant Explicit subjects 0.879 0.879 0.879

Health genre Elliphant Zero pronouns 0.773 0.795 0.784

Health genre Elliphant Impersonal constructions 0.882 0.620 0.728

Elliphant Health eszic training data accuracy: 0.841

Table 4.15: Elliphant Health eszic training data results.

Health genre eszic training data Precision Recall F-measure

Health genre Machinese Explicit subjects 0.879 0.735 0.801

Health genre Machinese Zero pronouns

Machinese Health eszic training data accuracy: 0.772

Table 4.16: Machinese Health eszic training data results.

When classifying instances derived from texts in the health genre (using Health

eszic training data), the accuracy of both the Elliphant system and the parser was

reduced. However, Elliphant still outperforms the parser in this context.

When considering the classification of instances of elision in the health genre, Con-

nexor’s Machinese Syntax parser does obtain higher measures for the averaged eval-

uation measures than Elliphant (precision: 0.827; recall: 0.707; f-measure: 0.756).

Nevertheless, unlike the parser, the Elliphant system distinguishes referential (zero

pronouns) and non-referential (impersonal constructions) elided subjects. This can be

considered one of its main contributions as this task is necessary in order to improve

practical anaphora resolution systems.

Chapter 5

Conclusions and Future Work

In this dissertation, a machine learning approach to the identification of zero pronouns,

impersonal constructions, and explicit subjects was presented. In treating this range

of classes, complete coverage is provided for all possible constituents which may occur

in subject position in Spanish clauses.

In order to enable a machine learning approach to classification, a parsed corpus of

Spanish texts from the health and legal genres was compiled. The corpus was manually

annotated to encode information about the element in subject position for every finite

verb in the corpus (the eszic Corpus). A set of 14 features was formulated and training

data consisting of 6,827 instances represented by vectors of the feature values was cre-

ated (eszic training data). The training data was utilised by classification algorithms

distributed with the Weka package. Empirical observation revealed that use of the K*

algorithm was optimal for the purpose of this classification. The performance of this

machine learning approach was compared with that of Connexor’s Machinese Syntax

parser. Elliphant offers a classification with superior accuracy in the recognition of

both of the elliptic classes (zero pronouns and impersonal constructions), and also in

the classification of the non-elliptic subject class (explicit subjects). The method pre-

sented in this dissertation is also able to identify impersonal constructions in Spanish.

This is a task which appears not to have been dealt with before in the literature.

In addition to presenting results with regard to algorithm selection, additional ex-

periments carried out with the underlying method included parameter optimisation,

learning of the most effective combinations of features, the optimal number of instances

to include in the training data and the relationships between the results and the differ-

ent genres on which the Elliphant system was tested. This chapter presents the findings

5. Conclusions and Future Work 5.1 Main Observations

of all of these experiments (see section 5.1). In future research, it is intended that op-

timisation of the approach and its adaptability to other genres will be investigated in

more depth (see section 5.2).

5.1 Main Observations

Algorithm selection: the instance-based learning algorithm K* was selected for clas-

sification of elliptic vs. explicit subject instances and referential vs. non-referential

subject instances. This decision was taken on the basis of having compared the accu-

racy of this classifier with the rest of the classifiers available in the Weka package. In

terms of accuracy, the K* algorithm is closely followed by the Bayes based algorithms

in Weka.

Parameter optimisation was investigated by checking the impact of the param-

eter setting on the performance of the K* classifier. Although Weka provides sensible

default settings, it is by no means certain that they will be optimal for this particular

task. The default settings were changed so that a blending parameter of 40% was used

with regard to the K* algorithm.

Feature selection: the set of experiments conducted to determine an optimal

group of features to be utilised by the classification algorithm revealed that of the en-

tire set of 14 features, the most effective group comprises six of the features: nhprev

(number of noun phrases previous to the verb), parser (parsed subject), nhtot (num-

ber of noun phrases in the clause), pospos (four pos following the verb), person (verb

morphological person), and lemma (verbal lemma). This study showed that feature a

(preposition a) does not make any meaningful contribution to the classification.

Training data required: learning curves experiments showed the correlation be-

tween the accuracy of the classifier and the size of the training set, whose performance

reaches a plateau at its maximum level when using 90% of the available data.

Genre interference: We evaluated the performance of the Elliphant system sep-

arately in two different genres, legal and health, showing that there is some genre

interference on the classification tasks. Elliphant classifies zero pronouns and explicit

subjects in legal texts with a higher accuracy than is the case in health texts. By con-

trast, impersonal constructions are more accurately classified in health texts. Cross-

genre training and testing demonstrated that legal instances are more informative and

5. Conclusions and Future Work 5.2 Future Research

homogeneous than health genre cases.

5.2 Future Research

Future research goals are related to improvements in: (1) optimisation of the Elliphant

system, (2) adaptation of the system to other genres, (3) inter-annotation agreement

of the eszic Corpus, (4) the comparison of Elliphant with a rule based approach and

the (5) design of an algorithm to resolve zero anaphora in Spanish.

Firstly, with regard to further improvement of the Elliphant system, the interaction

between (a) feature selection and parameter optimisation, and (b) class distribution will

be addressed. In related work, it was found that optimal settings for feature selection

and parameter optimisation should not be sought independently of one another since

there is an interaction between the two. The joint optimisation of feature selection

and parameter optimisation can cause variations in the accuracy levels obtained by

classifiers (Hoste, 2005). Additionally, an investigation will be made into how the class

distribution of the data affects learning. This will facilitate the compilation of an

optimal set of training instances as it has been found that training data containing a

lower distribution of negative instances can be beneficial to classification (Hoste, 2005).

In future work, evaluation and learning curve experiments in which training in-

stances derived from texts in one genre are used to classify instances derived from texts

in a different genre will provide an insight into the optimal type/combination of train-

ing data that enables better classification using less instances in various types/genres

of text, as well as provide additional robustness to our system.

Inter-annotator agreement will be measured and it is planned to design a ruled based

algorithm to identify and to resolve zero anaphora in Spanish as there is some debate

about which approach, machine learning or rule-based, brings optimal performance

when applied in anaphora resolution systems (Mitkov, 2002).

5. Conclusions and Future Work 5.2 Future Research

References

Aldea Munoz, S. (2003). Un caso de intervencion psicologica de la depresion infantil. psiquia-tria.com, 7. 28

Aldea Munoz, S. (2006). Influencia del autoconcepto y de la competencia social en la de-presion infantil. psiquiatria.com, 10. 28

Alonso-Ovalle, L. & D’Introno, F. (2000). Full and null pronouns in Spanish: the zeropronoun hypothesis. In H. Campos, E. Herburger, A. Morales-Front & T.J. Walsh, eds.,Hispanic linguistics at the turn of the millennium. Papers from the 3rd Hispanic LinguisticsSymposium, 189–210, Cascadilla Press, Sommerville, MA. 6

Balcazar Nava, P., Bonilla Munoz, M.P., Gurrola Pena, G.M., Oudhof van Barn-eveld, H. & Aguilar Mercado, M.R. (2005). La depresion como problema de saludmental en los adolescentes mexicanos. psiquiatria.com, 9. 28

Barreras, J. (1993). Resolucion de elipsis y tecnicas de parsing en una interficie de lenguajenatural. Procesamiento del lenguaje natural , 13, 247–258. 7, 8

Beavers, J. & Sag, I. (2004). Coordinate ellipsis and apparent non-constituent coordination.In S. Muller, ed., Proceedings of the 11th International Conference on Head-Driven PhraseStructure Grammar (HPSG-04), 48–69, CSLI Publications, Stanford, CA. 17

Bello, A. ([1847] 1981). Gramatica de la lengua castellana destinada al uso de los americanos.Instituto Universitario de Linguıstica Andres Bello, Cabildo Insular de Tenerife, Santa Cruzde Tenerife. 15, 19

Bergsma, S., Lin, D. & Goebel, R. (2008). Distributional identification of non-referentialpronouns. In Proceedings of the 46th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL/HLT-08), 10–18. 2, 10, 12

Bosque, I. (1989). Clases de sujetos tacitos. In J. Borrego Nieto, ed., Philologica: homenajea Antonio Llorente, vol. 2, 91–112, Servicio de Publicaciones, Universidad Pontificia deSalamanca, Salamanca. 15, 16, 18, 19, 24

Boyd, A., Gegg-Harrison, W. & Byron, D. (2005). Identifying non-referential it : amachine learning approach incorporating linguistically motivated patterns. In Proceedingsof the ACL Workshop on Feature Engineering for Machine Learning in Natural LanguageProcessing. 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05),40–47. 8, 10, 12, 13, 22, 45

References

Brucart, J.M. (1987). La elision sintactica en espanol . Universitat Autonoma de Barcelona,Bellaterra. 15

Brucart, J.M. (1999). La elipsis. In I. Bosque & V. Demonte, eds., Gramatica descriptiva dela lengua espanola, vol. 2, 2787–2863, Espasa-Calpe, Madrid. ix, 15, 16, 17, 19, 23, 24

Carden, G. (1982). Backwards anaphora in discourse context. Journal of Linguistics, 18,361–87. 33, 34

Chinchor, N. & Hirschman, L. (1997). MUC-7 Coreference task definition (version 3.0). InProceedings of the 1997 Message Understanding Conference (MUC-97). 2

Chomsky, N. (1965). Aspects of the theory of syntax . The MIT Press, Cambridge, MA. 15

Chomsky, N. ([1968] 2006). Language and mind . Cambridge University Press, Cambridge, 3rdedn. 14

Chomsky, N. (1981). Lectures on government and binding . Mouton de Gruyter, Berlin, NewYork. 1, 6, 19

Chomsky, N. (1995). The minimalist program. The MIT Press, Cambridge, MA. 15

Chung, S., Ladusaw, W. & McCloskey, J. (1995). Sluicing and logical form. NaturalLanguage Semantics, 3, 239–282. 17

Cleary, J. & Trigg, L. (1995). K*: an instance-based learner using an entropic distancemeasure. In Proceedings of the 12th International Conference on Machine Learning (ICML-95), 108–114. 13, 45, 46

Clemente, J., Torisawa, K. & Satou, K. (2004). Improving the identification of non-anaphoric it using Support Vector Machines. In Proceedings of the International Joint Work-shop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP-04), 58–61. 10, 12

Codigo Civil (1889). Texto de la edicion del Codigo Civil mandada publicar por el RealDecreto de 24 del corriente en cumplimiento de la ley de 26 de mayo ultimo. Gaceta deMadrid , 206, 249–312. 26

Connexor Oy (2006a). Conexor functional dependency grammar 3.7. User’s manual . 29, 38,39

Connexor Oy (2006b). Machinese language model . 13, 29, 35

Constitucion Espanola (1978). Constitucion Espanola de 27 de diciembre de 1978. BoletınOficial del Estado, 311, 29313–29424. 26

Corpas Pastor, G. (2008). Investigar con corpus en traduccion: los retos de un nuevoparadigma. Peter Lang, Frankfurt am Main. 7, 8

Corpas Pastor, G., Mitkov, R., Afzal, N. & Pekar, V. (2008). Translation universals:do they exist? A corpus-based NLP study of convergence and simplification. In Proceedings ofthe 8th Conference of the Association for Machine Translation in the Americas (AMTA-08),75–81. 2, 7, 8, 10

References

Danlos, L. (2005). Automatic recognition of French expletive pronoun occurrences. In R. Dale,K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedings of the2nd International Joint Conference on Natural Language Processing (IJCNLP-05), 73–78,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 3651. 2,10, 11, 12

Denber, M. (1998). Automatic resolution of anaphora in English. Tech. rep., Eastman KodakCo. 10, 11, 12

Dıaz Morfa, J. (2004). La crisis de las aventuras en las relaciones de pareja. psiquiatria.com,8. 28

Dıscolo, A. ([2nd century] 1987). Sintaxis. Gredos, Madrid. 14

Evans, R. (2000). A comparison of rule-based and machine learning methods for identifyingnon-nominal it. In D.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Pro-ceedings of the 2nd International Conference on Natural Language Processing (NLP-2000),233–241, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.1835. 10, 12

Evans, R. (2001). Applying machine learning: toward an automatic classification of it. Literaryand Linguistic Computing , 16, 45–57. 2, 10, 12, 13, 22, 29, 38, 45

Fernandez Soriano, O. & Taboas Baylın, S. (1999). Construcciones impersonales noreflejas. In I. Bosque & V. Demonte, eds., Gramatica descriptiva de la lengua espanola,vol. 2, 1631–1722, Espasa-Calpe, Madrid. 18, 19

Ferrandez, A. & Peral, J. (2000). A computational approach to zero-pronouns in Spanish.In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics(ACL-2000), 166–172. 2, 6, 7, 8, 9, 11, 17, 22, 52, 55

Ferrandez, A., Palomar, A. & Moreno, L. (1997). El problema del nucleo del sintagmanominal: ¿elipsis o anafora? Procesamiento del lenguaje natural , 20, 13–26. 24

Ferrandez, A., Palomar, A. & Moreno, L. (1998). Anaphor resolution in unrestrictedtexts with partial parsing. In Proceedings of the 36th Annual Meeting of the Association forComputational Linguistics and 17th International Conference on Computational Linguistics(ACL/COLING-98), 385–391. 9

Ferrandez, A., Palomar, A. & Moreno, L. (1999). An empirical approach to Spanishanaphora resolution. Machine Translation, 14, 191–216. 9

Fiengo, R. & May, R. (1994). Indices and identity . The MIT Press, Cambridge MA. 17

Francis, W. (1958). The structure of American English. Ronald Press, New York. 15

Fries, C. (1940). American English grammar . Appleton-Century-Crofts, New York. 15

Garcıa Jurado, F. (2007). La etimologıa como historia de las palabras. E-excellence, Areade Cultura Clasica, Filologıa Clasica, 39, 1–27. 14

Garcıa Losa, E. (2008). Efectividad, operatividad y potenciacion del tratamiento en patologıafobica, en el contexto de los servicios especializados de salud mental publicos: la utilizacionen la sala de consulta de los recursos de Internet. psiquiatria.com, 12. 26

References

Gomez Torrego, L. (1992). La impersonalidad gramatical: descripcion y norma. Arco Libros,Madrid. 17, 18, 19, 23, 25

Grice, H. (1975). Logic and conversation. In P. Cole & J.L. Morgan, eds., Syntax and seman-tics, vol. 3: Speech Acts, 41–58, Academic Press, New York. 15

Gundel, J., Hedberg, N. & Zacharski, R. (2005). Pronouns without NP antecedents:how do we know when a pronoun is referential? In A. Branco, T. McEnery & R. Mitkov,eds., Anaphora processing: linguistic, cognitive and computational modelling , 351–364, JohnBenjamins, Amsterdam. 10, 12

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I.H.(2009). The WEKA data mining software: an update. SIGKDD Explorations, 11, 10–18. 41

Halliday, M.A.K. & Hasan, R. (1976). Cohesion in English. Longman, London. 15

Han, N. (2004). Korean null pronouns: classification and annotation. In Proceedings of theWorkshop on Discourse Annotation. 42nd Annual Meeting of the Association for Computa-tional Linguistics (ACL-04), 33–40. 7

Hernandez Terres, J.M. (1984). La elipsis en la teorıa gramatical . Universidad de Murcia,Murcia. 14

Hirano, T., Matsuo, Y. & Kikui, G. (2007). Detecting semantic relations between namedentities in text using contextual features. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics. Companion volume proceedings of the demo andposter sessions (ACL-05), 157–160. 2, 7, 8

Hobbs, J. (1977). Resolving pronoun references. Lingua, 44, 311–338. 52

Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D.thesis, University of Antwerp. 61

Hu, Q. (2008). A corpus-based study on zero anaphora resolution in Chinese discourse. Ph.D.thesis, City University of Hong Kong. 7, 8

Iida, R., Inui, K. & Matsumoto, Y. (2006). Exploiting syntactic patterns as clues in zero-anaphora resolution. In Proceedings of the 44th Annual Meeting of the Association for Com-putational Linguistics and the 21st International Conference on Computational Linguistics(ACL/COLING-06), 625–632. 7, 8

Iida, R., Kentaro, I. & Matsumoto, Y. (2009). Capturing salience with a trainable cachemodel for zero-anaphora resolution. In Proceedings of the Joint Conference of the 47th AnnualMeeting of the Association for Computational Linguistics and the 4th International Confer-ence on Natural Language Processing of the Asian Federation of Natural Language Processing(ACL/AFNLP-09), 647–655. 2, 7, 8

Imamura, K., Saito, K. & Izumi, T. (2009). Discriminative approach to predicate-argumentstructure analysis with zero-anaphora resolution. In Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Association for Computational Linguistics and the 4thInternational Conference on Natural Language Processing of the Asian Federation of NaturalLanguage Processing (ACL/AFNLP-09), 85–88. 2, 7, 8

References

Isozaki, H. & Hirao, T. (2003). Japanese zero pronoun resolution based on ranking rulesand machine learning. In Theoretical Issues in Natural Language Processing. Proceedings ofthe 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP-03),184–191. 7, 8

Jarvinen, T. & Tapanainen, P. (1998). Towards an implementable dependency grammar. InA. Polguere & S. Kahane, eds., Proceedings of the Workshop on Processing of Dependency-Based Grammars. 36th Annual Meeting of the Association for Computational Linguistics and17th International Conference on Computational Linguistics (ACL/COLING-98), 1–10. 28

Jarvinen, T., Laari, M., Lahtinen, T., Paajanen, S., Paljakka, P., Soininen, M. &Tapanainen, P. (2004). Robust language analysis components for practical applications. InProceedings of the 20th International Conference on Computational Linguistics (COLING-04), 53–56. 28, 29

Kawahara, D. & Kurohashi, S. (2004). Improving Japanese zero pronoun resolution byglobal word sense disambiguation. In Proceedings of the 20th International Conference onComputational Linguistics (COLING-04), 343–349. 2, 7, 8

Kibrik, A.A. (2004). Zero anaphora vs. zero person marking in Slavic: a chicken/egg dilemma?In Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-04), 87–90. 2, 7, 8

Kratzer, A. (1998). More structural analogies between pronouns and tenses. In Proceedingsof Semantics and Linguistic Theory VIII (SALT-88), Cornell University, Ithaca, NY. 6

Kuno, S. (1972). Functional sentence perspective: a case study from Japanese and English.Linguistic Inquiry , 3, 269–320. 33

Lambrecht, K. (2001). A framework for the analysis of cleft constructions. Linguistics, 39,463–516. 10, 12

Lancelot, C. & Arnauld, A. ([1660] 1980). Gramatica general y razonada. Sociedad GeneralEspanola de Librerıa, Madrid. 14

Lappin, S. & Leass, H. (1994). An algorithm for pronominal anaphora resolution. Computa-tional Linguistics, 20, 535–561. 10, 11, 12, 52

Lee, S. & Byron, D. (2004). Semantic resolution of zero and pronoun anaphors in Korean. InProceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-04), 103–108. 2, 7

Lee, S., Byron, D. & Jang, S. (2005). Why is zero marking important in Korean? InR. Dale, K.F. Wong, J. Su & O.Y. Kwong, eds., Natural language processing. Proceedingsof the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05),588–599, Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol.3651. 7

Ley 29/1998 (1998). Ley 29/1998, de 13 de julio, reguladora de la Jurisdiccion Contencioso-administrativa. Boletın Oficial del Estado, 167, 23516–23551. 26

Ley 29/2005 (2005). Ley 29/2005, de 29 de diciembre, de Publicidad y Comunicacion Institu-cional. Boletın Oficial del Estado, 312, 42902–42905. 26

References

Ley 3/1991 (1991). Ley 3/1991, de 10 de enero, de Competencia Desleal. Boletın Oficial delEstado, 10, 959–962. 26

Ley Organica 10/1995 (1995). Ley Organica 10/1995, de 23 de noviembre, del Codigo Penal.Boletın Oficial del Estado, 281, 33987–34058. 26

Ley Organica 1/2002 (2002). Ley Organica 1/2002, de 22 de marzo, reguladora del Derechode Asociacion. Boletın Oficial del Estado, 73, 11981–11991. 26

Ley Organica 6/2001 (2001). Ley Organica 6/2001, de 21 de diciembre, de Universidades.Boletın Oficial del Estado, 307, 49400–49425. 26

Li, Y., Musilek, P. & Wyard-Scott, L. (2009). Identification of pleonastic it using theweb. Computer Engineering , 34, 339–389. 10, 12

Lopez Ortega, M.A. (2009). El cine como herramienta ilustrativa en la ensenanza de lostrastornos de la personalidad. psiquiatria.com, 13. 26

Manning, C. & Schutze, H. (1999). Foundations of statistical natural language processing .The MIT Press, Cambridge, MA. 41

Matsui, T. (1999). Approaches to Japanese zero pronouns: centering and relevance. InD. Cristea, N. Ide & D. Marcu, eds., Proceedings of the Workshop on the Relation of Dis-course/Dialogue Structure and Reference. 37th Annual Meeting of the Association Compu-tational Linguistics (ACL-99), 11–20. 2, 7, 8

Mel’cuk, I. (2003). Levels of dependency in linguistic description: concepts and problems.In Dependency and valency. An International handbook of contemporary research, 188–229,Mouton de Gruyter, Berlin, New York. 17

Mel’cuk, I. (2006). Zero sign in morphology. In Aspects of the theory of morphology , 447–495,Mouton de Gruyer, Berlin, New York. 6, 19

Mendikoetxea, A. (1994). La semantica de la impersonalidad. In C. Sanchez, ed., Las con-strucciones con se, 239–267, Visor, Madrid. 18

Mendikoetxea, A. (1999). Construcciones con se: medias, pasivas e impersonales. InI. Bosque & V. Demonte, eds., Gramatica descriptiva de la lengua espanola, vol. 2, 1575–1630,Espasa-Calpe, Madrid. 18

Merchant, J. (2001). The syntax of silence. Sluicing, islands and the theory of ellipsis. OxfordUniversity Press, Oxford. 17

Mitkov, R. (1998). Robust pronoun resolution with limited knowledge. In Proceedings of the36th Annual Meeting of the Association for Computational Linguistics and 17th InternationalConference on Computational Linguistics (ACL/COLING-98), 869–875. 12, 52

Mitkov, R. (2001). Outstanding issues in anaphora resolution. In A. Gelbukh, ed., Proceed-ings of the 2nd International Conference on Computational Linguistics and Intelligent TextProcessing (CICLing-01), 110–125, Springer, Berlin, Heidelberg, New York, Lecture Notesin Computer Science, Vol. 2004. 10

Mitkov, R. (2002). Anaphora resolution. Longman, London. 6, 8, 10, 33, 52, 61

References

Mitkov, R. (2010). Discourse processing. In A. Clark, C. Fox & S. Lappin, eds., The hand-book of computational linguistics and natural language processing , 599–629, Wiley Blackwell,Oxford. 2, 5, 10

Mitkov, R., Evans, R. & Orasan, C. (2002). A new, fully automatic version of Mitkov’sknowledge-poor pronoun resolution method. In Proceedings of the 3rd International Con-ference on Computational Linguistics and Intelligent Text Processing (CICLing-02), 69–83,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 2276. 10,12

Molina Lopez, D. (2008). Y de los hermanos ¿que? Como ayudar a los hermanos de un TLP.psiquiatria.com, 12. 28

Mori, T. & Nakagawa, H. (1996). Zero pronouns and conditionals in Japanese instructionmanuals. In Proceedings of the 16th International Conference on Computational Linguistics(COLING-96), 782–787. 7, 8

Muller, C. (2006). Automatic detection of nonreferential it in spoken multi-party dialog. InProceedings of the 11th Conference of the European Chapter of the Association for Compu-tational Linguistics (EACL-06), 49–56. 10, 12, 13

Murata, M., Isahara, H. & Nagao, M. (1999). Pronoun resolution in Japanese sentencesusing surface expressions and examples. In A. Bagga, B. Baldwin & S. Shelton, eds., Pro-ceedings of the Workshop on Coreference and Its Applications. 37th Annual Meeting of theAssociation for Computational Linguistics (ACL-99), 39–46. 7, 8

Nakagawa, H. (1992). Zero pronouns as experiencer in Japanese discourse. In Proceedings ofthe 15th International Conference on Computational Linguistics (COLING-92), 324–330. 7,8

Nakaiwa, H. (1997). Automatic identification of zero pronouns and their antecedents withinaligned sentence pairs. In Proceedings of the 3rd Annual Meeting of the Association for Nat-ural Language Processing in Japan (ANLP-97), 127–141. 7, 8

Nakaiwa, H. & Ikehara, S. (1992). Zero pronoun resolution in a Japanese to English machinetranslation system by using verbal semantic attributes. In Proceedings of the 3rd Conferenceon Applied Natural Language Processing (ANLP-92), 201–208. 7, 8

Nakaiwa, H. & Shirai, S. (1996). Anaphora resolution of Japanese zero pronouns with deicticreference. In Proceedings of the 16th International Conference on Computational Linguistics(COLING-96), 812–817. 7, 8

Ng, V. & Cardie, C. (2002). Identifying anaphoric and non-anaphoric noun phrases to im-prove coreference resolution. In Proceedings of the 19th International Conference on Compu-tational Linguistics (COLING-02), 1–7. 10, 12

Nomoto, T. & Yoshihiko, N. (1993). Resolving zero anaphora in Japanese. In Proceedings ofthe 6th Conference of the European Chapter of the Association for Computational Linguistics(EACL-93), 315–321. 7, 8

Okumura, M. & Tamura, K. (1996). Zero pronoun resolution in Japanese discourse basedon centering theory. In Proceedings of the 16th International Conference on ComputationalLinguistics (COLING-96), 871–876. 1, 7

References

Paice, C.D. & Husk, G.D. (1987). Towards an automatic recognition of anaphoric featuresin English text: the impersonal pronoun it. Computer Speech and Language, 2, 109–132. 10,11, 12

Peng, J. & Araki, K. (2007a). Zero anaphora resolution in Chinese and its application inChinese-English machine translation. In Z. Kedad, N. Lammari, E. Metais, F. Meziane &Y. Rezgui, eds., Natural language processing and information systems. Proceedings of the12th International Conference on Applications of Natural Language to Information Systems(NLDB-07), 364–375, Springer, Berlin, Heidelberg, New York, Lecture Notes in ComputerScience, Vol. 4592. 7

Peng, J. & Araki, K. (2007b). Zero-anaphora resolution in Chinese using maximum entropy.IEICE - Transactions on Information and Systems, E90-D, 1092–1102. 7, 8

Peral, J. (2002). Resolucion y generacion de la anafora nominal en espanol e ingles en unsistema de traduccion automatica. Procesamiento del lenguaje natural , 28, 127–128. 7, 8

Peral, J. & Ferrandez, A. (2000). Generation of Spanish zero-pronouns into English. InD.N. Christodoulakis, ed., Natural Language Processing - NLP 2000. Proceedings of the 2ndInternational Conference on Natural Language Processing (NLP-2000), 252–260, Springer,Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 1835. 2, 7, 8

Pintor Garcıa, M. (2007). Analisis factorial de las actitudes personales en educacion secun-daria. Un estudio empırico en la Comunidad de Madrid. psiquiatria.com, 11. 28

Pollard, C. & Sag, I. (1994). Head Driven Phrase Structure Grammar . CSLI Publications,Stanford, CA. 19

Real Academia Espanola (1977). Esbozo de una nueva gramatica de la lengua espanola.Espasa-Calpe, Madrid. 19

Real Academia Espanola (2001). Diccionario de la lengua espanola. Espasa-Calpe, Madrid,22nd edn. 15, 40, 41

Real Academia Espanola (2009). Nueva gramatica de la lengua espanola. Espasa-Calpe,Madrid. ix, 6, 15, 16, 17, 18, 19, 22, 23, 24, 25, 33, 34

Recasens, M. & Hovy, E. (2009). A deeper look into features for coreference resolution. InL.D. Sobha, A. Branco & R. Mitkov, eds., Anaphora Processing and Applications. Proceedingsof the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-09), 29–42,Springer, Berlin, Heidelberg, New York, Lecture Notes in Computer Science, Vol. 5847. 2, 6,11

Rello, L. & Illisei, I. (2009a). A comparative study of Spanish zero pronoun distribution. InProceedings of the International Symposium on Data and Sense Mining, Machine Translationand Controlled Languages, and their application to emergencies and safety critical domains(ISMTCL-09), 209–214, Presses Universitaires de Franche-Comte, Besancon. 3, 7

Rello, L. & Illisei, I. (2009b). A rule-based approach to the identification of Spanish zeropronouns. In Student Research Workshop. International Conference on Recent Advances inNatural Language Processing (RANLP-09), 209–214. 3, 7, 8, 9, 10, 11, 22, 35

References

Rello, L., Baeza-Yates, R. & Mitkov, R. (2010a). Improved subject ellipsis detection inSpanish. submitted . 3

Rello, L., Suarez, P. & Mitkov, R. (2010b). A machine learning method for identify-ing non-referential impersonal sentences and zero pronouns in Spanish. Procesamiento delLenguaje Natural , 45, 281–287. 3

Ross, J. (1967). Constrains on variables in syntax . Ph.D. thesis, Massachusetts Institute ofTechnology. 17

Sanchez de las Brozas, F. ([1562] 1976). Minerva. De la propiedad de la lengua latina.Catedra, Madrid. 14

Sasano, R., Kawahara, D. & Kurohashi, S. (2008). A fully-lexicalized probabilistic modelfor Japanese zero anaphora resolution. In Proceedings of the 22nd International Conferenceon Computational Linguistics (COLING-08), 769–776. 7, 8

Seco, M. (1988). Manual de gramatica espanola. Aguilar, Madrid. 19

Seki, K., Fujii, A. & Ishikawa, T. (2002). A probabilistic method for analyzing Japaneseanaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th In-ternational Conference on Computational Linguistics (COLING-02), 911–917. 7, 8

Sevillano Arroyo, M.A. & Ducret Rossier, F.E. (2008). Las emociones en la psiquiatrıa.psiquiatria.com, 12. 26

Shopen, T. (1973). Ellipsis as grammatical indeterminacy. Foundations of Language, 10, 65–77. 15

Steinberger, J., Poesio, M., Kabadjov, M.A. & Jeek, K. (2007). Two uses of anaphoraresolution in summarization. Information Processing and Management , 43, 1663–1680. 2, 7

Streb, J., Hennighausen, E. & Rosler, F. (2004). Different anaphoric expressions areinvestigated by event-related brain potentials. Journal of Psycholinguistic Research, 33, 175–201. 15

Takada, S. & Doi, N. (1994). Centering in Japanese: a step towards better interpretation ofpronouns and zero-pronouns. In Proceedings of the 15th International Conference on Com-putational Linguistics (COLING-94), 1151–1156. 7, 8

Tanaka, I. (2000). Cataphoric personal pronouns in English news reportage. In Proceedings ofthe 3rd Discourse Anaphora and Anaphor Resolution Colloquium (DAARC-2000), 108–117.33, 34

Tapanainen, P. (1996). The constraint grammar parser CG-2 . Department of General Lin-guistics, University of Helsinki, Publications, Vol. 27. 28

Tapanainen, P. & Jarvinen, T. (1997). A non-projective dependency parser. In Proceedingsof the 5th Conference on Applied Natural Language Processing (ANLP-97), 64–71. 13, 28

Tesniere, L. (1959). Elements de syntaxe. Klincksieck, Paris. 28

References

Theune, M., Hielkema, F. & Hendriks, P. (2006). Performing aggregation and ellipsis us-ing discourse structures. In Research on Language & Computation, vol. 4, 353–375, Springer,Berlin, Heidelberg, New York. 7

Wilder, C. (1997). Some properties of ellipsis in coordination. In Studies in universal grammarand typological variation, 59–107, John Benjamins, Amsterdam. 17

Witten, I.H. & Frank, E. (2005). Data mining: practical machine learning tools and tech-niques. Morgan Kaufmann, London, 2nd edn. 26, 38, 41, 44, 46

Yeh, C. & Chen, Y. (2003a). Using zero anaphora resolution to improve text categorization. InProceedings of the 17th Pacific Asia Conference on Language, Information and Computation(PACLIC-03), 423–430. 2, 7, 8

Yeh, C. & Chen, Y. (2003b). Zero anaphora resolution in Chinese with partial parsing basedon centering theory. In Proceedings of the International Conference on Natural LanguageProcessing and Knowledge Engineering (NLP-KE-03), 683–688. 7, 8

Yeh, C. & Chen, Y. (2007). Topic identification in Chinese based on centering model. Journalof Chinese Language and Computing , 17, 83–96. 2, 7, 8

Yeh, C. & Mellish, C. (1997). An empirical study on the generation of zero anaphors inChinese. Computational Linguistics, 23, 171–190. 7, 8

Yoshimoto, K. (1988). Identifying zero pronouns in Japanese dialogue. In Proceedings of the12th International Conference on Computational Linguistics (COLING-88), 779–784. 7, 8

Zhao, S. & Ng, H. (2007). Identification and resolution of Chinese zero pronouns: a machinelearning approach. In Proceedings of the 2007 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Natural Language Learning (EMNLP/CNLL-07), 541–550. 2, 7, 8

Elliphant: A Machine Learning Method for Identifying ... · for Identifying Subject Ellipsis and Impersonal Constructions in Spanish Luz Rello Main advisor: Ruslan Mitkov ... Charlie

Documents

Identifying Volume.fall2013

Identifying Asian

Identifying Your Spiritual Gifts Identifying Your Heart...

Identifying claims

INNOVATIVE COMMUNITY INVESTMENT STRATEGIES · including...

Oxford Handbook of Computational Linguistics, R. Mitkov...

Identifying and Characterizing Topological...

Radionytt...

Forensic Hair Analysis. What is it good for? Identifying...

Identifying Variables

SUCCESS IN INVESTMENT MANAGEMENT - IDENTIFYING...

Using Natural Language Processing for Automatic Plagiarism.....

identifying topic

Identifying terminology

CIS14: Identifying Things (and Things Identifying Us)

Identifying PHRASES