Natural L th Unsu M Philos Language Generation he Semantic Web: upervised Template Extraction Daniel Duma MSc Speech and Language Processing sophy, Psychology and Language Sciences University of Edinburgh 2012 n for
Apr 18, 2015
Natural Language Generation
the
Unsupervised
MSc Speech and Language Processing
Philosophy, Psychology and Language Sciences
Natural Language Generation
the Semantic Web:
Unsupervised Template
Extraction
Daniel Duma
MSc Speech and Language Processing
Philosophy, Psychology and Language Sciences
University of Edinburgh
2012
Natural Language Generation for
Template
2 of 66
Abstract
I propose an architecture for a Natural Language Generation system that
automatically learns sentence templates, together with statistical document
planning, from parallel RDF data and text. To this end, I design, build and
test a proof-of-concept system (“LOD-DEF”) trained on un-annotated text
from the Simple English Wikipedia and RDF triples from DBpedia, with the
communicative goal of generating short descriptions of entities in an RDF
ontology. Inspired by previous work, I implement a baseline triple-to-text
generation system and I conduct human evaluation of the LOD-DEF system
against the baseline and human-generated output. LOD-DEF significantly
outperforms the baseline on two of three measures: non-redundancy and
structure and coherence.
3 of 66
Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own
except where explicitly stated otherwise in the text, and that this work has not been submitted
for any other degree or professional qualification except as specified.
(Daniel Duma)
4 of 66
Acknowledgements
I am indebted to the many people who have, directly or indirectly, contributed to this effort.
First, to my parents, Eugenia and Calin Duma, for without them I would not be here to tell this
story, and to Decebal Duma, for his financial support and a wealth of stories to entertain friends
with.
Second, to everyone who helped in some way to the completion of this thesis, and especially to
Austin Leirvik, Ben Dawson, Cristian Kliesch, Dan Maftei and Magda Aniol for supplementing
my lack of knowledge with patient explanations and helpful hints.
Third, to my supervisor, Ewan Klein, for being a continuous source of encouragement and for his
many helpful pointers along the way.
Finally, I want to thank everyone who has made this year of my life something more than one
never-ending night in DSB. And to you, caffeine, for packing three days into one.
5 of 66
Table of Contents Chapter 1 Introduction and background ..................................................................................... 8
1.1 Introduction ........................................................................................................................ 8
1.2 Overview of this thesis ....................................................................................................... 9
1.3 Semantic Web and Linked Data ........................................................................................ 9
1.4 RDF triples: data format for the Semantic Web ............................................................. 11
1.5 DBpedia: the hub of the LOD Cloud............................................................................... 12
1.6 Natural Language Generation .......................................................................................... 13
1.6.1 Shallow vs. Deep NLG ................................................................................................. 15
Chapter 2 Previous approaches ................................................................................................. 17
2.1 Hand-coded, rule-based .................................................................................................... 17
2.1.1 Assessment .................................................................................................................... 17
2.2 Generating directly from RDF ......................................................................................... 17
2.2.1 Assessment .................................................................................................................... 18
2.3 Unsupervised trainable NLG ........................................................................................... 18
2.3.1 Assessment .................................................................................................................... 19
2.4 Automatic summarisation ................................................................................................ 19
Chapter 3 Design ....................................................................................................................... 21
3.1 Design overview ................................................................................................................ 21
3.2 Goal .................................................................................................................................. 23
3.3 Tasks ................................................................................................................................. 24
3.3.1 Aligning data and text ................................................................................................. 24
3.3.2 Extracting templates .................................................................................................... 24
3.3.3 Dealing with Linked Open Data .................................................................................. 25
3.3.4 Modelling different classes ............................................................................................ 26
3.3.5 Document planning ...................................................................................................... 27
3.4 Baseline ............................................................................................................................ 27
3.4.1 Coherent text ................................................................................................................ 28
Chapter 4 Implementation: Training ......................................................................................... 30
4.1 Obtaining the data ........................................................................................................... 30
4.1.1 Wikipedia text .............................................................................................................. 30
4.1.2 DBpedia triples ............................................................................................................. 30
4.2 Tokenizing and text normalisation ................................................................................... 32
6 of 66
4.3 Aligning: Named Entity Recognition ............................................................................... 32
4.3.1 Surface realisation generation....................................................................................... 32
4.3.2 Spotting ........................................................................................................................ 32
4.4 Class selection .................................................................................................................. 33
4.5 Coreference resolution ...................................................................................................... 34
4.6 Parsing .............................................................................................................................. 35
4.7 Syntactic pruning ............................................................................................................. 35
4.8 Store annotations ............................................................................................................. 37
4.9 Post-processing ................................................................................................................. 38
4.9.1 Cluster predicates into pools ........................................................................................ 38
4.9.2 Purge and filter sentences ............................................................................................ 39
4.9.3 Compute n-gram probabilities and store model........................................................... 39
Chapter 5 Implementation: Generation ..................................................................................... 40
5.1 Retrieve RDF triples ........................................................................................................ 40
5.2 Choose best class for entity .............................................................................................. 40
5.3 Chart generation .............................................................................................................. 41
5.4 Viterbi generation ............................................................................................................ 41
5.5 Filling the slots ................................................................................................................. 42
Chapter 6 Experiments .............................................................................................................. 44
6.1 Problems with the data .................................................................................................... 44
6.2 Performance of the system ............................................................................................... 44
6.2.1 Spotting performance ................................................................................................... 45
6.2.2 Parser performance ....................................................................................................... 45
6.2.3 Class selection performance .......................................................................................... 45
6.2.4 Template extraction ..................................................................................................... 46
6.2.5 Examples of errors in output........................................................................................ 47
Chapter 7 Evaluation ................................................................................................................. 49
7.1 Approach .......................................................................................................................... 49
7.2 Selection of data ............................................................................................................... 49
7.3 Human generation ............................................................................................................ 50
7.4 LOD-DEF generation ....................................................................................................... 51
7.5 Human rating ................................................................................................................... 51
7.6 Results .............................................................................................................................. 51
7 of 66
7.7 Discussion ......................................................................................................................... 52
Chapter 8 Conclusion and future work ..................................................................................... 53
8.1 Conclusion ........................................................................................................................ 53
8.2 Future work ...................................................................................................................... 53
Appendix A: Human generation ..................................................................................................... 55
Appendix B: Human evaluation ...................................................................................................... 57
References ........................................................................................................................................ 63
8 of 66
Chapter 1
Introduction and background
1.1 Introduction
The next generation of the web is in the making. The amount of information on the Semantic
Web is growing fast; this open, structured, explicitly meaningful machine-readable data on the
Web is already forming a web of data, a “giant global graph consisting of billions of RDF
statements from numerous sources covering all sorts of topics” (Heath & Bizer, 2011).
This information space is however designed to be used by machines rather than humans (Gerber
et al., 2006), and us humans are meant to access it via intelligent user agents, such as information
brokers, search agents and information filters (Decker et al., 2000) which on the whole are
expected to take the shape of question-answering systems (Bontcheva & Davis, 2009).
A crucial element of such a question-answering system is then the ability to communicate with
the user using natural language (Bontcheva & Davis, 2009), both understanding user input and
generating natural language to relay information to the user.
This is why the role of Natural Language Generation is potentially key in the Semantic Web
vision (Galanis & Androutsopoulos, 2007) where, for applications generating textual output, the
text presented to users can be generated or updated by NLG systems using data on the web as
input. However, Natural Language Generation systems have traditionally relied on hand-built
templates and schemas and many expert-hours of work. While this has been a successful
approach in several domains (e.g. Androutsopoulos el al., 2001), it is frequently observed that it
does not scale well, it is not easy to transfer across domains, and it requires many expert man-
hours which makes it expensive and impractical for many applications (Busemann & Horacek,
1998). The scale and decentralised nature of the Semantic Web suggests this is one of these
applications.
Recent initiatives by organisations and governments, coupled with efforts in text mining have
made large knowledge bases publicly available, such as census results, biomedical databases and
more general ones like DBpedia. These now contain information also found in natural language
texts, starting with the very ones the information was mined from. I propose here that, given the
widespread availability on the web of these parallel text and data resources, and the mature state
of key Natural Language Processing technologies, NLG systems could be automatically trained
from these resources with little or no human intervention. The wealth of research done in
trainable NLG and automatic summarisation suggests that it is feasible for these systems to learn
how to generate natural language by analysing existing human-authored natural language text in
9 of 66
an unsupervised fashion. This would make them inexpensive to build and deploy, easier to
transfer to other domains, and potentially multilingual, which would contribute massively to
making the Semantic Web vision a reality.
The aim of this project is to propose an architecture for an NLG system that can automatically
learn sentence templates and document planning from parallel data and text, for the
communicative goal of describing entities in an RDF ontology. The emphasis of the system is on
expressing literal, non-temporal, factual data.
The system is trained from text and data by performing four main actions: given parallel data
and text about a number of entities, first it aligns the text and the data by finding literal values
in the text. Second, it extracts and modifies sentences that express these values, so as to use
them as sentence templates. Third, it collects statistics about the frequency with which a spotted
property follows another. Finally, it determines the class of entity that the text and data describe
and builds a separate model for that class.
The nature of this project is exploratory rather than exhaustive. To this end, I design, build and
evaluate a proof-of-concept system (LOD-DEF) trained on text from the Simple English
Wikipedia and data from DBpedia. No part of its architecture is exhaustively optimised, and all
the modules in the pipeline can be seen as baselines for their specific function. This system is as
such in itself a baseline implementation, its goal being to explore the feasibility of this approach
and hopefully help inspire others to improve upon it.
1.2 Overview of this thesis
In this chapter, I present the case for trainable Natural Language Generation for the Semantic
Web and provide an overview these two technology areas.
In Chapter 2 I review the recent approaches most relevant to the present one, on which this
project is either based, or to which it is theoretically related.
Chapter 3 considers the project from a design standpoint: it formulates the goals of the project
and the criteria it must abide by, identifies the problem areas and lays out the design of the
solutions. In this chapter I also present the full specification of the non-trainable baseline
generation system implemented.
In Chapter 4 the training pipeline is laid out and technically detailed with step-by-step examples.
The same is done for the generation pipeline in Chapter 5.
In Chapter 6 a number of experiments are discussed, reporting the performance of the system on
different metrics and analysing some interesting findings.
Chapter 7 contains the detailed analysis of the human evaluation of the system. Chapter 8
contains the conclusion and suggests possible directions for future work.
1.3 Semantic Web and Linked Data
Where the terms Semantic Web and Web of Data refer to a vision of the web to come (Berners-
Lee et al., 2001), the term Linked Data is more concrete, referring to “a set of best practices for
publishing and interlinking structured data o
practices (the “Linked Data Principles”)
for publishing this data, such as HTTP
RDF for linking datasets (see section
of Linked Data is to apply the general architecture of the World Wide Web
structured data on global scale.
be added to this Linked Data.
Linked Data does not need to be open, i.e. accessible to anyone, but increasingly organisations
are publishing Linked Open Data
steadily increasing over the past years
“geographic locations, people, companies, books, scienti
and radio programmes, genes, proteins, drugs and cli
online communities and reviews
Figure 1.1 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
As of late, there were over 31 billion
LOD Cloud (Bizer et al., 2011)
bubbles, interconnected by edges.
It is these edges that are most interesting; p
published datasets are explicitly linked
Both the vocabularies and the data can be published by any organisation or individual, leading
to the somewhat famous observation “
Carroll, 2002). Throughout this
10 of 66
publishing and interlinking structured data on the Web” (Heath & Bizer,
(the “Linked Data Principles”) require the use of a number of web-
for publishing this data, such as HTTP as a transport layer, URIs as identifiers for resources and
section 1.4 for a detailed explanation). Essentially,
of Linked Data is to apply the general architecture of the World Wide Web to the task of sharing
structured data on global scale.” (Heath & Bizer, 2011) To realise the SW vision, inference must
Linked Data does not need to be open, i.e. accessible to anyone, but increasingly organisations
Linked Open Data (LOD). The rate of publication of this informati
steadily increasing over the past years, forming a big “data cloud”, containing
geographic locations, people, companies, books, scientiEc publications, Elms, music, television
and radio programmes, genes, proteins, drugs and clinical trials, statistical data, census results,
online communities and reviews” (Heath & Bizer, 2011).
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. (Cyganiak & Jentzsch, 2011)
, there were over 31 billion (109) RDF triples (statements) in datasets linked in the
2011) from 295 different datasets. Figure 1.1 represents
bubbles, interconnected by edges.
It is these edges that are most interesting; perhaps the key aspect of this effort is that these
published datasets are explicitly linked together by using common vocabularies and ontologies.
Both the vocabularies and the data can be published by any organisation or individual, leading
to the somewhat famous observation “Anyone can say Anything about Anything
his thesis, “Semantic Web” and “Web of Data”
, 2011). These best
-based open formats
as identifiers for resources and
Essentially, “the basic idea
to the task of sharing
To realise the SW vision, inference must
Linked Data does not need to be open, i.e. accessible to anyone, but increasingly organisations
(LOD). The rate of publication of this information has been
, forming a big “data cloud”, containing, among others,
Ec publications, Elms, music, television
nical trials, statistical data, census results,
(Cyganiak & Jentzsch, 2011)
datasets linked in the
represents these nodes as
this effort is that these
using common vocabularies and ontologies.
Both the vocabularies and the data can be published by any organisation or individual, leading
nyone can say Anything about Anything” (Klyne &
“Web of Data” are both taken to
11 of 66
mean Linked Open Data, thus referring to data published in adherence to the Linked Data
Principles1.
1.4 RDF triples: data format for the Semantic Web
Resource Description Framework (RDF) is the default and recommended data format for the
Web of Data (Heath & Bizer 2011). RDF triples represent the simplest statement that can be
made, involving two entities (nodes in a conceptual graph) and a relation between them (an
edge). These are often called subject, predicate and object, and a triple must contain all three of
them. Another way of reading this information is that the subject has a property (predicate), the
value of which is the object. I use both naming conventions throughout this thesis.
An example of a triple would be: http://dbpedia.org/resource/United_States_of_Americ a http://dbpedia.org/property/leaderName http://dbpedia.org/resource/Barack_Obama
Figure 1.1 Example of an RDF triple
where United_States_of_America has a property leaderName, the value of which is
Barack_Obama. This could be conceptually read as “the name of the leader of the USA is
Barack Obama”. This triple then connects two entities in the graph, United_States_of_America
and Barack_Obama, via the edge leaderName.
A central aspect to RDF triples is the fact that subjects and objects must necessarily be URIs
(Uniform Resource Identifier). The concept of a URL (Uniform Resource Locator) is perhaps a
familiar one, given how commonplace the use of that web addresses (e.g.
http://www.google.com) has become. A URI differs from a URL in that, although it must also
be globally unique, it does not need to be “dereferenceable”, that is, if we point a web browser to
that address, the browser may not be able to load a web page to show. It is recommended good
practice that URIs be made dereferenceable (Heath & Bizer 2011), but it is not required.
Objects of triples can either be URIs (e.g. “http://dbpedia.org/resource/Barack_Obama” in the
previous example) or literal values, such as character strings (e.g. “Hogwarts”), dates (e.g. “1066-
10-12”^^xsd:date) or numbers in different formats. RDF triples may either have a data type or a
language suffix (e.g. @en for English), but not both. A language suffix automatically identifies
the value as a string literal.
A number of serializations of RDF exist, for a number of purposes; the one in use throughout
this document is N3/Turtle (Berners-Lee & Connolly, 2011). This serialization is intended to be
easier for humans to access directly: among other things, it allows for the shortening of
namespaces to make triples easier to read. In this serialisation, we can define a number of
prefixes for namespaces: @prefix : <http://dbpedia.org/resource/> @prefix dbprop: <http://dbpedia.org/property/> @prefix dbont: <http://dbpedia.org/ontology/>
This allows us to write the previous triple as: dbpedia:United_States_of_America dbprop:leaderName dbpedia:Barack_Obama
1 See http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/ for an intuitive overview of Linked Open Data.
12 of 66
When accessed, each of the URIs in this triple would be expanded back to the values in Figure
1.1. Finally, Notation 3 permits defining and using a default namespace, identified by a single
colon. Henceforth, the default namespace used in examples throughout this document is
<http://dbpedia.org/resource/> for ease of reading:
:United_States_of_America dbprop:leaderName :Barac k_Obama.
Ontologies can be built on top of RDF using a standard class inheritance mechanism via the
predicate rdf:type. RDF implements multiple inheritance, that is, an instance can belong to any
number of classes. On top of this basic framework, more complex mechanisms to allow for
inheritance and reasoning have been implemented, most importantly for this project RDFS
(RDF Schema) and OWL (Web Ontology Language). Different extensions of OWL (OWL Lite,
OWL DL, OWL Full) can encode different types of formal logic (Bechhofer et al., 2004), but this
is outside the scope of this project.
A property, being a URI, can have properties itself. For every property, its rdfs:domain property
restricts the classes and instances which can have this property and its rdfs:range specifies what
values this property can take.
There are many standard prefixes for namespaces defining vocabularies with widely-used and
well-defined semantics. Two examples are foaf (“Friend Of A Friend”) and dc (“Dublin Core”).
Very important to the design of the system presented here are the widely used properties
foaf:name, the description of which simply stands for “a name for some thing” (Brickley &
Miller, 2010), and rdfs:label, which “may be used to provide a human-readable version of a
resource's name” (Brickley & Guha, 2004).
Beyond storage, the emphasis of the LOD approach is that this data can be queried in highly
complex ways. The default query language for this is SPARQL (Prud'hommeaux & Seaborne,
2008) which is a structured query language based on matching patterns of triples, allowing for
highly complex filters using logic, inference and regular expressions. Some examples of these
queries, expressed in natural language, could be:
• Skyscrapers in China that have more than 50 floors
• Albums from the Beach Boys that were released between 1980 and 1990
1.5 DBpedia: the hub of the LOD Cloud
At the centre of the LOD cloud (Figure 1.1) lays the DBpedia. This multi-lingual knowledge
base contains knowledge that has been extracted from the infobox systems of Wikipedias in 15
different languages (Mendes et al., 2012). Infoboxes are human-authored tables of information,
akin to collections of attribute-value pairs, that appear on the side of an article on Wikipedia.
They contain factual information such as dates, population sizes, titles of national anthems, etc.,
in a format that is easy to mine for data. This extracted data is stored as RDF triples, using a
number of standard vocabularies for properties (e.g. foaf, dc).
As described in (Mendes et al., 2012), “the DBpedia Ontology organizes the knowledge on
Wikipedia in 320 classes which form a subsumption hierarchy and are described by 1,650
different properties. It features labels and abstracts for 3.64 million things in up to 97 different
languages of which 1.83 million are classified in a consistent ontology, including 416,000 persons,
13 of 66
526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations,
183,000 species and 5,400 diseases.”
Due to its status as the unofficial hub of the LOD cloud (being used to interlink many datasets),
and its breadth of coverage, the data on the DBpedia seems to be an ideal starting dataset for
any approach that aims to generate natural language using data in the LOD Cloud.
1.6 Natural Language Generation
Natural Language Generation is the process of creating text in natural language (e.g. English)
from an input of conceptual information. NLG allows for adapting the text to specific
communicative goals and to user preferences and for generating text in different natural
languages using the same underlying representation. In what is perhaps the required reference of
the field, Reiter & Dale (2000), the authors propose and describe a standard architecture for a
Natural Language Generation system. While the approach to NLG in the present project is far
from this level of sophistication, it is relevant to present an overview of this architecture here to
put the task at hand in context.
According to Reiter & Dale (2000), the architecture of a Natural Language Generation system is
modularised, formed of a number of distinct, well-defined and easily integrated modules which
perform different functions. A graphical representation of this architecture is provided in Figure
1.2.
It is sometimes observed that NLG consists in making choices (what to say, in what order, with
what words, etc.). These choices depend on each other, but can be separated in different levels,
forming a pipeline. In this pipeline, domain data in some internal semantic representation is
input at one end and natural language text is output at the other.
14 of 66
Figure 1.2 Natural Language Generation Architecture (Based on Reiter and Dale, 2000)
The pipeline consists of three main stages, implemented by as many components:
1. In the document planning stage, the data to be included in the generated text is chosen
(content determination) and also, the order in which to present this is chosen (document
structuring). These processes produce an intermediate representation of the structure of
the document, in this diagram labelled “document plan”, typically a tree structure.
Document planning takes domain data as input together with a communicative goal, that
is, the purpose of the text that is to be generated, such as “describing of an entity”,
“recommending a restaurant”, or “comparing flights”, depending on the application. The
communicative goal typically determines both content determination and document
structuring.
Document planning, as well as the other components in this pipeline, can be informed,
among others, by: discourse strategies (helping realise the communicative goal), dialogue
history (in a dialogue system), constraints upon the output (the resulting text might
need to fit in a constrained space, etc.).
Most importantly however, it can be informed by a user model, capturing preferences or
specific circumstances that characterise the target audience of the text. Depending on the
application, system and communicative goal, this could mean e.g. a preference on
sentence length or for the ordering of an argument in the case of a recommendation.
2. In the microplanning stage, the document plan is taken as input to a number of sub-
processes, which are to a great extent dependent on each other.
15 of 66
a. Lexicalisation is choosing the content words required in order to express the
content selected by the previous module.
b. Aggregation consists in joining together short sentences or chunks of text to
create longer sentences. Both coordination and subordination strategies may play
a role in this.
c. Referring Expression Generation (REG) deals with how to refer to entities in the
discourse. There are multiple ways in which we can refer to the same real-world
entity. For example, “Barack Obama” might be referred to as “President
Obama”, “Obama”, “the President”, or simply “he”, depending on the
communicative context. A distinction is usually made between the first time an
entity is mentioned (“initial reference”) and “subsequent reference”. Depending on
other factors such as pragmatic considerations, we might want to avoid repetition
by using personal pronouns and other referring expressions. These also depend on
style considerations of the textual domain. For instance, in newswire it might be
preferred to use “President Obama” and “the President” instead of “he”.
3. The surface realisation component takes text specifications as input and outputs surface
text. Surface realisation often adopts an “overgenerate and rank” approach, where a
number of possible surface realisations are generated and then ranked using a language
model (i.e. how likely that sequence of words is) or other scoring functions.
This pipeline allows for much control over the output text, permits a high degree of confidence
that the text will be grammatical and semantically accurate by design, and most importantly,
allows for adapting and adjusting the output text according to a user model. This has been
called the “deep” model for generation, in contrast with “shallow” methods based on templates,
as outlined in the following.
1.6.1 Shallow vs. Deep NLG
It has been noted that there is a continuum between shallow and deep NLG methods
(Busemann, 2011). Considering “canned text” to be at the shallow end of the scale and “deep”
NLG to be at the other, a number of intermediate approaches can be situated in between them,
depending on what modules and functionality they implement, as represented in Figure 1.3.
Prefabricated texts shallow
“Fill in the slots”
With flexible templates
With aggregation
With sentence planning
With document planning deep
Figure 1.3 Shallow to deep NLG transition (based on Busemann, 2011)
The approach presented herein stands on the shallow end of this scale, as the generation is not
inherently knowledge-based or theoretically motivated (Busemann & Horacek, 1998), but based
on sentence templates with “slots” in them. These templates are sequences of text tokens of two
16 of 66
types: static text (words or punctuation) and placeholders for values, linking that slot to the
value of a property or variable. A template as we define it here could take the form of:
[name] was born on [dateOfBirth].
In this example, [name] and [dateOfBirth] are the names of properties whose values would be
substituted in that sentence in lieu of the properties themselves (e.g. “John Doe was born on 14
October 1066.”).
Templates can deal to a large extent with the issue of lexicalisation, for they already contain
many of the words used and as such are a lexical choice, and with that of aggregation, as they
can contain complex grammatical structures where only values have to be substituted in.
The approach presented herein also incorporates characteristics of deeper NLG, by performing a
kind of document planning as described in 3.3.5.
17 of 66
Chapter 2
Previous approaches
A number of previous approaches to Natural Language Generation for the Semantic Web have
been adopted. Of these, a majority have been concerned with verbalising OWL ontologies (c.f.
Stevens et al., 2011, Hewlett & Kalyanpur, 2005, Liang et al., 2012), and the verbalisation of
factual data has remained somewhat under-addressed. In the following I situate my work in the
context of previous efforts by providing an overview of the most relevant ones.
2.1 Hand-coded, rule-based
In a first category, there have been a number of approaches to NLG for the SW that employed a
deep NLG architecture like the one described in Chapter 2. Perhaps the most interesting of these
to date is the NaturalOWL system (Galanis & Androutsopoulos, 2007), which could potentially
be more easily applied across domains. It builds upon the M-PIRO system (Androutsopoulos el
al., 2001), used for multilingual generation of museum exhibits, but adapting it to use OWL
ontologies and RDF data. The classes and properties in the ontologies are explicitly annotated
with text in multiple languages to carry out the generation, which enables the system to generate
multilingual text using RDF data.
2.1.1 Assessment
NaturalOWL is a versatile and powerful system, including a full NLG pipeline adapted from an
already-successful system with commercial applications. It can achieve high quality output, and
is multilingual by design. In essence, this system is a solid implementation for Linked Open Data
of the NLG Architecture described in section 1.6. As such, we can see in it the same benefits and
shortcomings. Great control over the output comes with a requirement for many expert man-
hours and limited transferability between textual domains. Furthermore, the approach requires
publishers of Linked Data to provide non-trivial annotations of the ontologies they publish. It
remains to be seen to what extent this is a realistic expectation.
2.2 Generating directly from RDF
A competing approach is generating directly from RDF with few hand-coded rules, particularly
representative of which is the Triple-Text system of Sun & Mellish (2007). The authors note that
RDF predicates typically encode rich linguistic information, that is, their URIs are meaningful
chunks of natural language. Sun & Mellish (2007) have exploited this information to
automatically generate natural English text from triples without using domain dependent
knowledge.
Their approach, the Triple-Text (TT) system, is based on processing the predicate of the triple.
Words forming predicates that are meaningful in English are typically concatenated into one
string with no spaces or space-equivalent characters (e.g. underscores), but they are also typically
18 of 66
“camel-cased”, that is, using uppercase characters used for capturing these separations (e.g.
“hasProperty”, “wasBornIn”). This makes it easy to tokenise the predicate into its building
blocks (e.g. “has” + “property”, “was” + “born” + ”in”). This sequence of tokens is then
assigned part-of-speech (POS) tags and classified into one of 6 categories, depending on its
format (e.g. “has” + [unit]* + [noun]). For each a different rule is applied to build the output
sentence (e.g. “has” + det* + “units” + “noun”).
As an example, given the triple “North industryOfArea manufacturing_sector”, the system
generates the sentence “The industry of the North is manufacturing sector.”
2.2.1 Assessment
Simple as it is, this approach is interesting in many ways: it is reasonably domain-independent, it
is very fast, inexpensive and intuitive to deploy and can provide an immediate lexicalisation of
triples to natural language text without the need for domain-dependent knowledge.
However, its shortcomings severely limit its applications. First, generation from single RDF
triples is also limited by the fact that the relations encoded in a triple are only between two
entities. Human discourse is on average much richer, often including relations that can involve
more than two entities, like ditransitive verbs, which require a subject, a direct object and an
indirect object (e.g. "John gave the book to Mary"). The authors point this out and suggest that
the next step is generating from multiple triples. Second, no mechanism is provided to perform
document planning (as described in 1.6) when dealing with a collection of triples that should be
lexicalised together. Finally, the output is not always grammatically correct and it cannot be
easily adapted to a specific domain, as it does not take into account the ambiguity inherent to
natural language (e.g. polysemy) and relies on using the same words found in the RDF
predicates for realisation.
The baseline implemented as part of this project (see section 3.4) draws much inspiration from
this approach, extending it to use rdfs:label properties for verbalisation and combining it with a
baseline Referring Expression Generation algorithm.
2.3 Unsupervised trainable NLG
Perhaps the most relevant previous work on trainable NLG is that of Duboue & McKeown
(2003), where they describe a system that learns content determination rules statistically from
parallel text and semantic input data. They collect the information in the knowledge base for
this application by crawling websites with celebrity fact-sheets and obtained the biography texts
for these celebrities by crawling other websites.
They align the data with the text (i.e. the “Matching” stage) using a two-step approach. In the
first step they identify the spans of text that are verbatim copies of values in the data. The
second step is building statistical language models and using the entropy of these to determine if
other spans of text are indicative of other values being expressed. There is an amount of
inference and reasoning involved in this approach, such as deciding that “young” is someone
between a span of ages. This is specifically applied to short biographies.
19 of 66
2.3.1 Assessment
This work focuses on a limited domain and only on content determination. The output of their
system is however still exclusively dependent on hand-written rules and is specifically targeted at
the constrained domain of biography generation. However, their approach to automatic learning
of content determination is undoubtedly far superior to what I present in this paper. The
approach taken in this project is equivalent to the baseline of Duboue & McKeown (2003), or the
first matching step in their algorithm: only literal values found in the data are matched in the
text.
2.4 Automatic summarisation
Very related to the approach presented herein is the wealth of work done in the field of
automatic summarisation, which consists in creating a summary of a text by automatically
choosing the most relevant information in it and collating it 2 . A subfield of automatic
summarisation, frequently called “text-to-text” Natural Language Generation (to differentiate it
from full “concept-to-text” NLG) deals with the generation of documents by extracting
information from multiple input documents.
This can be seen as very related to this project, insofar as multi-document summarisation also
deals with the extraction of sentences and with concatenating them in an organised way to
create a new document. However, a main difference stands out: text-to-text NLG only deals with
processing documents that are all about the same entity, subject or topic and extracting the
most relevant sentences from those documents to create a new document. This stands in contrast
with the problem we are tackling here: we want to generate natural language describing an
instance in an ontology for which there may be no such text available. We then need to identify
sentences about an entity that will be transferable, that is, will be true of other entities of the
same type, more particularly, sentences that express values of properties that other entities of
that type will have.
Where those sentences are not directly available in the text, we can try to modify them to make
them transferable. This is much related to the frequently addressed task of sentence compression
in automatic summarisation, which consists in creating a summary of a single sentence. Often
this is addressed using tree operations, where a sentence is parsed and the parse tree is modified,
with the most frequent operation being the deletion of constituents. Where for summarisation
these constituents are removed because they are deemed of lesser importance, in the present
approach they are deleted where there is no evidence for them in the data.
A number of previous approaches to this deletion problem exist (e.g. Cohn & Lapata, 2009,
Filippova & Strube, 2008) but here I specifically borrow the term syntactic pruning from the
work of Gagnon & Da Sylva (2006). Their approach is to parse the sentences with a dependency
parser and apply a number of hand-built pruning rules and filters to simplify and compress those
sentences. This approach presented here is similarly rule-based.
2 Methods for summarisation are generally classified into extractive, i.e. extracting sentences from the text based on
their salience score and joining them, and abstractive, i.e. producing a new, shorter text (Gagnon & Da Sylva, 2006).
Both of these categories are relevant here.
20 of 66
21 of 66
Chapter 3
Design
3.1 Design overview
The present approach is based on two main intuitions. One is that if we can identify sentences in
text expressing factual information about an entity that are transferable (i.e. would also be true
of another entity of the same class) we can use them as sentence templates for that class.
Figure 3.1 System overview
This requires that we first identify literal values in the text that are the values of properties in
the data and then select and edit the sentences to make sure they express no information that
would not be true of another entity, i.e. that is not a value expressed from the input data. An
example of this would be “[foaf:name] was one of the greatest [rdf:type].”, as this template
contains a value judgement that is not supported by data.
The second intuition is that we can model the content of an article by collecting statistics about
the properties whose values we have spotted in the text.
22 of 66
Figure 3.2 Aligning data and text: spotting property values
As an illustration of this intuition, consider the RDF triples and text shown in Figure 3.2. We
can paraphrase the data in RDF as “:Johann_Sebastian_Bach is an entity of type German
Composers, his death place is Leipzig, his birth place is Eisenach”, etc.
In this particular example, we can align all the values of all the RDF properties with spans of
text (illustrated by arrows in Figure 3.2), and this is here called spotting. Although it is often the
case that string literals in the RDF can be matched with identical strings in the text, sometimes
a conversion of these values is required. For example, the value “1750-07-28” is matched with the
non-identical string “28 July 1750”, as these different formats for dates represent in fact the same
value.
Having spotted the values, we can assume we have spotted the properties that generated them,
which then allows us to replace each of those values in the text by a symbolic link to the
property that generated it and extract the sentence template:
[foaf:name] (b. [dbont:birthPlace], [dbont:birthDat e], d. [dbont:deathPlace], [dbont:deathDate]) was a [dbprop:shortDescription].
This template contains no information that is not supported by the data, and is therefore
transferable and can be instantiated for any other entity of the same class (in this case,
yago:GermanComposers3) for which we have the same properties, e.g.:
Ludwig van Beethoven (b. Bonn, 17 December 1770, d. Vienna, 26 March 1827) was a German composer.
This approach can be seen to a certain extent as a conceptual hybrid between shallow NLG
systems, where the information to be represented is stored in a symbolic data structure, and
text-to-text NLG, where the content determination and document structuring are automatically
3 Throughout this thesis, I refer to classes used by DBpedia with the namespace prefix yago: defined as @prefix yago: <http://dbpedia.org/class/yago/>
23 of 66
learned from text, and the surface realisation is carried out via templates that are also
automatically learned from text.
The hypothesis presented here is that this system, trained to learn document planning and
sentence templates will perform better in subjective human evaluation than a simple baseline
generating directly from English RDF predicates. The system is also ranked against human-
generated text for the same data, which is expected to perform better and so the hypothesis is
that this is an upper bound.
3.2 Goal
The system must generate descriptions, approximately equivalent to Wikipedia “stubs”, for any
entity in an RDF ontology, focussing on factual data. This is therefore the one hard-wired
communicative goal of the system.
This system must be inexpensive and fast to deploy, using readily available resources and
avoiding any manual annotation of data, while also keeping hand-written rules to a minimum
and avoiding domain-specific ones (e.g. specific for biographies, descriptions of cities, etc.).
At the same it should significantly overcome the shortcomings of direct generation from RDF
triples by performing content determination and document structuring as described in section
1.6. Given that the rules for this are not present in the data, they must be statistically acquired
from text aligned with the data.
Crucially, the system must be able to extract sentence templates from the training text for use in
generation. These templates should verbalise the values of properties identified in them in the
training text. Most importantly, these sentences should be transferrable between instances of the
same class: it should only extract sentences that would hold true for other instances of the same
class with different property values.
Also, this system is specifically targeted to use Linked Open Data, which implies that it should
exhibit a degree of robustness to inconsistencies and redundancies in the data. Linked Open
Data, unlike a relational database, has a very flexible schema, so the system should be able to
deal with e.g. a property value being available for one instance of a class but not for another.
Likewise, it should not depend on hand-picked lists of predicates, with some exceptions for very
widely used ones (e.g. rdfs:label).
As opposed to e.g. Stevens et al (2011), the aim is not to fully process OWL semantics, but the
focus here is exclusively on factual, literal, non-temporal data that can be expressed in quantities
and string literals.
Finally, it should be easy to adapt to other domains. A specific format of text and a specific
schema for the data should not be required, as long as one article of description text per entity is
available, together with RDF triples that contain properties aligned with an instance in an
ontology. Finally, it should be conceivable to adapt the system to other languages for which the
required resources exist (e.g. a trained parser).
24 of 66
3.3 Tasks
3.3.1 Aligning data and text
Automatically aligning the RDF data with the text can be seen as a case of Named Entity
Recognition, a well-established task used for such processes as Text Mining and Sentiment
Analysis, which consists in identifying and annotating a number of “entities” in a text (Feldman
& Sanger 2007). These entities can be names or values, such as quantities (both as digits and as
string literals), dates, etc.
General-purpose NER systems are usually based on three different matching techniques:
dictionaries (also known as gazetteers), regular expressions and trained classifiers (Conditional
Random Fields, MaxEnt, etc.). In this implementation, the NER task only uses gazetteers,
regular expressions and heuristics for normalising and recognising quantities. Despite this being a
very simple approach, for this specific project it is adequate as a baseline. Given that we have
prior knowledge about what entities we expect to find in the text, the problem is limited to
finding them. Cases of ambiguity are much reduced and the problem is limited to recognising
values when they appear.
Selecting the RDF properties whose values are to be spotted in the text is dependent on the
domain of the text and on the way data is encoded in RDF/OWL. We can think of this as a
window over the graph, defined by a number of edges or relations. The ideal distance of edges to
consider including is dataset-specific, depending on the design of this dataset, very complex
relationships may exist between nodes in the graph. In the present approach, only triples that
have as subject the main topic of the article being processed (title entity) are retrieved (i.e. s p o,
where s=title entity URI). These triples can be any property and have any value. This was
deemed to be sufficient for this dataset, and retrieving longer paths through the graph was found
to significantly increase the complexity of the extraction.
3.3.2 Extracting templates
The system must extract sentence templates with “slots” in them as described in 1.6.1. These
slots correspond to spotted properties of the class being described. As others before (Sripada et
al., 2003), I find Grice’s Cooperative Principle and its maxims (Grice, 1975) capture
fundamentally important aspects of an NLG system. Crucial to the approach presented herein is
the maxim of quality: “contribute only what you know to be true; do not say that for which you
lack evidence”. As per this maxim, the output of the system is desired to be truthful, which
means that we should make sure the textual output is supported by evidence in the input data.
The sentence templates extracted should be then transferable, that is, they should hold true for
any entity of a given class that has the same properties as the entity for which the original
sentence was written.
In order to realise this, similarly to (Gagnon & Da Sylva, 2006), the system needs to parse the
source sentences to prune them of constituents, but in our case, these constituents are those for
which we have no evidence in the data.
Parsing consists in taking a sentence in natural language and determining what its most likely
parse tree would be, i.e. how its grammatical constituents are clustered and nested. It is
important to note here that parsing is an area of active research and, due to natural language
25 of 66
ambiguity, far from a solved problem, as parsing will be another step in the pipeline that is likely
to introduce a significant source of errors.
3.3.3 Dealing with Linked Open Data
A brief examination of data from the DBpedia brings key issues to the fore. First, properties of
an instance can have multiple values. For example, let us examine the following triples:
:Carl_Maria_von_Weber dbont:birthPlace :Eutin. :Carl_Maria_von_Weber dbont:birthPlace :Holstein.
The property dbont:birthPlace here has two values, both of them are correct, and both of them
may be necessary, as Eutin is contained in Holstein. It would perhaps have been better to store
this information only as the more finely-grained value (:Eutin) and left the task of performing
inference to reach the coarser-grained container (:Holstein) to the smart agent. However, given
that both these values are in the data, the system must be able to deal with this. Our approach
is to group together spotted entities in text that are the values of the same property and to keep
track of properties spotted as lists in text. When generating, we can use this information to
determine if only one surface realisation must be chosen from the options given or if all of them
must be shown as a list.
Second, multiple properties can have the same value. These properties may have very different
meanings, as in the following example: :Carl_Orff dbont:birthPlace :Munich. :Carl_Orff dbont:deathPlace :Munich.
This creates a different problem that has arguably more difficult solutions, as the problem will
be disambiguating between surface realisations. A number of approaches could be applied to this,
some variant of the EM algorithm for instance.
Third, there are many properties that have the same meaning and yet are often present for the
same entity, which makes them completely redundant. These redundant triples are either kept for
backwards compatibility or due to of an incomplete alignment of vocabularies when aggregating
different sources of data. Consider, for example, these triples: :Johann_Sebastian_Bach dbont:birthPlace :Eisenac h. :Johann_Sebastian_Bach dprop:birthPlace :Eisenach . :Johann_Sebastian_Bach dont:placeOfBirth :Eisenach . :Johann_Sebastian_Bach dprop:placeOfBirth :Eisenac h.
It is immediately clear in this example that the four predicates shown are actually one and the
same in meaning, and their values confirm this. The equivalence of these predicates is well known
and documented (Mendes et al., 2012), but it is not retrievable from the triples themselves. OWL
implements a system to link equal URIs, via the owl:sameAs property (Bechhofer et al., 2004),
which is available for some properties but not for others4.
The solution to the last two problems is to compute predicate similarity (by essentially counting
the times the two predicates have the same value) and grouping significantly co-occurring
properties together into predicate “pools”. The similarity of predicates is their similarity
4 While we would expect this to be a solved problem when dealing only with data from a self-contained knowledge
base such as DBpedia, it can be seen as a good example of a common problem of LOD. As such, this is an opportunity
to rise to the challenge and provide a solution for it.
26 of 66
coefficient, conceptually identical to Dice’s coefficient. This is computed by dividing the number
of times these two properties for an entity of class C have the same value over the times they
appear for entities of that class. This is only computed for the set of entities of that class for
which both properties are defined (i.e. have a value other than a null string).
For example, the four predicates seen above are frequently grouped into a single pool, which
takes the name of the most frequent of these predicates (or the first in alphabetical order in case
of a tie).
Once this similarity metric has been computed, to discard those sentence templates that contain
conflicting predicates in their slots, e.g. if “birthPlace” and “deathPlace” have been spotted with
the same value when they belong to different “pools”, that is, over all the text their Dice
coefficient is smaller than a constant. Other options were considered, such as a semantic
similarity metric between the rdfs:label of the predicates and the context words. Duboue &
McKeown (2003) have a different approach to clustering which includes hand-input rules for
inference (e.g. people whose age is 1 < age < 25 are labelled “young”). For this baseline
implementation however, the simplest option was chosen.
3.3.4 Modelling different classes
The system has a single communicative goal: the description of an instance of a class. However,
descriptions of entities belonging to different classes can be seen as belonging to different textual
sub-domains. What is relevant in the description of, for instance, a rock band, is unlikely to be
relevant (or even apply) to the description of a species of animal. Similarly, the same predicates
need not be expressed in the same order for all classes. Finally, sentence templates can also be
expected to depend on the class of the entity, both because of the predicates they realise and of
the lexical items and structures used in them. To a certain degree, this also applies to classes
that belong to a common super-class, e.g. while both being instances of the super-class
“Company”, the airline KLM and the software company Oracle should intuitively be treated as
instances of different classes.
This means we need to choose, from those available, the class that an entity would be most
prototypical of, as defined by Prototype Theory (Rosch, 1973). This class must have the right
granularity or level of detail, both for training and generating. It must not be too general that
the statements are too generic or irrelevant, or too specific that the extracted templates do not
generalise to other entities of the same class.
This task is surprisingly nontrivial, as the standard class inheritance mechanism implemented by
RDF (i.e. “entity rdf:type class”) allows for multiple class inheritance, and as an initial
exploratory analysis of the data shows, this mechanism is very frequently used and a single entity
typically belongs to a number of classes (for example, J.S. Bach belongs to both
“ComposersForPipeOrgan” and “PeopleFromEisenach”5). This makes choosing the “right” class
for an entity a problem that requires nontrivial inference to solve.
5 It could be argued that this information would be much better encoded using properties, e.g.
”:Johann_Sebastian_Bach :composedForInstrument :PipeOrgan”
”:Johann_Sebastian_Bach :bornIn :Eisenach”
This, however, would require extracting (mining) this information from the names of Wikipedia categories.
27 of 66
For the present approach I develop a baseline class selection algorithm, essentially consisting in
computing a score for each class based on term frequency scores (i.e. the count of times a word
appears in the class names of the entity) and selecting the n-highest. The intuition is that
category overlap can help determine which class is more central to the entity, and help choose
the class with an adequate granularity6. That is, if an entity belongs to a number of classes with
“composer” in the name, the degree of confidence should be higher that the entity’s class should
be “composer” or a subclass of it. A detailed explanation of this is offered in section 4.4.
3.3.5 Document planning
N-grams are sequences of items in succession, of size n, which capture the probability of a
sequence of items to appear (Jurafsky & Martin 2009, pp. 117-124). N-gram models are
frequently used to model language, and have also been applied to capturing the likelihood of a
sequence of concepts, rather than words, in text. This approach has been previously applied by
e.g. Galley et al. (2001), who used it to aid in document structuring for a dialogue system. Their
“word-concept n-grams” are in our situation equivalent to RDF predicates.
Duboue and McKeown (2003) also used n-grams to refine their statistical model for content
determination. Here I implement their baseline: we aim to learn content determination by
collecting unigrams (1-grams) of spotted predicates in the text: if the frequency of a predicate
was below a threshold in the articles for a given class of entity, even if an instance of this class
has this property in the data, the system should not output it.
According to Reiter and Dale (2000), document structuring carries out more complex selection
and ordering of information than just sequencing; it treats the text as a tree structure and
clusters related items of information. Given that the sentences templates extracted from the text
contain several properties expressed, we can think of them as partial trees, part of the bigger tree
required for document structuring, so I expect that extracting these templates and ordering them
in the right way will yield good results.
3.4 Baseline
In order to evaluate this approach, a comparable baseline approach is necessary. The baseline
generation system I implement here is exclusively based on direct generation from RDF triples.
It is loosely based on Sun & Mellish (2007), in that single triples are used to generate single
sentences and I use a shallow linguistic analysis of the words in the predicate to determine the
structure of these sentences.
However, a number of differences stand out. First, as opposed to Sun & Mellish (2007), this
baseline does not directly split out the “camel case” in predicates. RDF predicates, being URIs
themselves, have properties like rdfs:label, and these triples are available on DBpedia. Given this,
I first attempt to retrieve the rdfs:label of the predicate for use in generation. It is an underlying
6 This does not attempt to provide a definitive solution to this problem, but to solve it to a satisfactory degree for the
present application. A more sophisticated approach would perhaps have to account for the fact that the “right” class
necessarily depends on the application and the context. For instance, when deciding whether Mauritius an
IslandCountry or AfricanCountry, when it is just as prototypical a member of the first and second categories.
28 of 66
assumption of this approach that labels in all the languages we are concerned with (here
exclusively English) will be available in the triples. If a label is not available, the system then
does back off to splitting the words in the predicate URI.
The first sentence created by the baseline is an expression of the class of the entity, formed by
the name of the entity (i.e. its rdfs:label) followed by “is a” and the rdfs:label of the class of the
entity, e.g. “Johann Sebastian Bach is a German composer.” This class is chosen using the class
selection algorithm detailed in section 4.4.
All subsequent sentences are composed according to the following logic:
• If the retrieved or created label starts with an auxiliary verb (i.e. “is” or “has”), the
article “a” is inserted after that first word, the first word of the sentence is made to be
the personal pronoun nominative (i.e. “he”, “she”, “it”), and the value(s) are appended
to the sentence, separated with a colon.
For example, from London_Heathrow_Airport foaf:isPrimaryTopicOf
<http://en.wikipedia.org/wiki/London_Heathrow_Airpo rt> , the resulting
sentence is:
“It is a primary topic of <http://en.wikipedia.org/wiki/London_Heathrow_Airpo rt>”
• If the label starts with “was”, the sentence is created using the template “[personal pronoun] [property label] [values].”
• Otherwise, the sentence is created with “[possessive pronoun] [property label] [is/are] [va lues].”
The predicate text is converted to plural if there are several. For several values, these are
presented as a list, e.g. “His names are X, Y and Z”. It is an implied assumption that
predicate labels will be in the singular.
3.4.1 Coherent text
As opposed to Sun & Mellish (2007), who only generated single sentences from single triples, we
are dealing with a collection of triples encoding information about a single entity, and we wish to
present this information as a coherent text, made up of several sentences connected using
coherence devices like coreference.
For this, the baseline implements a very simple Referring Expression Generation algorithm,
which operates in the following way: The initial reference to the title entity expresses the full
name of the entity being described, by retrieving its foaf:name or rdfs:label for the language we
are generating in (throughout this paper, always “en” for English). Subsequent references to that
entity will use a personal pronoun (e.g. “he”, “she”, “it”). The system specifically retrieves the
value for foaf:gender for the title entity whose description is being generated and chooses the
right pronoun and possessive based on its value (e.g. “he” and “his” for “Male”, “she” and “her”
for “Female”).
29 of 66
This is a simple approach that produces acceptable results in English. It is perhaps relevant to
note here that different languages will have different requirements for the treatment of
grammatical gender and probably a more complex approach would be necessary. In the baseline
implementation of the system, no attempt to order the sentences is made. Also, no attempt is
made at document planning, with one exception: simple heuristics and an “ignore list” filter out
from the input properties with inappropriate values. Specifically, values containing more than ten
words and values which are integers below 31 are ignored, together a number of predicates such
as purl:subject. This essentially helps filter out output that would be too verbose and visually
strident, with the aim of making the baseline competitive in evaluation.
30 of 66
Chapter 4
Implementation: Training
In this chapter I present an implementation of the training phase according to the design
principles outlined in Chapter 3: the LOD-DEF system. The system is trained on a corpus of
text documents, each of which is an article from the Simple English Wikipedia, and RDF triples
for the same entity, retrieved at runtime from DBpedia Live. The full training pipeline is
described in this chapter: what steps are taken to train the NLG model from the text and data.
The diagram in Figure 4.1 represents the pipeline for each article, while the post-processing steps
are represented in Figure 4.3.
4.1 Obtaining the data
4.1.1 Wikipedia text
The text from the Simple English Wikipedia was downloaded from the Wikipedia dump site7.
This is one large compressed XML file, containing the latest revision of all the articles on the
SEW, together with a small amount of metadata, e.g. author information for the most recent
edit. This text is in “wiki markup” format, including in it information like infoboxes, inter-
language links, markers such as stub and redirection, etc.
This text then had to be processed to remove all this mark-up unnecessary to the task, and leave
only the article text with the links to other articles. This step is done only once, previous to
running the training pipeline on any article. Steps 2-7 apply to every processed article.
4.1.2 DBpedia triples
RDF triples from DBpedia are retrieved at runtime for each article via SPARQL queries. The
standard SPARQL endpoints for this were the ones provided by DBpedia Live8. The endpoints
proved to be quite problematic, as they would quite frequently experience high overload which
meant slower response times and several times would be offline for several hours. To deal with
this and to speed up testing, I implemented an offline triple cache using SQLite3, as an
alternative to setting up a special triplestore for this project, which would have been a
considerable expenditure of time and resources.
Triples are retrieved using the SPARQL query:
SELECT ?pred, ?obj WHERE { <http://dbpedia.org/resource/%s> ?pred ?ob j.}
7 http://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2 8 http://live.dbpedia.org/sparql and http://dbpedia-live.openlinksw.com/sparql
31 of 66
This returns all the triples in the DBpedia with the entity marked by %s as subject9, using its
“wiki link” as search keyword. The triples returned by this query are then first filtered by
language, as we are only concerned with triples for English in this case10. Triples with literal
values in English often do not have a marked language suffix, as English is often considered the
“default” language, so we need to include both “en” and “” (blank string) as valid languages.
Second, triples using a predicate from an “exclude” list are filtered out. These are predicates that
we found to be useless for our purpose and to induce noise, add data overhead and increase
training time. Examples of these predicates are "http://www.w3.org/2002/07/owl#sameAs" and
"http://purl.org/dc/terms/subject" (the values of which are already available as rdf:type
properties connected to YAGO classes).
Figure 4.1 Training pipeline for each article
9 The special %s marker is interpreted by a formatting function which substitutes it for a parameter string. 10 Using the FILTER keyword in SPARQL queries resulted in much longer response times from the endpoint, together
with occasional time-outs, more so the more complex a filter, so I decided to use it sparingly and filter the results
client-side for increased robustness and faster execution. This is perhaps less than ideal within the vision of the
Semantic Web, but due to the limitations of the available servers, it is often more practical to retrieve more
information than needed and filter it client-side rather than relying on filtering by the SPARQL endpoint.
32 of 66
4.2 Tokenizing and text normalisation
The first step in the pipeline is to tokenize the Wikipedia text – separate text into words,
punctuation and sentences. Several standard tokenizers were tested for this task, and their
results proved quite unsatisfactory11. Therefore use a custom built algorithm is used, to take into
account the format of Wikipedia mark-up. Tokens are considered to be individual elements of the
sentence, and so punctuation is individually tokenized: commas, colons, semicolons, parentheses,
brackets, etc., are all considered to be individual tokens. The exception is the apostrophe (’), so
clitics like “n’t” and the genitive “’s” will be tokenized as one single word, remaining attached to
the root word.
These rules are applied in order to facilitate the next processing step, the spotting of values in
text. To further ease this, a number of processing stages normalise values found in the text,
mainly using Regular Expression matching and replacement. As an example, if a number is
expressed in words (e.g. “fourteen”) it is converted to its equivalent in digits as a string (“14”).
The same date may appear in the Wikipedia in several formats, e.g. “17 Aug 2012”, “17 Aug,
2012”, “17 August 2012”. We could either normalise these first in the text or perform a RegEx
matching for each generated date for each processed text. In the interest of processing ease and
efficiency, we do it once and normalise all dates found to one standard format: YYYY-MM-DD
(year-month-day) using all digits. This is the same format used in xsd:date values (ISO 860112),
which further eases spotting.
4.3 Aligning: Named Entity Recognition
4.3.1 Surface realisation generation
In a first step, we generate possible surface realisations for the triples retrieved. This is equivalent
to building a gazetteer list. The way it is done will depend on the type of object of each triple.
If the object is a URI, the rdfs:label for this URI (again identified by %s in the query) is retrieved
via a SPARQL query:
SELECT ?label WHERE { <%s> <http://www.w3.org/2000/01/rdf-schema#label> ? label. FILTER (langMatches(lang(?label),"") || langMatches(lang(?label),"en") ) }
If the object is a typed literal, that is, it has an associated data type, the conversion will depend
on the data type. Strings (xsd:string) are taken as they are with no modification, where
xsd:int, xsd:decimal, xsd:double are converted to integers and xsd:float is converted to float
and rounded to two decimals.
4.3.2 Spotting
A second step is clustering together tokens to facilitate a maximum span matching: all surface
realisations generated in the step above are ordered in inverse order of length (highest first). For
11 As an example, the Punkt sentence splitter included with the NLTK Python library would consistently fail to
separate sentences like “.. was born in Bath.Later in life..” or “…was born in [Bath].[London] was his first…”. 12 http://books.xmlschemata.org/relaxng/ch19-77041.html
33 of 66
each surface realisation r, if a series of tokens K matches r, assuming a whitespace character
between elements of K, then K are concatenated together into a single token t, with the insertion
of one whitespace character between every two tokens in K.
Flexible matching is then done between each token t and each surface realisation r using a
regular expression to accept whitespace, hyphens or any other character occurring between two
words.
4.4 Class selection
This step could in practice be located anywhere in the pipeline, as it only affects what “class
model” will be updated with the learned templates and n-gram counts (as detailed in 5.9). Class
models are the data structures holding the extracted sentence templates and stored annotations
from the training documents, together with other information after post-processing. Determining
what class models the title entity13 belongs to at this stage makes it more straightforward to save
this information with no intervening temporary data structure.
As pointed out earlier, determining the “right” class for an entity is not straightforward. As a
working example, consider the rdf:type triples available for Johann_Sebastian_Bach:
:Johann_Sebastian_Bach rdf:type :AnglicanSaints, yago:ComposersForViolin, foaf:Person, yago:ComposersForCello, yago:GermanComposers yago:GermanClassicalOrganists, yago:PeopleCelebratedInTheLutheranLiturgicalCalenda r, yago:ComposersForPipeOrgan yago:ComposersForLute, yago:OrganistsAndComposersInTheNorthGermanTradition , yago:18th-centuryGermanPeople, yago:PeopleFromSaxe-Eisenach, yago:BaroquEComposers, yago:PeopleFromEisenach, yago:ClassicalComposersOfChurchMusic.
As the example illustrates, entities in DBpedia are aligned with YAGO classes14, which are
automatically mined from crowd-sourced Wikipedia categories.
We choose the class using the following steps:
1. We retrieve the rdfs:label values for each of the classes. Using a bag-of-words approach,
we put all these labels in a single list of words.
2. We add to this vector the words from the first sentence of the Wikipedia article.
3. We remove the stopwords from this list, i.e. prepositions, conjunctions, articles (“for”,
“from”, “and”, “a”, “the”, etc.).
4. We compute term frequency (tf) scores for each word in this list, i.e. count how many
times they occur in it.
13 The title entity is the entity that is the main topic of the article on the Wikipedia and whose URI is the subject of
the triples in DBpedia. 14 YAGO is a freely available knowledge base, derived from Wikipedia, WordNet and GeoNames (Kasneci et al., 2008).
34 of 66
5. We compute a normalized sum of tf scores for every class label, using the formula:
�������� = 1� �������
���
� = ����� > 15��� = 1 �
where w is the class label string, wi is the ith element (word) in the string, tf is the term
frequency score and N is the total number of elements in w. Note: tf(wi) will return 0
when wi is a stopword. M is adjusted here to reflect a dispreference against one-word
class names.
6. We order all the classes by their score in inverted order and select the n-highest as the
classes the entity belongs to. We train for several models at the same time, given that we
cannot be confident the class we chose is the only one that the entity is prototypical of.
During the experiments, we set the value of this n to 5.
As an example, training for :Johann_Sebastian_Bach, the n-best list is shown in Table 4.1. For
each of these classes, a class model is created (or updated if already existent).
rdf:type Score
yago:GermanComposers 6.0
yago:BaroquEComposers15 3.3
yago:ComposersForViolin 3.0
yago:ComposersForCello 3.0
yago:ComposersForLute 3.0
Table 4.1 Class scores, n-best list
4.5 Coreference resolution
Coreference resolution can be a complex task and is an active area of research. A number of well-
known algorithms exist for this, but given the domain of text we are dealing with, for our
purposes a very simple approach to coreference resolution is sufficient,
A very simple heuristic is used for coreference resolution: the first pronoun appearing in the text
is assumed to refer to the entity the text is describing (the title entity), and so are all forms of it
throughout the text. The title entity, that is, the entity the article is about, its main topic, is
henceforth referred to as “$self”. Consider the following example, taking the first two sentences
from the article on Johann Sebastian Bach. Coreferent spans of text are in bold face:
“ Johann Sebastian Bach (b. Eisenach, 21 March 1685; d. Leipzig, 28 July 1750) was a German composer and organist. He lived in the last part of the Baroque period.”
15 Note that the spelling “BaroquEComposers” is taken verbatim from the data. This just offers a hint of how careful
one must be when dealing with automatically-mined data like that of DBpedia and YAGO. We explore this issue in
more depth in 6.1.
35 of 66
The first pronoun to appear is the “He” starting the second sentence. From this moment on,
“he” will be assumed to be coreferential with the entity :Johann_Sebastian_Bach, and so will be
the form “his”. For each coreferent token, a coreference annotation is stored.
4.6 Parsing
For parsing, I employ the Stanford parser with the pre-trained PCFG English model16 (Klein and
Manning, 2003), a widely-used, state-of-the-art, self-contained parser, which also provides a
number of pre-trained probabilistic models for other languages. Distributed as a Java Archive
(.jar), it is easy to interface or use from the command line or other programming languages17. A
number of freely licensed and open-sourced parsers were considered (e.g. C&C parser, NLTK
Viterbi parser, Berkeley parser) and the final choice was motivated by its robustness, speed, and
ease of interfacing.
Of all the sentences containing at least one coreferential token, the ones that also contain at least
one spotted property value that is not coreferential with $self are selected as template
candidates. For each of these sentences, a specially prepared pre-parse version is created, where
for each spotted entity (or each annotation) a placeholder variable is created. This variable takes
the name “var_n”, where n is an automatic counter, with values from 1 to N, the number of
tokens in the sentence with an annotated spotted entity. So, for instance, given the sentence:
“Carl Maria von Weber (born Eutin, Holstein, baptis ed 20 November 1786; died 5 June, 1826 in London was one of the most important German composers of the early Romant ic period.”
After date normalisation and spotted entity substitution, this sentence becomes:
“Var_1 (born Var_2, Var_3, baptised 1786-11-20; die d var_4 in var_5 was one of the most important Var_6 of the early Romantic period.”
The parser assigns a noun (NN) Part-Of-Speech tag by default to unknown words, which all the
placeholders are in this case. This is conceptually consistent with the fact that they are spotted
entities in the text, and can therefore be nouns. This is done in order to preserve these spotted
entities as one unit each. The parser will often nest entities formed by more than one word in the
parse tree in ways that complicate the posterior retrieval of those entities and even more so the
pruning of the tree.
4.7 Syntactic pruning
While parsing can be helpful for a number of tasks (e.g. it can inform coreference resolution by
identifying the subject of the sentence), here it is only deemed necessary in order to carry out
syntactic pruning to ensure the templates are transferable.
16 Version 1.6.5, from 30/11/2010. The Python library used was designed for the older API and incompatible with
more recent versions. 17 This implementation uses jPype (http://jpype.sourceforge.net/) to interface the Java Virtual Machine from Python.
36 of 66
For this, it is considered here that the following grammatical categories require support in the
data: nouns, adjectives, adverbs and numerals. The corresponding tags of these categories
returned by the parser are: N* (e.g. NNS – plural), JJ*, RB*, CD*. The asterisk here is meant as
a wildcard for zero or more characters: NN* should match both NNS and NN. These tags are the
ones used in the Penn Treebank, the annotated corpus the Stanford parser English PCFG model
was trained on.
By “require support” it is meant that the words with those corresponding tags must have been
aligned to values found in the data, i.e. must have a “spotted” predicate. Note that several
grammatical categories do not require support, most relevantly verbs. This is because what verbs
do require is objects, and it is these that require support. This concept can be seen as very
related to techniques of Relation Extraction (Sarawagi, 2008).
The pruning proceeds in three stages:
• Stage 1: Each leaf of the tree (i.e. word in the sentence) whose Part-of-Speech tag
matches one of the masks (N*, JJ*, RB*, CD*) and which does not have a “spotted”
value in the data is deleted.
• Stage 2: Context-Free Grammar rules are inverted. For example, if in a standard CFG a
constituent is expanded via the rule NP -> DET + N (a Noun Phrase is formed by a
head Noun and a DETerminer), if the head noun of an NP is deleted, then the whole NP
must be deleted too, together with all the constituents it may contain. The rules used in
stage 2 are:
o VP requires V leaf
o NP requires N* leaf (N, NP, NN, NNS, all valid)
o PP requires P and NP
o Verb requires object: either VB* or VP containing it must have sister constituent
to the right18
o WH-phrase (WDT, WP, WP$) requires VP or S leaf
o Coordinating conjunction (CC) requires sister nodes to its left and right of the
same type
The rules in this stage are applied successively and repeatedly to the parse tree until no
modification is made.
• Stage 3:
o Resulting parses are first filtered by the number of spotted entities they still
contain. If there is not at least one token that refers to $self and one spotted
value, we discard the template candidate.
o Finally, we apply a number of rules to ensure correct punctuation by deleting
empty brackets, duplicate commas, etc.
An example of this processing applied to the previous sentence from section 4.6 can be seen in
Figure 4.2:
18 This accounts for inconsistencies in the parse trees returned by the Stanford parser.
37 of 66
• In stage 1 the following leaves require evidence in the data, and as this is missing, they
are deleted: “1786-11-20” (CD), “most” (RBS), “important” (JJ), “early” (JJ),
“Romantic” (JJ), “period” (NN).
• In stage 2, the NP containing “period”, having lost the only NN that supported it, is
deleted. The ADJP “most important”, having no leaves in it, is also deleted.
• In post-processing rules, “one of the” is substituted by “a”.
This pruning produces the following pruned template candidate:
[foaf:name] (born [dbont:birthPlace], baptised died [dbont:deathDate] in [dbont:deathPlace]) was a [rdf :type].
This is not a perfectly correct template, given the presence of “baptised” in it with no
complement. Although it is extracted and stored, this template is not judged grammatical
enough as per the criteria defined in section 6.2.4.
Figure 4.2 Parse tree, with removed constituents underlined.
4.8 Store annotations
At this stage, a list of annotations from step 3 (NER) is collected and saved for the model as a
separate list for each document processed. This is done by iterating through all the tokens in the
text, ignoring sentence boundaries and storing only annotations that do not refer to the title
entity (e.g. foaf:name, rdfs:label, dbprop:name). This will be used in step 4.9.3 to compute counts
and probabilities for predicate n-grams.
An example list of annotations for an article would be:
{foaf:name, rdfs:label, dbprop:name}, {dbont:birthD ate}, {dbont:birthPlace}, {dbont:deathDate}, {dbont:death Place}, {rdf:type}, {dbont:knownFor}
38 of 66
Figure 4.3 Training: post-processing steps
4.9 Post-processing
Post-processing is applied to every class model independently, executing the following steps in
succession.
4.9.1 Cluster predicates into pools
First the owl:sameAs property is retrieved for every property spotted (i.e. for every entry in the
1-gram list) and it and its object are added to a single “predicate pool”. Second, the lists of
annotations for the analysed articles are processed and properties that are seen to have the same
value with high frequency are grouped together in pools. This is done by computing a similarity
coefficient based on the Dice coefficient formula:
������, ��� = 2 � !"#���, �����!$����� + ��!$�����
This is twice the amount of times they have the same value divided by the number of times they
appear individually. This is only computed for the set D of entities of class C for which both p1
and p2 are defined (i.e. have a value other than null string). If this coefficient is above a
threshold, the predicates are deemed to be equivalent. Experimentally the value of this threshold
is set to 0.9. As an example, foaf:name, rdfs:label and dbprop:name are clustered together in a
single pool for all classes used in the experiments.
The predicate pool is identified by the most frequent predicate of those in the pool. If they are
equally frequent, the first one in the list returned by the sorting function is used. This most
frequent predicate is then substituted in all the sentence templates in that model in place of all
39 of 66
the other predicates in the pool. This is mostly for aesthetic reasons and to simplify the
generation. It does not affect the behaviour of the system, but makes the visualisation of the
extracted templates more intuitive.
4.9.2 Purge and filter sentences
After the predicate pools have been built, each sentence in the model is checked for conflicts
between predicates. This is done to account for the fact that predicates with different and
possibly opposed meanings can have the same object.
For every sentence template t, for every slot s in t, if s contains more than one predicate after the
clustering carried out in the previous step, the predicates are checked for similarity. The Dice
coefficient of each pair of predicates is checked (having been computed previously), and if it falls
below a threshold (experimentally set to 0.1), the sentence template is discarded.
Ideally, sentences that express the same set properties would be filtered here according to some
criteria (e.g. length in tokens, amount of symbol tokens present, an overall character length
preference, etc.), and the best or n-best ones would be kept. This is, however, not implemented
in the LOD-DEF system.
4.9.3 Compute n-gram probabilities and store model
The n-gram counts collected are adjusted to reflect probabilities using Maximum Likelihood
Estimation. A very simple smoothing technique is applied, equivalent to add-α smoothing with a
very small α. Trigrams are used throughout this implementation. The model is finally stored in a
file, for which Python’s built-in serialisation is used.
Class: yago:GermanComposers Templates: (1) [foaf:name] ([dbont:birthDate] -- [dbont:deathDate]) was a
[purl:description]. (2)[foaf:name] (born [dbont:birthDate] [dbont:birth Place] ; died [dbont:deathDate] [dbont:deathPlace]) was a [rdf:ty pe]. …
Pools: (1) {foaf:name, rdfs:label, dbprop:name, dbprop:caption, dbprop:cname} (2) {rdf:type, purl:description, dbprop:shortDescription} (3) {dbprop:dateOfDeath, dbprop:deathDate, dbont:deathDate} (4) {dbprop:dateOfBirth, dbprop:birthDate, dbont: birthDate} (5) {dbont:knownFor} …
n-grams (“”, “”, foaf:name) (“”, foaf:name, dbont:birthDate) (dbont:birthDate, dbont:deatDate, purl:description) (dbont:deatDate, purl:description, dbont:knownFor) …
Figure 4.4 Example trained model for yago:GermanComposers
The resulting output of the training pipeline is not just one class model, but a model collection,
as for every entity a number of class models may be created or updated.
40 of 66
Chapter 5
Implementation: Generation
The generation algorithm described here takes as input a collection of trained class models as
defined in the previous chapter, and the URI of an entity for which to generate a description
article.
5.1 Retrieve RDF triples
Values are retrieved from the SPARQL endpoint for each predicate pool saved in the model. A
pool may contain any number of different predicates, but these being considered completely
equal, only the values of one of them are retrieved and saved for the whole pool. A query is made
for each predicate until values are returned, which are then stored for the whole pool.
5.2 Choose best class for entity
The first step of this procedure is identical to that detailed in section 4.4 with the difference that
there is no article text to be considered, so this is not added to the vector of words, therefore
step 2 (as defined in 4.4) is omitted. This generates an n-best list (experimentally, n=5) of
classes. For each of these classes, if a model is found in the model collection, a score is computed
for this model. This score is the amount of pools in the model that would get instantiated
through the sentence templates available in the model. Only pools for which there are values in
the triples for the entity are considered. For example, from the n-best list for
:Johann_Sebastian_Bach from Table 4.1, if two models were available such that:
Model yago:GermanComposers yago:ComposersForCello Number of templates 2 3 Pools for which there are values in the data 7 6 Pools that would be instantiated through templates 5 6
The chosen model would be yago:ComposersForCello, even though it received a lower score in
the first step, only because a higher number of values would be (potentially) expressed through
templates. This does not take into account the fact that due to the constraint on number of uses
of property values, not all these templates might be instantiated.
The motivation behind this choice is that an extracted sentence template is expected to generate
higher quality text, so a model instantiating more predicates through extracted templates is
preferred. This is especially important for the subjective human valuation conducted as part of
this project (see Chapter 7 for details).
41 of 66
5.3 Chart generation
We use chart generation: all sentence templates in the model for which there are enough triples
in the data are put on a chart and combinations of them are generated. The following steps are
taken:
1. For each template t, where S(t)i is the ith slot in it, we discard sentences for which there
is no value for S(t)i in the set of retrieved property values V. Every template t must
satisfy that for every slot of S we must find a value in V.
2. For each pool in the model, a simple sentence template is generated in exactly the same
way as for the baseline and added to the chart. This is done in order to deal with the
situation where pools (spotted properties in the training text) would not be expressed for
a lack of a template expressing them.
5.4 Viterbi generation
We now need to select and order sentence templates from the chart to produce a combination.
Ideally, we would want to find the combination of sentences that expresses all the values of the
pools in the model, yet uses as uses as many extracted templates and as few simple generated
ones as possible.
In order to deal with the combinatorial explosion, instead of an “overgenerate and rank”
approach, we apply the Viterbi criterion (Jurafsky & Martin, 2009). This means that we
compute scores for all the options at every step, select the one with the highest and discard all
the others, thus only ever keeping one possible combination. This is not guaranteed to be the
optimal solution to the requirements outlined above, but it is a satisfactory trade-off between
quality and speed and keeps the algorithm simple and the generation running in polynomial
time. The computational complexity of the algorithm presented in Figure 5.1 is O (n2 log n).
used_ngram_list = null predicate (beginning of document) combination = new list of sentence templates
(1) Do while len(candidates) < len(chart): considered = new list of templates (2) Do for each template in chart:
If template not in combination: If template does not require more uses of pools than allowed19:
Compute n-gram score of template using only the first pool Add to score a tenth of the number of pools used by template20 Add sentence with score to considered list
If no templates were considered, exit loop (1) Take the template with the highest score, add it to combination Increase the counter of times used of each pool used by template Add all pools used to used_ngram_list
Figure 5.1 Pseudocode for the Viterbi generation algorithm
19 One pool does not have to satisfy this constraint; this is the one representing the name of the entity being described,
“$self”, clearly identified based on the fact that it contains rdfs:label.
20 This has the effect of, where more than one template would have the same n-gram score, the one using more pools
(i.e. the longest one) will be selected.
42 of 66
To illustrate this algorithm, consider we are generating an article for the entity
:Woody_Woodpecker using the following example model (Figure 5.2), where two templates were
extracted from text. Consider also that in the input data there are only values available for pools
(1) to (4), so pool (5) has no value.
Class: yago:FictionalAnthropomorphicCharacters Templates: (1) [foaf:name] is a [rdf:type] created by [dbprop: creator].
(2) [foaf:name] first appeared in [dbprop:first]. Pools: (1) {foaf:name, rdfs:label, dbprop:name} = “Woody Woodpecker”
(2) {rdf:type} = “fictional anthropomorphic characters” (3) {dbprop:creator} = “Walter Lantz”, “Ben Hardaway”, “Alex Lovy” (4) {dbprop:significantother} = “Winnie Woodpecker” (5) {dbprop:first} = *empty*
Figure 5.2 Example model for generation
Having selected the model, the templates available for which the required values are available are
put on a chart (Figure 5.3). Here, template (2) requires values from pool (5), for which no values
were found in the RDF triples, so it is not added to the chart. Template (1) fulfils all
requirements and is added. Next, simple templates are generated for each of the four pools
except pool (1) as this one contains rdfs:label.
(1) [$self] is a [rdf:type] created by [dbprop:crea tor]. (2) [$self] is a [rdf:type]. (3) [$self-posessive] creator is [dbprop:creator]. (4) [$self-posessive] significant other is [dbprop:signoficantother].
Figure 5.3 Chart for generation
In the first iteration, none of the pools have been used, which makes all templates on the chart
selectable. Considering that the stored n-gram probabilities have rdf:type as the most likely
predicate to follow the null property (beginning of document), and given that only the first
property expressed is considered when computing the n-gram score, both (1) and (2) would have
the same score. However, given the formula adds to this score a value proportional to the number
of properties that would be instantiated by the template, template (1) is chosen and added to
the final combination. This has two pools marked as used once: pool (2) and (3). This still leaves
one pool that does not refer to $self to be expressed: number (4).
In the second iteration, templates (1), (2) and (3) cannot be selected as candidates to follow in
the combination, as they require properties that have already been used once. Only template (4)
is available for selection, so independently of its score it will be added next. The final template
combination is then (1,4).
5.5 Filling the slots
For every template in the combination created in the step above, we must select values for its
slots. For slots that are refer to $self, the title entity, LOD-DEF implements a very simple
Referring Expression Generation algorithm, similar to the baseline described in section 3.4.
The initial reference to an entity is its foaf:name or rdfs:label. For classes which have been
observed to be referred to using a singular pronoun with grammatical gender (“he” and “she”),
as it was done for the baseline, the system specifically retrieves the value for foaf:gender for the
43 of 66
entity whose description is being generated and chooses the right pronoun based on it. This
objective would ideally be attained by performing inference on the classes the entity belongs to
or by checking with the SPARQL endpoint whether the entity is of class “Person” (using any of
the available URIs identifying a person, e.g. foaf:Person). However, due to the experience dealing
with remote data as detailed in 6.1, in the implementation we trust the text rather than the
data. If no foaf:gender value is available for an entity for which “he” and/or “she” referring
pronouns were observed in training, the fallback gender is the most frequent one observed during
training.
For all other slots, if only one value is available for the pool, this is rendered depending on its
type (i.e. dates get formatted from 1066-10-14 to “14 October 1066”). Finally, a number of
regular expressions help keep the output grammatically correct, by adjusting spaces between
punctuation tokens, changing the article “a” to “an” before a word starting with a vowel, etc.
Continuing the previous example, the resulting output is:
Woody Woodpecker is a fictional anthropomorphic cha racter created by Walter Lantz, Ben Hardaway and Alex Lovy . His significantother is Winnie Woodpecker.
44 of 66
Chapter 6
Experiments
6.1 Problems with the data
Months of testing support the conclusion that a great degree of caution must be exercised when
relying on DBpedia data. To begin with, as mentioned before, the schema is rather unreliable.
Redundancy is high, as very often several properties with the same meaning are provided (e.g.
dbprop:birthPlace, dbprop:placeOfBirth, dbont:birthPlace and dbprop:birthPlace all have the same
meaning). These properties are meant to have owl:sameAs links to identify them as equal, yet
when these triples exist they always point back to the same URI (e.g. dbprop:dateOfBirth
owl:sameAs dbprop:dateOfBirth). As detailed in Chapter 3 and Chapter 4, this is addressed by
the LOD-DEF system by learning pools of equivalent predicates.
Further, it is remarkable that the rdf:type properties on DBpedia link to supposed class URIs
with incorrect spellings like “SpanishFootballCluBs”, “CarManuFACturers” and
“BaroquEComposers”. It remains unclear what the reasons behind these spellings are, but these
are clearly errors in the data, as there are no triples in the triplestore with these URIs as
subjects. Their correctly-spelled counterparts do have triples, e.g. yago:CarManufacturers rdfs:label "Car manufacturers"@en.
6.2 Performance of the system
Evaluation of the system’s performance is somewhat problematic due to the cumulative error
rate introduced by the amount of interdependent modules in the architecture pipeline. Each
stage depends on output from the previous stage, so for instance an error at the spotting stage is
sure to impact the extraction of a sentence template.
I manually evaluate here two main aspects of the system: the success of the template extraction
process and the class selection algorithm. For this, the training pipeline was run to train a single
model collection for the classes in Table 6.1, for a maximum of 30 entities of each class. These
classes are the same ones used for human evaluation (see Chapter 7), although the entities the
model was trained on need not be the same ones. Other aspects, such as spotting performance,
are evaluated through examples and critical discussion.
yago:EnglishComedians yago:CarManuFACturers yago:AmericanPopSingers yago:AfricanCountries yago:FictionalAnthropomorphicCharacters yago:SpanishFootballCluBs yago:ArgentineFootballers yago:SingingCompetitions yago:SpeedMetalMusicalGroups yago:GermanComposers yago:CapitalsInEurope dbont:TelevisionShow
Table 6.1 Classes used for testing
45 of 66
6.2.1 Spotting performance
As only a gazeteer is applied in this baseline, the performance of the spotting will exclusively
depend on the extent to which the value literals in the RDF triples mirror those found in text. In
the case of spelling differences or naming inconsistencies, the spotting will fail.
There are inconsistencies in the data, such as the spelling of names cross-language. For example,
for a single entity we find “George Frideric Handel” in the triple values and “George Fredrick
Handel (German: Georg Friedrich Händel)” in its associated article text. Note that none of the
two spellings found in text can be matched to the one in the triple values.
Another example is the category name “Argentine Fooballers”. This string is seldom spotted in
the text, as the surface realisation is “football players”. The two terms are synonymous, and
ideally the system should be able to determine that the surface realisation of “footballer” is
“football player”. Clearly then this task requires a more sophisticated approach than literal string
matching. DBpedia provides a lexicalisation dataset which can be applied to this task and is
indeed used by DBpedia Spotlight (Mendes & Jakob, 2011). Another option for a robust NER
solution is OpenCalais (Butuc, 2009).
6.2.2 Parser performance
Although the parser introduces a significant error rate to the system due to inconsistencies in
nesting constituents, I do not directly evaluate its performance here, lacking a gold standard to
compare against for this specific domain. It should however be noted that a different PCFG
model could help improve performance. Also, it was noted before that other approaches to
sentence compression use dependency parsing. This was tested for the present project but was
deemed unsuitable due to the low accuracy of the output from the parsers tested. There exists
perhaps a different dependency parser that would be more fitting to the task.
6.2.3 Class selection performance
The class selection algorithm was manually evaluated, by comparing the n-best classes identified
by the algorithm with the first line in the full English Wikipedia article for that entity.
The criteria adopted were as follows. Consider two sets, A and B, where A is the set of classes
mentioned in the first description sentence in the text, and B is the set of n-best classes chosen
by the class selection algorithm. The criteria for establishing matches between these sets are
shown in Table 6.2.
Match type Criterion No match No element of A is equivalent to an element in B Partial match At least one element in A is equivalent to an element in B Correct match A1 (the first class mentioned in the text) is also in B
Table 6.2 Match criteria
For this testing, n was set dynamically depending on the number of classes available for an
entity. For entities belonging to 9 or more classes, n is set to 5. For entities with fewer than 9
classes, n is set to half the number of classes, rounded up.
To give an example, for the entity :Woody_Woodpecker, the best class as chosen by the algorithm
is yago:FictionalAntropomorphicCharacters, whereas the article text begins “Woody Woodpecker
46 of 66
is an animated cartoon character, an anthropomorphic acorn woodpecker”. This is judged to be a
correct match, although “animated” and “cartoon” do not appear in the class name.
Similarly, for an entity of class yago:CapitalsInEurope, where no class with “city” or “capital” in
the name is available, yago:PopulatedPlace is accepted as equivalent of “city”.
For testing, set C is also defined, where B is the set of n-best classes chosen directly from the
triples, and C is the set that was chosen adding the first sentence from the article text to the bag
of words.
Set B (n-best) Set C (n-best with text) Entities tested 95 95 No match 1 1 Partial match 0 0 Correct match 94 94
Table 6.3 Results of class selection evaluation
The results of the evaluation suggest this is a robust algorithm, with almost a 100% correct
match rate as defined above. This evaluation is admittedly dependent on subjective
interpretation of the meaning and overlapping of class names, so further testing, refining of
criteria and ideally evaluation by other humans (like the one in Chapter 7) should be
undertaken.
The one entity for which no match was found was :Life_in_Hell. Revealingly, it is said to be of
class yago:FictionalAntropomorphicCharacters, yet the Wikipedia states “Life in Hell was a
weekly comic strip […] The strip features anthropomorphic rabbits and a gay couple.” While the
entity clearly contains fictional characters, it was decided that it should be of class “comic strip”
or equivalent.
6.2.4 Template extraction
The templates extracted were manually judged on transferability (Y/N) and on their
grammaticality. Grammaticality was judged according to the criteria outlined in Table 6.4.
Scores Meaning 5 Perfectly grammatical 4 Minor punctuation defects (e.g. stranded commas) 3 Missing determiner or stranded conjunction 1-2 Lack of verb or no meaning
Table 6.4 Grammaticality scores and criteria
Intuitively there are also different degrees of non-transferability, but this here was not taken into
account. The judging was binary: if a template was not perfectly transferable, it was judged not
transferable at all.
47 of 66
Item total Processed articles 268 Sentences considered for extraction 199 Discarded during pruning 98 (49%) Discarded after pruning (filtered) 26 (13%) Extracted templates 74 (37%)
Non-transferable 14 (19%) Transferable templates 60 (81%)
Average grammaticality 4.15 5-stars grammaticality 34 4-stars grammaticality 9 Final accuracy 43/74 (58%)
Table 6.5 Extracted templates statistics
For the purposes of evaluation, here I adopt as the final performance metric (accuracy) of the
template extraction process the percentage of final extracted templates that are both
transferable and have a grammaticality score of 4 or 5. As Table 6.5 shows, of a total of 74
extracted templates, 60 (81%) are transferable, of which 43 (58% of total extracted templates)
have a rating of 4 or 5 on grammaticality.
Note here that of the 14 non-transferable templates reported, 3 were purged in the post-
processing stage because of conflicting predicates in their slots. However, this is not taken into
account here, as this is an independent step with a different purpose.
The final accuracy metric can clearly be improved on, and one way of doing so is refining the
rules for pruning. The development of these rules did not follow a data-driven approach, but was
based on first principles of Context Free Grammar and on a summary examination of the data.
It only became apparent during evaluation that these rules were not sufficient to ensure the
grammaticality and transferability of the extracted templates.
6.2.5 Examples of errors in output
• “Casablanca of Morocco is Rabat.” Here the spotting failed mainly due to
the low quality of the data. During training, the title entity’s dbprop:largestCity property
had the value “capital” as a string literal. This prompted the extraction of the previous
template [dbprop:largestCity] of [dbprop:commonName] is [dbont:capital]. This property
has no rdfs:range specified in the schema, which means it can take any value. This is
unfortunate, as it could be argued that its values should be of type City, and a string
literal like “capital” here is of little use and adds noise to the data.
• “Her active is 1981.” What this means is that the person who is the title entity
has been active since 1981, but the rdfs:label for this does not say so.
• “Although Hyundai Motor Company started in Public.” Here, the
pruning rules were clearly not enough to make this sentence grammatical. Either the
“although” should have been removed or the whole template dropped.
• [foaf:surname] is married to [rdf:type] [dbprop:spo use]. This
template for EnglishComedians may well happen to be true once instantiated in text, if
and only if the spouse of the title entity is of the same type (i.e. EnglishComedians). This
48 of 66
means that this sentence is not transferable and should be identified as such and
discarded.
• “Mercyful Fate is a Speed metal musical group from, Denmark and
Copenhagen.” This sentence is mildly ungrammatical due to the presence of a
comma. While “Denmark and Copenhagen” are odd, they are due to the generation
algorithm, not to the extraction.
49 of 66
Chapter 7
Evaluation
7.1 Approach
Given the exploratory nature of this project, the evaluation relies on multiple human valuation of
the system’s output, evaluated in equal conditions with output from two other systems: the
baseline described in section 3.4 and expert human output. I adopt a two-panel (i.e. two separate
groups of subjects) approach to compare the three generation systems, very similar to the
evaluation undertaken by Sun & Mellish (2007). Humans in Panel A generate descriptions of the
same 12 entities and humans in Panel B rate the different outputs of System A (baseline),
System B (LOD-DEF) and System C (human generation) across a number of dimensions.
The hypothesis is that LOD-DEF will be rated higher on average in human evaluation than a
system generating exclusively from English words in RDF predicates. For comparison with an
upper bound, the system is also ranked against human-generated text for the same data.
Human-generated text need not always be an upper bound in subjective evaluation, but given
the simplicity of the two NLG systems, this is the hypothesis here.
Given the relatedness of the present approach with automatic summarisation, three of the
criteria for evaluation used by the Document Understanding Conference 2007 were found to be
very appropriate for the task at hand. The texts are rated on grammaticality, non-redundancy,
and structure and coherence. No direct evaluation of content determination is carried out: here it
is evaluated implicitly through the dimension of “non-redundancy”, given that its main effect in
this implementation is filtering out redundant and unnecessary information.
7.2 Selection of data
Classes used for evaluation were not chosen at random. Given that one of the aims was to
evaluate the effect of the sentence templates as opposed to the baseline, I purposefully applied a
bias towards classes for which a higher number of templates were extracted and more properties
were spotted in text, which correlates with classes for which more factual information (strings
and quantities) was available on the DBpedia 21 . This aimed to ensure richer output was
generated by the LOD-DEF system, to allow for a more meaningful rating from human judges
and so that we can evaluate the performance of the system at document structuring.
Within these constraints an attempt was made to select classes as varied as possible. While in
the final test set there are four instances of subclasses of “Person”, these are markedly different
kinds of person, with several different RDF properties. Also, this is in approximate correlation
(30%) to the amount of entities of type “person” available in the consistent DBpedia ontology,
approximately 23% (Mendes et al., 2012).
21 This also correlates with subclasses of Person.
50 of 66
Subject 1 2
Entities and classes
Jennifer Jane Saunders (English Comedians) Fernando Gago (Argentine Footballers) Hyundai Motor Company (Car Manufacturers) American Idol (Singing Competitions) Nicole Scherzinger (American Pop Singers) Mercyful Fate (Speed Metal Musical
Groups) Morocco (African Countries) William Herschel (German Composers) Woody Woodpecker (Fictional Characters) Belgrade (Capitals in Europe) Real Zaragoza (Football Clubs) Winx Club (Television Shows)
Table 7.1 Entities chosen for evaluation and subject generating each
For each class, the aim was to select a lesser known instance and so prevent the subjects’ adding
extraneous information to the output. For example, for “Fictional Character”, instead of “Mickey
Mouse”, “Woody Woodpecker” was chosen, still a widely known character but arguably one that
is less heavy with associations.
During the development of the LOD-DEF system, the development set of entities against which I
adjusted the several subcomponents was formed mostly of instances of yago:GermanComposers.
The template extraction system was adjusted in order to extract more grammatical sentences
this set, which was also included in evaluation.
William Herschel is not best known for being a German composer, but rather an astronomer.
Although he was both, this is an instance where the algorithm failed due to the data available.
Given that at the time of organising the survey this was not known, the survey reflects it as
such. However, an important observation is that he is best known for discovering Uranus the
planet. In the context of the article, as it is not specified, it could be assumed that this is a piece
of music.
7.3 Human generation
Panel A is given triples related to the chosen entities and instructions on how to proceed. Panel
A is formed by two native speakers of English, both of them linguistics postgraduate students.
Their task is to write summary descriptions of the entities the data is about by expressing as
much of this data as possible in text.
The triples are grouped by entity they relate to, one entity on each page. The information is
printed in a human-friendly format, where the rdfs:label is retrieved for every predicate, followed
by the equal sign and a list of n values, which are all the values the property has. If a value is a
URI, the rdfs:label for that URI is retrieved and printed instead. Otherwise the value is presented
with its literal value. For example, “birth date = 1958-07-06”, “place of birth = Sleaford,
Lincolnshire, England”. The full instructions used for this experiment can be found in Appendix
A.
Triples given to Panel A were selected from the same ones identified by the LOD-DEF system as
pools. Triples were then curated and filtered by hand to further remove redundancy and to
randomize the order in which they are presented. As I have already pointed out, much factual
information is encoded in Wikipedia categories, and thus in the names of YAGO classes. For this
reason, only one class is included in the triples, which I manually picked as the one intuitively
and subjectively considered more representative from the available rdf:type triples.
51 of 66
I avoided giving the subjects examples of what kind of output was expected, thus taking care not
to prime them. I did include an example of generating from one triple as a warning to avoid
including extraneous information.
7.4 LOD-DEF generation
For the training of the LOD-DEF system a separate model collection was trained for a maximum
of 30 entities for of the class manually chosen (see Table 7.1). These entities were taken in the
order returned by the SPARQL endpoint, where articles on the Simple English Wikipedia where
available for them.
7.5 Human rating
Subjects were asked to complete an online survey. For this survey, the same 12 entities (Table
7.1) were described by the three systems, which produced 36 short texts, rated by 25 subjects.
The participants were self-identified as having an upper-intermediate or above level of English.
The texts were presented to the subjects in pseudo-random order, to avoid texts about the same
entity occurring within a page of each other (four texts were presented on every page). This
avoided direct side-by-side comparison. For each of these, each subject was asked to rate the text
on a measure of 1 (lowest) to 5 (highest) on the following three criteria, adapted from the DUC
2007 criteria 22 : grammaticality, non-redundancy and structure and coherence. For the full
description of these criteria and the instructions given to the subjects, see Appendix B.
It was not disclosed to the subjects until the end of the experiment that humans generated the
texts of one of the systems being tested.
7.6 Results
An exploratory analysis of the data collected showed clear differences in means between for the
rating of the three systems (Table 7.2). To establish the significance of these differences I
conducted a One-Way ANOVA (as opposed to Paired-rank T-tests, to adjust for the comparisons
made) for each of the three criteria the texts were rated on. All three ANOVAs were statistically
significant: for grammaticality (F(2,72)=119.001, p < 0.001), for non-redundancy
(F(2,72)=129.053, p < 0.001) and for structure and coherence (F(2,72)=129.053, p < 0.001). I
conducted Tukey’s Post-Hoc test to establish which comparisons were significant for each; Table
7.3, Table 7.4 and Table 7.5 show the differences in mean and the results of the Tukey tests.
For the ratings on structure and coherence, three outliers were found to affect the normality of
the distribution of System C. The outliers were removed to and the assumption of normality held
(as reported by the Shapiro-Wilk test) and ANOVA and Tukey test were run with and without
the outliers. The same difference was found in main effects in both models, therefore I report the
main effects with the outliers included.
System Grammaticality Non-redundancy Structure and coherence A (baseline) 2.29 1.89 1.95
22 http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt
52 of 66
B (LOD-DEF) 2.58 3.03 2.70 C (humans) 4.48 4.66 4.49
Table 7.2 Means
Baseline vs.LOD-DEF Grammaticality Non-redundancy Structure and coherence Difference 0.29 1.14 0.75 Significance p = 0.151 p < 0.001 p < 0.001 Significant No Yes Yes
Table 7.3 Differences and significance
LOD-DEF vs. Humans Grammaticality Non-redundancy Structure and coherence Difference 1.14 1.63 1.79 Significance p < 0.001 p < 0.001 p < 0.001 Significant Yes Yes Yes
Table 7.4 Differences and significance
Humans vs. Baseline Grammaticality Non-redundancy Structure and coherence Difference 2.19 2.77 2.54 Significance p < 0.001 p < 0.001 p < 0.001 Significant Yes Yes Yes
Table 7.5 Differences and significance
7.7 Discussion
As it was expected, expert human generation is an upper bound in this evaluation, being
consistently superior to the other two systems tested. LOD-DEF does not improve on the
perception of grammaticality of the baseline, but it does significantly outperform the baseline on
non-redundancy and structure and coherence.
The difference between the average score of humans and the baseline is at its lowest for
grammaticality, as the output of both the baseline and LOD-DEF was judged surprisingly high
on grammaticality. LOD-DEF scored very slightly higher (a difference of 0.29 on means) but this
is not statistically significant (p = 0.151). The most significant improvement of LOD-DEF over
the baseline is on the non-redundancy metric, with a difference of 1.14.
The fact that, in spite of the simple approach taken and the many errors in output (as discussed
in the previous chapter), LOD-DEF still significantly outperforms the baseline on both non-
redundancy and structure and coherence is very encouraging. These results suggest that
automatic training of NLG systems is a promising approach that should be pursued further.
53 of 66
Chapter 8
Conclusion and future work
8.1 Conclusion
This project has focussed on describing, implementing and testing a trainable shallow Natural
Language Generation system for factual Linked Open Data based on the extraction of sentence
templates and document planning via content n-grams.
The main contributions of this work are:
• Describing a full architecture for this system, including both the training and generation
stages. To my knowledge this system in its whole represents a new approach to trainable
NLG, never before tried in its entirety.
• Building a baseline implementation of this architecture, the LOD-DEF system, and both
evaluating its performance at template extraction and class selection, and conducting
human evaluation of this system against a baseline and human-generated output.
• Proving that even an exceedingly simple system as LOD-DEF system is rated
significantly higher in human evaluation over the baseline. In essence, this project shows
that this approach is a promising one and that it should be pursued further.
As per the criteria outlined in Chapter 3, it is clear that much could be improved. Most
importantly, much more work on template extraction needs to be done. With little extra effort,
the system could easily improve on its current performance of 58% of extracted templates that
are both transferable and grammatical.
This project was met with a measure of success. However, were I to start again now, I would
approach it in different ways. First, I would perhaps focus on one of the many problems I have
tackled, e.g. the class selection algorithm, and investigate it more thoroughly. Second, while
building this whole system from the ground up was a highly instructive experience, I would
strive to use or adapt an existing architecture.
This three-month project started life as a PhD research proposal. It is immediately apparent that
this is but one twelfth of that initial project, and many interesting lines of research had to be
abandoned for lack of time and experience. Sourced both from the original proposal and from the
findings of this project, in the next section I offer some directions for future work.
8.2 Future work
First, the implementation described here is but a baseline. As I have already suggested, more
robust systems exist for every main module of this application: Named Entity Recognition,
parsing, coreference resolution, etc.
Within the same shallow approach, substituting these modules in the pipeline would surely help
improve the results, as the analyses of errors in Chapter 6 show. For the NER task, using an
54 of 66
established system like DBpedia Spotlight (Mendes & Jakob, 2011) or OpenCalais (Butuc, 2009)
would be a first step, which would also allow integrating inference in the selection of the data to
be spotted in text.
Also, the architecture implemented for this project duplicates readily-available general-purpose
architectures, of which the General Architecture for Text Engineering (GATE) (Cunningham et
al. 1996) is a prototypical example.
But beyond these improvements on the shallow approach, the crucial steps involve moving
towards deeper natural language understanding and with it to deeper generation. The most
sophisticated approaches to document planning use another level of abstraction from text:
discourse relations. The original aim was to automatically extract these relations, to which there
exist a number of approaches (e.g. Soricut & Marcu, 2003).
Whether we think of the rhetorical relations in a text as a tree (as in Rhetorical Structure
Theory – Mann & Thompson, 1988) or as a graph (e.g. Segmented Discourse Representation
Theory – Asher & Lascarides, 2003), it is clear that the structure and coherence of a text are
more than just a succession of properties and their values.
This move must probably be accompanied by an application of techniques of relation extraction,
ideally informed by a deeper understanding of the argument structure of predicates in natural
language text, that is, what arguments verbs take and their thematic roles. The FrameNet and
VerbNet projects, coupled with WordNet are likely to play a role in this (Shi & Mihalcea, 2005).
These steps would allow us to move towards the automatic learning of rules for deeper
generation. A number of more general-purpose NLG architectures exist (e.g. NaturalOWL as
described before, but also others not specifically targeted to the Semantic Web like OpenCCG –
White, 2008).
With better identifications of the relations between the spotted entities in text and an
understanding of the rhetorical relations between sentences, we could extract full document
planning and aggregation rules, which could be converted for use by one of those systems.
Finally, an interesting problem for Named Entity Recognition is that of vagueness (Klein &
Rovatsos, 2011), when dealing, for instance, with large numbers. For example, the population of
a country is a figure in the millions, which is often reported in text as “about 30 million people”,
but has exact numbers in the data (e.g. 27.543.216).
In brief, the approach presented herein does not but scratch the surface.
55 of 66
Appendix A: Human generation
The following text was given to each of the two subjects. It is followed by the first data for
generation as an example.
Description generation
In this experiment you are required to write short descriptions based on some given information.
You will find a block of information at the beginning of each page. This information consists of
facts about a real-world entity (a person, a country, etc.) You might have never heard of that
person or thing but this does not matter.
Your task is to write a short description of this entity based on the information at the beginning
of the page. You are writing for a general audience with no previous knowledge about the entity
you are describing (but with knowledge of other entities of the same type, e.g. countries) and you
want to get all this information across (think of an article on the English Wikipedia).
Use the blank area of each page to write your text. Feel free to copy and paste names and other
chunks of text.
Please do not use any other information resources for this (i.e. don't look it up on Google or
Wikipedia until you have finished the experiment). It is essential you write the text that seems
more natural to you from the given information only.
Please do not include any value judgements (e.g. “the best”, “one of the greatest”, “a very
famous”, “the most important”) unless these are present in the information provided.
The information is in random order. You should report it in the order that seems more logical to
you in a description.
You can use any format you prefer for dates, numbers and other amounts. You can use any
grammatical construction and vocabulary.
It is very important that you include in your text no other information and that you use all the
information that you can infer from what is given (that is relevant).
Example:
From this information:
name = John
date of death = 1666-02-01
You could write:
1. John died in 1666.
2. John died accompanied by his wife and 3 pigs, in a barge that was pushed blazing into
Dunsapie Loch.
3. John died on 1st Feb 1666.
56 of 66
(1) is bad because it omits date information.
(2) is bad because it adds extraneous information.
(3) is good.
[new page]
Information:
Category: English comedians
active = 1981,
spouse = Adrian Edmondson,
description = British comedienne,
place of birth = Sleaford, Lincolnshire, England,
birth name = Jennifer Jane Saunders,
spouse = Adrian Edmondson,
birth date = 1958-07-06,
notable work = Various in French & Saunders, Edina Monsoon in Absolutely Fabulous, Fairy
Godmother in Shrek 2,
name = Jennifer Saunders,
Text (write text here):
57 of 66
Appendix B: Human evaluation
Hello! This is an evaluation questionnaire that compares output from 3 Natural Language
Generation systems, that is, software that takes data and outputs text in English. Two of these I
have created myself.
There are 36 very short text snippets in this questionnaire, generated directly from information
on the Semantic Web. It should take you about 25 minutes.
All you need to do is rate each text from 1 (very poor) to 5 (very good) on these measures:
Grammaticality
Non-redundancy
Structure and Coherence.
Don't worry, these are all explained at the top of each page.
NOTE: I assume that your level of English is upper-intermediate or above.
Let's go!
These are the criteria for rating the texts, please take a moment to read them. They appear
again at the beginning of every page.
Grammaticality
The text should have no system-internal formatting, capitalization errors or obviously
ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to
read.
Non-redundancy
There should be no unnecessary repetition in the text. Unnecessary repetition might take the
form of whole sentences that are repeated, or repeated facts, or the repeated use of a noun or
noun phrase (e.g., "Bill Clinton") when a pronoun ("he") would suffice.
Structure and Coherence
The text should be well-structured and well-organized.
1. Very Poor
2. Poor
3. Barely Acceptable
4. Good
5. Very Good
58 of 66
Full text generated by systems B and C, and example of A
System A (baseline)
1 Jennifer Saunders is an English television actor. Her birth date is 6 July 1958. Her description is British comedienne.
Her spouse is Adrian Edmondson. Her genres are Comedy and Parody. Her caption is Saunders in November 2008.
Her birth name is Jennifer Jane Saunders. Her wordnet type is synset-actor-noun-1. Her nationality is British
people. Her medium is Television, film. Her short description is British comedienne. Her place of birth is Sleaford,
Lincolnshire, England. She is a primary topic of : Jennifer Saunders. Her page is Jennifer Saunders. Her notable
works are Edina Monsoon in Absolutely Fabulous, Various in French & Saunders and Fairy Godmother in Shrek 2.
Her name is Jennifer Saunders. She has a photo collection : Jennifer Saunders. Her label is Jennifer Saunders. Her
given name is Jennifer. Her surname is Saunders. Her birth places are Sleaford and Lincolnshire.
4 Morocco is an Arab LeaguE member state. Its cctld is .ma. Its geometry is POINT(-6.85 34.0333). Its area total
(km2)s are 446550.0 and 446739.2791875256. Its sovereignty types are Monarchy and Independence. Its demonym is
Moroccan. Its time zone is Western European Time. Its lats are 34.0333 and 32.0. Its established events are from
France, Alaouite dynasty, Mauretania and from Spain. Its percentage of area water is 250.0. Its leader names are
Abdelillah Benkirane, Abdelilah Benkirane and Mohammed VI of Morocco. Its points are 34.03333333333333 -6.85
and 32.0 -6.0. Its gdp ppp is 1.62617E11. Its image maps are 29.0 and Morocco on the globe .svg. Its official
languagess are Berber, Arabic and Arabic language. Its government types are Parliamentary system, Constitutional
Monarchy and Unitary state. Its leader titles are Prime Minister of Morocco, List of heads of government of
Morocco, List of rulers of Morocco and King of Morocco. Its currency is Moroccan dirham. Its conventional long
name is Kingdom of Morocco. Its legislature is Parliament of Morocco. Its national anthem is "Cherifian Anthem".
Its longs are -6.85 and -6.0. Its percent waters are 250.0. Its languages type is Native languages. Its time zone dst is
Western European Summer Time. It has a photo collection : Morocco. Its page is Morocco. Its north is
Mediterranean Sea. Its anthem is Cherifian Anthem. Its homepage is http://www.maroc.ma/PortailInst/An/. Its
longname is Kingdom of Morocco. Its established dates are 7 April 1956, 1666, 2 March 1956 and 110. Its founding
dates are 7 April 1956 and 2 March 1956. Its capital is Rabat. Its largest city is Casablanca. Its lower house is
Assembly of Representatives of Morocco. Its ethnic groups are North African Arabs, Berber people and Berber
Jews. Its gdp nominal is 9.9241E10. It is a primary topic of : Morocco. Its gdp ppp per capita is 5052.0. Its longew is
W. Its drives on is right. Its common name is Morocco. Its languages is Berber, Moroccan Arabic, Hassaniya.. Its
southwest is Atlantic Ocean. Its area footnote is or 710,850 km2. Its population density (/sqkm)s are 71.622 and
71.6. Its official languages are Arabic language and Berber languages. Its latns is N. Its gdp nominal per capita is
3083.0. Its calling codes are Telephone numbers in Morocco and %2B212. Its hdi category is medium. Its northeast
is Mediterranean Sea. Its label is Morocco. Its languages are Hassaniya language, Moroccan Arabic, Arabic language
and Berber languages. Its titles are Languages, Geographic locale and International membership. Its currency code
is MAD. Its national motto is "God, Homeland, King". Its mottoes are "God, Homeland, King", (Berber) and
(Arabic). Its northwest is Atlantic Ocean. Its name is Morocco. Its west is Atlantic Ocean. Its upper house is
Assembly of Councillors.
5 Woody Woodpecker is a Fictional anthropomorphic character. His last appearance is The New Woody Woodpecker
Show. His portrayers are Kent Rogers, Grace Stafford, Ben Hardaway, Daniel Webb, Mel Blanc, Cherry Davis,
Danny Webb and Billy West. His families are Splinter and Knothead and Scrooge Woodpecker. His creators are
Walter Lantz, Ben Hardaway and Alex Lovy. His significantother is Winnie Woodpecker. His caption is 1951.0. He
is a primary topic of : Woody Woodpecker. His species is Woodpecker. His last is I Know What You Did Last
Night. His first is Knock Knock. His gender is Male. He has a photo collection : Woody Woodpecker. His labels are
Woody Woodpecker. His page is Woody Woodpecker. His first appearance is Knock Knock (1940 cartoon). His
name is Woody Woodpecker. His homepage is www.woodywoodpecker.com.
System B (LOD-DEF)
1 Jennifer Saunders (6 July 1958, Sleaford and Lincolnshire) is a British comedienne. Her spouse is Adrian
Edmondson. Her active is 1981. Her place of birth is Sleaford, Lincolnshire, England. Her notable works are Edina
Monsoon in Absolutely Fabulous, Various in French & Saunders and Fairy Godmother in Shrek 2. Her nationality
is British people.
2 Hyundai Motor Company is a Car manuFACturer. Hyundai Motor Company started on 29 December 1967.
Although Hyundai Motor Company started in Public. Its parent company is Hyundai Motor Group. Its founded by
is Hyundai Motor Company. Its location country is South Korea. Its subsid is Hyundai Motor India Limited. Its
59 of 66
location cities are Seoul. Its key people is Chung Mong-koo. Its products is Automobiles, commercial vehicles,
engines. Its key person is Chung Mong-koo. Its products are Commercial vehicle and Internal combustion engine. Its
production is 2943529.
3 Nicole Scherzinger (born 29 June 1978) is an American female singer. Scherzinger worked in Hawaii and Honolulu.
Her labels are Polydor Records, Interscope Records and A&M Records. Her associated musical artists are Days of
the New, Pussycat Dolls and Eden's Crush. Her titles are "Jai Ho! ", "Poison" and Dancing with the Stars (US)
winner. Her alternative names is Kea, Nicole. Her befores are Donny Osmond and Kym Johnson. Her years is
Season 10.
4 Morocco (called as Kingdom of Morocco) is an African country. Casablanca of Morocco is Rabat. Morocco's leader
names are Abdelillah Benkirane, Abdelilah Benkirane and Mohammed VI of Morocco. Its west is Atlantic Ocean.
Its official languagess are Berber, Arabic and Arabic language. Its northeast is Mediterranean Sea. Its demonym is
Moroccan. Its founding dates are 7 April 1956 and 2 March 1956. Its established events are from France, Alaouite
dynasty, Mauretania and from Spain. It is an African country. Its demonym is Moroccan. Its established dates are 7
April 1956, 1666, 2 March 1956 and 110. Its leader titles are King and Prime Minister. Its largest city is Casablanca.
5 Woody Woodpecker is a Fictional anthropomorphic character created by Walter Lantz, Ben Hardaway and Alex
Lovy. His species is Woodpecker. His first is Knock Knock. His last is I Know What You Did Last Night. His first
appearance is Knock Knock (1940 cartoon). His significantother is Winnie Woodpecker.
6 Real Zaragoza's clubname is Real Zaragoza. Its nats are Italy, Portugal, Mexico, ESP, Serbia, ITA, Croatia, BRA,
Paraguay, Hungary, Argentina and Spain. Its league is La Liga. Its titles are Inter-Cities Fairs Cup, UEFA Cup
Winners' Cup and UEFA Cup Winners%27 Cup. Its founded is 1932. Its fullname is Real Zaragoza, S.A.D. It is a
Spanish football cluB.
7 Fernando Gago's teams are Real Madrid C.F., Boca Juniors and Argentina national football team. His clubss are
Real Madrid C.F. and Boca Juniors. His birth date is 10 April 1986. His playername is Fernando Gago. His
fullname is Fernando Rubén Gago. His currentclub is Real Madrid C.F. His dateofbirth is 10 April 1986. He is an
Argentina international footballer.
8 American Idol is a Creative Work run by the 19 Entertainment and FremantleMedia.Its presenters are Brian
Dunkleman and Ryan Seacrest. Its judges are Randy Jackson, Ryan Seacrest and Mariah Carey.
9 Mercyful Fate is an Speed metal musical group from, Denmark and Copenhagen. Their former band members are
Timi Hansen, Snowy Shaw and Michael Denner. Their associated musical artists are Fate (band), Arch Enemy,
Force of Evil (band), Spiritual Beggars, Memento Mori (band), King Diamond (band), Brats (band), Black Rose
(band) and Metallica. Their band members are Hank Shermann, Mike Wead, King Diamond and Sharlee D'Angelo.
Their record labels are Roadrunner Records, Combat Records, Rave On (record label) and Metal Blade Records.
Their labels are Roadrunner Records, Combat Records, Metal Blade Records and Rave On %28record label%29.
Their years active is 1981.
10 Friedrich Wilhelm Herschel (15 November 1738 in Holy Roman Empire, Hanover and Electorate of Brunswick-
Lüneburg – 25 August 1822 in England, Berkshire and Slough) was a German composer. His known fors are Uranus
and Infrared.
11 City of Belgrade is the populated place. Its is a part of : Belgrade%23 Municipalities. Its leader names are Party of
United Pensioners of Serbia, Dragan Đilas, Democratic Party (Serbia), Milan Krkobabić and Socialist Party of
Serbia. Its population demonym is Belgrader. Its official name is Belgrade. It is a populated place. Its native names
are Град Београд, Београд and Beograd.
12 Winx Club is a Nickelodeon, Rai Due, 4Kids TV, 4KidsTV and Rai 2 series made by Alfred R. Kahn, Norman J.
Grossfeld and Joanna Lee. Its director is Iginio Straffi. Its first aired is 28 January 2004. It is a Creative Work. Its
country is Italy.
System C (humans)
1 Jennifer Jane Saunders (Born 06/07/1958) is an English comedienne, originally from Sleaford, Lincolnshire. Jennifer
has been active as a comedienne since 1981 and a selection of her most notable roles include Edina Monsoon in
Absolutely Fabulous, the Fairy Godmother in Shrek 2, whilst also appearing in French and Saunders. Her spouse is
60 of 66
Adrian Edmondson.
2 Hyundai Motor Company is a South Korean company based in Seoul and is part of the Hyundai Motor Group. The
company was founded on the 29th of December, 1967 by Chung Ju-yung. They make various products ranging from
automobiles, commerical vehicles and internal combustion engines.
3 Nicole Prescovia Elikolani Valiente (also known as Nicole Scherzinger, Kea, Nicole) was born on the 29th of August
1978, in Honolulu, Hawaii, USA and is a singer from the noughties. She is associated with a variety of different acts
including Days of the New, Pussycat Dolls and Eden’s Crush. Record labels that she has been signed to include
A&M Records, Polydor Records and Interscope Records.
4 Morocco (or Kingdom of Morocco) is a country that is part of the continent of Africa. The capital city is Rabat and
the largest city is Casabalanca. The total area of Morocco is approximately 445739 Km2, with a hdi categorisation
of medium. The country has a population density of 186 people per square mile with the official population
demonym being Moroccan. It is geographically located with the Mediterranean Sea to the Northeast and the
Atlantic Ocean to the Southwest of the country. Arabic and Berber are the officially spoken languages, although
Hassaniva is also spoken. The present leaders in the country are King Mohammed VI and the Prime Minister is
Abdelilah Benkirane. The modern country was officially established in 1956 on the 7th of April, which marks the
independence from the Alouite Dynasty of France. Earlier reports of the country’s establishment relate to 1666, with
the event of Mauretania from Spain.
5 Woody Woodpecker is a fictional cartoon woodpecker. Created by Ben Hardaway, Walter Lantz and Alex Lovy, his
first appearance was in the 1940 cartoon ‘Knock Knock’. Since then various people have portrayed the character
including, Mel Blanc, Billy West, Kent Rogers, Ben Hardaway, Daniel Webb, Grace Stafford, Cherry Davis and
Danny Webb. The last appearance which Woody Woodpecker was featured in was ‘I Know What You Did Last
Night’. Other related characters include his significant other, Winnie Woodpecker.
6 Real Zaragoza are a Spanish football team, who play in the Spanish league La Liga. Their ground is La Romareda,
Aragon in Zaragoza. Founded in 1932, the club have won the Inter-Cities Fairs Cup and the UEFA cup Winners’
Cup. Players for the team come from a variety of different nations including Argentina, Italy, Hungary, Serbia,
Croatia, Paraguay, Mexico, Spain, Portugal and Brazil.
7 Fernando Rubén Gago, born on 10 April 1986, is an Argentine footballer. Gago currently plays for the club Real
Madrid C.F., as well as for the team Boca Juniors and the Argentine national football team; he has played in four
other clubs prior to joining Real Madrid. Gago has thus far scored no goals for his national team.
8 American Idol is a singing competition aired on television by the Fox Broadcasting Company, and produced by
FremantleMedia and 19 Entertainment. It was first aired on 11 June 2002. The programme is presented by Brian
Dunkleman and Ryan Seacrest, and the panel of judges is composed of Mariah Carey, Randy Jackson, Simon
Cowell, Steven Tyler, Ellen DeGeneres, Paula Abdul, Jennifer Lopez and Kara Dio Guardi. Its producers are Shane
Drake, Ken Warwick, Bruce Gowers, Nigel Lythgoe, Gregg Gelfand, John Pritchett and Andrew Scheer.
9 Mercyful Fate is a speed metal musical group from Copenhagen, Denmark. It has been associated with the bands
Metallica, Arch Enemy, King Diamond, Memento Mori, Brats, Black Rose, Force of Evil, Spiritual Beggars and
Fate. The band has been active since 1981. Its current members are King Diamond, Hank Shermann, Sharlee
D’Angelo, Mike Wead and Bjarne T. Holm; past members are Snowy Shaw, Michael Denner, Timi Hansen and Kim
Ruzz. It has released records on the labels Roadrunner Records, Combat Records, Metal Blade Records and Rave
On.
10 William Herschel (born Friedrich Wilhelm Herschel) was a German composer. He was born on 15 November 1738 in
Hanover, Electorate of Brunswick-Lüneburg, Holy Roman Empire. Herschel was known for the pieces Uranus and
Infrared. He died on 25 August 1822 in Slough, Berkshire, England.
11 Belgrade (officially the City of Belgrade, native name Beograd) is the capital of Serbia. It is a city with an area of
359.96 km2 and forms part of the Belgrade Municipalities. Its City Council is ruled by the Socialist Party of Serbia
and the Party of United Pensioners of Serbia; the current Mayor is Milan Krkobabić, and the Deputy Mayor is
Dragan Đilas. Belgrade was established prior to 279 BC. The population demonym of Belgrade is Belgrader.
12 Winx Club is an Italian animated television show aired in stereo on the networks 4Kids TV, Rai 2, and
Nickelodeon. It is directed by Iginio Straffi and released on 28 January 2004, and has so far run for 104 episodes
61 of 66
over four seasons. It is narrated by Joanna Lee, Alfred R. Kahn and Norman J. Grossfeld.
62 of 66
Order of the articles in the survey:
Page Text number Generated by system
1 1 C
1 3 B
1 6 B
1 4 A
2 7 C
2 5 A
2 2 C
2 8 B
3 10 A
3 4 B
3 9 C
3 11 C
4 12 A
4 2 B
4 3 A
4 5 B
5 10 C
5 1 A
5 8 C
5 12 B
6 3 C
6 9 B
6 2 A
6 7 B
7 11 A
7 6 C
7 8 A
7 1 B
8 12 C
8 11 B
8 9 A
8 4 C
9 10 B
9 5 C
9 7 A
9 6 A
63 of 66
References
Androutsopoulos, I., Kokkinaki, V., Dimitromanolaki, A., Calder, J., Oberlander, J., Not, E.
(2001). Generating Multilingual Personalized Descriptions of Museum Exhibits – The M-
PIRO Project. Retrieved from http://arxiv.org/ftp/cs/papers/0110/0110057.pdf
Asher, N. & Lascarides, A. (2003). Logics of Conversation. Studies in Natural Language
Processing. Cambridge University Press.
Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L.,
Patel-Schneider, P. & Stein, L.A. World Wide Web Consortium (W3C). (2004). OWL
Web Ontology Language Reference. Retrieved from http://www.w3.org/TR/owl-ref/
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American.
Retrieved from http://campus.fsu.edu/bbcswebdav/users/bstvilia/lis5916metadata/
readings/scientific-american_0.pdf
Berners-Lee, T. & Connolly, D. (W3C). (2011). Notation3 (N3): A readable RDF syntax.
Retrieved from http://www.w3.org/TeamSubmission/n3/
Bizer, C., Jentzsch, A. & Cyganiak, R.. (2011). State of the LOD Cloud. Retrieved from
http://www4.wiwiss.fu-berlin.de/lodcloud/state/
Bontcheva, K., & Davis, B. (2009). Natural Language Generation from Ontologies. In J. Davies,
M. Grobelnik, & D. Mladenic (Eds.), Semantic Knowledge Management: Integrating
Ontology Management, Knowledge Discovery and Human Language Technology (pp. 113–
127). Springer.
Brickley, D., & Guha, R.V. (W3C). (2004). RDF Vocabulary Description Language 1.0: RDF
Schema. Retrieved from http://www.w3.org/TR/2004/REC-rdf-schema-20040210/
Brickley, D. & Miller, L. (2010). FOAF Vocabulary Specification 0.98. Retrieved from
http://xmlns.com/foaf/spec/20100809.html
Busemann, S., & Horacek, H. (1998). A Flexible Shallow Approach to Text Generation, (2945),
10. Computation and Language. Retrieved from http://arxiv.org/abs/cs.CL/9812018
Busemann, S. (2011). Shallow Text Generation. Retrieved from
http://www.coli.uni-saarland.de/courses/LT1/2011/slides/shallow-nlg-lecture_WS1112.pdf
Butuc, M.G. (2009). Semantically enriching content using OpenCalais. Retrieved from
www.eed.usv.ro/SistemeDistribuite/2009/Butuc1.pdf
Cohn, T., & Lapata, M. (2009). Sentence compression as tree transduction. Journal of Artificial
Intelligence …, 1–38. Retrieved from http://eprints.pascal-network.org/archive/00005887/
Cunningham, H., Wilks, Y. & Gaizauskas, R.J. (1996). Gate: a general architecture for text
engineering. Proceedings of the 16th conference on Computational linguistics-Volume 2,
pp. 1057--1060
64 of 66
Cyganiak, R. & Jentzsch, A. (2011). The Linking Open Data cloud diagram. Retrieved from
http://lod-cloud.net/
Decker, S., Van Harmelen, F., Broekstra, J., Erdmann, M., Fensel, D., Horrocks, I., Klein, M.,
Melnik, S. (2000). The Semantic Web-on the respective Roles of XML and RDF. Retrieved
from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6109&rep=rep1&type=pdf
Duboue, P. A., & Mckeown, K. R. (2003). Statistical acquisition of content selection rules for
natural language generation. Proceedings of the 2003 conference on Empirical methods in
natural language processing, pp. 121-128. Retrieved from
http://dl.acm.org/citation.cfm?id=1119371
Feldman, R. & Sanger, J. (2007). The Text Mining Handbook - Advanced Approaches in
Analyzing Unstructured Data. Cambridge University Press.
Filippova, K., & Strube, M. (2008). Dependency tree based sentence compression. Proceedings of
the Fifth International Natural Language Generation Conference on - INLG ’08, 25.
doi:10.3115/1708322.1708329
Gagnon, M., & Sylva, L. D. (2006). Text Compression by Syntactic Pruning. Advances in
Artificial Intelligence 312–323. Springer.
Galanis, D., & Androutsopoulos, I. (2007). Generating multilingual descriptions from
linguistically annotated OWL ontologies: the NaturalOWL system. Proceedings of the
Eleventh European Workshop on Natural Language Generation, 143–146. Retrieved from
http://dl.acm.org/citation.cfm?id=1610188
Galley, M., Fosler-Lussier, E., & Potamianos, A. (2001). Hybrid natural language generation for
spoken dialogue systems. In Proceedings of the 7th European Conference on Speech
Communication and Technology (Interspeech-Eurospeech). September 3-7, 2001. Aalborg,
Denmark.
Grice, H. P. (1975). Logic and Conversation. In C. P. & M. J. (Eds.), Syntax and Semantics, Vol
3: Speech Acts (pp. 43–58). New York, New York, USA: Academic Press.
Heath, T. & Bizer, C., (2011). Linked Data: Evolving the Web into a Global Data Space
(1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136.
Morgan & Claypool.
Hewlett, D. Kalyanpur, A. Kolovski, V. Halaschek-Wiener, C. (2005). Effective NL paraphrasing
of ontologies on the Semantic Web. In Workshop on End-User Semantic Web Interaction,
4th Int. Semantic Web conference, Galway, Ireland. Retrieved from
http://www.mindswap.org/papers/nlpowl.pdf
Jurafsky, D. & Martin, J.H. (2009). Speech and language processing: An introduction to natural
language processing, computational linguistics, and speech recognition. Prentice Hall: New
Jersey.
Kasneci, G., Ramanath, M., Suchanek, F., & Weikum, G. (2008). The YAGO-NAGA Approach
to Knowledge Discovery. Retrieved from
http://dl.acm.org/citation.cfm?id=1519103.1519110
65 of 66
Klein, D. & Manning, C.. (2003). Accurate Unlexicalized Parsing. Proceedings of the 41st
Meeting of the Association for Computational Linguistics, pp. 423-430.
Klein, E. and Rovatsos, M. (2011). Temporal vagueness, coordination and communication. In
Nouwen, R., Schmitz, H.-C., van Rooij, R., and Sauerland, U., editors, Vagueness in
Communication, LNCS. Springer.
Klyne, G., & Carroll, J. (W3C). (2002). Resource Description Framework (RDF): Concepts and
Abstract Data Model. Retrieved from http://www.w3.org/TR/2002/WD-rdf-concepts-
20020829/
Liang, S.F., Stevens, R., Scott, D. & Rector, A. (2012). OntoVerbal: a Protege plugin for
verbalising ontology classes. Proceedings of the Third International Conference on
Biomedical Ontology , (ICBO'2012), Graz, Austria.
Mann, W.C. & Thompson, S.A. (1988). Rhetorical structure theory: Toward a functional theory
of text organization.
Mendes, P., & Jakob, M. (2011). DBpedia spotlight: shedding light on the web of documents.
Proceedings of the 7th …, 1–8. Retrieved from http://dl.acm.org/citation.cfm?id=2063519
Mendes, P., Jakob, M., & Bizer, C. (2012). DBpedia: A Multilingual Cross-Domain Knowledge
Base. Retrieved from http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/
research/publications/Mendes-Jakob-Bizer-DBpedia-LREC2012.pdf
Prud'hommeaux, E., & Seaborne, A. (W3C). (2008). SPARQL Query Language for RDF,
Retrieved from http://www.w3.org/TR/rdf-sparql-query/
Reiter, E., & Dale, R. (2000). Building Natural Language Generation systems. Cambridge
University Press.
Rosch, E.H. (1973). Natural categories. Cognitive Psychology 4 (3): 328–50. DOI:10.1016/0010-
0285(73)90017-0.
Sarawagi, S. (2008). Information Extraction, 1(3), 261–377. doi:10.1561/150000000
Shi, L. & Mihalcea, R. (2005). Putting Pieces Together: Combining FrameNet, VerbNet and
WordNet for Robust Semantic Parsing. Computational Linguistics and Intelligent Text
Processing. Lecture Notes in Computer Science, DOI: 10.1007/978-3-540-30586-6_9
Soricut, R., & Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical
information. Proceedings of the 2003 Conference of the North …, (June), 149–156.
Retrieved from http://dl.acm.org/citation.cfm?id=1073475
Sripada, S. G., Reiter, E., Hunter, J., & Yu, J. (2003). Generating English summaries of time
series data using the Gricean maxims. Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining - KDD ’03, 187.
doi:10.1145/956755.956774
Stevens, R., Malone, J., Williams, S., Power, R., and Third, A., (2011). Automating generation of
textual class definitions from OWL to English. Retrieved from
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102894/
Sun, X. & Mellish, C. (2007). An Experiment on “Free Generation” from Single RDF triples.
Retrieved from www.aclweb.org/anthology/W07/W07-2316.pdf
66 of 66
White, M., (2008). OpenCCG Realizer Manual. Documentation of the OpenCCG Realizer.
Retrieved from https://svn.kwarc.info/repos/lamapun/lib/LaMaPUn/External/Math-
CCG/docs/realizer-manual.pdf