Institut für Informatik der Friedrich-Schiller-Universität Jena An approach for semantic enrichment of social media resources for context dependent processing Diplomarbeit zur Erlangung des akademischen Grades Diplom-Informatiker vorgelegt von Oliver Schimratzki betreut von Birgitta König-Ries Fedor Bakalov January 26, 2010
103
Embed
An approach for semantic enrichment of social media ... · mantic enrichment module. Its purpose is to extract and provide semantic data for each input document. This semantic data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Institut für Informatik der
Friedrich-Schiller-Universität Jena
An approachfor semantic enrichmentof social media resourcesfor context dependent processing
Diplomarbeitzur Erlangung des akademischen Grades
Diplom-Informatiker
vorgelegt von
Oliver Schimratzki
betreut von
Birgitta König-Ries
Fedor Bakalov
January 26, 2010
Department of Computer Science at
Friedrich-Schiller-University Jena
An approachfor semantic enrichmentof social media resourcesfor context dependent processing
Diploma Thesissubmitted for the degree of
Diplom-Informatiker
submitted by
Oliver Schimratzki
supervised by
Birgitta König-Ries
Fedor Bakalov
January 26, 2010
Abstract
This diploma thesis provides the functional basis for information filtering in the domain
of complexity. It helps to create the domain-specific, adaptive portal CompleXys, that
filters blog entries and similar social media resources according to their relevance to a
specific context.
The first of two required modules, that are developed throughout this work, is a se-
mantic enrichment module. Its purpose is to extract and provide semantic data for each
input document. This semantic data should be appropriate for a relevance decision to
the domain of complexity as well as for further usage in the filter module. It utilizes
various approaches to perform a multi-label text classification onto a fixed complexity
thesaurus.
The second implemented module is a content filter module. It provides a dynamic
system of filters, which forms an access interface to the document store. It uses the
previously extracted annotation and classification data to enable complex, semantically
based filter queries.
Though the total system performance will only be testable after the complete system
is implemented, this thesis also conducts a first proof-of-concept evaluation of the two
created modules. It investigates the classification quality of the semantic enrichment
module as well as the response time behavior of the content filter module.
Acknowledgements
This thesis is the result of my research and implementation work in a project of the
Heinz-Nixdorf Endowed Chair of Practical Computer Science at the Friedrich-Schiller
University of Jena. I have been really fortunate to get the possibility to finish my studies
within such a pleasant and interesting environment. For this chance I like to give special
thanks to my two supervisors Birgitta König-Ries and Fedor Bakalov. Without them I
would have never been able to create this work.
Furthermore I like to thank Adrian Knoth and again Fedor Bakalov, who contributed
a lot to the basic project architecture, on which upon I builded my work and who im-
plemented the basic functions I relied on. Adrian has also been so kind to provide a
database server for my work.
Additionally I am indebted to my thesis reviewers Fedor Bakalov, Birgitta König-Ries
and Gerald Albe. They all helped to improve the text with their various comments and
suggestions.
Yet another important source for this thesis were the developers of the tools I worked
with and the writers of the papers and books I cited. Among them special mentions
should be given to the makers of GATE and KEA++, who were the most important
external supporters of my work.
Of course this page has also a place, that is solely reserved for my parents, who set
my personal longtime record of twenty-six years nonstop support. I can hardly express
how grateful I am. Just...thank you!
Last but not least, I like to give a huge thank-you to my beloved fiancee Monika Heyer
for her steady patience, encouragement and love. You are great. =o)
This chapter introduces the subject of this thesis, describes the task and clarifies the
further working procedure. First it introduces and motivates the general topic in the
Sections 1.1 and 1.2. Then it clarifies the objectives in Section 1.3 and sets the thesis
scope in Section 1.4. Finally it outlines the further chapters in Section 1.5.
1.1 Background
The world wide web is by far the greatest data repository mankind created. But a
majority of the therein stored information is incomprehensible, when one lacks the se-
mantic context it is stored in. Most people are able to manually reconstruct this con-
text out of a text, but to search the web for complex information is often an incredibly
time-consuming and hard task, even in times of elaborate search engines. Enhanced
automatization capabilities would therefore be a great achievement in the evolution of
the www. Unfortunately, machines provide up to now a far poorer performance in text
understanding than people do. A possibility to overcome this problem is to change
the structure of the data itself and to explicitly provide the additional semantics, that
people normally implicitly add in their minds.
Tim Berners-Lee, who is credited with the invention of the world wide web, proposed
a corresponding concept already back in 1989 [4]. Back then he suggested not just mere
hyperlinks, but typed ones - "the web of relationships amongst named objects". This
ideas resulted in the first HTML version [60], that contains the type element as well as
rel and rev attributes. Type is used to define the kind of relationship the source doc-
ument has towards another resource. The rel attribute is applicable to other HTML
elements and can be used to describe the appropriate type of semantic relationship to-
wards a second resource. Rev is the reverse - an adequate, passive version to the active
2 Introduction
rel attribute. Indeed, type became popular with defining structural references like the
related stylesheet document for a website and linking the alternate printing version
or RSS feed, but never get widely established as semantic informant, while rel and revremained mostly unused. The semantic HTML elements were misused for presenta-
tional purposes a long time, until the W3C CSS Level 1 Recommendation [38] in 1996
started the slowly progressing counterrevolution of strictly divided presentation and
content. This development finally leads to a rediscovery of semantic HTML in today’s
microformats movement, that will be further described in Subsection 3.1.3.
Figure 1.1: The Semantic Web layers1
However, these beginning problems did not hinder Tim Berners-Lee to pursue his vi-
sion further and to publish a Semantic Web Road Map in 1998 [59], which marks the
starting point to the W3C’s Semantic Web activities towards the machine-understandable
web. The Semantic Web layer diagram in Figure 1.1 shows the components, that should
finally achieve, what the first attempt had not. It is easily perceptible that the whole
approach is based on the traditional URIs2 and the Resource Description Framework
RDF3, which are used to reference and describe resources in a standardized way. Fur-
thermore the ontology component is of special interest for the topic of this thesis, be-
cause ontologies can be used to describe a domain in a machine-understandable way.
1Accessed on January 20, 2010: http://www.w3.org/2007/03/layerCake.png2http://tools.ietf.org/html/rfc39863http://www.w3.org/RDF/
1.2 Motivation 3
While the development and establishment of the Semantic Web is still the focus of many
researchers and organizations, the Web 2.0, another fundamental change to the usage
of the internet, seemingly overtook their efforts throughout the preceding decade. It
is often characterized as the step from a read-only (Web 1.0) environment to a read-
write ecology [56]. With blogs, social networks and wikis today’s web is not mostly
a consumption media anymore, but an infrastructure to collaborate and publish for
everyone. Among others, this leads to such interesting usage approaches as crowd-
sourcing [28], that harnesses the collective knowledge and creativity of network users
to produce outcomes, that are competitive to those of task experts, but often notably
less expensive.
Another important development in web related systems is context awareness. Among
others web applications are nowadays often conscious of the habits and interests of the
accessing user and tailor the content and structure of the user interface individually for
his special needs. For example this approach is successfully applied by the Amazon4
and Sevenload5 recommendations to improve their respective portals.
1.2 Motivation
Aware of their characteristics, it is a reasonable step to combine Web2.0, Semantic Web
and context awareness in order to achieve an even more useful web environment.
While the Web2.0 emergence provides an enormous and steadily growing amount of
new information, Semantic Web and context awareness are powerful approaches to
efficiently utilize all this data for single users, without them getting lost in a state of
information overload. A contribution to such efforts can be made by exploring the
possibilities to use Semantic Web components for crowdsourcing and context aware
content systems. Accordingly this thesis investigates the question how semantic data
and Semantic Web technologies can be applied to the task of utilizing resources, that are
freely published in scientific blogs or news sites, as content for a fixed domain, adaptive
portal.
This could help to improve information portals in two ways. It assists in the process of
automatically picking potentially relevant social media resources out of a multitude of
distributed sources. Furthermore the automatic extraction and annotation of semantic
metadata can be used to estimate the resources’ usefulness for the users. By doing so
this information can be used to provide content recommendations and to dynamically
4http://www.amazon.com5http://sevenload.com
4 Introduction
adapt the portal in order to display the optimal content for each individual.
1.3 Objectives
The goal of this thesis is to investigate the applicability of semantic data, that is auto-
matically extracted out of heterogeneous, social media resources, for various tasks in
the environment of a domain specific, adaptive information portal. The first task is the
binary decision, if a particular resource should be regarded as relevant for the field of a
certain domain and hence be further processed. The second task is the categorization of
the relevant resources into several main domain categories, which can, among others,
be used to organize the contents for intuitive browsing across several subpages. The
third task is the assignment of resources to a domain set of finer-grained topical terms,
that can be used to outline its subject.
The result of the third task can also be used to match user interest information to the
available set of resources to identify suitable content recommendations. However, to do
so the user interest had to be recorded in a way, that is comparable or equal to the same
topical term set. Assuming that this is the case, the second goal of this thesis is to ex-
plore the possibilities to efficiently pick out resources, that match to certain, potentially
complex conditions concerning their previously annotated semantic attributes.
The two described goals can not be successfully accomplished without an underlying
set of domain relevant terms. Thus it is necessarily a third goal of this thesis to provide
a sufficient domain model, in order to perform a proof-of-concept and enable a proper
evaluation of the previous goals.
1.4 Scope Of Thesis
This section clarifies the scope of this thesis and therewith provides a statement of what
should be achieved and what not. The first Subsection 1.4.1 sets the thesis’ work into
the context of the CompleXys project, whose component it is. The succeeding three
subsections describe the scopes of the implementation units, that emerge from the par-
ticular goals, that were identified in Section 1.3. Subsection 1.4.2 provides the scope
of the Semantic Content Enrichment module, Subsection 1.4.3 that of the CompleXys
Domain Model and Subsection 1.4.4 is dedicated to the scope of the Semantic Filter
module.
1.4 Scope Of Thesis 5
1.4.1 Complex Systems Portal
This work is part of the CompleXys project, that intends to provide a domain specific,
adaptive portal for complexity. This portal should be able to provide complexity re-
lated social media resources chosen by context. To achieve this, it needs to collect the
resources from the internet, enrich them with semantic data, match them with a domain
model for classification, filter them based on the raised data as well as on the context
and finally display them to the user. This thesis contributes to the system by providing a
module for the semantic enrichment and classification of the collected social resources,
a complexity domain model as necessary basis for these tasks and a module for the
content filtering itself. The scope of the single elements will be detailed further in the
succeeding subsections.
1.4.2 Semantic Content Enrichment
The first module, that should be provided, aims to enrich social resources with seman-
tic content and to classify them therewith. To achieve this, there is first a need to find
and apply ways to analyze the resources and extract semantic data out of them. This
task involves complex subfields of natural language processing and received extensive
research over many decades, so it is reasonable to assume, that a single part of this the-
sis is unlikely to be sufficient for outperforming the existent solutions. Thus the focus is
set on identifying and utilizing fitting state-of-the-art tools for the special requirements
of this module. Furthermore the module needs to be able to use the semantic data for
several classification tasks and to persist the extracted data in a usefully accessible way.
1.4.3 Complexity Domain Model
To provide a sufficient domain model, a set of complexity specific terms had to be col-
lected and usefully structured. The model will be used as a basis for the classification
and filter processes so it had to be extensive and specific enough to successfully identify
many texts out of the broad, interdisciplinary area of complexity. However, the creation
of a comprehensive model is a very time-consuming task and beyond the scope of this
thesis. So a good prototype will be enough, as long as the access interface onto it is
flexible and abstract enough to improve the model subsequently without problems.
Accordingly a suitable data structure for the representation of the model had to be
found.
6 Introduction
1.4.4 Semantic Filter
The second module should provide a method for content filtering, so that social re-
sources can be displayed according to a set of predefined filter criteria. The filter should
provide a possibility to express complex queries regarding the semantic attributes of
the resources and to efficiently access the subsets of resources, that match these queries.
The intention of this thesis is thereby not to provide an exhaustive set of imaginable
filters, but a flexible, free extendable system as well as a basic set of useful filters for
example and presentation purposes.
1.5 Thesis Outline
This introductory chapter gave an insight to the general topic as well as the particular
research problem. It clarified the motivation, the objectives and the thesis scope.
Chapter 2 supplies a presentation of the CompleXys project and information about gen-
eral considerations, needs and design decisions. Section 2.1 identifies and formulates
the requirements of the system. Section 2.2 introduces the CompleXys architecture and
its several working steps.
Chapter 3 provides the background knowledge for the remainder of this work. There-
fore, Section 3.1 treats options to notate semantic data. Section 3.2 gives an overview
over natural language processing as a fertile research field for semantic data extraction.
Chapter 4 presents tools and standards, that became apparent to be useful within the
practical part of this work. Section 4.1 introduces SIOC as a standard for the metadata
of social media resources. Section 4.2 describes GATE, which is basically an architec-
ture and framework for language processing systems. Section 4.3 deals with the taxon-
omy standard SKOS. Section 4.4 pays attention to the elaborate keyphrase exctraction
package KEA. Finally Section 4.5 gives an insight into the OpenCalais toolkit and web
service, that is capable to enrich content with semantic data upon request.
Chapter 5 provides an overview to previous research in the field of information filter-
ing. Various approaches to the task are presented and set into a context to this thesis.
Chapter 6 discusses the Semantic Content Annotator module. Section 6.1 is devoted
to the CompleXys domain model and Section 6.2 explains the concept of the Semantic
Content Annotator pipeline and gives a detailed description of its implementations.
Chapter 7 provides an insight into the semantic filter module. Section 7.1 treats the
AbstractFilter concept and its implementations. Section 7.2 presents the output variants
1.5 Thesis Outline 7
for the filtered data, that have been implemented for presentation purposes.
Chapter 8 evaluates the two introduced modules. Section 8.1 examines the classifica-
tion quality of the Semantic Content Annotator module. Section 8.2 tests the runtime
performance of the Semantic Filter module. The evaluation results are concluded in
Section 8.3.
Finally Chapter 9 provides a summary of the thesis and considers possible future work,
that is based on the obtained results.
8 Introduction
CHAPTER 2
CompleXys
The CompleXys project is the environment, in which this thesis’ work is embedded.
This chapter introduces the system. Therefore, a requirement analysis is performed in
Section 2.1 and a general architectural overview is given in Section 2.2.
2.1 Requirement Analysis
This section is dedicated to the requirements of CompleXys. For obtaining these, it is
helpful to first identify the relevant actors and respective use cases and only thereof
deduce the actual requirements. Accordingly Subsection 2.1.1 introduces the identified
actors. Subsection 2.1.2 is dedicated to the use cases, linking the actors to the system.
Subsection 2.1.3 specifies the performance requirements of the system and Subsection
2.1.4 the design constraints. This section is loosely oriented on standard requirement
analyses, but is due to the much smaller scope of this chapter of course rather shortened
and abstracted in comparison to a full-sized software requirement analysis.
2.1.1 Actors
The actors are parties outside the system that interact with the system [10]. These par-
ties can be users or other systems and they can be divided into consuming entities, that
use the functionality of the system and assisting entities, that help the system to achieve
its purpose. Five different actors could be identified for CompleXys:
• Information Consumers
• Information Providers
• Assisting Systems
10 CompleXys
• Administrators
• Developers
Information Consumers are the central clients of the systems. They are the ones, getting
value out of the system, whose main purpose indeed is to be a mediating software
layer between large resource sets and these same information consumers. They are
characterized by an interest focus on complexity related topics. Furthermore every
information consumer is supposed to have personal preferences and special interest
fields within this domain, so it is sensible to treat him as an individual. His interests
may change over time. It is expected, that information consumers are average world
wide web users and have at least basic web browsing skills. The usage frequency can
vary from one-time uses to many times a day, depending on the individual information
needs, time and access possibility. The number of information consumers could in
short-term vary from very few to hundreds and may long-term become notably higher.
Information Providers are the second most important actor class, because they provide
the resources, that will be displayed to the information consumers. Potentially every-
one in the internet could be an information provider as long as he publishes contents
and allows agents to crawl his site. They do not necessarily care or even know, that
their resources are processed within CompleXys. Thus there is no implicit control over
topic, quality, publishing frequency, size, form, language, media type and subsequent
modification of resources. Likely examples of information providers for CompleXys
are scientific bloggers or researchers, that publish their papers freely in the internet.
Assisting systems are all those external systems, that are utilized within CompleXys.
They may serve various purposes, that are beyond the scope of CompleXys. Up to now
this is only the OpenCalais web service, because the majority of the reused software is
applied internally, but this might change in the future. However, being dependent on
external systems, comes always with the risk of externally caused outages or errors, so
new systems must be integrated with care.
Administrators are the entities, that assist and support the running system. They are
responsible for generally maintaining the system. Furthermore they can manually add
information sources and resources to the resource set. The expertise level of the ad-
ministrators is naturally very high inside their respective working domain, but due to
specialization reasons it can not be assumed, that every administrator is able to main-
tain every part of the system. They should be promptly available whenever problems
with the system occur. Their number depends on the size of the system and the spe-
cialization level of the administrators. Special data administrators may manage the
2.1 Requirement Analysis 11
resources, that should be harvested and further administrators may be entrusted with
database, system and network management.
Developers are the only actor class, that is not occupied with the running system, but
with the code itself. They are responsible to evolve and extend CompleXys beyond
its initial version. Possible goals for these actors can be the elimination of flaws, new
functions or performance enhancement. They need to be skilled programmers.
2.1.2 Use Cases
The use cases are descriptions of how an actor uses a system to achieve a goal and
what the system does for the actor to achieve that goal. It tells the story of how the
system and its actors collaborate to deliver something of value for at least one of the
actors [6]. Use cases are strongly related to the functional requirements of a software
requirements analysis and are quite convenient to identify the external interfaces by
regarding their relationship to the actors . Due to the coarse grained perspective of this
thesis onto the requirements and the fact, that they tend to do be much more detailed
than the corresponding use cases, this subsection will act as an abstract surrogate for
the functional requirements and external interface requirements subsections.
The following eleven use cases could be identified:
• get information recommendations
• search
• modify user interest manually
• get digest
• gather resources
• use assisting service
• manage source or resource list manually
• maintain system
• add feature
• identify users
• record user interest
Figure 2.1 visualizes how the particular use cases relate to the roles of the previously
identified actors towards the system.
12 CompleXys
Figure 2.1: The relationships between actors and use cases
"Get Information Recommendations" is one of the most essential features of the sys-
tem and a typical use case of the information consumer. More precisely the use case
involves the selection and dynamic display of resources depending on their estimated
level of interest to a particular information consumer. In order to perform successfully,
the use case assumes, that several conditions are fulfilled. First it is dependent on the
use case "Identify Users", because a user had to be recognized, before a system can make
useful personal assumptions about him. Furthermore it is dependent of "Record User
Interest", because the system needs a possibility to store user interest representations,
for afterwards matching them to the available resources. And finally it is dependent
on the use case "Provide Information", because the displayed resources must obviously
be obtained in the first place. Additionally the resources have to actually match the
user interest. Avoiding front-end error messages, the system had to behave sensible,
whenever these assumptions are not given. There should be a possibility to display
resources in an unpersonalized way, if a user can not be identified, if no user inter-
est data has been raised yet or if there are simply no matching resources available for
the particular user. The use case involves the sequential steps "Load Stored User In-
terest", "Load Resources", "Match Resources to User Interest" and "Display Matching
Resources". Important demands are an acceptable response time, high recall and high
precision. These do reappear in the Subsection 2.1.3 and are discussed closer at this
point of the text. Furthermore the use case must be intuitively accessible for users with
the assumed average internet expertise level of the information consumer actors. This
involves the need of dynamically reflecting the probability of a resource to be inter-
2.1 Requirement Analysis 13
esting in display attributes like size, position and highlighting. To efficiently achieve
this the implicit relevance rating done in the step "Match Resources To User Interest"
should be expressed in relative probability values instead of binary decisions or an in-
teger sorted order. Because of its core importance to the system and its meaning for
information consumers as the central actor class "Get Information Recommendations"
can be rated with highest priority1.
"Search" is another important use case, that is related to information consumer. While
"Get Information Recommendations" is characterized by a passive information con-
sumption of the user, "Search" is the active querying of needed data. It is basically in-
dependent of other use cases than "Gather Resources", but may be used as information
source by the "Record User Interest" use case, when the user is additionally identified by
the "Identify User" use case. It involves the sequential steps "User Send Search Query",
"Search For Matching Resources" and "Display Search Results". Important demands are
good response times, that do not seriously interrupt the users’ browsing flow and an
intuitively understandable and controllable user interface. The use case is rated with
high priority, because albeit it is not actually a core feature of CompleXys, the internet
user is highly accustomed to this function and is likely to insist on needing it.
"Manually Modify User Interest" is another use case related to the information con-
sumer. Its goal is to visualize the recorded interest model to the described user and
let him alter it as he wishes. This assists to improve system transparency and possibly
also the system’s value to the user, because it is capable to establish a very up-to-date
and correct user interest model. This helps to smooth away three common flaws of in-
formation filtering systems. Firstly it helps to rapidly adapt the system to new interest
emphases of the user, secondly it helps to remove expired interests instantly and thirdly
it provides the possibility to correct erroneously added interest entries. Traditional sys-
tems may require quite a long time to autonomically adapt to the cases one and two,
because they usually require a certain amount of related behavior data. The third case
is worse, because the system may repeatedly draw the wrong conclusion and the un-
wanted topic may not even lose importance over time, when autonomic adaption is the
only option to change user models. This use case is dependent on the "Identify Users"
use case, because the actual user must be recognized by the system in order to find and
visualize his user model for him and to persistently store changes for future usage. Fur-
thermore, the use case is dependent on the assumption, that users benefit from a more
accurate interest profile. This is true as long as use cases like "Get Information Recom-
mendation" apply the profiles to produce value for the user. The use case involves the
tic data. But the semantic enrichment can obviously not be done, without collecting
the semantic data in the first place. For that reason Section 3.2 gives an overview over
natural language processing as a fertile research field for semantic data extraction.
3.1 Notation of Semantic Data
Semantic data can be displayed and stored in various ways, depending on the quality
and quantity of the data, as well as on the kind of purposed reuse. The three succeed-
ing subsections will introduce important notation possibilities. Subsection 3.1.1 will
deal with ontologies, Subsection 3.1.2 with annotations and Subsection 3.1.3 with mi-
croformats.
3.1.1 Ontologies
An ontology is a formal, explicit specification of a shared conceptualization [21]. It pro-
vides the needed syntax and semantics to describe relevant aspects of a domain in a
way others and especially machines can understand. This is achieved by determining
concepts and the relations between them. A tiny example ontology might be a concept
dogOwner, a concept dog and a relation owns that can connect both. Special properties
may add more information to a concept. For example dog may need to have a dogTagproperty. Furthermore axioms are defined to assign semantic information to those con-
cepts and relations. Axioms are sets of logical terms, which can be used to describe
24 Essentials
facts like: Every dogOwner needs to have at least one own relation towards a dog.
Frequently used ontology modeling languages are nowadays the Web Ontology Lan-
guage OWL1, the Web Service Modeling Language WSML2 and the Simple Knowledge
Organization System SKOS3. The latter will be described in detail in Section 4.3, be-
cause it fits the requirements of this thesis best.
3.1.2 Semantic Annotations
The knowledge structure represented in ontologies is an important step towards a
working semantic web, but up to this point the data is still abstract and not yet con-
nected to the actual world wide web. Therefore, todays websites need additional meta-
data, that describes its semantic meaning in a machine-understandable way. The pro-
cess of adding this metadata to a document is called semantic annotation.
There are basically three ways to link semantic data to a document [52]: Embedded,
internally and externally referenced annotations. Embedded annotations are directly
written into the HTML document. This can be done either by using an object or scriptelement or by writing it into an HTML comment. Either way, it is not displayed by
common browsers, but can be parsed and used by any semantically based application.
The advantage of this possibility is that the semantics are always present and do not
need to be fetched in a second loading step. The disadvantage is that much semantic
data may result in confusing source code and annotations in elements like script may
violate the code’s validation rules.
Internally referenced annotations reference to an external annotation storage out of
their code. This can be done in a link element with the rel attribute set to ’meta’ and
type attribute for instance set to ’application/rdf+xml’ in case of RDF based metadata
notation. References starting from object elements or anchors are also possible.
As a third option the external metadata document can reference to the annotated one.
To address special parts of the website XPointer or simple offset values may be used.
While the other possibilities expect direct write access to the source document, this
can be done by externals and can thereby be applied to a wide set of scenarios like
personalized annotations or social meta-tagging systems.
Beside the question how annotations are linked to a document, it is also interesting whoactually does it. Manual annotation is of course a valuable option. But it is probably
not sufficient, because even when incentives in combination with crowd sourcing prin-
ciples and useful annotation tools like SMORE [50], CREAM [22] and Annotea [33] may
accomplish a lot, the sheer mass of documents in the www is still likely to exceed what
can be achieved this way. Fortunately the field of natural language processing provides
promising approaches for automatic annotation. These will be discussed in Subsection
3.2.
3.1.3 Microformats
The web itself was originally conceptualized for managing semantic information as we
already stated in Section 1.1. The microformats idea is about utilizing this fact and pro-
ducing machine-understandable semantics just by providing special purpose notation
standards based on this POSH, which is a recently created abbreviation for ’Plain Old
Semantic Html’. Beside predefined semantic elements like address for contact informa-
tion or blockquote for quotes, the class attribute is applicable to every element and can
be used to assign other semantic descriptors. But to be useful for machine processing
these descriptors need to follow a common convention, that can be parsed. There-
fore, the microformats community defines modular, open data format standards. The
schema in Figure 3.1 reflects the principles of coherence and reusability, that are central
to the microformats idea. Fine grained elemental microformats are always reused to
build up the more complex, compound microformats like hCalendar4 or hCard5.
Figure 3.1: The basic microformats schema6
An example for microformats may be the following hCard, which describes identity
4http://microformats.org/wiki/hcalendar5http://microformats.org/wiki/hcard6Accessed on January 12, 2010: http://microformats.org/media/2008/micro-diagram.gif
It is obvious how naturally anyone with HTML skills can adopt to this style. It is ex-
tremely simple, light-weighted and pragmatic. It concentrates on modular, specific
topics and is quite human-readable. Furthermore it is self-contained because it is based
on embedded annotations and avoids language redundancy by reusing existing and
well-known HTML elements.
On the other hand microformats do not support URI identification of entities, which
leads to problems when trying to interoperate with the Semantic Web concepts around
the RDF-based W3C initiative. Additionally microformats do have a flat vocabulary
structure without namespaces, which may become problematic when different micro-
formats with equal class attributes are supposed to be combined on a single page. And
finally microformats are controlled by a little, closed community, that standardizes ex-
isting and common formats. This approach makes it unlikely to ever provide dozens
of domain-specific formats, therefore "the long tail" [1] of the social web will probably
always stay excluded from this kind of semantics.
3.2 Natural Language Processing 27
3.2 Natural Language Processing
Natural language processing (NLP) is an interdisciplinary research field, that resides
between linguistics and computer science and strongly interrelates with artificial intel-
ligence. It is concerned with the processing of natural language by computers. NLP
emerged originally from machine translation researches in the middle of the twenti-
eth century [46]. Today’s applications involve useful tasks like spellcheckers, machine
translation, speech recognition and information extraction. In this particular thesis the
subproblem of text classification is a central task of the Semantic Content Annotator
module. Therefore, it will be amplified separately in Subsection 3.2.7.
There are various basic approaches to handle the problems of natural language process-
ing, which will be discussed in Subsection 3.2.1. The essential subfields of text analysis
are mostly derived from the linguistic language description layers. Namely they are
lexical, morphological, syntactic, semantic and pragmatic analysis [37]. These are the-
matized in the Subsections 3.2.2 to 3.2.6.
3.2.1 Approaches
The basic approaches for natural language processing can be broadly divided in sym-
bolic, statistical, connectionist and hybrid approaches [37]. The symbolic approach rests
on the usage of explicit knowledge representations like logic propositions, rule sets or
semantic networks for language analysis. It is based on the assumption that an exhaus-
tive formal representation of words, grammar rules, possible syntactic and semantic
word relations and other linguistic data must provide a machine with all necessary in-
formation to perform text procession. A given text shall thereby be stepwise analyzed
and transformed, until it is directly displayed in the intended machine-understandable
format.
The statistical approaches are on a widely varying degree based on mathematical statis-
tics and often strongly related with machine learning techniques. The corresponding
methods make use of large sets of already worked out machine-understandable text
data. These data sets can for example be used to train naive Bayesian networks, which
thereon build up a statistical model. This model can afterwards be used to transform
unprocessed texts in the same way, that was shown in the training data.
Connectionist approaches base on the idea of neural networks, that intelligence emerges
from parallel interaction of many single neuron-units. These approaches combine sym-
bolic knowledge representation with statistical methods. The knowledge is stored in
28 Essentials
the weights of the neuronal connections, but the network is trained up like the statisti-
cal approaches, until it is capable of solving unprocessed cases itself.
Finally hybrid approaches pay attention to the fact, that all three preceding approaches
have strengths and weaknesses and may be optimally used in combination by utilizing
them in those NLP subtasks, in whose individual requirements they fit in best.
3.2.2 Lexical Analysis
Lexical analysis deals with text segmentation tasks. The central program of these anal-
ysis is called a tokenizer and it divides the text in known token-units like words, punc-
tuation and numbers. The sentence splitter is responsible for the segmentation of the
text in separate sentences. Another related tool is the part of speech tagger, that matches
sentence parts to word classes like noun, verb and adjective. This is necessary to resolve
ambiguations for the tokenizer.
3.2.3 Morphological Analysis
Morphological analysis is charged with the word structure and the morphological pro-
cesses it results from. The goal is to normalize a word into a morphology independent
form. This is important to simply reduce the size and complexity of the underlying lex-
icons. It is easier to store morphology independent forms and a set of rules, expressing
how any word can be reduced to it, than to store every possible morphologically trans-
formed form and maybe even to add word heritage relations just to gain a comparable
expressiveness.
Morphology independent forms can be stems or lexemes. Stems are the remaining part
of a word, when all suffixes are cut off. For example the words ’category’, ’categorical’
and ’categories’ do share the same stem ’categ’. Lexemes on the other hand are basic
words like those one can find in a lexicon. For example the lexeme of words like ’took’,
’taken’ and ’taking’ would be ’take’. The latter is harder to implement but also more
expressive, because stemmer would not be able to match ’took’ to ’take’ or ’better’ to
’good’ while a program for lexemes will.
Part of speech tagger, which were already introduced in 3.2.2, are of importance for this
analysis layer too, because the way morphological processing is done relies strongly on
the affected part of speech type. Therefore, it is reasonable to use morphological data
for part of speech tagging and vice versa.
3.2 Natural Language Processing 29
3.2.4 Syntactic Analysis
Syntactic analysis works with the syntax of sentences. It deals with word order and
phrase structure. Phrases are thereby word groups with a collective function in this
particular sentence. The word order in those phrases and the phrase distribution in
the sentence follows language inherent rules and carries information of grammatical
states. So syntactic analysis contributes to the extraction of central linguistic concepts
like sentence type, tense and morphologic case. Another application for this analysis
level is text parsing, which is the verification of a sentence by means of syntactic well-
formedness.
3.2.5 Semantic Analysis
The semantic analysis aims to perceive the meaning of text . It is generally dividable in
lexical and compositional semantics. Lexical semantics deal with the semantic of single
words or phrases. This may involve the classification into a relation network of similar
synonyms, hierarchical related higher classed hypernyms or lower classed hyponyms,
contrary antonyms and others. An important application of this analysis step is the
word sense disambiguation.
Compositional semantics deal with those semantics emerging from the composition of
words and phrases to bigger clusters like sentences or whole texts. An instance for this
is the semantic deduction, which is drawn from a reflexive pronoun referring to the
noun of a preceding sentence. The application of semantic analysis to the interrelation
between sentences of a text is called discourse analysis.
3.2.6 Pragmatic Analysis
Pragmatic analysis is responsible for the highest level of text understanding. The se-
mantic meaning is considered by its relation to a wide-ranging set of context, back-
ground knowledge and conventions to extract hidden inherent information like action
implications, speaker motivation, irony or citation. The ability of mastering this anal-
ysis level is probably a main obstacle of a machine to reliably pass the turing test and
may according to Alan Turing [62] therefore count as equivalent to human intelligence.
However, highest does in no way suggest, that other levels are of lesser importance -
pragmatic understanding can not be achieved without profound preparatory work at
the preceding levels.
30 Essentials
3.2.7 Text Classification
Text classification is a subfield of natural language processing. It determines, whether a
text is a member of certain categories. Such categories may for instance refer to the text
genre or to topic domains. The latter categorization task can be labeled separately as
term assignment, but in this thesis we will include it under the term text classification
for reasons of clarity and simpleness. Generally classification is useful for supporting
effective access to big amounts of information. Hence it is especially of great interest in
regard to the rapidly growing world wide web.
Text classification is based on a controlled vocabulary, which lists all permitted clas-
sification terms. The opposite is text clustering, which freely arranges document sets
according to shared words, phrases or even just shared relations to words. On the one
hand this free indexing strategy has the advantage, that it is domain independent and
more flexible towards unexpected inputs. Controlled indexing on the other hand pro-
vides better performance in its special domain and provides predictable output, that is
easier to work with on application side. Furthermore it can be easier semantically used
by preparing a specialized semantic net for the anticipated outputs and the consistency
with human classification is higher.
Text classifiers usually consists of a knowledge extractor and a filter. The knowledge
extractor creates class models containing sets of weighted features. These are mostly
displayed as word- or letter n-grams and represent extracted text data like frequency
counts, entropy and correlations. Each module can either work in a statical way, which
is usually symbolical and rule-based, or self-learning, which involves training data and
statistical or connectionist methods.
CHAPTER 4
Tools and Standards
This chapter describes important tools and standards, that were used during the thesis
work. Section 4.1 explains the purpose and features of the SIOC ontologies, which are
used by CompleXys’ Content Type Indexer to express social media specific metadata.
Section 4.2 surveys the GATE project, that is utilized as a basic framework for the Se-
mantic Content Annotator module and provides one of the implemented methods for
semantic extraction and annotation. Section 4.3 treats the taxonomy description lan-
guage SKOS, that serves as ontology description language for the CompleXys domain
taxonomy. Section 4.4 discusses the keyphrase extraction package KEA and its follow-
ups, on which another approach for semantic extraction is based on. Finally section
4.5 introduces the Calais initiative and its semantic annotation web service OpenCalais,
that was the third utilized semantic data source within the Semantic Content Annotator.
4.1 SIOC
The abbreviation SIOC [9] is short for Semantically-Interlinked Online Communities.
The intitiative aims to bridge the gap between the social web and semantic web tech-
nologies. To achieve this it provides a series of ontologies, defining a description stan-
dard for the domain of online communities.
The ontology structure specifies different abstraction levels, that relate to each other.
For example Figure 4.1 presents the semantic net of the SIOC main classes. The abstract
items relate to a superordinate container, which in turn belongs to a certain space. In
case more details are known, the item may be more precisely described as a post and
the container as a forum, both located in a concrete site. A post may have replies, tags,
categories and a creator. The creator may be member of a usergroup, have a function
in the forum and may be further related to special person description ontologies like
32 Tools and Standards
FOAF [11].
Generally said these concepts enable people to describe and consolidate their identity
across the social web and possibly merge all the multiple accounts of today’s web life
into a coherent web identity. This is coupled with rapid access and processing capacity
for community related data and thereby with many interesting application options.
But on the other hand it may also lead to an increased potential of abusive data storage,
hence increasing the necessity for public awareness towards data parsimony.
Figure 4.1: The SIOC main classes in relation1
4.2 GATE
The abbreviation GATE [16] is short for a General Architecture for Text Engineering. It
is an infrastructure for language processing software development. The contained soft-
ware architecture defines a fundamental organization schema for NLP software based
on loosely coupled GATE layers. These can also be externally utilized, by accessing the
corresponding open source API set of the GATE Embedded framework, whose compo-
nents are visualized in Figure 4.2.
1Accessed on January 25, 2010: http://wiki.sioc-project.org/images_sioc/f/f2/Sioc_spec_5_small.png
4.2 GATE 33
Figure 4.2: The APIs, which form the GATE architecture2
It is easily perceivable, that GATE has a meticulous focus on clean level separation, di-
viding its APIs in IDE GUI-, Application-, Processing-, Language Resource-, Corpus-,
Document Format-, DataStore and Index Layer as well as Web Services. The inter-
nal resources are structured in three categories. Basic data and language documents
like lexicons, ontologies and corpora are termed as Language Resources (LR). Algo-
rithmic components like part of speech tagger, tokenizer and parser are called Process-
ing Resources (PR). Visualization- and GUI related components are denoted as Visual
Resources. A division, that obviously mirrors the Model-view-controller architectural
pattern. The mutual set of these three resource types is collectively known as CREOLE,
which is short for Collection of REusable Objects for Language Engineering.
Furthermore, GATE contains a graphical IDE, a ready-to-use data model for corpora
and documents discussed in Subsection 4.2.2 and an elaborate Information Extraction
system called ANNIE, which will be discussed in detailed in Subsection 4.2.3.
2Accessed on January, 2010: http://gate.ac.uk/sale/talks/gate-apis.png
34 Tools and Standards
4.2.1 Corpus Data Model
The corpus data model is used as document and annotation format for the Semantic
Content Annotator and Semantic Filter modules of CompleXys. It can be described by
the six essential data objects, whose relation network is visualized in Figure 4.3.
Figure 4.3: A data model diagram for GATE’s corpus layer
The corpus object is per definition a large, structured set of texts and therefore contains
an arbitrary big set of documents. Additionally it is identified by a name and may con-
tain a FeatureMap, which lists descriptive features of an object in key-value pairs. The
documents possess the actual document content, a name, a source URL, a FeatureMap
and AnnotationSets. An AnnotationSet contains any number of annotations and an
identifying set name.
An Annotation has an id and a type, potentially connecting it to an ontology concept.
Further information can also be noted in an attached FeatureMap. The Annotation ob-
jects are implementations of the externally referenced semantic annotation approach,
that is discussed in Subsection 3.1.3. This means, that the annotations are neither em-
bedded in the content itself nor even referred to from within the content, but point
to the respective text interval by simply externally describing a start node and an end
node offset. The format, which is a modified form of TIPSTER [20] , is useful on the
one hand, because it cleanly divides content and semantic description and preserves
the original text. On the other hand even slight modifications of the text will result
in reference inconsistencies from annotation to content. Thus a more flexible reference
approach would be desirable.
4.3 SKOS 35
4.2.2 ANNIE
GATE delivers in bundle with a ready set of Processing Resources for information ex-
traction named A Nearly-New Information Extraction system or short ANNIE. It’s cen-
tral processing resources are tokenizer, sentence splitter, part of speech tagger, gazetteers
and for our purposes the semantic tagger.
The tokenizer is responsible to divide the text in known token-units like words, punc-
tuation and numbers, while the sentence splitter has to identify the sentences and split
the text into them. The part of speech tagger categorizes tokens and token clusters as
part of speech like noun, verb or adjective. All three were already introduced as lexical
analysis related tools in Subsection 3.3.
A gazetteer is by word heritage a geographical directory listing information about
places. However, in the domain of NLP the term meaning changed and now gener-
ally implies a set of wordlists each referring to a certain category, for example lists
of persons, cities or companies. Listed words must thereby not exclusively be entity
names, but can also be mere indicators like Ltd. is for a company. Furthermore a GATE
gazetteer module provides lookup functionality to match text parts to words occur-
ring in the respective list and annotating them with the respective list category. This
functionality can be implemented by finite state machines or hashtables.
The semantic tagger builds upon the gazetteer principle, using JAPE rules to further
describe matching patterns and the resulting annotations. JAPE [17] is short for Java
Annotations Pattern Engine and it provides finite state transduction over annotations
based on regular expression. Hereby it is possible to assign and process rules like
"When a text part has been already tagged by the gazetteer with the name x, then
add a feature ontology referring to the corresponding ontology concept y.". In this way
gazetteers can be used to automatically assign semantic annotations to text. This is one
of the semantic extraction approaches, that are used in the Semantic Content Annotator.
Its application is explained in detail within Subsection 6.2.3.
4.3 SKOS
SKOS [31] is a W3C standard ontology description language and a particular imple-
mentation language for the ontology concept described in Subsection 3.1.1. The ab-
breviation SKOS is short for Simple Knowledge Organization System. It is based on
RDF and thereby natively integrated in the semantic web environment. Furthermore
it is a light-weight modeling language specialized on hierarchical data structures like
36 Tools and Standards
thesauri, taxonomies and classification schemes.
SKOS concepts can possess three kinds of labels - prefLabel, altLabel and hiddenLabel.The prefLabel property defines the preferred label of the concept and altLabel defines al-
ternative labels, which is useful to assign synonyms, acronyms and abbreviations. A
hiddenLabel is a label, that can be used internally, for tasks like search operations and
text-based indexing, but should never be visibly displayed. A practical example there-
fore might be common misspellings of actual labels. Every kind of label can optionally
have a language tag, that restricts the scope of a label to this particular language and
by doing this, enables an executing entity to preferably display the label in the native
language of the calling instance.
Figure 4.4: An exemplary SKOS taxonomy
SKOS allows three relation types - broader, narrower and related. Broader and narrower can
be used to build up a concept hierarchy as demonstrated in the example in Figure 4.4.
A relation broader of a concept tinyBooks towards a concept books would express, that
tinyBooks is a subconcept of books and so every narrower instance of tinyBooks is also
an instance of books. A relation narrower of the concept book towards tinyBooks implic-
itly expresses the same. The relation related can be used to express a non-hierarchical
connection between two concepts. For example dog and dogOwner are in no way sub-
concepts of each other, but they are naturally related.
4.4 KEA 37
4.4 KEA
The abbreviation KEA [63] is short for Keyphrase Extraction Algorithm. This piece of
Java-based open source software is supposed to analyze text documents to extract a set
of keywords or keyphrases, which are multi-word units. Keyphrases are widely used
in corpora to shortly describe the content of single documents and to provide a basic
sort of semantic metadata, that can be reused by other processing tasks.
The task of assigning keyphrases to a document is called keyphrase indexing. Tradi-
tionally authors or special indexing experts have done this task manually, but with an
increasing amount of texts in digital libraries and the whole world wide web this ap-
proach is no longer sufficient. KEA provides a software-driven, free indexing approach
to automate this task.
Figure 4.5: The KEA algorithm diagram together with KEA++3
The diagram in Figure 4.5 visualizes the overall process of KEA. It can be divided in
two basic subtasks. The first step candidate extraction is accordingly termed as extractcandidates within the schema. It is further described in Subsection 4.4.2. The second step
is the filter process to extract those keyphrases, that are most likely to be useful. This
3Accessed January 6, 2010: http://www.nzdl.org/Kea/img/kea_diagram.gif
38 Tools and Standards
subtask essentially involves the schema entities compute features and compute entitieswhen it actually works, but also includes compute model, while being in training mode.
It will be detailed in Subsection 4.4.3. Finally Subsection 4.4.4 will introduce the KEA
advancement KEA++.
4.4.1 Candidate Extraction
Candidate Extraction is responsible for extracting candidate phrases out of a plain text
using lexical methods. (see Subsection 3.2.3 for general information about lexical anal-
ysis) This subtask is again dividable in the three basic steps input cleaning, candidate
identification and normalization of phrase candidates.
Input cleaning normalizes the raw input text into a standardized format. Therefore,
the text is divided into tokens, using spaces and punctuation as splitting clues. The
outcome is modified by separating out single or framing symbols like marks, brackets,
numbers and apostrophes as well as non-token characters and those tokens, that do not
contain any letters. Furthermore hyphenated words are split in pieces.
Phrase identification is the task of considering all token sequences as possible phrases
and finding the suitable candidates among them. KEA uses three conditions to match
suitability. The first condition claims, that a candidate phrase is composed of a limited
number of tokens. Three words appeared to be a good choice of length configuration.
The second condition claims, that proper names can not be candidate phrases and the
third one, that phrases can not begin or end with a stopword. Stopwords are thereby a
wordlist containing types like conjunctions, articles, particles, prepositions, pronouns,
anomalous verbs, adjectives and adverbs, that are unlikely to begin or end a useful
phrase.
The third task is meant to normalize the identified phrase candidates by stemming and
case-folding. Stemming is usually achieved by iteratively cutting the suffix of the can-
didate until just the stem is left. Case-folding is simply done by a general lower case
conversion. Additionally multi-word phrases can be re-ordered, so that for instance
Technical Supervisor and supervising technician both result in the normalized form supertech. This extracted form is called a pseudo-phrase. Besides the most frequent original
phrase of every pseudo-phrase is investigated to be presented as phrase label to human
users.
4.4 KEA 39
4.4.2 Filtering
The filtering task is responsible for choosing the most suitable keyphrases out of a given
set of keyphrase candidates. To achieve this the candidates have to be measured in a
way, that makes them comparable. Thereupon, it must be decided which candidates
will be chosen as keyphrases. The three applied metrics for free indexing are TFxIDF,
first occurrence and phrase length.
TFxIDF is a frequency metric, that relates the phrase occurrence in a particular docu-
ment to the its occurrence frequency in all preceding documents. The idea behind this
is that a phrase is more likely to be a keyphrase, when it occurs often inside the respec-
tive document and this occurrence is also generally rare in corpora average. Rareness
relates in this case to unpredictability and thereby to a higher amount of information
gain. For example the fact that a document is related to chemistry is not very surprising
inside a purely chemical corpora, but may be a useful descriptor, when the document
is member of a computer science corpora. The formula for TFxIDF is:
TFxIDF =f req(P, D)
size(D)x− log2
d f (P)N
, where
1. TF, is the term frequency in the actual document
2. IDF, is the inverse document frequency, which measures the probability of a term to
occur in a document of the corpora
3. freq(P,D) is the number of times term P occurs in document D
4. size(D) is the number of words in document D
5. df(P) is the number of documents containing the term P in the global corpus
6. N is the size of the global corpus.
The second feature metric is the relative position of first occurrence. It is calculated
with the following formula:
FO =prec(P, D)
size(D), where
1. FO, is the relative value of the first term occurence
2. prec(P,D), is the number of words in document D preceding the first occurence of the
term P
3. size(D) is the total number of words in document D
The third feature is the phrase length. This metric gives attention to the idea, that
human indexing experts tend to choose two word phrases instead of one- or three word
phrases. Therefore, it might be reasonable to weight these candidates higher.
40 Tools and Standards
After the features have been derived, the selection itself must be processed. Therefore,
KEA applies a machine-learning algorithm based on WEKA [64], that is supposed to
learn how to valuate a phrase candidate and afterwards do it autonomically. Within the
approach categorization introduced in Subsection 3.2.2 KEA uses a statistical approach,
working with naive Bayesian networks to build up the prediction model. Hence KEA
needs a set of already annotated documents in the first place to train the model how
to distinguish useful keyphrases among the candidates. Once the model is sufficiently
trained, KEA is able to differentiate useful and useless keyphrases from unknown doc-
uments quite well.
4.4.3 KEA++
KEA++ [43] is an advancement of KEA, that enhances it by the possibility of controlled
indexing. Since version 4 it is also included in the KEA main distribution. Controlled
indexing in contrast to free indexing restricts the number of possible keyphrases to a
fixed set of determined phrases. The advantages and disadvantages of these two ap-
proaches were already discussed in Subsection 3.2.7. Summarizing one may say, that
controlled indexing is very useful in fixed domains and in cases where predictable
keyphrases are an important requirement. The most fundamental resource of con-
trolled indexing is the thesaurus. KEA++ is hereby designed for using SKOS tax-
onomies. (see Section 4.3 for information on SKOS)
An effect of the advancement on the candidate extraction is that a phrases had to be
successfully matched with a thesaurus entry before it is considered as a keyphrase can-
didate. The matching process is done by normalizing the thesaurus entries in the same
way as the candidates and comparing the pseudo-phrases instead of the originals in
order to avoid complex morphology handling.
An additional metric in the feature extraction process of the controlled index approach
is the node degree, which uses the number of direct semantic relations from one can-
didate to the others as clue for the representativity, thus modifying its weight. This
feature has the interesting effect, that even phrases, that do not actually appear in the
text, might become keyphrase candidates, just because they are well connected to the
other candidates. For instance a text, that mentions astronomy, biology, physics, chem-
istry and earth science might be well described by the common related term natural
science, albeit it does not appear in the text.
Another new metric, directly resulting from the preceding one, is the actual appearance
of a phrase in the text. Although a particular node degree might lead to the conclusion,
4.5 Calais 41
that a not appearing thesaurus term may be a good candidate, appearance is still a
strong indicator, that should be considered in the selection process.
4.5 Calais
Calais [12] is a strategic initiative at Thomson Reuters, that aims to improve the inter-
operability of content. Therefore, it utilizes state-of-the-art natural language processing
techniques to "turn static text into Smart Media that is enriched with open data and con-
nected to a dynamic Linked Content Economy" [61]. More precisely calais provides free
metatagging services, developer tools and an open standard for the generation of se-
mantic content. The key component of their efforts is the OpenCalais webservice, that
will be detailed in Subsection 4.5.2. Finally the underlying data format will be treated
separately in Subsection 4.5.3.
Figure 4.6: Input and output data of the OpenCalais web service4
4.5.1 OpenCalais WebService
OpenCalais is the web service at the core of the Calais initiative. It is an API, that takes
unstructured plain text as input, processes it with natural language processing and ma-
4Accessed on January 8, 2010:http://enioaragon.files.wordpress.com/2009/12/12-03-calais.jpg?w=450&h=325
42 Tools and Standards
chine learning methods and returns a semantically annotated text version to the user.
The access to the web service is even for commercial use basically free, but requires a
registration to get an API key, that is mandatory for every request. Furthermore the re-
quest frequency for a single API key is currently limited to fifty thousand transactions
per day and four transactions per second.
Method invocation can be done by sending either SOAP or REST requests. Calais takes
the committed content, that must not be larger than one hundred thousand characters,
identifies the occurring entities therein and tags them with metadata. Relevant entity
classes are categories, named entities, facts and events as shown in Figure 4.6. The exact
data model is described further in Subsection 4.5.3. Web service responses return the
improved content with all the assigned tags, document IDs and URIs as RDF, Microfor-
mats, JSON [15] or Calais’ hybrid Simple Format.
4.5.2 Data Model
OpenCalais’ data model is strongly oriented on the linked data design principle, prop-
agated by W3C director Tim Berners-Lee [5]. This principle can be described by four
simple rules:
1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names.
3. When someone looks up a URI, provide useful information, using the standards
(RDF, SPARQL)
4. Include links to other URIs. so that they can discover more things.
According to the points one and two, OpenCalais identifies every relevant object with
an HTTP URI. Common URIs relate to types, type instances, documents, text instances
or resolution nodes. Types are the predefined entity categories, that Calais provides.
Their URIs are statically formed like http://s.opencalais.com/1/type/em/e/Company.
Type instances are special individuals of a certain type. For example the ClearForest
Ltd. would be a type instance of Company. Their URIs are composed of a type related
prefix and an instance specific hash token. The URI of the ClearForest Ltd. would be
http://d.opencalais.com/comphash-1/899a2db3-ce69-3926-ba4f-6dea099c3fc9. If the
relevance feature is turned on, the RDF also includes a score, that estimates the im-
portance of the entity for the document.
Document URIs refer to the actual text, that is sent within the request, and are com-
posed like type instances, but with a prefix referring to document. An example URI is
This section introduces the Semantic Content Annotator pipeline. This pipeline is re-
sponsible for the extraction of semantic data out of the incoming documents and for
the annotation of this data back to the resources. Furthermore, it is meant to decide
whether a resource is relevant for Complexity and to what main classification it should
be assigned to. The succeeding subsections describe how these problems are solved, by
explaining the principle of the CompleXys Tasks and their implementation instances.
Subsection 6.2.1 gives an overview to the structure and purpose of the pipeline and
the Subsections 6.2.2 to 6.2.6 describe the components Crawled Content Reader, Onto
Gazetteer Annotator, Open Calais Annotator and Content Writer.
6.2 Semantic Content Annotator Pipeline 55
6.2.1 Introduction
The Semantic Content Annotator module is meant to take a potentially high number
of documents as input, to analyze them in order to extract semantic data, to decide
whether they are relevant for complexity, to fuzzily classify them into the topics of
the domain model and to finally put them out again. Obviously this involves several
sequential steps, in which each document had to be processed. That makes this module
a perfect candidate for a parallel processing pipeline structure. The main advantage of
such an approach is the effective exploitation of distributed processing and most of
all of multi-core processor systems. Therewith, it is a strong way to raise processing
performance and scalability.
Figure 6.2: The CompleXysTask principle
The Semantic Content Annotator utilizes the java package java.util.concurrent to im-
plement such a pipeline. The basic principle is visualized in Figure 6.2. Every coherent
component is implemented as a runnable task object and submitted to a thread pool.
Concurrent Linked Queues handle the communication between the several tasks. Each
queue has a sender task and a receiver task. Whenever a sender task has finished its
function with a certain document, it sends it to its output queue. At the other side of
the queue the receiver task takes every document in First-In-First-Out order and starts
processing it. Every task possesses a unique name and a set of features, that can be
used to transmit all kinds of special information, that may be needed by a task. For
example the standard feature, that is used up to now, is debug. It enables a centralized
control of debugging output in the Semantic Content Annotator main class. Generally
there are three kinds of tasks in this module, differentiated by the number and usage
type of their queues, by their determination dependency and by their basic duties -
the initiating task, CompleXys Tasks and the finishing task. Furthermore every task
is linked to a Future object, which is basically a flag, that describes the determination
56 Semantic Content Annotator
state of the thread it runs in. Every task, except the first, listens to the preceding task of
the pipeline and terminates then and just then, when the preceding task has terminated
and no document is left in its input queue.
The initiating task is the first task in the pipeline. Accordingly it possesses an output,
but no input queue. Instead it is responsible for collecting the necessary resources and
the already included metadata itself, which is its main purpose as well. It terminates,
when no documents are left to collect. The implemented initiating task of CompleXys
is the Crawled Content Reader. In Subsection 6.2.2 it will be described in detail.
Figure 6.3: The Semantic Content Annotator Pipeline
CompleXys Tasks are implementations of the abstract class CompleXysTask. They are
characterized by possessing both input and output queue. Their purpose is the actual
analysis, semantic data extraction and classification of the incoming documents. Three
CompleXysTask instances were implemented throughout this thesis’ work. The Onto
Gazetteer Annotator is further described in Subsection 6.2.3, the KEA Annotator in
Subsection 6.2.4 and the Open Calais Annotator in Subsection 6.2.4.
The finishing task is the last task in the pipeline. Accordingly it possesses an input,
but no output queue. Instead it is responsible for outputting the documents itself. The
whole pipeline and therewith the Semantic Content Annotator terminates, when the
finishing task is done. In CompleXys the finishing task is called Content Writer. It will
be further described in Subsection 6.2.5.
Figure 6.3 provides an overview of the implemented pipeline and its connected data
stores.
6.2 Semantic Content Annotator Pipeline 57
6.2.2 Crawled Content Reader
The Crawled Content Reader is the first component of the pipeline and its main purpose
is to gather the documents from the input data store, to decide whether they should be
processed, to wrap them into the GATE data format (see Subsection 4.2.2) and to send
them into the output queue for further processing.
First it builds up connections to both the input data store, where the unprocessed docu-
ment are stored, and the output data store, where the processed documents are stored.
The former will be referred to as Harvester DB, because the Harvester module is re-
sponsible to steadily fill it with documents and the latter will be referred to as SemanticDB, because it is filled by the Semantic Content Annotator module and the stored data
additionally includes the semantic information. The connection to the Semantic DB is
built by using an intermediate persistence layer consisting of a set of DataAccessOb-
jects7 (DAO), Factories and GATE’s Hibernate Persistence Layer, resulting in a strong
layer division and high data store exchangeability.
It fetches all the documents stored in the Harvester DB and checks if the document is
already stored in the Semantic DB and thus was already processed once. This must be
done, because the Harvester DB actually needs the old documents to check for subse-
quent modification and to decide whether a resource is new. This step should become
obsolete in the future, because a modified time stamp or an unprocessed flag can help
to pointedly access only new and modified documents. If the document can be found
there, both versions are compared by a hash value of their content to find out if the
text was modified in the meantime. If it was not modified and the hash value is equal,
the document is ignored, because everything is still up-to-date, but if something was
changed the correctness of all recent annotations and potentially even of the classifica-
tion is uncertain. This dilemma is currently solved by simply deleting the document
out of the Semantic DB and treat it further as it is a new one.
If the documents needs further processing it is wrapped as a GATE document object,
thereby commiting it to the GATE persistency management of the Semantic DB. After
this is accomplished, the document is send to the output queue and the next document
is handled. The Crawled Content Reader terminates, when no unprocessed or modified
The evaluation of the Semantic Content Annotator is basically performed by text clas-
sification quality. Therefore, the test data set, that was introduced in the preceding
subsection, is processed by the Semantic Content Annotator pipeline. This execution
is done for a series of comparable configurations. In the first test the Onto Gazetteer
Annotator is the only component that performs the classification. In the following ten
test configurations the Kea Annotator processes the documents. The latter differ by the
threshold of occurrence, that had to be exceeded by a term, before it counts as relevant
for the text. This variable is supposed to significantly influence the performance of the
classification, because complexity terms like "complexity" or "chaos" also frequently oc-
cur in texts, that are not pertinent to complexity as such. But those irrelevant words are
likely to occur significantly less often, than relevant words, so a sophisticated threshold
can help to sort the wheat from the chaff.
The obtained binary classification data is used to calculate the standard metrics preci-
sion and recall, that can be compared between the configuration cases, but also to the
performance of other text classification systems. Additively to the relevance decision
the test stores URL, main category and the weight of the main category. The main cat-
egory weight is the relative share of terms from a certain main category within the set
of all term annotations of a resource. It is used to compare certain main categories in
order to identify the most important ones for the text. This data is used to take random
samples to empirically evaluate the quality of the categorization into main categories
and analyze the correlation of correctness and weight.
8.1.3 Test Results
The results of the classification quality tests had to be considered within the contrast of
recall and precision. Accordingly Figure 8.1 presents the measured values distributed
across these two dimensions. The particular points represent the several test runs. The
label gaz refers to the Onto Gazetteer Annotator test and the kea labels to the Kea Anno-
tator tests, with the number standing for the different minimum term occurence values.
It can be perceived, that the gaz test achieves a very high recall value, but by doing so
clearly fails to fulfill the precision target value of 0.7. The kea2 test performs even worse,
but the higher the occurrence threshold is adjusted, the better become the precision
values. This tendency can be continued up to kea80, that misses the target value just
by 0.06, while still complying to the recall requirement. However, kea90 significantly
72 Evaluation
Figure 8.1: The distribution of the quality test runs across the dimensions precisionand recall
declines in both values recall and precision, so it can be assumed that kea80 forms a
local optimum and can not simply be further improved by increasing the minimum
occurrence threshold.
The main category values, that were annotated to the resources, are evaluated by taking
random classification samples. These were manually compared to the actual content to
get an empirical clue to how this classification task performs. The samples generally
achieve a success rate of approximately 50%. Considering the negative effect on the
user experience, that a one out of two error rate has, this is obviously not a very good
result. However, this flaw seems partially to be a consequence of the previous false
complexity classifications. This impression is underlined by the fact, that the success
rate of those resources of the random samples, that were correctly classified as com-
plexity, is approximately three out of four, which is significantly higher. Furthermore
the main category classifications are often understandable and vaguely right, but rated
as false, because one or two other categories would definitely fit better. Due to the
interdisciplinary nature of complexity, this is a frequently occurring case. So this key
characteristic is likely to be utilizable in order to improve the quality of main category
classifications. Further improvement proposals are made in the future work considera-
tions in Chapter 9.
Additionally the analysis of the main category classifications and their weights uncov-
ers another interesting relation. Table 8.1 shows, that the resources, that were classified
to their main category with a total weight of 1.0, have a significantly lower complexity
8.1 Classification Quality 73
weight
range
document
number
average
precision
0,0 - 0,4 11 0,72
0,4 - 0,5 15 0,66
0,5 - 0,6 17 0,65
0,6 - 1,0 23 0,78
1,0 21 0,38Table 8.1: Average precision and number of documents within the top test kea80, clus-
tered by main category weight ranges with at least 10 documents
classification precision than others The rapidly falling trend line in Figure 8.2 visualizes
this effect. Based on this knowledge, the result of the top performing test kea80 was re-
evaluated, by simply discarding all resources with the 1.0 value. This virtual test run
is referred to as kea80+ and it is listed among the other tests in Figure 8.1. On one side
the KeaAnnotator finally exceeds the precision requirement value in this configuration,
but on the other it also slightly misses the target recall by 0.02. However, it is still the
best result, that was achieved throughout the tests and if it can be a little bit further im-
proved it will totally achieve the requirements. Suggestions to how this improvements
might look like are discussed in Chapter 9.
Figure 8.2: The correlation of main category weight and precision within the top testkea80
74 Evaluation
8.2 Response Time
This section is dedicated to the performance evaluation of the Semantic Filter. The
performance is measured in response time (see also Subsection 2.2.3). Subsection 8.2.1
explains the applied test strategy and Subsection 8.2.2 discusses the test results.
8.2.1 Test Strategy
The Semantic Filter is evaluated by his response time behavior, because it is time-critical
insofar that it can directly influence the time a user had to wait for the system response.
To evaluate this response time a test should be able to simulate various influencing
variables. Thus the tests vary in four basic dimensions. The number of documents, that
had to be filtered, scales in the steps 10, 100 and 1000. The number of considered terms
scales from 1 to 251 in steps of fifty. The usage of logical filters is varied by either using
an AndFilter or an OrFilter, each with all BasicFilters inverted by NonFilters, without
any invertion or both randomly mixed. Finally a complex nested Filter will be sim-
ulated by randomly chunked BasicFilters, that are nested in randomly chosen logical
filters, that can be nested within another filters again and so on. This mix is supposed
to simulate complexity in the structure of filter systems and measure its effects on the
performance. All filter combinations are visualized in Table 8.2. The tested documents
are only required to possess a certain number of random semantic annotations, so they
can be instantly and automatically created.
Test and or not random
not
and plain x
and not x x
and random not x x
or plain x
or not x x
or random not x x
mixed x x xTable 8.2: The characteristics of the performed test series
The tests were performed on a Macbook with a two gigahertz IntelCore2Duo processor,
two gigabyte DDR2 SDRAM, the operation system MacOSX 10.4.11 and Java 5. This is
not a representative server system, but should be sufficient to reveal the basic runtime
8.2 Response Time 75
behavior and eventual scalability problems.
8.2.2 Test Results
The performance evaluation was dependent from many variables, so the results are
displayed in two perspectives. The first one calculates the average values of the respec-
tive term numbers to be able to examine the relation between document number and
response time. Its results are visualized in Figure 8.3.
Figure 8.3: Average response times over several test series and numbers of handleddocuments
The results reveal that the OrFilter test runs are not significantly influenced by the
number of documents. The response time of the AndFilter test runs on the other side
steadily increases over the three document scales. Furthermore, are those AndFilters,
that contain additional NotFilters, slower than those, that do not and the response time
of the mixed filters does also constantly increase. This behavior can be explained by the
combination of two facts. Firstly it was more likely for a proposition to be false than
true, with a distribution of approximately 120 successes occurring in 1000 documents.
Secondly NotFilters are an additional filter layer, that always costs extra time. The first
fact leads to a better performance of and plain and the inverted OrFilters, because they
can frequently shortcut their decisions. Opposed to that, or plain and the inverted And-
Filters had to check more BasicFilters before they can return their results. However,
76 Evaluation
the increased number of iteration turns alone does apparently not cause any notewor-
thy problems. These do emerge only then, when the iterations are multiplied with the
additional execution time of a NotFilter.
Speed requirements alone can often be satisfied by simply using better hardware for
the servers, but this approach soon stops to be feasible if the response times grow ex-
ponentially. The attribute of software, that measures this behavior is the scalability. To
observe it for the number of documents, the response times are normalized to a rel-
ative response time per document value. The results of this procedure are presented in
Figure 8.4. They reveal that none of the test series grows faster than the linear increas-
ing number of documents, so it can be concluded that the system is scalable within
this dimension. More than that, the relative response time decreases, which is likely to
be caused by a relatively high initial loading time for the code, that is followed by an
efficient processing iteration.
Figure 8.4: Normalized average response times over several test series and numbersof handled documents
The second perspective of the evaluation is the number of used terms. In order to
evaluate this dimension, the average values for the test runs with different numbers
of documents are calculated and presented in Figure 8.5. It can be perceived that the
OrFilters are again indifferently fast in every test. The and plain test run rises to a satu-
ration level and stagnates. Only those filters that include both AndFilter and NotFilter
are steadily dependent on the number of terms. But, as the normalized presentation in
8.2 Response Time 77
Figure 8.5: Average response times over several test series and numbers of terms
Figure 8.6 shows, this growth is not exponentially and hence not critical. The behavior
of the mixed test is too random to provide useful clues in this dimension.
Figure 8.6: Normalized average response times over several test series and numbersof terms
78 Evaluation
8.3 Discussion
The evaluation of the complexity relevance decision reveals that the unadjusted KEA
Annotator as well as the Onto Gazetteer Annotator fails to achieve the precision re-
quirements, that were stated in Subsection 2.2.3. However, further configuration of the
minimal term occurrence variable and the additional discarding of a special class of
documents, whose main category classification was performed with a weight of 1.0,
leads to a test run, that fulfills the precision requirement and just slightly misses the
target recall value. So after adjustment the module already performs this task nearly
satisfying. Approaches for a further improvement of its performance are suggested in
Chapter 9.
According to the random samples the main category classification performs with an
approximate error ratio of fifty percent. This is an unusable state, so it is necessary to
investigate the causes of this flaw and further improve the solution of this task.
The response time of the Semantic Filter module never grows exponentially, so it can be
considered as scalable. The tests revealed a clear performance difference between the
runtimes of OrFilters and those of AndFilters with and without nested NotFilters. Andand not are apparently a slow combination. Several theories for the causes of certain
performance patterns were constructed, but have not yet been verified. Generally the
performance requirements of the Semantic Filter should be achievable, if some further
code optimization is performed and when the system runs on a more powerful server
hardware.
CHAPTER 9
Summary and Future Work
This thesis investigated the applicability of semantic metadata for the task of utilizing
social media resources in topic specific and context aware systems. It is embedded into
the CompleXys project, that develops an adaptive information portal for the field of
complexity. More precisely this thesis implemented the two modules Semantic Content
Annotator and Semantic Filter. The former uses GATE, KEA and OpenCalais to extract
semantic data from incoming documents. Thereupon it decides whether the resource
is considered as relevant for the topic of complexity and into which domain category it
should be classified. A newly created complexity taxonomy was used as a controlled
vocabulary for this process. The Semantic Filter applies the combined concept of filter
iterators and propositional logic to provide a flexible access interface to the semantically
indexed documents.
The evaluation of this work is split into a quality evaluation of the classification process
in the Semantic Content Annotator and a performance evaluation of the time-critical Se-
mantic Filter. The quality requirements, that were stated in Subsection 2.2.3, demand a
precision value of at least 0.7 and a recall value of at least 0.5 for the complexity clas-
sification. While several tested configurations were able to met one of these two goals,
none was able to met both at the same time. However, the kea80+ test run succeeds the
precision threshold and misses the required recall by just 0.02. So, it can be regarded as
a good starting point for further quality improvement. This top configuration is based
on a surprisingly high minimal term occurrence value of eighty. It can be assumed
that the success of this value was caused by the huge average text size of the scientific
documents in the test set. But not all documents of the set are as big as this average
and a broader amount of sources is likely to cause even a bigger variance of text sizes.
Unfortunately high occurrence values are nearly a guarantee for smaller than average
documents to be discarded without distinction. A possibility to handle this fact is to
implement the minimal term occurrence not as a constant, but as a relative value to the
80 Summary and Future Work
size of the document. Another effect, that was harnessed by the kea80+ test run, was
that a main classification with a one-sided weight of 1.0 towards a single category is
likely to be a false success and can be rejected to increase the precision. However, it had
to be clear that doing so probably also decreases the recall value, because it skips all
documents that just contain words from one main category. Additionally, it is possible
that this approach fails for small documents, because short texts are far more likely to
contain just terms of one category.
A further empirical evaluation of the main category classifications revealed an error
rate of approximately fifty percent. It must therefore be regarded as a current weakness
of the system and should be improved. A first approach to do so is to increase the
precision of the complexity classification, because documents that are not relevant for
complexity can hardly be correctly classified into a complexity category. Furthermore,
many classification are not strictly wrong, but just chose a classification, that would
not be the first choice of a human classifier. The interdisciplinary nature of complexity
even boosts this effect, because most of the texts could possibly be classified to more
than one main category. Therefore, it is worth to consider if a multi-value classification
would not be a better choice than the current one.
General improvements to the quality of the Semantic Content Annotator can also be
made by utilizing the document structure by increasing the term candidate weight of
words with emphasizing markup like boldness or headline elements. Yet another pos-
sibility is the use of additional data sources. For example the links in the text could be
loaded and analyzed too, the already annotated OpenCalais data can be applied and
the title can be searched in sites like Google, Citeulike, Technorati or Delicious to extract
additional context and collaborative classification suggestions. User tags in CompleXys
itself can also help to improve the classification. They can not just subsequently refine
the classification quality, but also provide feedback for certain classification decisions,
which can be used for a steady training of the classifier.
The performance evaluation of the Semantic Filter revealed a sufficient scalability of
the filter systems. Minor flaws are likely to be compensated by powerful hardware
and additionally reduced by further code optimization. Performance improvements
can generally be made by decoupling the text annotations from the filter process. Up
to now the Basic Filters iterate over all semantic annotations of a certain text to find
fitting terms. This step can be fastened by separately storing occurring terms and their
occurrence number, which would limit the maximum number of accesses to the number
of terms in the taxonomy. If this data is additionally sorted, the filters can also apply
advanced search techniques to accelerate the procession. Apart from the performance,
81
an improvement of usefulness for information filtering purposes can be achieved by
implementing fuzzy filters, that pick documents not according to discrete criteria, but
just return the top matching candidates sorted by their additive interest probability.
82 Summary and Future Work
References
[1] C. Anderson. The Long Tail. Random House Business, 2006.
[2] A. Baruzzo, A. Dattolo, N. Pudota, and C. Tasso. A general framework for person-
alized text classification and annotation. In Workshop on Adaptation and Personaliza-tion for Web 2.0, UMAP’09, 2009.
[3] N.J. Belkin and W.B. Croft. Information filtering and information retrieval: two
sides of the same coin? Commun. ACM, 1992.
[4] T. Berners-Lee. Information Management: A Proposal, 1989. URL http://www.
w3.org/History/1989/proposal.html.
[5] T. Berners-Lee. Linked Data - Design Issue, 2006. URL http://www.w3.org/
DesignIssues/LinkedData.html.
[6] K. Bittner. Use Case Modeling. Addison-Wesley Longman Publishing Co., Inc.,
2002.
[7] S. Bloehdorn, P. Cimiano, and A. Hotho. Learning ontologies to improve text clus-
tering and classification. In From Data and Information Analysis to Knowledge Engi-neering: Proceedings of the 29th Annual Conference of the German Classification Society(GfKl’05), 2005.
[8] J. Bogg and R. Geyer. Complexity, science and society. Radcliffe Medical Press, Ox-
ford, 2008.
[9] U. Bojars and J.G. Breslin et al. SIOC Core Ontology Specification, 2009. URL http:
//rdfs.org/sioc/spec/. Revision 1.33.
[10] G. Booch, I. Jacobson, and J. Rumbaugh. The Unified Modeling Language User Guide.
Addison-Wesley, 1999.
[11] D. Brickley and L. Miller. FOAF Vocabulary Specification 0.96, 2009. URL http:
[12] Open Calais Documentation. Calais, 2009. URL http://www.opencalais.com/
documentation/opencalais-documentation.
[13] P. Casoto, A. Dattolo, F. Ferrara, N. Pudota, P. Omero, and C. Tasso. Generating
and sharing personal information spaces. In Proc. of the Workshop on Adaptation forthe Social Web, 5th ACM Int. Conf. on Adaptive Hypermedia and Adaptive Web-BasedSystems, 2008.
[14] J. Chen, D. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query
system for internet databases. In In Proc. of SIGMOD, 2000, 2000.
[15] D. Crockford. The application/json Media Type for JavaScript Object Notation (JSON),2006. URL http://tools.ietf.org/html/rfc4627.
[16] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework
and graphical development environment for robust NLP tools and applications.
In Proceedings of the 40th Anniversary Meeting of the Association for ComputationalLinguistics, 2002.
[17] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. The GATE User Guide,
[25] P. Herron. Automatic text classification of consumer health web sites using word-
net. Technical report, The University of North Carolina at Chapel Hill, 2005.
[26] A. Heß, P. Dopichaj, and C. Maaß. Multi-value classification of very short texts.
In KI ’08: Proceedings of the 31st annual German conference on Advances in ArtificialIntelligence, 2008.
[27] F. Heylighen. Encyclopedia of Library and Information Sciences, chapter Complexity
and Self-organization. Marcel Dekker, 2008.
[28] J. Howe. Crowdsourcing: A definition. Crowdsourcing: Tracking the Rise of the Am-ateur (weblog, 2 June), 2006. URL http://crowdsourcing.typepad.com/cs/2006/
06/crowdsourcing_a.html. (accessed on jan 10, 2010).
[29] F. Iacobelli, K. Hammond, and L. Birnbaum. Makemypage: Social media meets
automatic content generation. In Proc. of ICWSM 2009, 2009.
[30] IEEE. IEEE Recommended Practice for Software Requirements Specifications, 1998. URL
[51] I. Peacock. Showing robots the door, (w)hat is (r)obots (e)xclusion (p)rotocol? Ari-adne, 1998.
[52] T. Pellegrini and A. Blumauer. Semantic Web : Wege zur vernetzten Wissensge-sellschaft. X.media.press, 2006.
[53] C. Da Costa Pereira and A. Tettamanzi. An evolutionary approach to ontology-
based user model acquisition. In WILF, volume 2955 of Lecture Notes in ComputerScience, 2003.
[54] A. Montejo Raez, L.A. Urena-Lopez, and R. Steinberger. Automatic Text Categoriza-tion of Documents in the High Energy Physics Domain. PhD thesis, Granada Univ.,
2006.
[55] L. Razmerita, S. Antipolis, G. Gouardères, E. Conté, and M. Saber. Ontology based
user modeling for personalization of grid learning services. In ELeGI Conference,
2005.
[56] D. Rosem and C. Nelson. Web 2.0: A new generation of learners and education.
Computers in the Schools, 2008.
[57] S. Schmidt. PHP Design Patterns. O’Reilly, 2006.
[58] A.V. Smirnov and A.A. Krizhanovsky. Information filtering based on wiki index
database. CoRR, 2007.
[59] T. Berners-Lee. Semantic Web Road map, 1998. URL http://www.w3.org/
DesignIssues/Semantic.html.
[60] T. Berners-Lee and D. Conolly. Hypertext Markup Language (HTML) - A Represen-tation of Textual Information and MetaInformation for Retrieval and Interchange, 1993.
[65] B. Yang and G. Jeh. Retroactive answering of search queries. In WWW ’06: Pro-ceedings of the 15th international conference on World Wide Web, 2006.
[66] C. Zimmer. Approximate Information Filtering in Structured Peer-to-Peer Networks.