Learning Lightweight Ontologies from Text across Different Domains using the Web as Background Knowledge Wilson Yiksen Wong M.Sc. (Information and Communication Technology), 2005 B.IT. (HONS) (Data Communication), 2003 This thesis is presented for the degree of Doctor of Philosophy of The University of Western Australia School of Computer Science and Software Engineering. September 2009
282
Embed
Learning Lightweight Ontologies from Text across Different ... · Learning Lightweight Ontologies from Text across Different Domains using the Web as Background Knowledge ... ontology
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Learning Lightweight Ontologies from
Text across Different Domains using the
Web as Background Knowledge
Wilson Yiksen WongM.Sc. (Information and Communication Technology), 2005
B.IT. (HONS) (Data Communication), 2003
This thesis is presented for the degree of
Doctor of Philosophy
of The University of Western Australia
School of Computer Science and Software Engineering.
September 2009
To my wife Saujoe
and
my parents and sister
Abstract
The ability to provide abstractions of documents in the form of important con-
cepts and their relations is a key asset, not only for bootstrapping the Semantic
Web, but also for relieving us from the pressure of information overload. At present,
the only viable solution for arriving at these abstractions is manual curation. In
this research, ontology learning techniques are developed to automatically discover
terms, concepts and relations from text documents.
Ontology learning techniques rely on extensive background knowledge, ranging
from unstructured data such as text corpora, to structured data such as a semantic
lexicon. Manually-curated background knowledge is a scarce resource for many do-
mains and languages, and the effort and cost required to keep the resource abreast
of time is often high. More importantly, the size and coverage of manually-curated
background knowledge is often inadequate to meet the requirements of most on-
tology learning techniques. This thesis investigates the use of the Web as the sole
source of dynamic background knowledge across all phases of ontology learning for
constructing term clouds (i.e. visual depictions of terms) and lightweight ontolo-
gies from documents. To appreciate the significance of term clouds and lightweight
ontologies, a system for ontology-assisted document skimming and scanning is de-
veloped.
This thesis presents a novel ontology learning approach that is devoid of any
manually-curated resources, and is applicable across a wide range of domains (the
current focus is medicine, technology and economics). More specifically, this research
proposes and develops a set of novel techniques that take advantage of Web data to
address the following problems: (1) the absence of integrated techniques for cleaning
noisy data; (2) the inability of current term extraction techniques to systematically
explicate, diversify and consolidate their evidence; (3) the inability of current corpus
construction techniques to automatically create very large, high-quality text corpora
using a small number of seed terms; and (4) the difficulty of locating and preparing
features for clustering and extracting relations.
This dissertation is organised as a series of published papers that contribute to
a complete and coherent theme. The work into the individual techniques of the
proposed ontology learning approach has resulted in a total of nineteen published
articles: two book chapters, four journal articles, and thirteen refereed conference
papers. The proposed approach consists of several major contributions to each task
in ontology learning. These include (1) a technique for simultaneously correcting
noises such as spelling errors, expanding abbreviations and restoring improper casing
in text; (2) a novel probabilistic measure for recognising multi-word phrases; (3) a
probabilistic framework for recognising domain-relevant terms using formal word
distribution models; (4) a novel technique for constructing very large, high-quality
text corpora using only a small number of seed terms; and (5) novel techniques for
clustering terms and discovering coarse-grained semantic relations using featureless
similarity measures and dynamic Web data. In addition, a comprehensive review
is included to provide background on ontology learning and recent advances in this
area. The implementation details of the proposed techniques are provided at the
end, together with a description on how the system is used to automatically discover
term clouds and lightweight ontologies for document skimming and scanning.
Acknowledgements
First and foremost, this dissertation would not have come into being without the
continuous support provided by my supervisors Dr Wei Liu and Prof Mohammed
Bennamoun. Their insightful guidance, financial support and broad interest made
my research journey at the School of Computer Science and Software Engineering
(CSSE) an extremely fruitful and enjoyable one. I am proud to have Wei and
Mohammed as my mentors and personal friends.
I would also like to thank Dr Krystyna Haq, Mrs Jo Francis and Prof Robyn
Owens for being there to answer my questions on general research skills and scholar-
ships. A very big thank you goes to the Australian Government and the University
of Western Australia for sponsoring this research under the International Postgrad-
uate Research Scholarship and the University Postgraduate Award for International
Students. I am also very grateful to CSSE, and Dr David Glance of the Centre
for Software Practice (CSP) for providing me with a multitude of opportunities to
pursue this research further.
I would like to thank the other members of CSSE including Prof Rachell Cardell-
Oliver, Assoc/Prof Chris McDonald and Prof Michael Wise for their advice. My
appreciation goes to my office mates Faisal, Syed, Suman and Majigaa. A special
thank you to the members of the CSSE’s support team, namely, Laurie, Ashley,
Ryan, Sam and Joe, for always being there to restart the virtual machine and to
fix my laptop computers due to accidental spills. Not forgetting the amicable peo-
ple in CSSE’s administration office, namely, Jen Redman, Nicola Hallsworth, Ilse
Lorenzen, Rachael Offer, Jayjay Jegathesan and Jeff Pollard for answering my ad-
ministrative and travel needs, and making my stay at CSSE an extremely enjoyable
one.
I also had the pleasure of meeting with many reputable researchers during my
travel whose advice has been invaluable. To name a few, Prof Kyo Kageura, Prof
Udo Hahn, Prof Robert Dale, Assoc/Prof Christian Gutl, Prof Arno Scharl, Prof
Albert Yeap, Dr Timothy Baldwin, and Assoc/Prof Stephen Bird. A special thank
you to the wonderful people at the Department of Information Science, University of
Otago for being such a gracious host during my visit to Dunedin. I would also like to
extend my gratitude for the constant support and advice provided by researchers at
the Curtin University of Technology, namely, Prof Moses Tade, Assoc/Prof Hongwei
Wu, Dr Nicoleta Balliu and Prof Tharam Dillon. In addition, my appreciation goes
to my previous mentors Assoc/Prof Ongsing Goh and Prof Shahrin Sahib of the
Technical University of Malaysia Malacca (UTeM), and Assoc/Prof R. Mukundan
of the University of Canterbury. I should also acknowledge my many friends and
colleagues at the Faculty of Information and Communication Technology at UTeM.
My thank you also goes to the anonymous reviewers who have commented on all
publications that have arisen from this thesis.
Last but not least, I will always remember the unwavering support provided by
my wife Saujoe, my parents and my only sister, without which I would not have
cruised through this research journey so pleasantly. Also a special appreciation to
the city of Perth for being such a nice place to live in and to undertake this research.
measure, 1, 300 term candidates, and the GENIA corpus as benchmark. The evalu-
ation showed that term recognition using the SPARTAN -based corpus achieved the
best precision at 99.56%.
1.3.5 Term Clustering and Relation Acquisition (Chapter 7 and 8)
Figure 1.6: Overview of the ARCHILES technique used in the relation acquisition
phase of the proposed ontology learning system. The ARCHILES technique is
described in Chapter 8, while the TTA clustering technique and noW measure are
described in Chapter 7.
Figure 1.6 shows an overview of the techniques for relation acquisition. The
flat lists of domain-relevant terms obtained during the previous phase are organ-
ised into hierarchical structures during the relation acquisition phase. A review of
the techniques for acquiring semantic relations between terms was conducted. Cur-
rent techniques rely heavily on the presence of syntactic cues and static background
knowledge such as semantic lexicons for acquiring relations. A novel technique
named Acquiring Relations through Concept Hierarchy Disambiguation, Associa-
tion Inference and Lexical Simplification (ARCHILES) is proposed for constructing
lightweight ontologies using coarse-grained relations derived from Wikipedia and
search engines. ARCHILES combines word disambiguation, which uses the distance
measure n-degree of Wikipedia (noW), and lexical simplification to handle complex
and ambiguous terms. ARCHILES also includes association inference using a novel
multi-pass Tree-Traversing Ant (TTA) clustering algorithm with the Normalised
10 Chapter 1. Introduction
Web Distance (NWD)4 as the similarity measure to cope with terms not covered
by Wikipedia. This technique can be used to complement conventional techniques
for acquiring fine-grained relations. Two small experiments using 11 terms in the
genetics domain and 31 terms in the food domain revealed precision scores between
80% to 100%. The details about TTA and noW are provided in Chapter 7. The
descriptions of ARCHILES are included in Chapter 8.
1.4 Contributions
The standout contribution of this dissertation is the exploration of a complete
solution to the complex problem of automatic ontology learning from text. This
research has produced several other contributions to the field of ontology learning.
The complete list is as follows:
• A technique which consolidates various evidence from existing tools and from
search engines for simultaneously correcting spelling errors, expanding abbre-
viations and restoring improper casing.
• Two measures for determining the collocational strength of word sequences
using page counts from search engines, namely, an adaptation of existing word
association measures, and a novel probabilistic measure.
• In-depth experiments on parameter estimation and linear regression involving
various word distribution models.
• Two measures for determining term relevance based on explicitly defined term
characteristics and the distributional behaviour of terms across different cor-
pora. The first measure is a heuristic measure, while the second measure
is based on a novel probabilistic framework for consolidating evidence using
formal word distribution models.
• In-depth experiments on the effects of search engine and page count variations
on corpus construction.
• A novel technique for corpus construction that requires only a small num-
ber of seed terms to automatically produce very large, high-quality text cor-
pora through the systematic analysis of website contents. The on-demand
4NWD is a generalisation of the Normalised Google Distance (NGD) [50] that employs any
available Web search engines.
1.5. Layout of Thesis 11
construction of new text corpora enables this and many other term recogni-
tion techniques to be widely applicable across different domains. A generally-
applicable heuristic technique is also introduced for removing HTML tags and
boilerplates, and extracting relevant content from webpages.
• In-depth experiments on the peculiarities of clustering terms as compared to
other forms of feature-based data clustering.
• A novel technique for constructing lightweight ontologies in an iterative pro-
cess of lexical simplication, association inference through term clustering, and
word disambiguation using only Wikipedia and search engines. A generally-
applicable technique is introduced for multi-pass term clustering using feature-
less similarity measurement based on Wikipedia and page counts by search
engines.
• Demonstration of the use of term clouds and lightweight ontologies to assist
the skimming and scanning of documents.
1.5 Layout of Thesis
Overall, this dissertation is organised as a series of papers published in interna-
tionally refereed book chapters, journals and conferences. Each paper constitutes an
independent set of work into ontology learning. However, these papers together con-
tribute to a complete and coherent theme. In Chapter 2, a background to ontology
learning and a review on several prominent ontology learning systems is presented.
The core content of this dissertation is laid out in Chapter 3 to 8. Each of these
chapters describes one of the five phases in our ontology learning system.
• Chapter 3 (Text Preprocessing) features an IJCAI workshop paper on the text
cleaning technique called ISSAC.
• In Chapter 4 (Text Processing), an IJCNLP conference paper describing the
two word association measures UH and OU is included.
• An Intelligent Data Analysis journal paper on the two term relevance measures
TH and OT is included in Chapter 5 (Term Recognition).
• In Chapter 6 (Corpus Construction for Term Recognition), a Language Re-
sources and Evaluation journal paper is included to describe the SPARTAN
technique for automatically constructing text corpora for term recognition.
12 Chapter 1. Introduction
• A Data Mining and Knowledge Discovery journal paper that describes the
TTA clustering technique and noW distance measure is included in Chapter
7 (Term Clustering for Relation Acquisition).
• In Chapter 8 (Relation Acquisition), a PAKDD conference paper is included
to describe the ARCHILES technique for acquiring coarse-grained relations
using TTA and noW.
After the core content, Chapter 9 elaborates on the implementation details of
the proposed ontology learning system, and the application of term clouds and
lightweight ontologies for document skimming and scanning. In Chapter 10, we
summarise our conclusions and provide suggestions for future work.
13CHAPTER 2
Background
“A while ago, the Artificial Intelligence research community got
together to find a way to enable knowledge sharing...They proposed an
infrastructure stack that could enable this level of information exchange,
and began work on the very difficult problems that arise.”
- Thomas Gruber, Ontology of Folksonomy (2007)
This chapter provides a comprehensive review on ontology learning. It also
serves as a background introduction to ontologies in terms of what they are, why
they are important, how they are obtained and where they can be applied. The
definition of an ontology is first introduced before a discussion on the differences
between lightweight ontologies and the conventional understanding of ontologies
is provided. Then the process of ontology learning is described, with a focus on
types of output, commonly-used techniques and evaluation approaches. Finally,
several current applications and prominent systems are explored to appreciate the
significance of ontologies and the remaining challenges in ontology learning.
2.1 Ontologies
Ontologies can be thought of as directed graphs consisting of concepts as nodes,
and relations as the edges between the nodes. A concept is essentially a mental
symbol often realised by a corresponding lexical representation (i.e. natural language
name). For instance, the concept “food” denotes the set of all substances that can
be consumed for nutrition or pleasure. In Information Science, an ontology is a
“formal, explicit specification of a shared conceptualisation” [92]. This definition
imposes the requirement that the names of concepts, and how the concepts are
related to one another have to be explicitly expressed and represented using formal
languages such as Web Ontology Language (OWL). An important benefit of a formal
representation is the ability to specify axioms for reasoning to determine validity
and to define constraints in ontologies.
As research into ontology progresses, the definition of what constitutes an on-
tology evolves. The extent of relational and axiomatic richness, and the formality
of representation eventually gave rise to a spectrum of ontology kinds [253] as il-
lustrated in Figure 2.1. At one end of the spectrum, we have ontologies that make
little or no use of axioms referred to as lightweight ontologies [89]. At the other
end, we have heavyweight ontologies [84] that make intensive use of axioms for spec-
ification. Ontologies are fundamental to the success of the Semantic Web as they
14 Chapter 2. Background
Figure 2.1: The spectrum of ontology kinds, adapted from Giunchiglia & Zaihrayeu
[89].
enable software agents to exchange, share, reuse and reason about concepts and
relations using axioms. In the words of Tim Berners-Lee [24], “For the semantic
web to function, computers must have access to structured collections of informa-
tion and sets of inference rules that they can use to conduct automated reasoning”.
However, the truth remains that the automatic learning of axioms is not an easy
task. Despite certain success, many ontology learning systems are still struggling
with the basics of extracting terms and relations [84]. For this reason, the majority
of ontology learning systems out there that claim to learn ontologies are in fact
creating lightweight ontologies. At the moment, lightweight ontologies appear to
be the most common type of ontologies in a variety of Semantic Web applications
(e.g. knowledge management, document retrieval, communities of practice, data
integration) [59, 75].
2.2 Ontology Learning from Text
Ontology learning from text is the process of identifying terms, concepts, rela-
tions and optionally, axioms from natural language text, and using them to construct
and maintain an ontology. Even though the area of ontology learning is still in its
infancy, many proven techniques from established fields such as text mining, data
mining, natural language processing, information retrieval, as well as knowledge rep-
resentation and reasoning have powered a rapid growth in recent years. Information
retrieval provides various algorithms to analyse associations between concepts in
texts using vectors, matrices [76] and probabilistic theorems [280]. On the other
hand, machine learning and data mining provides ontology learning the ability to
2.2. Ontology Learning from Text 15
extract rules and patterns out of massive datasets in a supervised or unsupervised
manner based on extensive statistical analysis. Natural language processing pro-
vides the tools for analysing natural language text on various language levels (e.g.
morphology, syntax, semantics) to uncover concept representations and relations
through linguistic cues. Knowledge representation and reasoning enables the onto-
logical elements to be formally specified and represented such that new knowledge
can be deduced.
Figure 2.2: Overview of the outputs, tasks and techniques of ontology learning.
In the following subsections, we look at the types of output, common techniques
and evaluation approaches of a typical ontology learning process.
2.2.1 Outputs from Ontology Learning
There are five types of output in ontology learning, namely, terms, concepts,
taxonomic relations, non-taxonomic relations and axioms. Some researchers [35]
refer to this as the “Ontology Learning Layer Cake”. To obtain each output, cer-
tain tasks have to be accomplished and the techniques employed for each task may
vary between systems. This view of output-task relation that is independent of any
implementation details promotes modularity in designing and implementing ontol-
ogy learning systems. Figure 2.2 shows the output and the corresponding tasks.
Each output is a prerequisite for obtaining the next output as shown in the figure.
16 Chapter 2. Background
Terms are used to form concepts which in turn are organised according to relations.
Relations can be further generalised to produce axioms.
Terms are the most basic building blocks in ontology learning. Terms can be
simple (i.e. single-word) or complex (i.e. multi-word), and are considered as lexical
realisations of everything important and relevant to a domain. The main tasks as-
sociated with terms are to preprocess texts and extract terms. Preprocessing ensures
that the input texts are in an acceptable format. Some of the techniques relevant
to preprocessing include noisy text analytics and the extraction of relevant contents
from webpages (i.e. boilerplate removal). The extraction of terms usually begin
with some kind of part-of-speech tagging and sentence parsing. Statistical or prob-
abilistic measures are then used to determine the extent of collocational strength
and domain relevance of the term candidates.
Concepts can be abstract or concrete, real or fictitious. Broadly speaking, a
concept can be anything about which something is said. Concepts are formed by
grouping similar terms. The main tasks are therefore to form concepts and label
concepts. The task of forming concepts involve discovering the variants of a term
and grouping them together. Term variants can be determined using predefined
background knowledge, syntactic structure analysis or through clustering based on
some similarity measures. As for deciding on the suitable label for a concept, existing
background knowledge such as WordNet may be used to find the name of the nearest
common ancestor. If a concept is determined through syntactic structure analysis,
the heads of the complex terms can be used as the corresponding label. For instance,
the common head noun “tart” can be used as the label for the concept comprising
of “egg tart”, “French apple tart”, “chocolate tart”, etc.
Relations are used to model the interactions between the concepts in a domain.
There are two types of relations, namely, taxonomic relations and non-taxonomic
relations. Taxonomic relations are the hypernymies between concepts. The main
task is to construct hierarchies. Organising concepts into a hierarchy involves the
discovery of hypernyms and hence, some researchers may also refer to this task as
extracting taxonomic relations. Hierarchy construction can be performed in various
ways such as using predefined relations from existing background knowledge, using
statistical subsumption models, relying on semantic relatedness between concepts,
and utilising linguistic and logical rules or patterns. Non-taxonomic relations are the
interactions between concepts (e.g. meronymy, thematic roles, attributes, possession
and causality) other than hypernymy. The less explicit and more complex use of
words for specifying relations other than hypernymy causes the tasks to discover
2.2. Ontology Learning from Text 17
non-taxonomic relations and label non-taxonomic relations to be more challenging.
Discovering and labelling non-taxonomic relations are mainly reliant on the analysis
of syntactic structures and dependencies. In this aspect, verbs are taken as good
indicators for non-taxonomic relations and help from domain experts are usually
required to label such relations.
Lastly, axioms are propositions or sentences that are always taken as true. Ax-
ioms act as a starting point for deducing other truth, verifying correctness of existing
ontological elements and defining constraints. The task involved here is to discover
axioms. The task of learning axioms usually involve the generalisation or deduction
of a large number of known relations that satisfy certain criteria.
2.2.2 Techniques for Ontology Learning
The techniques employed by different systems may vary depending on the tasks
to be accomplished. The techniques can generally be classified into statistics-based,
linguistics-based, logic-based, or hybrid. Figure 2.2 illustrates the various commonly-
used techniques, and each technique may be applicable to more than one task.
The various statistics-based techniques for accomplishing the tasks in ontology
learning are mostly derived from information retrieval, machine learning and data
mining. The lack of consideration for the underlying semantics and relations between
the components of a text makes statistics-based techniques more prevalent in the
early stages of ontology learning. Some of the common techniques include clustering
[272], latent semantic analysis [252], co-occurrence analysis [34], term subsumption
[77], contrastive analysis [260] and association rule mining [239]. The main idea
behind these techniques is that the extent of occurrence of terms and their contexts
in documents often provide reliable estimates about the semantic identity of terms.
• In clustering, some measures of relatedness (e.g. similarity, distance) is em-
ployed to assign terms into groups for discovering concepts or constructing
hierarchy [152]. The process of clustering can either begin with individual
terms or concepts and grouping the most related ones (i.e. agglomerative
clustering), or begin with all terms or concepts and dividing them into smaller
groups to maximise within-group relatedness (i.e. divisive clustering). Some
of the major issues in clustering are working with high-dimensional data, and
feature extraction and preparation for similarity measurement. This gave rise
to a class of feature-less similarity and distance measures based solely on the
co-occurrence of words in large text corpora. The Normalised Web Distance
(NGD) is one example [262].
18 Chapter 2. Background
• Relying on raw data to measure relatedness may lead to data sparseness [35].
In latent semantic analysis, dimension reduction techniques such as singular
value decomposition are applied on the term-document matrix to overcome the
problem [139]. In addition, inherent relations between terms can be revealed
by applying correlation measures on the dimensionally-reduced matrix, leading
to the formation of groups.
• The analysis of the occurrence of two or more terms within a well-defined
unit of information such as sentence or more generally, n-gram is known as co-
occurrence analysis. Co-occurrence analysis is usually coupled with some mea-
sures to determine the association strength between terms or the constituents
of terms. Some of the popular measures include dependency measures (e.g.
mutual information [47]), log-likelihood ratios [206] (e.g. chi-square test), rank
correlations (e.g. Pearson’s and Spearman’s coefficient [244]), distance mea-
sures (e.g. Kullback-Leiber divergence [161]), and similarity measures (e.g.
cosine measures [223]).
• In term subsumption, the conditional probabilities of the occurrence of terms
in documents are employed to discover hierarchical relations between them
[77]. A term subsumption measure is used to quantify the extent of a term x
being more general than another term y. The higher the subsumption value,
the more general term x is with respect to y.
• The extent of occurrence of terms in individual documents and in text corpora
is employed for relevance analysis. Some of the common relevance measures
from information retrieval include the Term Frequency-Inverse Document Fre-
quency (TF-IDF) [215] and its variants, and others based on language mod-
elling [56] and probability [83]. Contrastive analysis [19] is a kind of relevance
analysis based on the heuristic that general language-dependent phenomena
should spread equally across different text corpora, while special-language phe-
nomena should portray odd behaviours.
• Given a set of concept pairs, association rule mining is employed to describe the
associations between the concepts at the appropriate level of abstraction [115].
In the example by [162], given the already known concept pairs chips, beer
and peanuts, soda, association rule mining is then employed to generalise
the pairs to provide snacks, drinks. The key to determining the degree of
abstraction in association rules is provided by user-defined thresholds such as
2.2. Ontology Learning from Text 19
confidence and support.
Linguistics-based techniques are applicable to almost all tasks in ontology learn-
ing and are mainly dependent on the natural language processing tools. Some of
the techniques include part-of-speech tagging, sentence parsing, syntactic structure
analysis and dependency analysis. Other techniques rely on the use of semantic
lexicon, lexico-syntactic patterns, semantic templates, subcategorisation frames, and
seed words.
• Part-of-speech tagging and sentence parsing provide the syntactic structures
and dependency information required for further linguistic analysis. Some ex-
amples of part-of-speech tagger are Brill Tagger [33] and TreeTagger [219].
Principar [149], Minipar [150] and Link Grammar Parser [247] are among the
few common sentence parsers. Other more comprehensive toolkits for nat-
ural language processing include General Architecture for Text Engineering
(GATE) [57], and Natural Language Toolkit (NLTK) [25]. Despite the place-
ment under the linguistics-based category, certain parsers are built on statis-
tical parsing systems. For instance, the Stanford Parser [132] is a lexicalised
probabilistic parser.
• Syntactic structure analysis and dependency analysis examines syntactic and
dependency information to uncover terms and relations at the sentence level.
In syntactic structure analysis, words and modifiers in syntactic structures (e.g.
noun phrases, verb phrases and prepositional phrases) are analysed to discover
potential terms and relations. For example, ADJ-NN or DT-NN can be extracted
as potential terms, while ignoring phrases containing other part-of-speech such
as verbs. In particular, the head-modifier principle has been employed exten-
sively to identify complex terms related through hyponymy with the heads of
the terms assuming the hypernym role [105]. In dependency analysis, gram-
matical relations such as subject, object, adjunct and complement are used
for determining more complex relations [86, 48].
• Semantic lexicon can either be general such as WordNet [177] or domain-
specific such as the Unified Medical Language System (UMLS) [151]. Semantic
lexicon offers easy access to a large collection of predefined words and rela-
tions. Concepts from semantic lexicon are usually organised in sets of similar
words (i.e. synsets). These synonyms are employed for discovering variants of
terms [250]. Relations from semantic lexicon have also been proven useful to
20 Chapter 2. Background
ontology learning. These relations include hypernym-hyponym (i.e. parent-
child relation) and meronym-holonym (i.e. part-whole relation). Many of the
work related to the use of relations in WordNet can be found in the area of
word sense disambiguation [265, 145] and lexical acquisitions [190].
• The use of lexico-syntactic patterns was proposed by [102], and has been em-
ployed to extract hypernyms [236] and meronyms. Lexico-syntactic patterns
capture hypernymy relations using patterns such as NP such as NP, NP,...,
and NP. For extracting meronyms, patterns such as NP is part of NP can be
useful. The use of patterns provide reasonable precision but the recall is low
[35]. Due to the cost and time involved in manually producing such patterns,
efforts [234] have been taken to study the possibility of learning them. Seman-
tic templates [238, 257] are similar to lexico-syntactic patterns in terms of their
purpose. However, semantic templates offer more detailed rules and conditions
to extract not only taxonomic relations but also complex non-taxonomic rela-
tions.
• In linguistic theory, the subcategorisation frame [5, 85] of a word is the number
and kinds of other words that it selects when appearing in a sentence. For
example, in the sentence “Joe wrote a letter”, the verb “write” selects “Joe”
and “letter” as its subject and object, respectively. In other words, “Person”
and “Written-Communication” are the restrictions of selection for the subject
and object of the verb “write”. The restrictions of selection extracted from
parsed texts can be used in conjunction with clustering techniques to discover
concepts [68].
• The use of seed words (i.e. seed terms) [281] is a common practice in many
systems to guide a wide range of tasks in ontology learning. Seed words provide
good starting points for the discovery of additional terms relevant to that
particular domain [110]. Seed words are also used to guide the automatic
construction of text corpora from the Web [15].
Logic-based techniques are the least common in ontology learning and are mainly
adopted for more complex tasks involving relations and axioms. Logic-based tech-
niques have connections with advances in knowledge representation and reasoning,
and machine learning. The two main techniques employed are inductive logic pro-
gramming [141, 283] and logical inference [227].
2.2. Ontology Learning from Text 21
• In inductive logic programming, rules are derived from existing collection of
concepts and relations which are divided into positive and negative examples.
The rules proves all the positive and none of the negative examples. In an
example by Oliveira et al. [191], induction begins with the first positive ex-
ample “tigers have fur”. With the second positive example “cats have fur”, a
generalisation of “felines have fur” is obtained. Given the third positive exam-
ple “dogs have fur”, the technique will attempt to generalise that “mammals
have fur”. When encountered with a negative example “humans do not have
fur”, then the previous generalisation will be dropped, giving only “canines
and felines have fur”.
• In logical inference, implicit relations are derived from existing ones using
rules such as transitivity and inheritance. Using the classic example, given the
premises “Socrates is a man” and “All men are mortal”, we can discover a
new attribute relation stating that “Socrates is mortal”. Despite the power of
inference, the possibilities of introducing invalid or conflicting relations may
occur if the design of the rules is not complete. Consider the example where
“human eats chicken” and “chicken eats worm” yield a new relation that is
not valid. This happened because the intransitivity of the relation “eat” was
not explicitly specified in advance.
2.2.3 Evaluation of Ontology Learning Techniques
Evaluation is an important aspect of ontology learning, just like any other re-
search areas. Evaluation allows individuals who use ontology learning systems to
assess the resulting ontologies, and to possibly guide and refine the learning process.
An interesting aspect about evaluation in ontology learning, as opposed to informa-
tion retrieval and other areas, is that ontologies are not an end product but rather,
a means to achieve some other tasks. In this sense, an evaluation approach is also
useful to assist users in choosing the best ontology that fits their requirements when
faced with a multitude of options.
In document retrieval, the object of evaluation is documents and how well sys-
tems provide documents that satisfy user queries, either qualitatively or quantita-
tively. However, in ontology learning, we cannot simply measure how well a system
constructs an ontology without raising more questions. For instance, is the ontology
good enough? If so, with respect to what application? An ontology is made up of
different layers such as terms, concepts and relations. If an ontology is inadequate
for an application, then which part of the ontology is causing the problem? Consid-
22 Chapter 2. Background
ering the intricacies of evaluating ontologies, a myriad of evaluation approaches have
been proposed in the past few years. Generally, these approaches can be grouped
into one of the four main categories depending on the kind of ontologies that are
being evaluated and the purpose of the evaluation [30]:
• The first approach evaluates the adequacy of ontologies in the context of other
applications. For example Porzel & Malaka [202] evaluated the use of ontolog-
ical relations in the context of speech recognition. The output from the speech
recognition system is compared with a gold standard generated by humans.
• The second approach uses domain-specific data sources to determine to what
extent the ontologies are able to cover the corresponding domain. For instance,
Brewster et al. [31] described a number of methods to evaluate the ‘fit’ between
an ontology and the domain knowledge in the form of text corpora.
• The third approach is used for comparing ontologies using benchmarks includ-
ing other ontologies [164].
• The last approach rely on domain experts to assess how well an ontology meets
a set of predefined criteria [158].
Due to the complex nature of ontologies, evaluation approaches can also be
distinguished by the layers of an ontology (e.g. term, concept, relation) they evaluate
[202]. More specifically, evaluations can be performed to assess the (1) correctness at
the terminology layer, (2) coverage at the conceptual layer, (3) wellness at taxonomy
layer, and (4) adequacy of the non-taxonomic relations.
The focus of evaluation at the terminology layer is to determine if the terms
used to identify domain-relevant concepts are included and correct. Some form of
lexical reference or benchmark is typically required for evaluation in this layer. Typ-
ical precision and recall measures from information retrieval are used together with
exact matching or edit distance [164] to determine performance at the terminology
layer. The lexical precision and recall reflect how good the extracted terms cover
the target domain. Lexical Recall (LR) measures the number of relevant terms ex-
tracted (erelevant) divided by the total number of relevant terms in the benchmark
(brelevant), while Lexical Precision (LP) measures the number of relevant terms ex-
tracted (erelevant) divided by the total number of terms extracted (eall). LR and LP
are defined as [214]:
LP =erelevant
eall
(2.1)
2.2. Ontology Learning from Text 23
LR =erelevant
brelevant
(2.2)
The precision and recall measure can be also combined to compute the corresponding
Fβ-score. The general formula for non-negative real β is:
Fβ =(1 + β2)(precision× recall)
β2 × precision + recall(2.3)
Evaluation measures at the conceptual level are concerned with whether the de-
sired domain-relevant concepts are discovered or otherwise. Lexical Overlap (LO)
measures the intersection between the discovered concepts (Cd) and the recom-
mended concepts (Cm). LO is defined as:
LO =|Cd ∩ Cm|
|Cm|(2.4)
Ontological Improvement (OI) and Ontological Loss (OL) are two additional mea-
sures to account for newly discovered concepts that are absent from the benchmark,
and for concepts which exist in the benchmark but were not discovered, respectively.
They are defined as [214]:
OI =|Cd − Cm|
|Cm|(2.5)
OL =|Cm − Cd|
|Cm|(2.6)
Evaluations at the taxonomy layer is more complicated. Performance measures
for the taxonomy layer are typically divided into local and global [60]. The similarity
of the concepts’ positions in the learned taxonomy and in the benchmark is used
to compute the local measure. The global measure is then derived by averaging
the local scores for all concept pairs. One of the few measures for the taxonomy
layer is the Taxonomic Overlap (TO) [164]. The computation of the global similarity
between two taxonomies begins with the local overlap of their individual terms. The
semantic cotopy, the set of all super- and sub-concepts, of a term varies depending
on the taxonomy. The local similarity between two taxonomies given a particular
term is determined based on the overlap of the term’s semantic cotopy. The global
taxonomic overlap is then defined as the average of the local overlaps of all the
terms in the two taxonomies. The same idea can be applied to compare adequacy
non-taxonomic relations.
24 Chapter 2. Background
2.3 Existing Ontology Learning Systems
Before looking into some of the prominent systems and recent advances in on-
tology learning, a recap of three previous independent surveys is conducted. The
first is a report by the OntoWeb Consortium [90], a body funded by the Information
Society Technologies Programme of the Commission of the European Communi-
ties. This survey listed 36 approaches for ontology learning from text. Some of the
important findings presented by this review paper are:
• There is no detailed methodology that guides the ontology learning process
from text.
• There is no fully automated system for ontology learning. Some of the systems
act as tools to assist in the acquisition of lexical-semantic knowledge, while
others help to extract concepts and relations from annotated corpora with the
involvement of users.
• There is no general approach for evaluating the accuracy of ontology learning,
and for comparing the results produced by different systems.
The second survey, released during the same time as the OntoWeb Consortium
survey, was performed by Shamsfard & Barforoush [226]. The authors claimed to
have studied over fifty different approaches before selecting and including seven
prominent ones in their survey. The main focus of the review was to introduce a
framework for comparing ontology learning approaches. The approaches included in
the review merely served as test cases to be fitted into the framework. Consequently,
the review provided an extensive coverage of the state-of-the-art of the relevant
techniques but was limited in terms of discussions on the underlying problems and
future outlook. The review arrived at the following list of problems:
• Much work has been conducted on discovering taxonomic relations, while non-
taxonomic relations were given less attention.
• Research into axiom learning was nearly unexplored.
• The focus of most research is on building domain ontologies. Most of the tech-
niques were designed to make heavy use of domain-specific patterns and static
background knowledge, with little regard to the portability of the systems
across different domains.
2.3. Existing Ontology Learning Systems 25
• Current ontology learning systems are evaluated within the confinement of
their domains. Finding a formal, standard method to evaluate ontology learn-
ing systems remains an open problem.
• Most systems are either semi-automated or tools for supporting domain ex-
perts in curating ontologies. Complete automation and elimination of user
involvement requires more research.
Lastly, Ding & Foo [62] presented a survey of 12 major ontology learning projects.
The authors wrapped up their survey with following findings:
• Input data are mostly structured. Learning from free texts remains within the
realm of research.
• The task of discovering relations is very complex and a difficult problem to
solve. It has turned out to be the main impedance to the progress of ontology
learning.
• The techniques for discovering concepts have reached a certain level of matu-
rity.
A closer look into the three survey papers revealed a consensus on several aspects of
ontology learning that required more work. These conclusions are in fact in line with
the findings of our literature review in the following Sections 2.3.1 and 2.3.2. These
conclusions are (1) fully automated ontology learning is still in the realm of research,
(2) current approaches are heavily dependent on static background knowledge, and
may face difficulty in porting across different domains and languages, (3) there is no
common evaluation platform for ontology learning, and (4) there is a lack of research
on discovering relations. The validity of some of these conclusions will become more
evident as we look into several prominent systems and recent advances in ontology
learning in the following two sections.
2.3.1 Prominent Ontology Learning Systems
A summary of the techniques used by five prominent ontology learning systems,
and the evaluation of these techniques are provided in this section.
OntoLearn
OntoLearn [178, 182, 259, 260], together with Consys (for ontology validation by
experts) and SymOntoX (for updating and managing ontology by experts) are part
26 Chapter 2. Background
of a project for developing an interoperable infrastructure for small and medium
enterprises in the tourism sector under the Federated European Tourism Informa-
tion System1 (FETISH). OntoLearn employs both linguistics and statistics-based
techniques in four major tasks to discover terms, concepts and taxonomic relations.
• Preprocess texts and extract terms: Domain and general corpora are first
processed using part-of-speech tagging and sentence parsing tools to produce
syntactic structures including noun phrases and prepositional phrases. For
relevance analysis, the approach adopts two metrics known as Domain Rel-
evance (DR) and Domain Consensus (DC). Domain relevance measures the
specificity of term t with respect to the target domain Dk through comparative
analysis across a list of predefined domains D1, ..., Dn. The measure is defined
as
DR(t,Dk) =P (t|Dk)
∑
i=1...n P (t|Di)
where P (t|Dk) and P (t|Di) are estimated asft,k
∑
t∈Dkft,k
andft,i
∑
t∈Dift,i
, respec-
tively. ft,k and ft,i are the frequencies of term t in domain Dk and Di, re-
spectively. Domain consensus, on the other hand, is used to measure the
appearance of a term in a single document as compared to the overall occur-
rence in the target domain. The domain consensus of a term t in domain Dk
is an entropy defined as
DC(t,Dk) =∑
d∈Dk
P (t|d)log1
P (t|d)
where P (t|d) is the probability of encountering term t in document d of domain
Dk.
• Form concepts: After the list of relevant terms has been identified, concepts
and glossary from WordNet are employed for associating the terms to existing
concepts and to provide definitions. The author named this process as seman-
tic interpretation. If multi-word terms are involved, the approach evaluates all
possible sense combinations by intersecting and weighting common semantic
patterns in the glossary until it selects the best sense combinations.
• Construct hierarchy: Once semantic interpretation has been performed on the
terms to form concepts, taxonomic relations are discovered using hypernyms
from WordNet to organise the concepts into domain concept trees.
1More information is available via http://sourceforge.net/projects/fetishproj/. Last accessed
25 May 2009.
2.3. Existing Ontology Learning Systems 27
An evaluation of the term extraction technique was performed using the F-
measure. A tourism corpus was manually constructed from the Web containing
about 200, 000 words. The evaluation was done by manually looking at 6, 000 of the
14, 383 candidate terms and marking all the terms judged as good domain terms
and comparing the obtained list with the list of terms automatically filtered by the
system. A precision of 85.42% and recall of 52.74% were achieved.
Text-to-Onto
Text-to-Onto [51, 162, 163, 165] is a semi-automated system that is part of an
ontology management infrastructure called KAON2. KAON is a comprehensive tool
suite for ontology creation and management. The authors claimed that the approach
has been applied to the tourism and insurance sector, but no further information
was presented. Instead, ontologies3 for some toy domains4 have been constructed
using this approach. Text-to-Onto employs both linguistics and statistics-based
techniques in six major tasks to discover terms, concepts, taxonomic relations and
non-taxonomic relations.
• Preprocess texts and extract terms: Plain text extraction is performed to ex-
tract plain domain texts from semi-structured sources (i.e. HTML documents)
and other formats (e.g. PDF documents). Abbreviation expansion is per-
formed on the plain texts using rules and dictionaries to replace abbreviations
and acronyms. Part-of-speech tagging and sentence parsing are performed on
the preprocessed texts to produce syntactic structures and dependencies. Syn-
tactic structure analysis is performed using weighted finite state transducers
to identify important noun phrases as terms. These natural language process-
ing tools are provided by a system called Saarbruecken Message Extraction
System (SMES) [184].
• Form concepts: Concepts from domain lexicon are required to assign new
terms to predefined concepts. Unlike other approaches that employ general
background knowledge such as WordNet, the lexicon adopted by Text-to-Onto
are domain-specific containing over 120, 000 terms. Each term is associated
with concepts available in a concept taxonomy. Other techniques for concept
2More information is available via http://kaon.semanticweb.org/. Last accessed 25 May 2009.3The ontologies can be downloaded from http://kaon.semanticweb.org/ontologies. Last ac-
cessed 25 May 2009.4The term toy domain is in wide use in the research community to describe work in extremely
restricted domains.
28 Chapter 2. Background
formations are also performed such as the use of co-occurrence analysis but no
additional information was provided.
• Construct hierarchy: Once the concepts have been formed, taxonomic relations
are discovered by exploiting the hypernyms from WordNet. Lexico-syntactic
patterns are also employed to identify hypernymy relations in the texts. The
authors refer to these hypernyms as oracle, denoted by H. The projection
H(t) will return a set of tuples (x, y) where x is a hypernym for term t and y
is the number of times the algorithm has found evidence for it. Using cosine
measure for similarity and the oracle, a bottom-up hierarchical clustering is
carried out with a list T of n terms as input. When given two terms which
are similar according to the cosine measure, the algorithm works by ordering
them as sub-concepts if one is a hypernym of the other. If the previous case
does not apply, the most frequent common hypernym h is selected to create a
new concept to accommodate both terms as siblings.
• Discover non-taxonomic relations and label non-taxonomic relations: For non-
taxonomic relations extraction, association rules together with two user-defined
thresholds (i.e. confidence, support) are employed to determine associations
between concepts at the right level of abstraction. Typically, users start with
low support and confidence to explore general relations, and later increases
the values to explore more specific relations. User participation is required to
validate and label the non-taxonomic relations.
An evaluation of the relation discovery technique was performed using a measure
called the Generic Relations Learning Accuracy (RLA). Given a set of discovered
relations D, precision is defined as |D ∩ R|/|D| and recall as |D ∩ R|/|R| where R
is the non-taxonomic relations prepared by domain experts. RLA is a measure to
capture intuitive notions for relation matches such as utterly wrong, rather bad, near
miss and direct hit. RLA is the averaged accuracy that the instances of discovered
relations match against their best counterpart from manually-curated gold-standard.
As the learning algorithm is controlled by support and confidence parameters, the
evaluation is done by varying the support and the confidence values. When both the
support and the confidence thresholds are set to 0, 8, 058 relations were produced
with a RLA of 0.51. Both the number of relations and the recall decreases with
growing support and confidence. Precision increases at first but drops when so few
relations are discovered that almost none is a direct hit. The best RLA at 0.67 is
achieved with a support at 0.04 and a confidence at 0.01.
2.3. Existing Ontology Learning Systems 29
ASIUM
ASIUM [71, 70, 69] is a semi-automated ontology learning system that is part of
an information extraction infrastructure called INTEX, by the Laboratoire d’Automatique
Documentaire et Linguistique de l’Universite de Paris 7. The aim of this approach
is to learn semantic knowledge from texts and use the knowledge for the expansion
(i.e. portability from one domain to the other) of INTEX. The authors mentioned
that the system has been tested by Dassault Aviation, and has been applied on a toy
domain using cooking recipe corpora in French. ASIUM employs both linguistics
and statistics-based techniques to carry out five tasks to discover terms, concepts
and taxonomic relations.
• Preprocess texts and discover subcategorisation frames: Sentence parsing is
applied on the input text using functionalities provided by a sentence parser
called SYLEX [54]. SYLEX produces all interpretations of parsed sentences in-
cluding attachments of noun phrases to verbs and clauses. Syntactic structure
and dependency analysis is performed to extract instantiated subcategorisation
frames in the form of <verb><syntactic role|preposition:head noun>∗
where the wildcard character ∗ indicates the possibility of multiple occurrences.
• Extract terms and form concepts: The nouns in the arguments of the sub-
categorisation frames extracted from the previous step are gathered to form
basic classes based on the assumption “head words occurring after the same,
different prepositions (or with the same, different syntactic roles), and with the
same, different verbs represent the same concept” [68]. To illustrate, suppose
that we have the nouns “ballpoint pen”, “pencil and “fountain pen” occurring
in different clauses as adjunct of the verb “to write” after the preposition
“with”. At the same time, these nouns are the direct object of the verb “to
purchase”. From the assumption, these nouns are thus considered as variants
representing the same concept.
• Construct hierarchy: The basic classes from the previous task are successively
aggregated to form concepts of the ontology and reveal the taxonomic relations
using clustering. Distance between all pairs of basic classes is computed and
two basic classes are only aggregated if the distance is less than the threshold
set by the user. On the one hand, the distance between two classes containing
the same words with same frequencies have the distance 0. On the other hand,
a pair of classes without a single common word have distance 1. The clustering
30 Chapter 2. Background
algorithm works bottom-up and performs first-best using basic classes as input
and builds the ontology level by level. User participation is required to validate
each new cluster before it can be aggregated to a concept.
An evaluation of the term extraction technique was performed using the precision
measure. The evaluation uses texts from the French journal Le Monde that have
been manually filtered to ensure the presence of terrorist event descriptions. The
results were evaluated by two domain experts who were not aware of the ontology
building process using the following indicators: OK if extracted information is cor-
rect, FALSE if extracted information is incorrect, NONE if there were no extracted
information, and FALSE for all other cases. Two precision values are computed,
namely, precision1 which is the ratio between OK and FALSE, and precision2 which
is the same as precision1 by taking into consideration NONE. Precision1 and pre-
cision2 have the value 86% and 89%, respectively.
TextStorm/Clouds
TextStorm/Clouds [191, 198] is a semi-automated ontology learning system that
is part of an idea sharing and generation system called Dr. Divago [197]. The
aim of this approach is to build and refine domain ontology for use in Dr. Divago
for searching resources in a multi-domain environment to generate musical pieces
or drawings. No information was provided on the availability of any real-world
applications, nor testing on toy domains. TextStorm/Clouds employs logic and
linguistics-based techniques to carry out six tasks to discover terms, taxonomic
relations, non-taxonomic relations and axioms.
• Preprocess texts and extract terms: The part-of-speech information in Word-
Net is used to annotate the input text. Later, syntactic structure and depen-
dency analysis is performed using an augmented grammar to extract syntactic
structures in the form of binary predicates. The Prolog-like binary predicates
represent relations between two terms. Two types of binary predicates are
considered. The first type captures terms in the form of subject and object
connected by a main verb. The second type captures the property of com-
pound nouns, usually in the form of modifiers. For example, the sentence “Ze-
bra eat green grass” will result in two binary predicates namely eat(Zebra,
grass) and property(grass, green). When working with dependent sen-
tences, finding the concepts may not be straightforward and this approach
performs anaphora resolution to resolve ambiguities. The anaphora resolution
2.3. Existing Ontology Learning Systems 31
uses a history list of discourse entities generated from preceeding sentences
[6]. In the presence of an anaphora, the most recent entities are given higher
priority.
• Construct hierarchy, discover non-taxonomic relations and label non-taxonomic
relations: Next, the binary predicates are employed to gradually aggregate
terms and relations to an existing ontology with user participation. Hy-
pernymy relations appear in binary predicates in the form of is-a(X,Y)
while part-of(X,Y) and contain(X,Y) provide good indicators for meronyms.
Attribute-value relations are obtainable from the predicates in the form of
property(X,Y). During the aggregation process, users may be required to in-
troduce new predicates to connect certain terms and relations to the ontology.
For example, in order to attach the predicate is-a(predator, animal) to
an ontology with the root node living entity, user will have to introduce
is-a(animal, living entity).
• Extract axioms: The approach employs inductive logic programming to learn
regularities by observing the recurrent concepts and relations in the predicates.
For instance, the approach using the extracted predicates below
1: is-a(panther, carnivore)
2: eat(panther, zebra)
3: eat(panther, gazelle)
4: eat(zebra, grass)
5: is-a(zebra,herbivore)
6: eat(gazelle, grass)
7: is-a(gazelle,herbivore)
will arrive at the conclusions that
1: eat(A, zebra):- is-a(A, carnivore)
2: eat(A, grass):- is-a(A, herbivore)
These axioms describe relations between concepts in terms of its context (i.e.
the set of neighbourhood connections that the arguments have).
Using the accuracy measure, the performance of the binary predicate extraction
task was evaluated to determine if the relations hold between the corresponding
concepts. A total of 21 articles from the scientific domain were collected and analysed
by the system. Domain experts then determined the coherence of the predicates and
32 Chapter 2. Background
its accuracy with respect to the corresponding input text. The authors concluded
an average accuracy of 52%.
SYNDIKATE
SYNDIKATE [96, 95] is a stand-alone automated ontology learning system. The
authors have applied this approach in two toy domains, namely, information tech-
nology and medicine. However, no information was provided on the availability of
any real-world applications. SYNDIKATE employs purely linguistics-based tech-
niques to carry out five tasks to discover terms, concepts, taxonomic relations and
non-taxonomic relations.
• Extract terms: Syntactic structure and dependency analysis is performed on
the input text using a lexicalised dependency grammar to capture binary va-
lency5 constraints between a syntactic head (e.g. noun) and possible modifiers
(e.g. determiners, adjectives). In order to establish a dependency relation be-
tween a head and a modifier, the term order, morpho-syntactic features com-
patibility and semantic criteria have to be met. Anaphora resolution based on
the centering model is included to handle pronouns.
• Form concepts, construct hierarchy, discover non-taxonomic relations and la-
bel non-taxonomic relations: Using predefined semantic templates, each term
in the syntactic dependency graph is associated with a concept in the domain
knowledge and at the same time, used to instantiate the text knowledge base.
The text knowledge base is essentially an annotated representation of the in-
put texts. For example, the term “hard disk” in the graph is associated with
the concept HARD DISK in domain knowledge, and at the same time, an in-
stance called HARD DISK3 will be created in the text knowledge base. The
approach then tries to find all relational links between conceptual correlates
of two words in the subgraph if both grammatical and conceptual constraints
are fulfilled. The linkage may either be constrained by dependency relations,
by intervening lexical materials, or by conceptual compatibility between the
concepts involved. In the case where unknown words occur, semantic inter-
pretation of the dependency graph involving unknown lexical items in the text
knowledge base is employed to derive concept hypothesis. The structural pat-
terns of consistency, mutual justification and analogy relative to the already
5Valency refers to the capacity of a verb to take a specific number and type of arguments (noun
phrase positions).
2.3. Existing Ontology Learning Systems 33
available concept descriptions in the text knowledge base will be used as initial
evidence to create linguistic and conceptual quality labels. An inference en-
gine is then used to estimate the overall credibility of the concept hypotheses
by taking into account the quality labels.
An evaluation using the precision, recall and accuracy measures was conducted
to assess the concepts and relations extracted by this system. The use of semantic
interpretation to discover the relations between conceptual correlates yielded 57%
recall and 97% precision, and 31% recall and 94% precision, for medicine and infor-
mation technology texts, respectively. As for the formation of concepts, an accuracy
of 87% was achieved. The authors also presented the performance of other aspects
of the system. For example, sentence parsing in the system exhibits a linear time
complexity while a third-party parser runs in exponential time complexity. This
behaviour was caused by the latter’s ability to cope with ungrammatical input. The
incompleteness of the system’s parser results in a 10% loss of structural information
as compared to the complete third-party parser.
2.3.2 Recent Advances in Ontology Learning
Since the publication of the three survey papers [62, 226, 90], the research activ-
ities within the ontology learning community have been mainly focusing on (1) the
advancement of relation acquisition techniques, (2) the automatic labelling of con-
cept and relation, (3) the use of structured and unstructured Web data for relation
acquisition, and (4) the diversification of evidence for term recognition.
On the advancement of relation acquisition techniques, Specia & Motta [237]
presented an approach for extracting semantic relations between pairs of entities
from texts. The approach makes use of a lemmatiser, syntactic parser, part-of-speech
tagger, and word sense disambiguation models for language processing. New entities
are recognised using a named-entity recognition system. The approach also relies
on a domain ontology, a knowledge base, and lexical databases. Extracted entities
that exist in the knowledge base are semantically annotated with their properties.
Ciaramita et al. [48] employ syntactic dependencies as potential relations. The
dependency paths are treated as bi-grams, and scored with statistical measures of
correlation. At the same time, the arguments of the relations can be generalised to
obtain abstract concepts using algorithms for Selectional Restrictions Learning [208].
Snow et al. [234, 235] also presented an approach that employs the dependency
paths extracted from parse trees. The approach receives trainings using sets of
text containing known hypernym pairs. The approach then automatically discovers
34 Chapter 2. Background
useful dependency paths that can be applied to new corpora for identifying new
hypernyms.
On the automatic concept and relation labelling, Kavalec & Svatek [123] studied
the feasibility of label identification for relations using semantically-tagged corpus
and other background knowledge. The authors suggested that the use of verbs,
identified through part-of-speech tagging, can be viewed as a rough approximation
of relation labels. With the help of semantically-tagged corpus to resolve the verbs
to the correct word sense, the quality of relations labelling may be increased. In
addition, the authors also suggested that abstract verbs identified through generali-
sation via WordNet can be useful labels. Jones [119] proposed in her PhD research a
semi-automated technique for identifying concepts and simple technique for labelling
concepts using user-defined seed words. This research was carried out exclusively
using small lists of words as input. In another PhD research by Rosario [211], the
author proposed the use of statistical semantic parsing to extract concepts and rela-
tions from bioscience text. In addition, the research presented the use of statistical
machine learning techniques to build a knowledge representation of the concepts.
The concepts and relations extracted by the proposed approach are intended to be
combined by some other systems to produce larger propositions which can then be
used in areas such as abductive reasoning or inductive logic programming. This
approach has only been tested with a small amount of data from toy domains.
On the use of Web data for relation acquisition, Sombatsrisomboon [236] pro-
posed a simple 3-step technique for discovering taxonomic relations (i.e. hyper-
nym/hyponym) between pairs of terms using search engines. Search engine queries
are first constructed using the term pairs and patterns such as X is a/an Y. The
webpages provided by search engines are then gathered to create a small corpus.
Sentence parsing and syntactic structure analysis is performed on the corpus to dis-
cover taxonomic relations between the terms. Such use of Web data redundancy
and patterns can also be extended to discover non-taxonomic relations. Sanchez
& Moreno [217] proposed methods for discovering non-taxonomic relations using
Web data. The authors developed a technique for learning domain patterns using
domain-relevant verb phrases extracted from webpages provided by search engines.
These domain patterns are then used to extract and label non-taxonomic relations
using linguistic and statistical analysis. There is also an increasing interest in the
use of structured Web data such as Wikipedia for relation acquisition. Pei et al.
[196] proposed an approach for constructing ontologies using Wikipedia. The ap-
proaches uses a two-step technique, namely, name mapping and logic-based map-
2.3. Existing Ontology Learning Systems 35
ping, to deduce the type of relations between concepts in Wikipedia. Similarly,
Liu et al. [154] developed a technique called Catriple for automatically extract-
ing triples using Wikipedia’s categorical system. The approach focuses on category
pairs containing both explicit property and explicit value (e.g. “Category:Songs
by artist”-“Category:The Beatles songs” where “artist is property and “The Bea-
tles” is value), and category pairs containing explicit value but implicit property
(e.g. “Category:Rock songs”-“Category:British rock songs” where British is a value
with no property). Sentence parsers and syntactic rules are used to extract the
explicit properties and values from the category names. Weber & Buitelaar [267]
proposed a system called Information System for Ontology Learning and Domain
Exploration (ISOLDE) for deriving domain ontologies using manually-curated text
corpora, a general-purpose named-entity tagger, and structured data on the Web
(i.e. Wikipedia, Wiktionary and a German online dictionary known as DWDS) to
derive a domain ontology.
On the diversification of evidence for term recognition, Sclano & Velardi [222] de-
veloped a system called TermExtractor for identifying relevant terms in two steps.
TermExtractor uses a sentence parser to parse texts and extract syntactic struc-
tures such as noun compounds, and ADJ-N and N-PREP-N sequences. The list of
term candidates is then ranked and filtered using a combination of measures for
realising different evidence, namely, Domain Pertinence (DP), Domain Consensus
[103], CREAM [101], MnM8 [258], and OntoAnnotate [241].
In addition to the above-mentioned research areas, ontologies have also been
deployed in certain applications across different domains. One of the most successful
application area of ontologies is bioinformatics. Bioinformatics have thrived on the
advances in ontology learning techniques and the availability of manually-curated
terminologies and ontologies (e.g. Unified Medical Language System [151], Gene
Ontology [8] and other small domain ontologies at www.obofoundry.org). The
computable knowledge in ontologies is also proving to be a valuable resource for
reasoning and knowledge discovery in biomedical decision support systems. For
example, the inference that a disease of the myocardium is a heart problem is possible
using the subsumption relations in an ontology of disease classification based on
anatomic locations [52]. In addition, terminologies and ontologies are commonly
used for annotating biological datasets, biomedical literature and patient records,
and improving the access and retrieval of biomedical information [27]. For instance,
Baker et al. [13] presented a document query and delivery system for the field of
lipidomics9. The main aim of the system is to overcome the navigation challenges
that hinder the translation of scientific literatures into actionable knowledge. The
system allows users to access tagged documents containing lipid, protein and disease
names using description logic-based query capability that comes with the semi-
automatically created lipid ontology. The lipid ontology contains a total of 672
concepts. The ontology is the result of merging existing biological terminologies,
knowledge from domain experts, and output from a customised text mining system
that recognises lipid-specific nomenclature.
Another visible application of ontologies is in the manufacturing industry. Cho
et al. [43] looked at the current approach for locating and comparing parts informa-
tion in an e-procurement setting. At present, buyers are faced with the challenge
of accessing and navigating through different parts libraries from multiple suppliers
using different search procedures. The authors introduced the use of the “Parts
Library Concept Ontology” to integrate heterogeneous parts library to enable the
consistent identification and systematic structuring of domain concepts. Lemaignan
et al. [143] presented a proposal for a manufacturing upper ontology. The authors
stressed on the importance of ontologies as a common way for describing manufac-
turing processes for produce lifecycle management. The use of ontologies ensures
the uniformity in assertions throughout a product’s lifecycle, and the seamless flow
8http://projects.kmi.open.ac.uk/akt/MnM/index.html9Lipidomics is the study of pathways and networks of cellular lipids in biological systems.
2.5. Chapter Summary 39
of data between heterogeneous manufacturing environments. For instance, assume
that we have these relations in an ontology:
isMadeOf(part,rawMaterial)
isA(aluminium,rawMaterial)
isA(drilling,operation)
isMachinedBy(rawMaterial,operation)
and the drilling operation has the attributes drillSpeed and drillDiameter. Using
these elements, we can easily specify rules such as if isMachinedBy(aluminium,drilling)
and the drillDiameter is less than 5mm, then drillSpeed should be 3000 rpm
[143]. This ontology allows a uniform interpretation of assertions such as
isMadeOf(part,aluminium) anywhere along the product lifecycle, thus facilitating
the inference of standard information such as the drill speed.
2.5 Chapter Summary
In this chapter, an overview on ontologies and ontology learning from text was
provided. In particular, we looked the types of output, techniques and evaluation
methods related to ontology learning. The differences between a heavyweight (i.e.
formal) and a lightweight ontology are also explained. Several prominent ontology
learning systems, and some recent advances in the field were summarised. Finally,
some current noticeable applications were included to demonstrate the applicabil-
ity of ontologies to a wide range of domains. The use of ontologies for real-world
applications in the area of bioinformatics and the manufacturing industry was high-
lighted.
Overall, it was concluded that the automatic and practical construction of full-
fledged formal ontologies from text across different domains is currently beyond
the reach of conventional systems. Many current ontology learning systems are
still struggling to achieve high-performance term recognition, let alone more com-
plex tasks (e.g. relation acquisition, axiom learning). An interesting point revealed
during the literature review is that most systems ignore the fact that the static
background knowledge relied upon by their techniques is a rare resource and may
not have the adequate size and coverage. In particular, all existing term recognition
techniques are rested on a false assumption that the domain corpora required will
always be available. Only recently, there is a growing interest in automatically con-
structing text corpora using Web data. However, the governing philosophy behind
40 Chapter 2. Background
these existing corpus construction techniques is inadequate for creating very large
high-quality text corpora. In regard to relation acquisition, existing techniques rely
heavily on static background knowledge, especially semantic lexicon, such as Word-
Net. While there is an increasing interest in the use of dynamic Web data for relation
acquisition, more research work is still required. For instance, new techniques are
appearing every now and then that make use of Wikipedia for finding semantic re-
lations between two words. However, these techniques often leave out the details
on how to cope with words that do not appear in Wikipedia. Moreover, the use of
clustering techniques for acquiring semantic relations may appear less attractive due
to the complications in feature extraction and preparation. The literature review
also exposes the lack of treatment for data cleanliness during ontology learning. As
the use of Web data becomes more common, integrated techniques for removing
noises in texts are turning into a necessity.
All in all, it is safe to conclude that there is currently no single system that sys-
tematically uses dynamic Web data to meet the requirements for every stage of the
ontology learning process. There are several key areas that require more attention,
namely, (1) integrated techniques for cleaning noisy text, (2) high-performance term
recognition techniques, (3) high-quality corpus construction for term recognition,
and (4) dynamic Web data for clustering and relation acquisition. Our proposed
ontology learning system is designed specifically to address these key areas. In the
subsequent six chapters (i.e. Chapter 3 to 8), details are provided on the design,
development and testing of novel techniques for the five phases (i.e. text prepro-
cessing, text processing, term recognition, corpus construction, relation acquisition)
of the proposed system.
41CHAPTER 3
Text Preprocessing
Abstract
An increasing number of ontology learning systems are gearing towards the use
of online sources such as company intranet and the World Wide Web. Despite such
rise, not much work can be found in aspects of preprocessing and cleaning noisy
texts from online sources. This chapter presents an enhancement of the Integrated
Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration
(ISSAC) technique. ISSAC is implemented as part of the text preprocessing phase
in an ontology learning system. New evaluations performed on the enhanced ISSAC
using 700 chat records reveal an improved accuracy of 98% as compared to 96.5%
and 71% based on the use of basic ISSAC and of Aspell, respectively.
3.1 Introduction
Ontology is gaining applicability across a wide range of applications such as
information retrieval, knowledge acquisition and management, and the Semantic
Web. The manual construction and maintenance of ontologies was never a long-term
solution due to factors such as the high cost of expertise and the constant change in
knowledge. These factors have prompted an increasing effort in automatic and semi-
automatic learning of ontologies using texts from electronic sources. A particular
source of text that is becoming popular is the World Wide Web.
The quality of texts from online sources for ontology learning can vary anywhere
between noisy and clean. On the one hand, the quality of texts in the form of
blogs, emails and chat logs can be extremely poor. The sentences in noisy texts are
typically full of spelling errors, ad-hoc abbreviations and improper casing. On the
other hand, clean sources are typically prepared and conformed to certain standards
such as those in the academia and journalism. Some common clean sources include
news articles from online media sites, and scientific papers. Different text quality
requires different treatments during the preprocessing phase and noisy texts can be
much more demanding.
An increasing number of approaches are gearing towards the use of online sources
0This chapter appeared in the Proceedings of the IJCAI Workshop on Analytics for Noisy
Unstructured Text Data (AND), Hyderabad, India, 2007, with the title “Enhanced Integrated
Scoring for Cleaning Dirty Texts”.
42 Chapter 3. Text Preprocessing
such as corporate intranet [126] and search engines retrieved documents [51] for
different aspects of ontology learning. Despite such growth, only a small number of
researchers [165, 187] acknowledge the effect of text cleanliness on the quality of their
ontology learning output. With the prevalence of online sources, this “...annoying
phase of text cleaning...”[176] has become inevitable and ontology learning systems
can no longer ignore the issue of text cleanliness. An effort by Tang et al. [246]
showed that the accuracy of term extraction in text mining improved by 38-45%
(F1-measure) with the additional cleaning performed on the input texts (i.e. emails).
Integrated techniques for correcting spelling errors, abbreviations and improper
casing are becoming increasingly appealing as the boundaries between different er-
rors in online sources are blurred. Along the same line of thought, Clark [53] de-
fended that “...a unified tool is appropriate because of certain specific sorts of er-
rors”. To illustrate this idea, consider the error word “cta”. Do we immediately
take it as a spelling error and correct it as “cat”, or is there a problem with the
letter casing, which makes it a probable acronym? It is obvious that the problems
of spelling error, abbreviation and letter casing are inter-related to a certain extent.
The challenge of providing a highly accurate integrated technique for automatically
cleaning noisy text in ontology learning remains to be addressed.
In an effort to provide an integrated technique to solve spelling errors, ad-hoc
abbreviations and improper casing simultaneously, we have developed an Integrated
Scoring for Spelling Error Correction, Abbreviation Expansion and Case Restoration
(ISSAC)1 technique [273]. The basic ISSAC uses six weights from different sources
for automatically correcting spelling error, expanding abbreviations and restoring
improper casing. These includes the original rank by the spell checker Aspell [9],
reuse factor, abbreviation factor, normalised edit distance, domain significance and
general significance. Despite the achievement of 96.5% in accuracy by the basic
ISSAC, several drawbacks have been identified that require additional work. In
this chapter, we present the enhancement of the basic ISSAC. New evaluations
performed on seven different sets of chat records yield an improved accuracy of 98%
as compared to 96.5% and 71% based on the use of basic ISSAC and of Aspell,
respectively.
In Section 2, we present a summary of work related to spelling error detection and
correction, abbreviation expansion, and other cleaning tasks in general. In Section 3,
1This foundation work on ISSAC appeared in the Proceedings of the 5th Australasian Con-
ference on Data Mining (AusDM), Sydney, Australia, 2006, with the title “Integrated Scoring for
Spelling Error Correction, Abbreviation Expansion and Case Restoration in Dirty Text”.
3.2. Related Work 43
we summarise the basic ISSAC. In Section 4, we propose the enhancement strategies
for ISSAC. The evaluation results and discussions are presented in Section 5. We
summarise and conclude this chapter with future outlook in Section 6.
3.2 Related Work
Spelling error detection and correction is the task of recognising misspellings in
texts and providing suggestions for correcting the errors. For example, detecting
“cta” as an error and suggesting that the error to be replaced with “cat”, “act” or
“tac”. More information is usually required to select a correct replacement from
a list of suggestions. Two of the most studied classes of techniques are minimum
edit distance and similarity key. The idea of minimum edit distance techniques
began with Damerau [58] and Levenshtein [146]. Damerau-Levenshtein distance is
the minimal number of insertions, deletions, substitutions and transpositions needed
to transform one string into the other. For example, changing the word “wear” to
“beard” requires a minimum of two operations, namely, a substitution of ‘w’ with
‘b’, and an insertion of ‘d’. Many variants were developed subsequently such as
the algorithm by Wagner & Fischer [266]. The second class of techniques is the
similarity key. The main idea behind similarity key techniques is to map every
string into a key such that similarly spelt strings will have identical keys [135].
Hence, the key, computed for each spelling error, will act as a pointer to all similarly
spelt words (i.e. suggestions) in the dictionary. One of the earliest implementation
is the SOUNDEX system [189]. SOUNDEX is a phonetic algorithm for indexing
words based on their pronunciation in English. SOUNDEX works by mapping a
word into a key consisting of its first letter followed by a sequence of numbers. For
example, SOUNDEX replaces the letter li ∈ A,E, I, O, U,H,W, Y with 0 and
li ∈ R with 6, and hence, wear → w006 → w6 and ware → w060 → w6. Since
SOUNDEX, many improved variants were developed such as the Metaphone and
the Double-metaphone algorithm [199], Daitch-Mokotoff Soundex [138] for Eastern
European languages, and others [108]. One of the famous implementation that
utilises the similarity key technique is Aspell [9]. Aspell is based on the Metaphone
algorithm and the near-miss strategy from its predecessor Ispell [134]. Aspell begins
by converting a misspelt word to its soundslike equivalent (i.e. metaphone) and then
finding all words that have a soundslike within one or two edit distances from the
original word’s soundslike2. These soundslike words are the basis of the suggestions
by Aspell.
2Source from http://aspell.net/man-html/Aspell-Suggestion-Strategy.html
44 Chapter 3. Text Preprocessing
Most of the work in detecting and correcting spelling errors, and expanding ab-
breviations are carried out separately. The task of abbreviation expansion deals
with recognising shorter forms of words (e.g. “abbr.” or “abbrev.”), acronyms (e.g.
“NATO”) and initialisms (e.g. “HTML”, “FBI”), and expanding them to their cor-
responding words3. The work on detecting and expanding abbreviations are mostly
conducted in the realm of named-entity recognition and word-sense disambiguation.
The technique presented by Schwartz & Hearst [221] begins with the extraction of
all abbreviations and definition candidates based on the adjacency to parentheses.
A candidate is considered as the correct definition for an abbreviation if they ap-
pears in the same sentence, and the candidate has no more than min(|A|+5, |A|∗2)
words, where |A| is the number of characters in an abbreviation A. Park & Byrd
[195] presented an algorithm based on rules and heuristics for extracting definitions
for abbreviations from texts. Several factors are employed in this technique such as
syntactic cues, priority of rules, distance between abbreviation and definition and
word casing. Pakhomov [193] proposed a semi-supervised technique that employs
a hand-crafted table of abbreviations and their definitions for training a maximum
entropy classifier. For case restoration, improper letter casings in words are detected
and restored. For example, detecting the letter ‘j’ in “jones” as improper and cor-
recting the word to produce “Jones”. Lita et al. [153] presented an approach for
restoring cases based on the context in which the word exists. The approach first
captures the context surrounding a word and approximates the meaning using n-
grams. The casing of the letters in a word will depend on the most likely meaning of
the sentence. Mikheev [176] presented a technique for identifying sentence bound-
aries, disambiguating capitalised words and identifying abbreviations using a list of
common words. The technique can be described in four steps: identify abbreviations
in texts, disambiguate ambiguously capitalised words, assign unambiguous sentence
boundaries and disambiguate sentence boundaries if an abbreviation is followed by
a proper name.
In the context of ontology learning and other related areas such as text min-
ing, spelling error correction and abbreviation expansion are mainly carried out as
part of the text preprocessing (i.e. text cleaning, text normalisation) phase. Some
other common tasks in text preprocessing include plain text extraction (i.e. format
conversion, HTML/XML tag stripping, table identification [185]), sentence bound-
ary detection [243], case restoration [176], part-of-speech tagging [33] and sentence
3Some researchers refer to this relationship as abbreviation and definition or short-form and
long-form.
3.3. Basic ISSAC as Part of Text Preprocessing 45
parsing [149]. A review by Gomez-Perez & Manzano-Macho [90] showed that nearly
all ontology learning systems in the survey perform only shallow linguistic analysis
such as part-of-speech tagging during the text preprocessing phase. These exist-
ing systems require the input to be clean and hence, the techniques for correcting
spelling errors, expanding abbreviations and restoring cases are considered as un-
necessary. Ontology learning systems such as Text-to-Onto [165] and BOLE [187]
are the few exceptions. In addition to shallow linguistic analysis, these systems
incorporate some cleaning tasks. Text-to-Onto extracts plain text from various for-
mats such as PDF, HTML, XML, and identifies and replaces abbreviations using
substitution rules based on regular expressions. The text preprocessing phase of
BOLE consists of sentence boundary detection, irrelevant sentence elimination and
text tokenisation using Natural Language Toolkit (NLTK).
In a text mining system for extracting topics from chat records, Castellanos
[38] presented a comprehensive list of text preprocessing techniques. The system
employs a thesaurus, constructed using the Smith-Waterman algorithm [233], for
correcting spelling errors and identifying abbreviations. In addition, the system
removes program codes from texts, and detects sentence boundary based on simple
heuristics (e.g. shorter lines in program codes, and punctuation marks followed by
an upper case letter). Tang et al. [246] presented a cascaded technique for cleaning
emails prior to text mining. The technique is composed of four passes: non-text
filtering for eliminating irrelevant data such as email header, sentence normalisation,
case restoration, and spelling error correction for transforming relevant text into
canonical form. Many of the techniques mentioned above perform only one out
of the three cleaning tasks (i.e. spelling error correction, abbreviation expansion,
case restoration). In addition, the evaluations conducted to obtain the accuracy are
performed in different settings (e.g. no benchmark, test data and agreed measure
of accuracy). Hence, it is not possible to compare these different techniques based
on the accuracy reported in the respective papers. As pointed out earlier, only
a small number of integrated techniques are available for handling all three tasks.
Such techniques are usually embedded as part of a larger text preprocessing module.
Consequently, the evaluations of the individual cleaning task in such environments
are not available.
3.3 Basic ISSAC as Part of Text Preprocessing
ISSAC was designed and implemented as part of the text preprocessing phase
in an ontology learning system that uses chat records as input. The use of chat
46 Chapter 3. Text Preprocessing
Figure 3.1: Examples of spelling errors, ad-hoc abbreviations and improper casing
in a chat record.
records has required us to place more effort in ensuring text cleanliness during
the preprocessing phase. Figure 3.1 highlights the various spelling errors, ad-hoc
abbreviations and improper casing that occur much more frequently in chat records
than in clean texts.
Prior to spelling error correction, abbreviation expansion and case restoration,
three tasks are performed as part of the text preprocessing phase. Firstly, plain text
extraction is conducted to remove HTML and XML tags from the chat records us-
ing regular expressions and Perl modules, namely, XML::Twig4 and HTML::Strip5.
Secondly, identification of URLs, emails, emoticons6 and tables is performed. Such
information is extracted and set aside for assisting in other business intelligence
analysis. Tables are removed using the signatures of a table such as multiple spaces
between words, and words aligned in columns for multiple lines [38]. Thirdly, sen-
tence boundary detection is performed using Lingua::EN::Sentence7 Perl module.
Firstly, each sentence in the input text (e.g. chat record) is tokenised to obtain a
set of words T = t1, ...tw. The set T is then fed into Aspell. For each word e that
Aspell considers as erroneous, a list of ranked suggestions S is produced. Initially,
4http://search.cpan.org/dist/XML-Twig-3.26/5http://search.cpan.org/dist/HTML-Strip-1.06/6An emoticon, also called a smiley, is a sequence of ordinary printable characters or a small
image, intended to represent a human facial expression and convey an emotion.7http://search.cpan.org/dist/Lingua-EN-Sentence/
3.3. Basic ISSAC as Part of Text Preprocessing 47
S = s1,1, ..., sn,n is an ordered list of n suggestions where sj,i is the jth suggestion
with rank i (smaller i indicates higher confidence in the suggested word). If e appears
in the abbreviation dictionary, the list S is augmented by adding all the correspond-
ing m expansions in front of S as additional suggestions with rank 1. In addition,
the error word e is appended at the end of S with rank n + 1. These augmentations
produce an extended list S = s1,1, ..., sm,1, sm+1,1, ..., sm+n,n, sm+n+1,n+1, which is
a combination of m suggestions from the abbreviation dictionary (if e is a potential
abbreviation), n suggestions by Aspell, and the error word e itself. Placing the error
word e back into the list of possible replacements serves one purpose: to ensure that
if no better replacement is available, we keep the error word e as it is. Once the
extended list S is obtained, each suggestion sj,i is re-ranked using ISSAC. The new
score for the jth suggestion with original rank i is defined as
NS(sj,i) = i−1 + NED(e, sj,i) + RF (e, sj,i)
+AF (sj,i) + DS(l, sj,i, r) + GS(l, sj,i, r)
where
• NED(e, sj,i) ∈ (0, 1] is the normalised edit distance defined as (ED(e, sj,i) +
1)−1 where ED is the minimum edit distance between e and sj,i.
• RF (e, sj,i) ∈ 0, 1 is the boolean reuse factor for providing more weight to
suggestion sj,i that has been previously used for correcting error e. The reuse
factor is obtained through a lookup against a history list that ISSAC keeps
to record previous corrections. RF (e, sj,i) provides factor 1 if the error e has
been previously corrected with sj,i and 0 otherwise.
• AF (sj,i) ∈ 0, 1 is the abbreviation factor for denoting that sj,i is a potential
abbreviation. A lookup against the abbreviation dictionary, AF (sj,i) yields
factor 1 if suggestion sj,i exists in the dictionary and 0 otherwise. When the
scoring process takes place and the corresponding expansions for potential
abbreviations are required, www.stands4.com is consulted. A copy of the
expansion is stored in a local abbreviation dictionary for future reference.
• DS(l, sj,i, r) ∈ [0, 1] measures the domain significance of suggestion sj,i based
on its appearance in the domain corpora by taking into account the neigh-
bouring words l and r. This domain significance weight is inspired by the
TF-IDF [210] measure commonly used for information retrieval. The weight is
48 Chapter 3. Text Preprocessing
defined as the ratio between the frequency of occurrence of sj,i (individually,
and within l and r) in the domain corpora and the sum of the frequencies of
occurrences of all suggestions (individually, and within l and r).
• GS(l, sj,i, r) ∈ [0, 1] measures the general significance of suggestion sj,i based
on its appearance in the general collection (e.g. webpages indexed by the Gog-
gle search engine). The purpose of this general significance weight is similar
to that of the domain significance. In addition, the use of dynamic Web data
allows ISSAC to cope with language change that is not possible with static
corpora and Aspell. The weight is defined as the ratio between the number
of documents in the general collection containing sj,i within l and r and the
number of documents in the general collection that contains sj,i alone. Both
the ratios in DS and GS are offset by a measure similar to that of the IDF
[210]. For further details on DS and GS, please refer to Wong et al. [273].
3.4 Enhancement of ISSAC
The list of suggestions and the initial ranks provided by Aspell are an integral
part of ISSAC. Figure 3.2 summarises the accuracy of basic ISSAC obtained from
the previous evaluations [273] using four sets of chat records (where each set contains
100 chat records). The achievement of 74.4% accuracy by Aspell from the previous
evaluations, given the extremely poor nature of the texts, demonstrated the strength
of the Metaphone algorithm and the near-miss strategy. The further increase of 22%
in accuracy using basic ISSAC demonstrated the potential of the combined weights
NS(sj,i).
Evaluation 1 Evaluation 2 Evaluation 3 Evaluation 4 Average
number of correct
replacements using
ISSAC
97.06% 97.07% 95.92% 96.20% 96.56%
number of correct
replacements using
Aspell
74.61% 75.94% 71.81% 75.19% 74.39%
Figure 3.2: The accuracy of basic ISSAC from previous evaluations.
Based on the previous evaluation results, we discuss in detail the three causes
behind the remaining 3.5% of errors which were incorrectly replaced. Figure 3.3
shows the breakdown of the causes behind the incorrect replacements by the basic
ISSAC. The three causes are summarised as follow:
3.4. Enhancement of ISSAC 49
Basic ISSAC
2.00%
1.00%
0.50%
Causes
Correct replacement not in suggestion list
Inadequate/erroneous neighbouring words
Anomalies
Figure 3.3: The breakdown of the causes behind the incorrect replacements by basic
ISSAC.
1. The accuracy of the corrections by basic ISSAC is bounded by the coverage
of the list of suggestions S produced by Aspell. About 2% of the wrong
replacements is due to the absence of the correct suggestions produced by
Aspell. For example, the error “prder” in the context of “The prder number”
was incorrectly replaced by both Aspell and basic ISSAC as “parader” and
“prder”, respectively. After a look into the evaluation log, we realised that the
correct replacement “order” was not in S.
2. The use of the two immediate neighbouring words l and r to inject more con-
textual consideration into domain and general significance has contributed to
a huge increase in accuracy. Nonetheless, the use of l and r in ISSAC is
by no means perfect. About 1% of the wrong replacements is due to two
flaws related to l and r, namely, neighbouring words with incorrect spelling,
and inadequate neighbouring words. Incorrectly spelt neighbouring words in-
ject false contextual information into the computation of DS and GS. The
neighbouring words may also be considered as inadequate due to their indis-
criminative nature. For example, the left word “both” in “both ocats are” is
too general and does not offer much discriminatory power for distinguishing
between suggestions such as “coats”, “cats” and “acts”.
3. The remaining 0.5% is considered as anomalies where basic ISSAC cannot
address. There are two cases of anomalies: the equally likely nature of all pos-
sible suggestions, and the contrasting value of certain weights. As an example
for the first case, consider the error “Janice cheung has”. The left word is
correctly spelt and has adequately confined the suggestions to proper names.
In addition, the correct replacement “Cheung” is present in the suggestion list
S. Despite all these, both Aspell and ISSAC decided to replace “cheung” with
“Cheng”. A look into the evaluation log reveals that the surname “Cheung”
is as common as “Cheng”. In such cases, the probability of replacing e with
50 Chapter 3. Text Preprocessing
the correct replacement is c−1 where c is the number of suggestions with ap-
proximately same NS(sj,i). The second case of anomalies is due to contrasting
value of certain weights, especially NED and i−1, that causes wrong replace-
ments. For example, in the case “cannot chage an”, basic ISSAC replaced
the error “chage” with “charge” instead of “change”. All the other weights
for “change” are comparatively higher (i.e. DS and GS) or the same (i.e.
RF , NED and AF ) as “charge”. Such inclination indicates that “change” is
the most proper replacement given the various cues. Nonetheless, the original
rank by Aspell for charge is i=1 while change is i=6. As smaller i indicates
higher confidence, the inverse of the original Aspell’s rank i−1 results in the
plummeting of the combined weight for “change”.
In this chapter, we approach the enhancement of ISSAC from the perspective
of the first and second cause. For this purpose, we proposed three modifications to
the basic ISSAC :
1. We proposed the use of additional spell checking facilities as the answer to the
first cause (i.e. compensating for the inadequacy of Aspell). Google spellcheck,
which is based on statistical analysis of words on the World Wide Web8, ap-
pears to be the ideal candidate for complementing Aspell. Using the Google
SOAP API9, we can have easy access to one of the many functions provided
by Google, namely, Google spellcheck. Our new evaluations show that Google
spellcheck works well for certain errors where Aspell fails to suggest the cor-
rect replacements. Similar to adding the expansions for abbreviations and the
suggestions by Aspell, the suggestion provided by Google is added at the front
of the list S with rank 1. This places the suggestion by Google on the same
rank as the first suggestion by Aspell.
2. The basic ISSAC relies only on Aspell for determining if a word is an error.
For this purpose, we decided to include Google spellcheck as a complement.
If a word is detected as a possible error by either Aspell or Google spellcheck,
then we have adequate evidence to proceed and correct it using enhanced
ISSAC. In addition, errors that result in valid words are not recognised by
Aspell. For example, Aspell does not recognise “hat” as an error. If we were
to take into consideration the neighbours that it co-occurs with, namely, “suret
hat they”, then “hat” is certainly an error. Google contributes in this aspect.
“transcription factor”, “CREB”, “C-Fos”,“E2F”. Using WX , SLOP gathered
80, 633 webpage URLs for downloading. A total of 76, 876 pages were actually
downloaded while the remaining 3, 743 could not be reached for reasons such as
connection error. Finally, HERCULES is used to extract contents from the down-
loaded pages for constructing the Web-derived corpus. About 15% of the webpages
were discarded by HERCULES due to the absence of proper contents. The final
Web-derived corpus, denoted as SPARTAN-L (the letter L refers to local) is com-
posed of N = 64, 578 documents with F = 118, 790, 478 tokens. We have made
available an online query tool for SPARTAN-L13. It is worth pointing out that using
SPARTAN and the same number of seed terms, we can easily construct a corpus
11More information on Yahoo! Search, including API key registration, is available at
http://developer.yahoo.com/search/web/V1/webSearch.html.12A demo is available at http://explorer.csse.uwa.edu.au/research/data virtualcorpus.pl. Note
that slow response time is possible when server is under heavy load.13A demo is available at http://explorer.csse.uwa.edu.au/research/data localcorpus.pl. Note
that slow response time is possible when server is under heavy load.
6.4. Evaluations and Discussions 153
that is at least 20 times larger than a BootCat-derived corpus.
Many researchers have found good use of page counts for a wide range of NLP
applications using search engines as gateways to the Web (i.e. general virtual cor-
pus). In order to justify the need for content analysis during the construction of
virtual corpora by SPARTAN, we included the use of guided search engine queries as
a form of specialised virtual corpus during term recognition. We refer to this virtual
corpus as SREQ, the seed-restricted querying of the Web. Quite simply, we append
the conjunction of the seed terms W for every query made to the search engines. In
a sense, we can consider SREQ as the portion of the Web which contains the seed
terms W . For instance, the normal approach for obtaining the general page count
(i.e. the number of pages on the Web) for “TNF beta” is by submitting the n-gram
as a query to any search engines. Using Yahoo, the general virtual corpus has 56, 400
documents containing “TNF beta”. In SREQ, the conjunction of the seeds in W is
appended to “TNF beta”, resulting in the query q=“TNF beta” “transcription fac-
tor” “blood cell” “human”. Using this query, Yahoo provides us with 218 webpages,
while the conjunction of the seed terms alone results in the page count N = 149, 000.
We can consider the latter as the size of SREQ (i.e. total number of documents in
SREQ), while the former as the number of documents in SREQ which contains the
term “TNF beta”.
GENIA Corpus and the Preparations for Term Recognition
In this section, we evaluate the performance of term recognition using the dif-
ferent corpora discussed in Sections 6.4.3 and 6.4.3. Terms are content-bearing
words which are unambiguous, highly specific and relevant to a certain domain of
interest. Most existing term recognition techniques identify terms from among the
candidates through some scoring and ranking mechanisms. The performance of
term recognition is heavily dependent on the quality and the coverage of the text
corpora. Therefore, we find it appropriate to use this task to judge the adequacy
and applicability of both SPARTAN-V and SPARTAN-L in real-world applications.
The term candidates and gold standard employed in this evaluation comes with
the GENIA corpus [130]. The term candidates were extracted from the GENIA
corpus based on the readily-available part-of-speech and semantic marked-ups. A
gold standard, denoted as the set G, was constructed by extracting the terms which
have semantic descriptors enclosed by cons tags. For practicality reasons, we ran-
domly selected 1, 300 term candidates for evaluation, denoted as T . We manually
inspected the list of candidates and compared them against the gold standard. Out
154 Chapter 6. Corpus Construction for Term Recognition
of the 1, 300 candidates, 121 are non-terms (i.e. misses) while the remaining 1, 179
are domain-relevant terms (i.e. hits).
Figure 6.7: The number of documents and tokens from the local and virtual corpora
used in this evaluation.
Instead of relying on some complex measures, we used a simple, unsupervised
technique based solely on the cross-domain distributional behaviour of words for
term recognition. Our intention is to observe the extent of contribution of the quality
of corpora towards term recognition without being obscured by the complexity of
state-of-the-art techniques. We employed relative frequencies to determine whether
a word (i.e. term candidate) is a domain-relevant term or otherwise. The idea
is simple: if a word is encountered more often in a specialised corpus than the
contrastive corpus, then the word is considered as relevant to the domain represented
by the former. As such, this technique places even more emphasis on the coverage
and adequacy of the corpora to achieve good performance term recognition. For the
contrastive corpus, we have prepared a collection comprising of texts from a broad
sweeping range of domains other than our domain of interest, which is molecular
biology. Figure 6.7 summarises the composition of the contrastive corpus.
The term recognition procedure is performed as follows. Firstly, we took note of
the total number of tokens F in each local corpus (i.e. BootCat, GENIA, SPARTAN-
L, contrastive corpus). For the two virtual corpora, namely, SPARTAN-V and
SREQ, the total page count (i.e. total number of documents) N is used instead.
Secondly, the word frequency ft for each candidate t ∈ T is obtained from each
local corpus. We use page counts (i.e. document frequencies), nt as substitutes for
the virtual corpora. Thirdly, the relative frequency, pt for each t ∈ T are calcu-
lated as either ft/F or nt/N depending on the corpus type (i.e. virtual or local).
Fourthly, we evaluated the performance of term recognition using these relative
frequencies. Please take note that when comparing local corpora (i.e. BootCat,
6.4. Evaluations and Discussions 155
Algorithm 3 assessBinaryClassification(t,dt,ct,G)
1: initialise decision
2: if dt > ct ∧ t ∈ G then
3: decision := “true positive”
4: else if dt > ct ∧ t /∈ G then
5: decision := “false positive”
6: else if dt < ct ∧ t ∈ G then
7: decision := “false negative”
8: else if dt < ct ∧ t /∈ G then
9: decision := “true negative”
10: return decision
GENIA, SPARTAN-L) with the contrastive corpus, the pt based on word frequency
is used. The pt based on document frequency is used for comparing virtual corpora
(i.e. SPARTAN-V, SREQ) with the contrastive corpus. If the pt by a specialised
corpus (i.e. BootCat, GENIA, SPARTAN-L, SPARTAN-V, SREQ), denoted as dt,
is larger than or equal to the pt by the contrastive corpus, ct, then the candidate
t is classified as a term. The candidate t is classified as a non-term if dt < ct. An
assessment function described in Algorithm 3 is employed to grade the decisions
achieved using the various specialised corpora.
Term Recognition Results
Contingency tables are constructed using the number of false positives and neg-
atives, and true positives and negatives obtained from Algorithm 3. Figure 6.8
summarises the errors introduced during the classification process for term recog-
nition using several different specialised corpora. We then computed the precision,
accuracy, F1 and F.5 score using the values in the contingency tables. Figure 6.9
summarises the performance metrics for term recognition using the different corpora.
Firstly, in the context of local corpora, Figure 6.9 shows that SPARTAN-L
achieved a better performance compared to BootCat. While SPARTAN-L is merely
2.5% more precise compared to BootCat, the latter fared the worst recall at 65.06%
among all other corpora included in the evaluation. The poor recall by BootCat
is due to its high false negative rate. In other words, true terms are not classified
as terms by BootCat due to its low-quality composition (e.g. poor coverage, speci-
ficity). Many domain-relevant terms in the vocabulary of molecular biology are not
covered by the BootCat-derived corpus. Despite being 19 times larger than GENIA,
156 Chapter 6. Corpus Construction for Term Recognition
(a) Results using the GENIA corpus.
(b) Results using the SPARTAN-V corpus.
(c) Results using the SPARTAN-L corpus.
(d) Results using the SREQ corpus.
(e) Results using the BootCat corpus.
Figure 6.8: The contingency tables summarising the term recognition results using
the various specialised corpora.
Figure 6.9: A summary of the performance metrics for term recognition.
the F1 score of the BootCat-derived corpus is far from ideal. The SPARTAN-L cor-
pus, which is 295 times larger than GENIA in terms of token size, has the closest
performance to the gold standard at F1 = 92.87%. Assuming that size does matter,
we speculate that a specialised Web-derived corpus of at least 419 times larger than
GENIA (using linear extrapolation) would be required to match the latter’s high vo-
cabulary coverage and specificity for achieving a 100% F1 score. At the moment, this
conjecture remains to be tested. Given its inferior performance and effortless setup,
6.4. Evaluations and Discussions 157
BootCat-derived corpora can only serve as baselines in the task of term recognition
using specialised Web-derived corpora.
Secondly, in the context of virtual corpora, term recognition using the SPARTAN-
V achieved the best performance across all metrics with a 99.56% precision, even
outperforming the local version SPARTAN-L. An interesting point here is that the
other virtual corpus, SREQ achieved a good result with precision and recall close
to 90% despite the relative ease of setting up the apparatus required for guided
search engine querying. For this reason, we regard SREQ as the baseline for com-
paring the use of specialised virtual corpus in term recognition. In our opinion, a
9% improvement in precision justifies the additional systematic analysis of website
content performed by SPARTAN for creating a virtual corpus. From our experience,
the analysis of 200 websites generally requires on average, ceteris paribus, 1 to 1.5
hours of processing time using Yahoo API on a standard 1GHz computer with a
256 Mbps Internet connection. The ad-hoc use of search engines for accessing the
general virtual corpus may work for many NLP tasks. However, the relatively poor
performance by SREQ here justifies the need for more systematic techniques such
as SPARTAN when the Web is used as a specialised corpus for tasks such as term
recognition.
Thirdly, comparing between virtual and local corpora, only SPARTAN-V scored
a recall above 90% at 96.44%. Upon localising, the recall of SPARTAN-L dropped to
89.40%. This further confirms that term recognition requires large corpora with high
vocabulary coverage, and that the SPARTAN technique has the ability to system-
atically construct virtual corpora with the required coverage. It is also interesting
to note that a large 118 million token local corpus (i.e. SPARTAN-L) matches the
recall of a 149, 000 document virtual corpus (i.e. SREQ). However, due to the het-
erogenous nature of the Web and the inadequacy of simple seed term restriction,
SREQ scored 6% less than SPARTAN-L in precision. This concurred with our ear-
lier conclusion that ad-hoc querying, as in SREQ, is not the optimal way of using
the Web as specialised virtual corpora. Even the considerably smaller BootCat-
derived corpus achieved a 4% higher precision compared to SREQ. This shows that
size and coverage (there is 46 times more documents in SREQ than in BootCat)
contributes only to recall, which explains SREQ’s 24% better recall than BootCat.
Due to SREQ’s lack of vocabulary specificity, it fared the least precision at 90.44%.
Overall, certain tasks indeed benefit from larger corpora, obviously when metic-
ulously constructed. More specifically, tasks which do not require local access to
the texts in the corpora such as term recognition may well benefit from the con-
158 Chapter 6. Corpus Construction for Term Recognition
siderably larger and distributed nature of virtual corpora. This is evident when
the SPARTAN-based corpus fared 3 − 7% less across all metrics upon localising
(i.e. SPARTAN-L). Furthermore, the very close F1 score achieved by the worst per-
forming virtual corpus (i.e. baseline SREQ) with the best performing local corpus
SPARTAN-L shows that virtual corpus may indeed be more suitable for the task
of term recognition. We speculate that several reasons are at play, including the
ever-evolving vocabulary on the Web, and the sheer size of the vocabulary that even
Web-derived corpora cannot match.
In short, in the context of term recognition, the two most important factors which
determine the adequacy of the constructed corpora are coverage and specificity.
On the one hand, larger corpora, even when conceived in an ad-hoc manner, can
potentially lead to higher coverage, which in turn contributes significantly to recall.
On the other hand, the extra efforts spent on systematic analysis leads to more
specific vocabulary, which in turn contributes to precision. Most existing techniques
lack focus on either one or both factors, leading to poorly constructed and inadequate
virtual corpora and Web-derived corpora. For instance, BootCat has difficulty in
practically constructing very large corpora, while ad-hoc techniques such as SREQ
lacks systematic analysis which results in poor specificity. From our evaluation, only
SPARTAN-V achieved a balance F1 score exceeding 95%. In other words, the virtual
corpora constructed using SPARTAN are both adequately large with high coverage
and has specific enough vocabulary to achieve highly desirable term recognition
performance. We can construct much larger specialised corpora using SPARTAN by
adjusting certain thresholds. We can adjust τC , τS and τA to allow for more websites
to be included into the virtual corpora. We can also permit more related terms to
be included as extended seed terms during STEP. This will allow more webpages to
be downloaded to create even larger Web-derived corpora. This is possible since the
maximum pages derivable from the 43 websites are 84, 963, 524 as shown in Figure
6.7. During the localisation phase, only 64, 578 webpages which is a mere 0.07%
of the total, were actually downloaded. In other words, the SPARTAN technique
is highly customisable to create both small and very large Web and Web-derived
corpora using only several thresholds.
6.5 Conclusions
The sheer volume of textual data available on the Web, the ubiquitous coverage
of topics, and the growth of content have become the catalysts in promoting a wider
acceptance of the Web for corpus construction in various applications of knowledge
6.6. Acknowledgement 159
discovery and information extraction. Despite the extensive use of the Web as a
general virtual corpus, very few studies have focused on the systematic analysis
of website contents for constructing specialised corpora from the Web. Existing
techniques such as BootCat simply pass the responsibility of deciding on suitable
webpages to the search engines. Others allow their Web crawlers to run astray (and
subsequently resulting in topic drift) without systematic controls while downloading
webpages for corpus construction. In the face of these inadequacies, we introduced
a novel technique called SPARTAN which places emphasis on the analysis of the
domain representativeness of websites for constructing virtual corpora. This tech-
nique also provides the means to extend the virtual corpora in a systematic way
to construct specialised Web-derived corpora with high vocabulary coverage and
specificity.
Overall, we have shown that SPARTAN is independent of the search engines used
during corpus construction. SPARTAN performed the re-ranking of websites pro-
vided by search engines based on their domain representativeness to allow those with
the highest vocabulary coverage, specificity and authority to surface. The system-
atic analysis performed by SPARTAN is adequately justified when the performance
of term recognition using SPARTAN-based corpora achieved the best precision and
recall in comparison to all other corpora based on existing techniques. Moreover,
our evaluation showed that only the virtual corpora constructed using SPARTAN
are both adequately large with high coverage and has specific enough vocabulary
to achieve a balance term recognition performance (i.e. highest F1 score). Most
existing techniques lack focus on either one or both factors. We conclude that larger
corpora, when constructed with consideration for vocabulary coverage and speci-
ficity, deliver the prerequisites required for producing consistent and high-quality
output during term recognition.
Several future work have been planned to further assess SPARTAN. In the near
future, we hope to study the effect of corpus construction using different seed terms
W . We also intend to conduct research on examining how the content of SPARTAN-
based corpora evolve over time and its effect on term recognition. Furthermore, we
are also planning to study the possibility of extending the use of virtual corpora to
other applications which requires contrastive analysis.
6.6 Acknowledgement
This research was supported by the Australian Endeavour International Post-
graduate Research Scholarship. The authors would like to thank the anonymous
160 Chapter 6. Corpus Construction for Term Recognition
reviewers for their invaluable comments.
6.7 Other Publications on this Topic
Wong, W., Liu, W. & Bennamoun, M. (2008) Constructing Web Corpora through
Topical Web Partitioning for Term Recognition. In the Proceedings of the 21st Aus-
tralasian Joint Conference on Artificial Intelligence (AI), Auckland, New Zealand.
This paper reports the preliminary ideas on the SPARTAN technique for creating
text corpora using data from the Web. The SPARTAN technique was later improved
and extended to form the core contents of this Chapter 6.
161CHAPTER 7
Term Clustering for Relation Acquisition
Abstract
Many conventional techniques for concepts formation in ontology learning rely on
the use of predefined templates and rules, and static background knowledge such as
WordNet. These techniques are not only difficult to scale between different domains
and to handle knowledge change, their results are far from desirable. This chap-
ter proposes a new multi-pass clustering algorithm for concepts formation known
as Tree-Traversing Ant (TTA) as part of an ontology learning system. This tech-
nique uses Normalised Google Distance (NGD) and n-degree of Wikipedia (noW)
as measures for similarity and distance between terms to achieve highly adaptable
clustering across different domains. Evaluations using seven datasets show promis-
ing results with an average lexical overlap of 97% and an ontological improvement
of 48%. In addition, the evaluations demonstrated several advantages that are not
simultaneously present in standard ant-based and other conventional clustering tech-
niques.
7.1 Introduction
Ontologies are gaining increasing importance in modern information systems for
providing inter-operable semantics. Increasing demand on ontologies makes labour-
intensive creation more and more undesirable, if not impossible. Exacerbating the
situation is the problem of knowledge change that results from ever growing infor-
mation sources, both online and offline. Since the late nineties, more and more
researchers started looking for solutions to relieve knowledge engineers from the in-
creasingly acute situations. One of the main research area with high impact if suc-
cessful is to construct and maintain ontology automatically or semi-automatically
from electronic text. Ontology learning from text is the process of identifying con-
cepts and relations from natural language text, and using them to construct and
maintain ontologies. In ontology learning, terms are the lexical realisations of im-
portant concepts for characterising a domain. Consequently, the task of grouping
together variants of terms to form concepts, known as term clustering, constitutes
0This chapter appeared in Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages
349-381, with the title “Tree-Traversing Ant Algorithm for Term Clustering based on Featureless
Similarities”.
162 Chapter 7. Term Clustering for Relation Acquisition
a crucial fundamental step in ontology learning.
Unlike documents [242], webpages [44], and pixels in image segmentation and ob-
ject recognition [113], terms alone are lexically featureless. Similarity of objects can
be established by feature analysis based on visible (e.g. physical and behavioural)
traits. Unfortunately, using object names (i.e. terms) alone, similarity depends
on something less tangible, namely, background knowledge which humans acquired
through their senses over the years. The absence of features requires certain ad-
justments to be made with regard to the term clustering techniques. One of the
most evident adaptation required is the use of context and other linguistic evidence
as features for the computation of similarity. A recent survey [90] revealed that all
ontology learning systems which apply clustering techniques rely on the contextual
cues surrounding the terms as features. The large collection of documents, and pre-
defined patterns and templates required for the extraction of contextual cues makes
the portability of such ontology learning systems difficult. Consequently, non-feature
similarity measures are fast becoming a necessity for term clustering in ontology
learning from text. Along the same line of thought, Lagus et al. [137] stated that “In
principle a document might be encoded as a histogram of its words...symbolic words
as such retain no information of their relatedness”. In addition to the problems as-
sociated with feature extraction in term clustering, much work is still required with
respect to the clustering algorithm itself. Researchers [98] have shown that certain
commonly adopted algorithms such as the K-means and average-link agglomerative
yield mediocre results in comparison with the ant-based algorithms, which is a rela-
tively new paradigm. Handl et al. [98] demonstrated certain desirable properties in
ant-based algorithms such as the tolerance to different cluster sizes, and the ability
to identify the number of clusters. Despite such advantages, the potentials of ant-
based algorithms remain relatively unexplored for possible applications in ontology
learning.
In this chapter, we employ the established Normalised Google Distance (NGD)
[50] together with a new hybrid, multi-pass algorithm called Tree-Traversing Ant
(TTA)1 for clustering terms in ontology learning. TTA fuses the strengths of
standard ant-based and conventional clustering techniques with the advantages of
featureless-similarity measures. In addition, a second-pass is introduced in TTA
1This foundation work on term clustering using featureless similarity measures appeared in the
Proceedings of the International Symposium on Practical Cognitive Agents and Robots (PCAR),
Perth, Australia, 2006 with the title “Featureless Similarities for Terms Clustering using Tree-
Traversing Ants”.
7.2. Existing Techniques for Term Clustering 163
for refining the results produced using NGD. During the second-pass, the TTA em-
ploys a new distance measure called n-degree of Wikipedia (noW) for quantifying the
distance between two terms based on Wikipedia’s categorical system. Evaluations
using seven datasets show promising results, and revealed several advantages which
are not simultaneously present in existing clustering algorithms. In Section 2, we
give an introduction to the current term clustering techniques for ontology learning.
In Section 3, a description of the NGD measure and an introduction to standard
ant-based clustering are presented. In Section 4, we present the TTA, and how NGD
and noW are employed to support term clustering. In Section 5, we summarise the
results and findings from our evaluations. Finally, we conclude this chapter with an
outlook to future work in Section 6.
7.2 Existing Techniques for Term Clustering
Faure & Nedellec [69] presented a corpus-based conceptual clustering technique
as part of an ontology learning system called ASIUM. The clustering technique is
designed for aggregating basic classes based on a distance measure inspired by the
Hamming distance. The basic classes are formed prior to clustering in a phase for
extracting sub-categorisation frames [71]. Terms that appear in at least two different
occasions with the same verb, and the same preposition or syntactic role, can be
regarded as semantically similar such that they can substituted with one another in
that particular context. These semantically similar terms form the basic classes. The
basic classes form the lowest level of the ontology and are successively aggregated
to construct a hierarchy from bottom-up. Each time, only two basic classes are
compared. The clustering begins by computing the distance between all pairs of
basic classes and aggregate those with distance less than a user-defined threshold.
The distance between two classes containing the same words with same frequencies
have a distance 0. On the other hand, two classes without a single common word
have a distance 1. In other words, the terms in the basic classes act as features,
allowing for inter-class comparison. The measure for distance is defined as
distance(C1, C2) = 1− (
∑
FC1 ×Ncomm
card(C1)+∑
FC2 ×Ncomm
card(C2)∑card(C1)
i=1 f(wordiC1) +
∑card(C2)i=1 f(wordiC2
))
where card(C1) and card(C2) are the numbers of words in C1 and C2, respectively,
and Ncomm is the number of words common to both C1 and C2.∑
FC1 and∑
FC2
are the sums of the frequencies of the words in C1 and C2 which also occur in C2
and C1, respectively. f(wordiC1) and f(wordiC2
) are the frequencies of the ith word
of class C1 and C2, respectively.
164 Chapter 7. Term Clustering for Relation Acquisition
Maedche & Volz [165] presented a bottom-up hierarchical clustering technique
that is part of the ontology learning system Text-to-Onto. This term clustering
technique relies on an all-knowing oracle, denoted by H, which is capable of re-
turning possible hypernyms for a given term. In other words, the performance of
the clustering algorithm has an upper-bound limited by the ability of the oracle to
know all possible hypernyms for a term. The oracle is constructed using WordNet
and lexico-syntactic patterns [51]. During the clustering phase, the algorithm is
provided with a list of terms and the similarity between each pair is computed using
the cosine measure. For this purpose, the syntactic dependencies of each term are
extracted and used as the features for that term. The algorithm is an extremely
long list of nested if-else statements. For the sake of brevity, it suffices to know
that the algorithm examines the hypernymy relations between all pairs of terms
before it decides on the placement of terms as parents, children or siblings of other
terms. Each time the information about the hypernym relations between two terms
is required, the oracle is consulted. The projection H(t) returns a set of tuples (x, y)
where x is a hypernym of term t and y is the number of times the algorithm has
found the evidence for it.
Shamsfard & Barforoush [225] presented two clustering algorithms as part of the
ontology learning system Hasti. Concepts have to be formed prior to the clustering
phase. It suffices to know that the process of forming the concepts and extracting
relations that are used as features for clustering involve a knowledge extractor where
“the knowledge extractor is a combination of logical, template driven and semantic
analysis methods” [227]. In the concept-based clustering technique, a similarity ma-
trix, consisting of the similarity of all possible pairs of concepts is computed. The
pair with the maximum similarity that is also greater than the merge-threshold is
chosen to form a new super concept. In this technique, each intermediate (i.e. non-
leaf) node in the conceptual hierarchy has at most two children, but the hierarchy is
not a binary tree as each node may have more than one parent. As for the relation-
based clustering technique, only non-taxonomic relations are considered. For every
concept c, a set of assertions about the non-taxonomic relations NF (c) that c has
with other concepts is identified. In other words, these relations can be regarded as
features that allow concepts to be merged according to what they share. If at least
one related concept is common between assertions about that relation, then the set
comprising the other concepts (called merge-set) contains good candidates for merg-
ing. After all the relations have been examined, a list of merge-set is obtained. The
merge-set with the highest similarity between its members is chosen for merging. In
7.3. Background 165
both clustering algorithms, the similarity measure employed is defined as
similarity(a, b) =maxlevel∑
j=1
(
card(cm)∑
i=1
(Wcm(i).r +
valence(cm(i).r)∑
k=1
Wcm(i).arg(k)))× Lj
where cm = Nf(a) ∩ Nf(b) is the intersection between the sets of assertions (i.e.
common relations) about a and b, and card(cm) is the cardinality of cm. Wcm(i).r is
the weight for each common relation and∑valence(cm(i).r)
k=1 Wcm(i).arg(k) is the sum of
the weights of all terms related to the common relations cm. Lj is the level constant
assigned to each similarity level which decreases as the level increases. The main
aspect of the similarity measure is the common features between two concepts a and
b (i.e. the intersection between the sets of non-taxonomic assertions Nf(a)∩Nf(b)).
Each common feature cm(i).r together with the corresponding weight Wcm(i).r and
the weight of the related terms are accumulated. In other words, the more features
two concepts have in common, the higher the similarity between them.
Regardless of how the existing techniques described in this section are named,
they shared a common point, namely, the reliance on some linguistic (e.g. subcate-
gorisation frames, lexico-syntactic patterns) or predefined semantic (e.g. WordNet)
resources as features. These features are necessary for the computation of similarity
using conventional measures and clustering algorithms. The ease of scalability across
different domains and the resources required for feature extraction are among the
few questions our new clustering technique attempts to overcome. In addition, the
new clustering technique fuses the strengths of recent innovations such as ant-based
algorithms and featureless similarity measures that have yet to benefit ontology
learning systems.
7.3 Background
7.3.1 Normalised Google Distance
Normalised Google Distance (NGD) computes the semantic distance between
objects based on their names using only page counts from the Google search engine.
A more generic name for the measure that employs page counts provided by any
Web search engines is the Normalised Web Distance (NWD) [262]. NGD is a non-
feature distance measure which attempts to capture every effective distance (e.g.
Hamming distance, Euclidean distance, edit distances) into a single metric. NGD
is based on the notions of Kolmogorov Complexity [93] and Shannon-Fano coding
[142].
166 Chapter 7. Term Clustering for Relation Acquisition
The basis of NGD begins with the idea of the shortest binary program capable
of producing a string x as an output. The Kolmogorov Complexity of the string x,
K(x) is just the length of that program in binary bits. Extending this notion to
include an additional string y produces the Information Distance [23] where E(x, y)
is the length of the shortest binary program that can produce x given y, and y given
x. It was shown that [23]:
E(x, y) = K(x, y)−minK(x), K(y) (7.1)
where E(x, x) = 0, E(x, y) > 0 for x 6= y, and E(x, y) = E(y, x). Next, for every
other computable distances D that are non-negative and symmetric, there is a binary
program, given string x and y, with a length equal to D(x, y). Formally,
E(x, y) ≤ D(x, y) + cD
where cD is a constant that depends on the distance D and not x and y. E(x, y) is
called universal because it acts as the lower bound for all computable distances. In
other words, if two strings x and y are close according to some distance D, then they
are at least as close according to E [49]. Since all computable distances compare
the closeness of strings through the quantification of certain common features they
share, we can consider that information distance determines the distance between
two strings according to the feature by which they are most similar.
By normalising information distance, we have NID(x, y) ∈ (0, 1) where 0 means
the two strings are the same and 1 being completely different in the sense that they
share no features. The normalised information distance is defined as:
NID(x, y) =K(x, y)−minK(x), K(y)
maxK(x), K(y)
Nonetheless, referring back to Kolmogorov Complexity and Equation 7.1, the non-
computability of K(x) implies the non-computability of NID(x, y). Nonetheless,
an approximation of K can be achieved using real compression programs [261]. If
C is a compressor, then C(x) denotes the length of the compressed version of string
x. Approximating K(x) with C(x) results in:
NCD(x, y) =C(x, y)−minC(x), C(y)
maxC(x), C(y)
The derivation of NGD continues by observing the working behind compressors.
Compressors encode source words x into code words x′ such that the length |x| < |x′|.
We can consider these code words from the perspective of Shannon-Fano coding.
7.3. Background 167
Shannon-Fano coding encodes a source word x using a code word that has the
length log 1p(x)
. p(x) can be thought of as a probability mass function that maps each
source word x to the code that achieves optimal compression of x. In Shannon-Fano
coding, p(x) = nx
Xcaptures the probability of encountering source word x in a text
or a stream of data from a source, where nx is the occurrence of x and N is the total
number of source words in the same text. Cilibrasi & Vitanyi [49] discussed the use of
compressors for NCD and concluded that the existing compressors’ inability to take
into consideration external knowledge during compression makes them inadequate.
Instead, the authors proposed to make use of a source that “...stands out as the most
inclusive summary of statistical information” [49], namely, the World Wide Web.
More specifically, the authors proposed the use of the Google search engine to devise
a probability mass function that reflects the Shannon-Fano code. It appears that
the Google’s equivalence of Shannon-Fano code, known as Google code, has length
defined by [49]:
G(x) = log1
g(x)
G(x, y) = log1
g(x, y)
where g(x) = |x|/N and g(x, y) = |x ∩ y|/N are the new probability mass function
that capture the probability of occurrences of search terms x and y. x is the set
of webpages returned by Google containing the single search term x (i.e. singleton
set) and similarly, x ∩ y is the set of webpages returned by Google containing both
search term x and y (i.e. doubleton set). N is the summation of all unique singleton
and doubleton sets.
Consequently, the Google search engine can be considered as a compressor for
encoding search terms (i.e. source words) x to produce the meaning (i.e. compressed
code words) that has the length G(x). By rewriting the NCD, we obtain the new
NGD defined as:
NGD(x, y) =G(x, y)−minG(x), G(y)
maxG(x), G(y)(7.2)
All in all, NGD is an approximation of NCD and hence, NID to overcome the non-
computability of Kolmogorov Complexity. NGD employs the Google search engine
as a compressor to generate Google codes based on the Shannon-Fano coding. From
the perspective of term clustering, NGD provides an innovative starting point which
demonstrates the advantages of featureless similarity measures. In our new term
clustering technique, we take such innovation a step further by employing NGD in
168 Chapter 7. Term Clustering for Relation Acquisition
a new clustering technique that combines the strengths from both conventional and
ant-based algorithms.
7.3.2 Ant-based Clustering
The idea of ant-based clustering was first proposed by Deneubourg et al. [61] in
1991 as part of an attempt to explain the different types of emergent technologies
inspired by nature. During simulation, the ants are represented as agents that move
around the environment, a square grid, in random. Objects are randomly placed in
this environment and the ants can pick-up the object, move and drop them. These
three basic operations are influenced by the distribution of the objects. Objects that
are surrounded by dissimilar ones are more likely to be picked up and later dropped
elsewhere in the surrounding of more similar ones. The probability of picking up
and dropping of objects are influenced by the probabilities:
Ppick(i) = (kp
kp + f(i))2
Pdrop(i) = (f(i)
kd + f(i))2
where f(i) is an estimation of the distribution density of the objects in the ants’
immediate environment (i.e. local neighbourhood) with respect to the object that
the ants is considering to pick or drop. The choice of f(i) varies depending on
the cost and other factors related to the environment and the data items. As f(i)
decreases below kp, the probability of picking up the object is very high, and the
opposite occurs when f(i) exceeds kp. As for the probability of dropping an object,
high f(i) exceeding kd induces the ants to give up the object, while f(i) less than
kd encourages the ants to hold on to the object. The combination of these three
simple operations and the heuristics behind them gave birth to the notion of basic
ants for clustering, also known as standard ant clustering algorithm (SACA).
Gutowitz [94] examined the basic ants described by Deneubourg et al. and
proposed a variant ant known as complexity-seeking ants. Such ants are capable of
sensing local complexity and are inclined to work in regions of high interest (i.e. high
complexity). Regions with high complexity are determined using a local measure
that assesses the neighbouring cells and counts the number of pairs of contrasting
cells (i.e. occupied or empty). Neighbourhoods with all empty or all occupied im-
mediate cells have zero complexity while regions with checkboard patterns have high
7.3. Background 169
complexity. Hence, these modified ants are able to accomplish their task faster be-
cause they are more inclined to manipulate objects in regions with higher complexity
[263].
Lumer & Faieta [160] further extended and improved the idea of ant-based clus-
tering in terms of the numerical aspects of the algorithm and the convergence time.
The authors represented the objects in terms of numerical vectors and the distance
between the vectors is computed using the Euclidean distance. Hence, given that
δ(i, j) ∈ [0, 1] as the Euclidean distance between object i (i.e. i is the location of the
object in the centre of the neighbourhood) and every other neighbouring objects j,
the neighbourhood function f(i) is defined by the authors as:
f(i) =
1s2
∑
j 1− δ(i,j)α
if f(i) > 0
0 otherwise(7.3)
where s2 is the size of the local neighbourhood, and α ∈ [0, 1] is a constant for scaling
the distance among objects. In other words, an ant has to consider the average
similarity of object i with respect to all other objects j in the local neighbourhood
before performing an operation (i.e. pickup or drop). As the value of f(i) is obtained
by averaging the total similarities with the number of neighbouring cells s2, empty
cells which do not contribute to the overall similarity must be penalised. In addition,
the radius of perception (i.e. the extent to which objects are taken into consideration
for f(i)) of each ant at the centre of the local neighbourhood is given by s−12
. The
clustering algorithm using the basic ant SACA is defined in Algorithm 4.
Handl & Meyer [100] introduced several enhancements to make ant-based clus-
tering more efficient. The first is the concept of eager ants where idle phases are
avoided by having the ants to immediately pickup objects as soon as existing ones
are dropped. The second is the notion of stagnant control. There are occasions in
ant-based clustering when ants are occupied or blocked due to objects that are diffi-
cult to dispose. In such cases, the ants are forced to drop whatever they are carrying
after a certain number of unsuccessful drops. In a different paper [98], the authors
have also demonstrated that the ant-based algorithm has several advantages:
• tolerance to different cluster size
• ability to identify the number of clusters
• performance increases with the size of the datasets
• graceful degradation in the face of overlapping clusters.
170 Chapter 7. Term Clustering for Relation Acquisition
Algorithm 4 Basic ant-based clustering defined by Handl et al. [99]
1: begin
2: //INITIALISATION PHASE
3: Randomly scatter data items on the toroidal grid
4: for each j in 1 to #agents do
5: i := random select(remaining items)
6: pick up(agent(j),i)
7: g := random select(remaining empty grid locations)
8: place agent(agent(j),g)
9: //MAIN LOOP
10: for each it ctr in 1 to #iterations do
11: j := random select(all agents)
12: step(agent(j),stepsize)
13: i := carried item(agent(j))
14: drop := drop item?(f(i))
15: if drop = TRUE then
16: while pick = FALSE do
17: i := random select(free data items)
18: pick := pick item?(f(i))
Nonetheless, the authors have also highlighted two shortcomings of ant-based clus-
tering, namely, the inability to distinguish more refined clusters within coarser level
ones, and the inability to specify the number of clusters can be seen as a disadvantage
when the users have precise ideas about it.
Vizine et al. [263] proposed an adaptive ant clustering algorithm (A2CA) that
improves upon the algorithm by Lumer & Faieta. The authors introduced two major
modifications, namely, progressive vision scheme and the use of pheromones on grid
cells. The progressive vision scheme allows the dynamic adjustment of s2. Whenever
an ant perceives a larger cluster, it increases its radius of perception from the originals−12
to the new s′
−12
. The second enhancement allows ants to mark regions that are
recently constructed or under construction. The pheromones attract other ants,
resulting in an increase in the probability of deconstruction of relatively smaller
regions, and increases the probability of dropping objects at denser clusters.
Ant-based algorithms have been employed to cluster objects that can be repre-
sented using numerical vectors. Similar to conventional algorithms, the similarity or
distance measures used by existing ant-based algorithms are still feature-based. Con-
7.4. The Proposed Tree-Traversing Ants 171
sequently, they share similar problems such as difficult portability across domains.
In addition, despite the strengths of standard ant-based algorithms, two disadvan-
tages were identified. In our new technique, we make use of the known strengths of
standard ant-based algorithms and some desirable traits from conventional ones for
clustering terms using featureless similarity.
7.4 The Proposed Tree-Traversing Ants
The Tree-Traversing Ant (TTA) clustering technique is based on dynamic tree
structures as compared to toroidal grids in the case of standard ants. The dynamic
tree begins with one root node r0 consisting of all terms T = t1, ..., tn, and branches
out to new sub-nodes as required. In other words, the clustering process begins with
r0 = t1, ..., tn. For example, the first snapshot in Figure 7.1 shows the start of the
TTA clustering process with the root node r0 initialised with the terms t1, ...tn=10.
Essentially, each node in the tree is a set of terms ru = t1, ..., tq. The sizes of new
sub-nodes |ru| reduce as less and less terms are assigned to them in the process of
creating nodes with higher intra-node similarity.
The clustering starts with only one ant, while an unbounded number of ants
awaits to work at each of the new sub-node created. In the third snapshot in Figure
7.1, while the first ant moves on to work at the left sub-node r01, a new second
ant proceeds to process the right sub-node r02. The number of possible new sub-
nodes for each main node (i.e branching factor) in this version of TTA is two. In
other words, for each main node rm, we have the sub-nodes rm1, rm2. Similar to
some of the current enhanced ants, the TTA ants are endowed with the ability
of short-term memory for remembering similarities and distances acquired through
their senses. The TTA is equipped with two types of senses, namely, NGD and
n-degree of Wikipedia (noW). The standard ants have a radius of perception defined
in terms of cells immediately surrounding the ants. Instead, the perception radius
of TTA ants covers all terms in the two sub-nodes created for each current node. A
current node is simply a node originally consisting of terms to be sorted to the news
sub-nodes.
The TTA adopts a two-pass approach for term clustering. During the first-pass,
the TTA recursively breaks nodes into sub-nodes and relocate terms until the ideal
clusters are achieved. The resulting trees created in the first-pass are often good
enough to reflect the natural clusters. Nonetheless, discrepancies do occur due to
certain oddities in the co-occurrences of terms on the World Wide Web that manifest
themselves through NGD. Accordingly, a second-pass is created that uses noW for
172 Chapter 7. Term Clustering for Relation Acquisition
Figure 7.1: Example of TTA at work
relocating terms which are displaced due to NGD. The second-pass can be regarded
as a refinement phase for producing clusters with higher quality.
7.4.1 First-Pass using Normalised Google Distance
The TTA begins clustering at the root node which consists of all n terms, r0 =
t1, ..., tn. Each term can be considered as an element in the node. A TTA ant
randomly picks a term, and proceed on to sense its similarity with every other terms
on that same node. The ant repeats this for all n terms until the similarity of all
possible pairs of terms have been memorised. The similarity between two terms tx
and ty is defined as:
s(tx, ty) = 1−NGD(tx, ty)α (7.4)
where NGD(tx, ty) is the distance between term tx and ty estimated using the orig-
inal NGD defined at Equation 7.2. α is a constant for scaling the distance between
the two terms. The algorithm then grows two new sub-nodes to accommodate the
two least similar terms ta and tb. The ant moves the first term ta from the main
node rm to the first sub-node while emitting pheromones that trace back to tb in
7.4. The Proposed Tree-Traversing Ants 173
the process. The ant then follows the pheromone trail back to the second term tb
to move it to the second sub-node.
The second snapshot in Figure 7.1 shows two new sub-nodes r01 and r02. The
ant moved the term t1 to r01 and the least similar term t6 to r02. Nonetheless, prior
to the creation of new sub-nodes and the relocation of terms, an ideal intra-node
similarity condition must be tested. The operation of moving the two least similar
terms from the current node to create and initialise new sub-nodes is essentially a
partitioning process. Eventually, each leaf node will end up with only one term if
the TTA does not know when to stop. For this reason, we adopt an ideal intra-node
similarity threshold sT for controlling the extent of branching out. Whenever an ant
senses that the similarity between the two least similar terms exceeds sT , no further
sub-nodes will be created and the partitioning process at that branch will cease. A
high similarity (higher than sT ) between the two most dissimilar terms in a node
provides a simple but effective indication that the intra-node similarity has reached
an ideal stage. More refined factors such as the mean and standard deviation of
intra-node similarity are possible but have not been considered.
If the similarity between the two most dissimilar terms is still less than sT ,
further branching out will be performed. In this case, the TTA ant repeatedly picks
up the remaining terms on the current node one by one and senses their similarities
with every other terms which are already located in the sub-nodes. Formally, the
probability of picking up term ti by an ant in the first-pass is defined as:
P 1pick(ti) =
1 if ti ∈ rm
0 otherwise(7.5)
where rm is the set of terms in the current node. In other words, the probability
of picking up terms by an ant is always 1 as long as there are still terms remain-
ing in the current node. Each term ti ∈ rm is moved to one of the two sub-nodes
ru that has the term tj ∈ ru with the highest similarity with ti. In other words,
an ant considers multiple neighbourhoods prior to dropping a term. Snapshot 3 in
Figure 7.1 illustrates the corresponding two sub-nodes r01 and r02 that have been
populated with all the terms which were previously located at the current node
r0. The standard neighbourhood function f(i) defined in Equation 7.3 represents
the density of the neighbourhood as the average of the similarities between ti with
every other term in its immediate surrounding (i.e. local neighbourhood) confined
by s2. Unlike the sense of basic ants which covers only the surrounding cells s2,
the extent to which a TTA ant perceives covers all terms in the two sub-nodes (i.e.
multiple neighbourhoods) corresponding to the immediate current node. Accord-
174 Chapter 7. Term Clustering for Relation Acquisition
ingly, instead of estimating f(i) as the averaged similarity defined over s2 terms
surrounding the ant, the new neighbourhood function fTTA(ti, u) is defined as the
maximum similarity between term ti ∈ rm and the neighbourhood (i.e. sub-nodes)
ru. The maximum similarity between ti and ru is the highest similarity between ti
and all other terms tj ∈ ru. Formally, we define the density of neighbourhood ru
with respect to term ti during the first-pass as:
f 1TTA(ti, ru) = maximum of s(ti, tj) w.r.t. tj ∈ ru (7.6)
where the similarity between the two terms s(ti, tj) is computed using Equation 7.4.
Besides deciding on whether to drop an object or not, like in the case of basic
ants, the TTA ant has to decide on one additional issue, namely, where to drop.
The TTA decides on where to drop a term based on the fTTA(ti, ru) that it has
memorised for all sub-nodes ru of the current node rm. Formally, the decision on
whether to drop term ti ∈ rm on sub-node rv depends on:
P 1drop(ti, rv) =
1 if (f 1TTA(ti, rv) = maximum of f 1
TTA(ti, ru)
w.r.t. ru ∈ rm1, rm2)
0 otherwise
(7.7)
The current version of the TTA clustering algorithm is implemented in two parts.
The first is the main function while the second one is a recursive function. The
main function is defined in Algorithm 5 while the recursive function for the first-
pass elaborated in this subsection is reported in Algorithm 6.
Algorithm 5 Main function
1: input A list of terms, T = t1, ...tn.
2: Create an initial tree with a root node r0 containing n terms.
3: Define the ideal intra-node similarity threshold sT and δT .
4: //first-pass using NGD
5: ant := new ant()
6: ant.ant traverse(r0,r0)
7: //second-pass using noW
8: leafnodes := ant.pickup trail()//return all leaf nodes marked by pheromones
9: for each rnext ∈ leafnodes do
10: ant.ant refine(leafnodes,rnext)
7.4. The Proposed Tree-Traversing Ants 175
Algorithm 6 Function ant traverse(rm,r0) using NGD
1: if |rm| = 1 then
2: leave trail(rm,r0)//leave trail from current leave node to root node. for use in
second-pass
3: return //only one term left. return to root
4: ta, tb := find most dissimilar terms(rm)
5: if s(ta, tb) > sT then
6: leave trail(rm,r0)//leave trail from current leave node to root node. for use in
second-pass
7: return //ideal cluster has been achieved. return to root node
8: else
9: rm1, rm2 := grow sub nodes(rm)
10: move terms(ta, tb,rm1, rm2)
11: for each term ti ∈ rm do
12: pick(ti)//based on Eq. 7.5
13: for each ru ∈ rm1, rm2 do
14: for each term tj ∈ ru do
15: s(ti, tj) := sense similarity(ti,tj) //based on Eq. 7.4
16: remember similarity(s(ti, tj))
17: f 1TTA(ti, ru) := sense neighbourhood() //based on Eq. Eq. 7.6
18: remember neighbourhood(f 1TTA(ti, ru))
19: ∀u, f 1TTA(ti, ru) := recall neighbourhood()
20: rv := decide drop(∀u, f 1TTA(ti, ru))// based on Eq. 7.7
21: drop(ti,rv)
22: antm1 := new ant()
23: antm1.ant traverse(rm1,r0)//repeat the process recursively for each sub-node
24: antm2 := new ant()
25: antm2.ant traverse(rm2,r0)//repeat the process recursively for each sub-node
176 Chapter 7. Term Clustering for Relation Acquisition
7.4.2 n-degree of Wikipedia: A New Distance Metric
The use of NGD for quantifying the similarity between two objects based on
their names alone can occasionally produce low-quality clusters. We will highlight
some of these discrepancies during our initial experiments in the next section. The
initial tree of clusters generated by the TTA using NGD demonstrated promising
results. Nonetheless, we reckoned that higher-quality clusters could be generated
if we allow the TTA ants to visit the nodes again for the purpose of refinement.
Instead of using NGD, we present a new way to gauge the similarity between terms.
Google can be regarded as the gateway to the huge volume of documents on
the World Wide Web. The sheer size of Google’s index enables a relatively reliable
estimate of term usage and occurrence using NGD. The page counts provided by
the Google search engine, which are the essence of NGD, are used to compute
the similarity between two terms based on the mutual information that they both
share at the compressed level. As for Wikipedia, its number of articles is less than
a fraction of what Google indexes. Nonetheless, the restrictions imposed on the
authoring of Wikipedia’s articles and their organisations provide a possibly new
way of looking at similarity between terms. n-degree of Wikipedia (noW) [272] is
inspired by a game for Wikipedians. 6-degree of Wikipedia2 is a task set out to
study the characteristics of Wikipedia in terms of the similarity between its articles.
An article in Wikipedia can be regarded as an entry of encyclopaedic information
describing a particular topic. The articles are organised using categorical indices
which eventually leads to the highest level, namely, “Categories”3. Each article
can appear under more than one category. Hence, the organisation of articles in
Wikipedia appears more as a directed acyclic graph with a root node instead of a
pure tree structure4. The huge volume of articles in Wikipedia, the organisation
of articles in a graph structure, the open-source nature of the articles, and the
availability of the articles in electronic form makes Wikipedia the ideal candidate
for our endeavour.
We define Wikipedia as a directed graph W := (V,E). W is essentially a network
of linked-articles where V = a1, ..., aω is the set of articles. We limit the vertices to
English articles only. At the moment, ω = |V | is reported to be 1, 384, 7295, making
it the largest encyclopaedia6 in merely five years since its conception. The inter-
2http://en.wikipedia.org/wiki/Six Degrees of Wikipedia3http://en.wikipedia.org/wiki/Category:Categories4http://en.wikipedia.org/wiki/Wikipedia:Categorization#Categories do not form a tree5http://en.wikipedia.org/wiki/Wikipedia:Size comparisons6http://en.wikipedia.org/wiki/Wikipedia:Largest encyclopedia
7.4. The Proposed Tree-Traversing Ants 177
connections between articles are represented as the set of ordered pairs of vertices
E. At the moment, the edges are uniformly assigned with weight 1. Each article can
be considered as an elaboration of a particular event, an entity or an abstract idea.
In this sense, an article in Wikipedia is a manifestation of the information encoded
in the terms. Consequently, we can represent each term ti using the corresponding
article ai ∈ V in Wikipedia. Hence, the problem of finding the distance between two
terms ti and tj can be reduced to the discovery of how closely situated are the two
corresponding articles ai and aj in the Wikipedia categorical indices. The problem
of finding the degree of separation between two articles can be addressed in terms of
the single-source shortest path problem. Since the weights are all positive, we have
resorted to Dijkstra’s Algorithm for finding the shortest-path between two vertices
(i.e. articles). Other algorithms for the shortest-path problem are available. How-
ever, a discussion on these algorithms is beyond the scope of this chapter. Formally,
the noW value between terms tx and ty is defined as
noW (tx, ty) = δ(ax, ay) =
∑|SP |k=1 cek
if(ax 6= ay ∧ ax, ay ∈ V )
0 if(ax = ay ∧ ax, ay ∈ V )
∞ otherwise
(7.8)
where delta(ax, ay) is the degree of separation between the articles ax and ay which
corresponds to the term tx and ty, respectively. The degree of separation is computed
as the sum of the cost of all edges along the shortest path between articles ax and
ay in the graph of Wikipedia articles W . SP is the set of edges along the shortest
path and ek is the kth edge or element in set SP . |SP | is the number of edges
along the shortest path and cekis the cost associated with the kth edge. It is also
worth mentioning that while δ(tx, ty) ≥ 0 for ax, ay ∈ V , no upper bound can
be ascertained. The noW value between terms that do not have corresponding
articles in Wikipedia is set to∞. There is a hypothesis7 stating that no two articles
in Wikipedia are separated by more than six degrees. However, there are some
Wikipedians who have shown that certain articles can be separated by up to eight
steps8. This is the reason why we adopted the name n-degree of Wikipedia instead
of 6-degree of Wikipedia.
7http://tools.wikimedia.de/sixdeg/index.jsp8http://en.wikipedia.org/wiki/Six Degrees of Wikipedia
178 Chapter 7. Term Clustering for Relation Acquisition
7.4.3 Second-Pass using n-degree of Wikipedia
Upon completing the first-pass, there is at most n leaf nodes where each term
in the initial set of all terms T end up in individual nodes (i.e. clusters). There are
only two possibilities for such extreme cases. The first is when the ideal intra-node
similarity threshold sT is set too high while the second is when all the terms are
extremely unrelated. In normal cases, most of the terms will be nicely clustered
into nodes with intra-node similarities exceeding sT . Only a small number of terms
is usually isolated into individual nodes. We refer to these terms as isolated terms.
There are two possibilities that lead to isolated terms in normal cases, namely, (1)
the term has been displaced during the first-pass due to discrepancies related to
NGD, or (2) the term is in fact an outlier. The TTA ants leave pheromone trails on
their return trip to the root node (as in line 2 and line 7 of Algorithm 6) to mark the
paths to the leaf nodes. In order to relocate the isolated terms to other more suitable
nodes, the TTA ants return to the leaf nodes by following the pheromone trails. At
each leaf node rl, the probability of picking up a term ti during the second-pass is
1 if the leaf node has only one term (isolated term):
P 2pick(ti) =
1 if |rl| = 1 ∧ ti ∈ rl
0 otherwise(7.9)
After picking up an isolated term, the TTA ant continues to move from one leaf
node to the next. At each leaf node, the ant determines whether that particular
leaf node (i.e. neighbourhood) rl is the most suitable one to house the isolated
term ti based on the average distance between ti and all other existing terms in rl.
Formally, the density of neighbourhood rl with respect to the isolated term ti during
the second-pass is defined as:
f 2TTA(ti, rl) =
∑|rl|j=1 noW (ti, tj)
|rl|(7.10)
where |rl| is the number of terms in the leaf node rl and the noW value between
the two terms ti and tj is computed using Equation 7.8.
This process of sensing the distance of the isolated term with all other terms in
a leaf node is performed for all leaf nodes. The probability of the ant dropping the
isolated term ti on the most suitable leaf node rv is evaluated once the ant returns
to the original leaf node that used to contain ti. Back at the original leaf node of
ti, the ant recalls the neighbourhood density f 2TTA(ti, rl) that it has memorised for
7.5. Evaluations and Discussions 179
all neighbourhoods (i.e leaf nodes). The TTA ant drops the isolated term ti on the
leaf node rv if all terms in rv collectively yield the minimum average distance with
ti that satisfies the outlier discrimination threshold δT . Formally,
P 2drop(ti, rv) =
1 if (f 2TTA(ti, rv) = minimum of f 2
TTA(ti, rl) w.r.t. rl ∈ L)
∧ (f 2TTA(ti, rv) ≤ δT )
0 otherwise
(7.11)
where L is the set of all leaf nodes. After the ant has visited all the leaf nodes
and has failed to drop the isolated term, the term will be returned to its original
location. The failure to drop the isolated term in a more suitable node indicates
that the term is an outlier.
Referring back to the example in Figure 7.1, assume that snapshot 5 represents
the end of the first-pass where the intra-node similarity of all nodes have satisfied
sT . While all other leaf nodes, namely, r011, r012 and r021 consist of multiple terms,
leaf node r022 contains only one term t6. Hence, at the end of the first-pass, all ants,
namely, ant1, ant2, ant3 and ant4 retreat back to the root node r0. Then, during
the second-pass, one TTA ant is deployed to relocate the isolated term t6 from r022
to either leaf node r011, r012 or r021, depending on the average distances of these leaf
nodes with respect to t6. The algorithm for the second-pass using noW is described
in Algorithm 7. Unlike the ant traverse() function in Algorithm 6 where each new
sub-node is processed as a separate iteration of ant traverse() using an independent
TTA ant, there is only one ant required throughout the second-pass.
7.5 Evaluations and Discussions
In this section, we focus on evaluations at the conceptual layer of ontologies to
verify the taxonomic structures discovered using TTA. We employ three existing
metrics. The first is known as Lexical Overlap (LO) for evaluating the intersection
between the discovered concepts (Cd) and the recommended (i.e. manually created)
concepts (Cm) [164]. The manually created concepts can be regarded as the reference
for our evaluations. LO is defined as:
LO =|Cd ∩ Cm|
|Cm|(7.12)
Some minor changes were made in terms of how the intersection between the set
of recommended clusters and discovered clusters (i.e. Cd ∩ Cm) is computed. The
180 Chapter 7. Term Clustering for Relation Acquisition
Algorithm 7 Function ant refine(leafnodes, ru) using noW
1: if |ru| = 1 then
2: //current leaf node has isolated term ti
3: pick(ti)//based on Eq. 7.9
4: for each rl ∈ leafnodes do
5: for each term tj in current leaf node rl do
6: //jump from one leaf node to the next to sense neighbourhood density
7: δ(ti, tj) := sense distance(ti, tj)//based on Eq. 7.8
8: remember distance(δ(ti, tj))
9: f 2TTA(ti, rl) := sense neighbourhood()//based on Eq. 7.10
10: remember neighbourhood(f 2TTA(ti, rl))
11: //back to original leaf node of term ti after visiting all other leaves
12: ∀l, f2TTA(ti, rl) := recall neighbourhood()
13: rv := decide drop(∀l, f2TTA(ti, rl))// based on Eq. 7.11
14: if rv not null then
15: drop(ti,rv)//drop at ideal leaf node
16: else
17: drop(ti,ru)//outlier. no ideal leaf node. drop back at original leaf
normal way of having exact lexical matching of the concept identifiers cannot be
applied to our experiments. Due to the ability of the TTA in discovering concepts
with varying level of granularity depending on sT , we have to put into consideration
the possibility of sub-clusters that collectively correspond to some recommended
clusters. For our evaluations, the presence of discovered sub-clusters that correspond
to some recommended clusters are considered as a valid intersection. In other words,
given that Cd = c1, ..., cn and Cm = cx where cx /∈ Cd, then
|Cd ∩ Cm| = 1 if c1 ∪ ... ∪ cn = cx
The second metric is used to account for valid discovered concepts that are absent
from the reference set, while the third metric ensures that concepts which exist in
the reference set but are not discovered are also taken into consideration. The second
metric is referred to as Ontological Improvement (OI) and the third metric is known
as Ontological Loss (OL). They are defined as [214]:
OI =|Cd − Cm|
|Cm|(7.13)
7.5. Evaluations and Discussions 181
Table 1. Summary of the datasets employed for experiments. Column
Cm are the recommended clusters and Cd are clusters automatically
discovered using TTA.
OL =|Cm − Cd|
|Cm|(7.14)
Ontology learning is an incremental process that involves the continuous mainte-
nance of ontology every time new terms are added. As such, we do not see clustering
large datasets as a problem. In this section, we employ seven datasets to assess the
quality of the discovered clusters using the three metrics described above. The origin
of the datasets and some brief descriptions are provided below:
• Three of the datasets used for our experiments were obtained from the UCI
Machine Learning Repository9. These sets are labelled as WINE 15T, MUSH-
9http://www.ics.uci.edu/ mlearn/MLRepository.html
182 Chapter 7. Term Clustering for Relation Acquisition
Table 2. Summary of the evaluation results for all ten experiments
using the three metrics LO, OI and OL.
ROOM 16T and DISEASE 20T. The accompanying numerical attributes, which
were designed for use with feature-based similarities, were removed.
• We also employ the original animals dataset (i.e. ANIMAL 16T) proposed for
use with Self-Organising Maps (SOMs) by Ritter & Kohonen [209].
• We constructed the remaining three datasets called ANIMALGOOGLE 16T,
MIX 31T and MIX 60T. ANIMALGOOGLE 16T is similar to the ANIMAL 16T
dataset except for a single replacement with the term “Google”. The other two
MIX datasets consist of a mixture of terms from a large number of domains.
Table 1 summarises the datasets employed for our experiments. The column Cm
are the recommended clusters and Cd are clusters automatically discovered using
TTA. Table 2 summarises the evaluation of TTA using the three metrics for all
ten experiments. The high lexical overlap (LO) shows the good domain coverage of
the discovered clusters. The occasionally high ontological improvement (OI) demon-
strates the ability of TTA in highlighting new, interesting concepts that were ignored
during manual creation of the recommended clusters.
During the experiments, snapshots were produced to show the results in two
parts: results after the first-pass using NGD, and results after the second-pass using
noW. The first experiment uses WINE 15T. The original dataset has 178 nameless
instances spread out over 3 clusters. Each instance has 13 attributes for use with
feature-based similarity measures. We augment the dataset by introducing famous
names in the wine domain and remove their numerical attributes. We maintained
the three clusters namely, “white”, “red” and “mix”. “Mix” refers to wines that were
7.5. Evaluations and Discussions 183
Figure 7.2: Experiment using 15 terms from the wine domain. Setting sT = 0.92
results in 5 clusters. Cluster A is simply red wine grapes or red wines, while Cluster E
represents white wine grapes or white wines. Cluster B represents wines named after
famous regions around the world and they can either be red, white or rose. Cluster
C represents white noble grapes for producing great wines. Cluster D represents
red noble grapes. Even though uncommon, Shiraz is occasionally admitted to this
group.
named after famous wine regions around the world. Such wines can either be red or
white. As shown in Figure 7.2, setting sT = 0.92 produces five clusters. Clusters A
and D are actually sub-clusters for the recommended cluster “red”, while Clusters C
and E are sub-clusters for the recommended cluster “white”. Cluster B corresponds
exactly to the recommended cluster “mix”. The second experiment uses MUSH-
ROOM 16T. The original dataset has 8124 nameless instances spread out over two
clusters. Each instance has 22 nominal attributes for use with feature-based simi-
larity measures. We augment the dataset by introducing names of mushrooms that
fit into one of the two recommended clusters, namely, “edible” and “poisonous”. As
shown in Figure 7.3, setting sT = 0.89 produces 4 clusters. Cluster A corresponds
exactly to the recommended cluster “poisonous”. The remaining three clusters are
actually sub-clusters of the recommended cluster “edible”. Cluster B contains edible
mushrooms prominent in East Asia, while Clusters C and D comprise mushrooms
found mostly in North America and Europe, and are prominent in Western cuisines.
Similarly, the third experiment was conducted using DISEASE 20T with the results
shown in Figure 7.4. At sT = 0.86, TTA discovered hidden sub-clusters within
the four recommended clusters, namely, “skin”, “blood”, “cardiovascular” and “di-
gestion”. In relation to this, Handl et al. [99] highlighted a shortcoming in their
184 Chapter 7. Term Clustering for Relation Acquisition
Figure 7.3: Experiment using 16 terms from the mushroom domain. Setting sT =
0.89 results in 4 clusters. Cluster A represents poisonous mushrooms. Cluster B
comprises edible mushrooms which are prominent in East Asian cuisine except for
Agaricus Blazei. Nonetheless, this mushroom was included in this cluster probably
due to its high content of beta glucan for potential use in cancer treatment, just
like Shiitake. Moreover, China is the major exporter of Agaricus Blazei, also known
as Himematsutake, further relating this mushroom to East Asia. Cluster C and D
comprise edible mushrooms found mainly in Europe and North America, and are
more prominent in Western cuisines.
evaluation of ant-based clustering algorithms. The authors stated that the algo-
rithm “...only manages to identify these upper-level structures and fails to further
distinguish between groups of data within them.”. In other words, unlike existing
ant-based algorithms, the first three experiments demonstrated that our TTA has
the ability to further distinguish hidden structures within clusters.
The fourth and fifth experiments were conducted using ANIMAL 16T dataset.
This dataset has been employed to evaluate both the standard ant-based clustering
(SACA) and the improved version called A2CA by Vizine et al. [263]. The original
dataset consists of 16 named instances, each representing an animal using binary
feature attributes. Both SACA and A2CA discovered two natural clusters, one
for “mammal” while the other for “bird”. While SACA was inconsistent in its
results, A2CA yielded 100% recall rate over ten runs. The authors of A2CA stated
that the dataset can also be represented as three recommended clusters. In the
spirit of the evaluation by Vizine et al., we performed the clustering of the 16
animals using TTA over ten runs. In our case, no feature was used. Just like all
experiments in this chapter, the 16 animals were clustered based on their names.
7.5. Evaluations and Discussions 185
Figure 7.4: Experiment using 20 terms from the disease domain. Setting sT = 0.86
results in 7 clusters. Cluster A represents skin diseases. Cluster B represents a
class of blood disorders known as anaemia. Cluster C represents other kinds of
blood disorders. Cluster D represents blood disorders characterised by the relatively
low count of leukocytes (i.e. white blood cells) or platelets. Cluster E represents
digestive diseases. Cluster F represents cardiovascular diseases characterised by
both the inflammation and thrombosis (i.e. clotting) of arteries and veins. Cluster
G represents cardiovascular diseases characterised by the inflammation of veins only.
As shown in the fourth experiment in Figure 7.5, by setting sT = 0.60, the TTA
automatically discovered the two recommended clusters after the second-pass: “bird”
and “mammal”. While ant-based techniques are known for their intrinsic capability
in identifying clusters automatically, conventional clustering techniques (e.g. K-
means, average link agglomerative clustering) rely on the specification of the number
of clusters [99]. The inability to control the desired number of natural clusters can
be troublesome. According to Vizine et al. [263], “in most cases, they generate a
number of clusters that is much larger that the natural number of clusters”. Unlike
both extremes, TTA has the flexibility in regard to the discovery of clusters. The
granularity and number of discovered clusters in TTA can be adjusted by simply
modifying the threshold sT . By setting higher sT , the number of discovered clusters
for ANIMAL 16T has been increased to five as shown in Figure 7.6. A lower value
of the desired ideal intra-node similarity sT results in less branching out and hence,
fewer clusters. Conversely, setting higher sT produces more tightly coupled terms
where the similarities between elements in the leaf nodes are very high. In the
186 Chapter 7. Term Clustering for Relation Acquisition
fifth experiment depicted in Figure 7.6, the value sT was raised to 0.72 and more
refined clusters were discovered: “bird”, “mammal hoofed”, “mammal kept as pet”,
“predatory canine” and “predatory feline”.
Figure 7.5: Experiment using 16 terms from the animal domain. Setting sT = 0.60
produces 2 clusters. Cluster A comprises birds and Cluster B represents mammals.
The next three experiments were conducted using the ANIMALGOOGLE 16T
dataset. These three experiments are meant to reveal another advantage of TTA
through the presence of an outlier, namely, the term “Google”. An outlier can be
simply considered as a term that does not fit into any of the clusters. In Figure 7.7,
TTA successfully isolated the term “Google” while discovering clusters at different
levels of granularities based on different sT . As similar terms are clustered into the
same node, outliers are eventually singled out as isolated terms in individual leaf
nodes. Consequently, unlike some conventional techniques such as K-means [282],
clustering using TTA is not susceptible to poor results due to outliers. In fact, there
are two ways of looking at the term “Google”, one as an outlier as described above,
or the second as an extremely small cluster with one term. Either way, the term
“Google” demonstrates two abilities of TTA: capable of identifying and isolating
outliers, and tolerance to differing cluster sizes like its predecessors. Handl et al.
[99] have shown through experiments that certain conventional clustering techniques
such as K-means and one-dimensional self-organising maps perform poorly in the
face of increasing deviations between cluster sizes.
The last two experiments were conducted using MIX 31T and MIX 60T. Fig-
ure 7.8 shows the results after the first-pass and second-pass using 31 terms while
7.5. Evaluations and Discussions 187
Figure 7.6: Experiment using 16 terms from the animal domain (the same dataset
from the experiment in Figure 7.5). Setting sT = 0.72 results in 5 clusters. Cluster
A represents birds. Cluster B includes hoofed mammals (i.e. ungulates). Cluster C
corresponds to predatory feline while Cluster D represents predatory canine. Cluster
E constitutes animals kept as pets.
Figure 7.9 shows the final results using 60 terms. Similar to the previous experi-
ments, the first-pass resulted in a number of clusters plus some isolated terms. The
second-pass aims to relocate these isolated terms to the most appropriate clusters.
Despite the rise in the number of terms from 31 to 60, all the clusters formed by
the TTA after the second-pass correspond precisely to their occurrences in real-life
(i.e. natural clusters). With the absolute consistency of the results over ten runs,
these two experiments yield 100% recalls just like the previous experiments. Con-
sequently, we can claim that TTA is able to produce consistent results, unlike the
standard ant-based clustering where the solution does not stabilise and fail to con-
verge. For example, in the evaluation by Vizine et al. [263], the standard ant-based
clustering were inconsistent in their performance over the ten runs using the AN-
IMAL 16T dataset. This is a very common problem in ant-based clustering when
“they constantly construct and deconstruct clusters during the iterative procedure of
adaptation” [263].
There is also another advantage of TTA that is not found in the standard ants,
namely, the ability to identify taxonomic relations between clusters. Referring to
all the ten experiments conducted, we noticed that there are implicit hierarchical
information that connects the discovered clusters. For example, referring to the most
188 Chapter 7. Term Clustering for Relation Acquisition
Figure 7.7: Experiment using 15 terms from the animal domain plus an additional
term “Google”. Setting sT = 0.58 (left screenshot), sT = 0.60 (middle screenshot)
and sT = 0.72 (right screenshot) result in 2 clusters, 3 clusters and 5 clusters, respec-
tively. In the left screenshot, Cluster A acts as the parent for the two recommended
clusters “bird” and “mammal”, while Cluster B includes the term “Google”. In the
middle screenshot, the recommended clusters “bird” and “mammal” were clearly
reflected through Cluster A and C respectively. By setting sT higher, we dissected
the recommended cluster “mammal” to obtain the discovered sub-clusters C, D and
E as shown in the right screenshot.
recent experiment in Figure 7.8, the two discovered Clusters A (which contains
“Sandra Bullock”, “Jackie Chan”, “Brad Pitt”) and B (which contains “3 Doors
Down”, “Aerosmith”,“Rod Stewart”) after the second-pass share the same parent
node. We can employ the graph of Wikipedia articles W to find the nearest common
ancestor of the two natural clusters and label it with the category name provided by
Wikipedia. In our case, we can label the parent node of the two natural clusters as
“Entertainers”. In fact, the labels of the natural clusters themselves can be named
using the same approach. For example, the terms in the discovered cluster B (which
contains “3 Doors Down”, “Aerosmith”,“Rod Stewart”) fall under the same category
“American musicians” in Wikipedia and hence, we can accordingly label this cluster
using that category name. In other words, clustering using TTA with the help of
NGD and noW does not only produce flexible and consistent natural clusters, but
is also able to identify implicit taxonomic relations between clusters. Nonetheless,
we would like to point out that not all hierarchies of natural clusters formed by
the TTA correspond to real-life hierarchical relations. More research is required to
properly validate this capability of the TTA.
7.5. Evaluations and Discussions 189
Figure 7.8: Experiment using 31 terms from various domains. Setting sT = 0.70
results in 8 clusters. Cluster A represents actors and actresses. Cluster B represents
musicians. Cluster C represents countries. Cluster D represents politics-related
notions. Cluster E is transport. Cluster F includes finance and accounting matters.
Cluster G constitutes technology and services on the Internet. Cluster H represents
food.
One can notice that in all the experiments in this section, the quality of the
clustering output using TTA was less desirable if we were to only rely on the results
from the first-pass. As pointed out earlier, the second-pass is necessary to produce
naturally-occurring clusters. The results after the first-pass usually contain isolated
terms due to discrepancies in NGD. This is mainly due to the appearance of words
and the popularity of word pairs that are not natural. For example, given the words
“Fox”, “Wolf” and “Entertainment”, the first two should go together naturally.
Unfortunately, due to the popularity of the name “Fox Entertainment”, a Google
search using the pair “Fox” and “Wolf” generates lower page count as compared to
“Fox” and “Entertainment”. A lower page count has adverse effects on Equation
7.2, resulting in lower similarity. Using Equation 7.4, “Fox” and “Entertainment”
achieve a similarity of 0.7488 while “Fox” and “Wolf” yield a lower similarity of
0.7364. Despite such shortcomings, search engine page counts and Wikipedia offer
TTA the ability to handle technical terms and common words of any domain re-
gardless of whether they have been around for some time or merely beginning to
evolve into common use on the Web. Due to the mere reliance on names or nouns
for clustering, some readers may question the ability of TTA in handling various
linguistic issues such as synonyms and word senses. Looking back at Figure 7.4, the
190 Chapter 7. Term Clustering for Relation Acquisition
Figure 7.9: Experiment using 60 terms from various domains. Setting sT = 0.76
results in 20 clusters. Cluster A and B represent herbs. Cluster C comprises pastry
dishes while Cluster D represents dishes of Italian origin. Cluster E represents
computing hardware. Cluster F is a group of politicians. Cluster G represents cities
or towns in France while Cluster H includes countries and states other than France.
Cluster I constitutes trees of the genus Eucalyptus. Cluster J represents marsupials.
Cluster K represents finance and accounting matters. Cluster L comprises transports
with four or more wheels. Cluster M includes plant organs. Cluster N represents
beverages. Cluster O represents predatory birds. Cluster P comprises birds other
than predatory birds. Cluster Q represents two-wheeled transports. Cluster R and
S represent predatory mammals. Cluster T includes trees of the genus Acacia.
term “Buerger’s disease” and “Thromboangiitis obliterans” are actually synonyms
referring to the acute inflammation and thrombosis (clotting) of arteries and veins
of the hands and feet. In the context of the experiment in Figure 7.2, the term “Bor-
deaux” was treated as “Bordeaux wine” instead of the “city of Bordeaux”, and was
successfully clustered together with other wines from other famous regions such as
“Burgundy”. In another experiment in Figure 7.9, the similar term “Bordeaux” was
automatically disambiguated and treated as a port-city in the Southwest of France
instead. The TTA then automatically clustered this term together with other cities
in France such as “Chamonix” and “Paris”. In short, TTA has the inherent capa-
bility of coping with synonyms, word senses and the fluctuation in term usage.
The quality of the clustering results is very much dependent on the choice of sT
7.6. Conclusion and Future Work 191
and to a lesser extent, δT . Nonetheless, as an effective rule-of-thumb, sT should be
set as high as possible. Higher sT will result in more leaf nodes with each having
possibly a smaller number of terms that are tightly coupled together. High sT
will also enable the isolation of potential outliers. The isolated terms and outliers
generated by a high sT can then be further refined in the second-pass. The ideal
range of sT derived through our experiments is within 0.60−0.9. Setting sT too low
will result in very coarse clusters like the ones shown in Figure 7.5 where potential
sub-clusters are left uncovered. Regarding the value of δT , it is usually set inversely
proportional to sT . As shown during our evaluations, the higher we set sT , the more
we decrease the value of δT . The reason behind the choices of these two threshold
values can be explained as follows: as we lower sT , TTA produces coarser clusters
with loosely coupled terms. The intra-node distance of such clusters are inevitably
higher compared to the finer clusters because the terms in these coarse clusters are
more likely to be less similar. In order for the second-pass to function appropriately
during the relocation of isolated terms and the isolation of outliers, δT has to be set
comparatively higher. Besides, lower sT will not provide the adequate discriminative
ability for the TTA to distinguish or pick out the outliers. Another interesting point
about sT is that by setting it to the maximum (i.e. 1.0), it results in a divisive
clustering effect. In divisive clustering, the process starts with one, all-inclusive
cluster and at each step, splits the cluster until only singleton clusters of individual
term remain [242].
7.6 Conclusion and Future Work
In this chapter, we introduced a decentralised multi-agent system for term clus-
tering in ontology learning. Unlike documents clustering or other forms of clustering
in pattern recognition, clustering terms in ontology learning requires a different ap-
proach. The most evident adjustment required in term clustering is the measure of
similarity and distance. Existing term clustering techniques in many ontology learn-
ing systems remain confined within the realm of conventional clustering algorithms
and feature-based similarity measures. Since there is no explicit feature attached to
terms, these existing techniques have come to rely on contextual cues surrounding
the terms. These clustering techniques require extremely large collections of domain
documents to reliably extract contextual cues for the computation of similarity ma-
trices. In addition, the static background knowledge required for term clustering
such as WordNet, patterns and templates make such techniques even more difficult
to scale across domains.
192 Chapter 7. Term Clustering for Relation Acquisition
Consequently, we introduced the use of featureless similarity and distance mea-
sures called Normalised Google Distance (NGD) and n-degree of Wikipedia (noW)
for term clustering. The use of these two measures as part of a new multi-pass clus-
tering algorithm called Tree-Traversing Ant (TTA) demonstrated excellent results
during our evaluations. Standard ant-based techniques exhibit certain characteris-
tics that have been shown to be useful and superior compared to conventional clus-
tering techniques. The TTA is the result of an attempt to inherit these strengths
while avoiding some inherent drawbacks. In the process, certain advantages from the
conventional divisive clustering were incorporated, resulting in the appearance of a
hybrid between ant-based and conventional algorithms. Seven of the most notable
strengths of the TTA with NGD and noW are (1) able to further distinguish hidden
structures within clusters, (2) flexible in regard to the discovery of clusters, (3) ca-
pable of identifying and isolating outliers, (4) tolerance to differing cluster sizes, (5)
able to produce consistent results, (6) able to identify implicit taxonomic relations
between clusters, and (7) inherent capability of coping with synonyms, word senses
and the fluctuation in term usage.
Nonetheless, much work is still required in certain aspects. One of the main
future work we have in plan is to ascertain the validity and make good use of the
implicit hierarchical relations discovered using TTA. The next issue that interests
us is the automatic labelling of the natural clusters and the nodes in the hierarchy
using Wikipedia. Labelling has always been a hard problem in clustering, especially
document and term clustering. We are also keen on conducting more studies on the
interaction between the two thresholds in TTA, namely, sT and δT . If possible, we
intend to find ways to enable the automatic adjustment of these threshold values to
maximise the quality of clustering output.
7.7 Acknowledgement
This research was supported by the Australian Endeavour International Post-
graduate Research Scholarship, and a Research Grant 2006 from the University of
Western Australia. The authors would like to thank the anonymous reviewers for
their invaluable comments.
7.8 Other Publications on this Topic
Wong, W., Liu, W. & Bennamoun, M. (2006) Featureless Similarities for Terms
Clustering using Tree-Traversing Ants. In the Proceedings of the International Sym-
posium on Practical Cognitive Agents and Robots (PCAR), Perth, Australia.
7.8. Other Publications on this Topic 193
This paper reports the preliminary work on clustering terms using featureless simi-
larity measures. The resulting clustering technique called TTA was later refined to
contribute towards the core contents of Chapter 7.
Wong, W., Liu, W. & Bennamoun, M. (2008) Featureless Data Clustering. M.
Song and Y. Wu (eds.), Handbook of Research on Text and Web Mining Technolo-
gies, IGI Global.
The research on TTA reported in Chapter 7 was generalised in this book chapter
to work with both terms and Internet domain names.
194 Chapter 7. Term Clustering for Relation Acquisition
195CHAPTER 8
Relation Acquisition
Abstract
Common techniques for acquiring semantic relations rely on static domain and
linguistic resources, predefined patterns, and the presence of syntactic cues. This
chapter proposes a hybrid technique which brings together established and novel
techniques in lexical simplification, word disambiguation and association inference
for acquiring coarse-grained relations between potentially ambiguous and composite
terms using only dynamic Web data. Our experiments using terms from two different