LINKING NAMED ENTITIES TO A STRUCTURED KNOWLEDGE BASE By Kranthi Reddy. B 200502008 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science (by Research) in Computer Science & Engineering Search and Information Extraction Lab Language Technologies Research Centre International Institute of Information Technology Hyderabad, India June 2010
104
Embed
LINKING NAMED ENTITIES TO A STRUCTURED KNOWLEDGE BASEweb2py.iiit.ac.in/research_centres/publications/download/masters... · LINKING NAMED ENTITIES TO A STRUCTURED KNOWLEDGE BASE By
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
LINKING NAMED ENTITIES TO A
STRUCTURED KNOWLEDGE BASE
By
Kranthi Reddy. B
200502008
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF
It is certified that the work contained in this thesis, titled “ Linking Named Entities
to a Structured Knowledge Base ” by Kranthi Reddy.B (200502008) submitted
in partial fulfillment for the award of the degree of Master of Science (by Research)
in Computer Science & Engineering, has been carried out under my supervision
and it is not submitted elsewhere for a degree.
Date Advisor :
Dr. Vasudeva VarmaAssociate Professor
IIIT, Hyderabad
Acknowledgements
I am grateful to my advisor, Dr Vasudeva Varma for his advice and for believing
in me throughout the duration of my thesis work. His regular suggestions have been
of great value. I would also like to thank Dr Prasad Pingali for his valuable insights
on research. I have had great pleasure and joy to work with him for the whole
duration of my MS by Research studies. I have been fortunate to get timely advice
and quick feedback from Dr Prasad Pingali and Dr Vasudeva Varma inspite of their
hectic schedules. I would like to thank Mr Babji who worked tirelessly to keep the
IE lab servers running 24/7.
I would also like to acknowledge the time, help and guidance provided by Pra-
neeth and Sai Krishna. Both have been monumental in giving shape to my thesis
draft, without whom it would have an herculean task. Along with Kiran they not
only helped me through the difficult times and but also helped me to coup with the
pressure. Their confidence in me gave a lot of moral support. I have the pleasure of
working and publishing work with all three of them. Thanks to them, who showed
that research can be done with interest and fun.
I thank all my colleagues in Setu Software Systems Pvt. Ltd where I have been
working as an intern during the entire period of my thesis. I have had great time
and fun working in their companionship.
A person can be defined by the social circle he is associated with. I think I had
one of the best friend circle during my stay in IIIT. I thank Abhilash and Ambati
for their inputs and discussions on my Thesis work. A special thanks to Phani
Chaitanya, Ganesh, Girish, Gopal, Vijay, Harsha and Samrat who have been my
close knit of friends. Their frequent visits to campus during my research had lifted
my spirits many a time. Special thanks to charan. He always gave philosophical
and motivating talks whenever he saw me in dull mood.
Last, but not the least, I would like to thank my parents and sister for having the
trust in my abilities. They gave freedom and space to grow more as an individual. I
thank them for being my invisible sources of moral and mental support.
vi
Abstract
The World Wide Web (WWW) is a huge, widely distributed global source of in-
formation to web users. Web documents are broadly classified into: unstructured
and structured documents. Users prefer structured documents when looking for a
piece of information. Hence, in the past decade research community focused on
mining structured information from unstructured documents and attempted to pre-
serve them in the form of attribute-value pairs, tables, flow charts etc. But, the focus
has been only on extracting information at document level or on particular domains
like disaster, finance, medicine etc. The techniques never attempted to integrate the
extracted information to common knowledge repositories like Wikipedia, DBPedia
etc.
Structured databases like Wikipedia, DBPedia etc are created through collabo-
rative contributions from volunteers and organizations. Since they rely heavily on
manual effort, the process of updating these databases is not only tedious and time
consuming but is also fraught with many drawbacks. Hence, automatic updation of
structured databases has become one of the hot topics of research in the past few
years. Automatic updation of structured databases can be broken down into two sub
problems: Entity Linking and Slot Filling. In this thesis, we address Entity Link-
ing. Entity Linking is the task of linking named entities occurring in a document
to entries in a Knowledge Base. This is a challenging task because entities can not
only occur in various forms, viz: acronyms, nick names, spelling variations etc but
can also occur in various contexts.
Once named entities from documents are linked to entries in a knowledge base,
information can be integrated across them. Current IE techniques can be used to ex-
tract information from documents. Person named disambiguation and Co-reference
Resolution are two tasks that share a lot of similarities with Entity Linking. These
tasks have attempted to link entities across documents but never attempted to inte-
grate them into a common Knowledge Base.
Our approach to Entity Linking begins with building of an Entity Repository
(ER). ER contains information about different forms of named entities and is built
using Wikipedia structural information like redirect pages, disambiguation pages
and bold text from first paragraph. Our core algorithm for Entity Linking can be
broken down into two steps : Candidate List Generation (CLG) and Ranking.
In the CLG phase, we use the ER, Web search results and a named entity rec-
ognizer to identify all possible variations of a given named entity. Using these
variations we obtain an unordered list of candidate nodes from the KB which can
be linked to the given named entity in a document. In the ranking phase, we rank
the unordered list of candidate nodes using various similarity techniques. We cal-
culate the similarity between the text of the candidate nodes and the document in
which the named entity occurrs. We experiment ranking using various similarity
functions like cosine similarity, Naı̈ve Bayes, maximum entropy, Tf-idf ranking
and re-ranking using pseudo relevance feedback. Our experiments show that cosine
similarity and Naı̈ve Bayes perform close to state of the art and the Tf-idf ranking
function performs better in some cases.
Our approach was tested on a standard Entity Linking dataset provided as part
of Text Analysis Conference (TAC) for Knowledge Base Population (KBP) shared
task. We evaluated our approach using Micro-Average Score which is the standard
evaluation metrics. We achieved very impressive MAS of 83% and 85% on TAC-
KBP, Entity Linking 2009 and 2010 data sets, which secured top spot in these shared
tasks respectively.
Publications
• Kranthi Reddy, Karun Kumar, Sai Krishna, Prasad Pingali, Vasudeva Varma ,“Linking Named Entities to a Structured Knowledge Base”, in Cicling 2010.Published in “International Journal of Computational Linguistics and Appli-cations, ISSN 0976-0962 ”.
• Vasudeva Varma, Vijay Bharath Reddy, Sudheer K, Praveen Bysani, GSKSantosh, kiran kumar, kranthi Reddy, karuna Kumar, nithin M, “IIIT Hy-derabad at TAC 2009”, In the Working Notes of Text Analysis Conference(TAC), National Institute of Standards and Technology Gaithersburg, Mary-land USA, November, 2009.
• Praveen Bysani, Kranthi Reddy, Vijay Bharath Reddy, Sudheer Kovelamudi,Prasad Pingali, Vasudeva Varma, “IIIT Hyderabad in Guided Summarizationand Knowledge Base Population”, In the Working Notes of Text AnalysisConference (TAC), National Institute of Standards and Technology Gaithers-burg, Maryland USA, November, 2010.
particular candidate list size. . . . . . . . . . . . . . . . . . . . . . . . . . 666.5 The above table indicates the number of queries(2009 query set) having a
particular candidate list size. . . . . . . . . . . . . . . . . . . . . . . . . . 676.6 The above table indicates the failure to list the correct candidate node in
the Candidate List even though the mapping node exist in the KnowledgeBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.7 Average Micro-Average Score and Base line scores obtained by variousparticipating universities/teams for TAC-KBP Entity Linking task on 2009and 2010 query sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.8 Micro-Average Score for individual heuristics for 2010 Query set. GoogleSearch includes both Google spell suggestion and Google directive search. . 70
6.9 Micro-average score for individual heuristics for 2009 Query set. GoogleSearch includes both Google spell suggestion and Google directive search. . 71
6.10 Statistics of NIL predictions and its accuracy for 2010 Query Set. . . . . . . 736.11 Statistics of NIL predictions and its accuracy for 2009 Query Set. . . . . . . 746.12 Performance Comparison with Top 5 systems at TAC-KBP 2010 Entity
Linking sub task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.13 Performance Comparison with Top 5 systems at TAC-KBP 2009 Entity
If the best ranked candidate node similarity score is greater than 0.4 it returned as the
mapping node, else NIL is predicted.
The system developed by us and Xianpei Han et.al bears a lot of similarities. Both
systems create candidate sets and then rank the sets using BOW as a feature. The difference
between the systems is that we use a more fine tuned module for generating the candidate
sets and for handling acronyms. Another key difference is in the approach to NIL detection.
28
CHAPTER 2. RELATED WORK
We augment the KB with Wikipedia in order to predict NIL for entities that don’t have a
mapping node in the KB, where as Xianpei Han et.al predict a mapping node or NIL based
on a fixed threshold.
The main draw back with this approach is that the manually written heuristics for can-
didate detection will cover limited patterns. Another draw back is with respect to NIL
prediction methodology proposed by Xianpei Han et.al. Fixing the same threshold for
query entities occurring across various contexts is never a good strategy.
2.3.3 Supervised Machine Learning for Entity Linking
Fangtao Li et al. [28] use a “Learning to Rank” strategy to find the mapping node in the
KB for a query entity. They employ a list wise learning to rank model and augment it
with Naı̈ve Bayes binary classifier to find a mapping node. Their algorithm can be broken
down into multiple steps, but the main components remain the same i.e. candidate nodes
generation and ranking. We now explain the algorithm in detail.
• Preprocessing : Since, the KB can be in the order of millions, Fangtao Li et.al index
them for faster access of the documents. Also, sometimes query entities might be
misspelled, they use query correction function from Google, altavista 7 etc.
• Query Expansion : Fangtao Li et.al argue that using only the given query entity
is not sufficient to find the correct mapping node from the KB. Hence, they use
various strategies like using the document associated with the query entity to find the
expanded form for abbreviations, use Wikipedia redirect, disambiguation and link
information to obtain various possible variations of an entity.
• Candidate Generation : Using the obtained variations, they retrieve top 20 doc-
uments from the KB by forming an “OR” query from the entity variations. The
obtained set of candidate nodes are then ranked to identify the mapping node.
7http://www.altavista.com/
29
CHAPTER 2. RELATED WORK
• List wise learning to Rank : Using a small training data of 285 queries they adopt
a ListNet, an algorithm of learning to rank proposed by Zhe Cao [9]. The candidate
nodes obtained are ranked using the model built. Then, they use a Naı̈ve Bayes
binary classifier to decide whether the top ranked node is correct or if NIL should be
predicted.
The drawback with the approach of Fangtao Li et al. is that it requires large corpus
of human annotated data to train the model. Creating a training data for three categories
mainly person, location and organization covering various contexts is a difficult and time
consuming task. McNamee et.al [37] also propose a supervised machine learning similar
to Fangtao Li et.al. The only difference is that McName considers absence as another entry
to rank and selects the top ranked node directly, unlike Fangtao Li et.al who use a Naı̈ve
Bayes binary classifier. We show that our approach scales to large scale KBs easily and
performs better than all the above algorithms without any training data.
2.4 Conclusions
In this chapter, we did elaborate discussions on the literature related to EL. We discussed
seminal work on Person Name Disambiguation and Co-reference Resolution as they share
a lot of similarities with EL. Then, we discussed seminal work on EL by Curezan, Bunescu
and Pasca. Later, we explained in detail the three systems developed as part of the TAC-
KBP, EL shared task and explained their shortcomings. We also discussed how our ap-
proach overcomes their shortcomings. In the next chapter, we explain the first phase of our
algorithm, Candidate List Generation (CLG).
30
Chapter 3
Candidate List Generation
Given a KB, the task of EL is to determine for each named entity occurring in a document,
which KB node is being referred to, or if it is a new entity and not present in the KB.
As discussed in Section 1.5, we break EL into two steps. In the first step, we build an
entity repository (ER), which contains different forms of various named entities. ER is
built using various features from Wikipedia. ER is a prerequisite for identifying candidate
nodes because it contains information about various forms in which a named entity can
occur.
In the next step, query entity1 is expanded to obtain its variations. In addition to using
ER for identifying query entity variations, we use web search results and Stanford NER.
These variations are used to generate candidate nodes from the KB, referred as Candidate
List (CL). This phase of generating the CL is referred as Candidate List Generation (CLG)
phase. These candidate nodes are finally ranked using various similarity techniques. In this
chapter, we explain in detail about the CLG phase.
1Query Entity refers to a named entity occurring in a document which is to be linked to a node in the
Knowledge Base, if any.
31
CHAPTER 3. CANDIDATE LIST GENERATION
3.1 Building Entity Repository
In real world, a named entity can be referred using various forms like nick names, alias
names, acronyms and spelling variations. We introduced how a named entity could be
referred using these various forms with examples in Chapter 1. In order to handle these
variations, we build an ER which contains various forms in which an entity could be re-
ferred. Though web contains various forms of named entities, it is not an ideal place for us
to extract entity variations because of the following reasons.
• Web is voluminous and continuous to grow at an astounding rate in both the sheer
volume of traffic and size. Valuable information about entities is sparsely distributed
across the web. The process of mining entity variations from such voluminous data
is tedious and time consuming.
• Large percentage of web documents are unstructured. Inferencing information from
such wide range of documents is extremely difficult and not an ideal solution.
• Most of the information available on the web is never moderated. Hence, extracting
information from the web can result in false and unauthenticated data being extracted.
Hence, we use Wikipedia which is the largest semi-structured database [55] available
to mine various forms of named entities. The advantages of using Wikipedia are
• It has better coverage of named entities [69]. Since the KB provided by TAC-KBP
shared task covers only named entities, Wikipedia acts as a perfect platform for build-
ing our ER.
• Articles in Wikipedia are heavily linked and structured. We use the information
encoded in redirect and disambiguation pages for extracting named entity variations.
• With over 3.5 million articles Wikipedia is rightly sized and big enough to provide
information about name variants.
32
CHAPTER 3. CANDIDATE LIST GENERATION
• Since data on Wikipedia is moderated, we can be assured to a certain level of authen-
tication on the information present in it.
The existing literature [18, 41, 45] confirms the fact that valuable information can be
mined from Wikipedia. A sample Wikipedia article/document encoded in XML is shown
in the figure 3.1.
Figure 3.1 A sample article/document in Wikipedia.
A Wikipedia article contains a unique title, an ID, text carrying information about an
entity/event and some meta information. We use the title and text of an article for identify-
ing name variants.
The features we use in extracting name variants from Wikipedia are33
CHAPTER 3. CANDIDATE LIST GENERATION
• Redirect Pages : A redirect page in Wikipedia is an aid to navigation, it contains
no content but only a link to another article (target page) and strongly relates to
the concept of target page. In lay man terms, a redirect is a page which has no
content itself, but sends the reader to another article or a section of an article, or page,
usually from an alternative title. Redirect pages help in identifying the following
name variants.
– Alternative names (for example, “Edison Arantes do Nascimento” redirects to
“Pel”).
– Plurals (for example, “Greenhouse gases” redirects to “Greenhouse gas”).
– Closely related words (for example, “Symbiont” redirects to “Symbiosis”).
– Less specific forms of names, for which the article subject is still the primary
topic. For example, “Hitler” redirects to “Adolf Hitler”.
– More specific forms of names (for example, “Articles of Confederation and
Perpetual Union” redirects to “Articles of Confederation”).
– Abbreviations (for example, “DSM-IV” redirects to “Diagnostic and Statistical
Manual of Mental Disorders”).
– Alternative spellings or punctuation. For example, “Colour” redirects to “Color,
and Al-Jazeera” redirects to “Al Jazeera”.
– Likely misspellings (for example, “Condoleeza Rice” redirects to “Condoleezza
Rice”).
– Likely alternative capitalizations (for example, “Natural Selection” redirects to
“Natural selection”).
A sample redirect page encoded in XML is shown in the figure 3.2. A redirect page
contains a unique title and redirect information to the original article. For example,
from figure 3.2, we obtain “Tendulkar” as a name variant of “Sachin Tendulkar”.
34
CHAPTER 3. CANDIDATE LIST GENERATION
Figure 3.2 A sample redirect document in Wikipedia.
• Disambiguation Pages : Disambiguation pages are specifically created for ambigu-
ous entities, and consist of links to articles defining the different meanings of the
entity. They are used as a process of resolving conflicts in article titles that occur
when a single term can be associated with more than one topic, making that term
likely to be the natural title for more than one article. In other words, disambigua-
tions are paths leading to different articles which could, in principle, have the same
title. For example, the word “Mercury” can refer to an element, a planet, a Roman
god, and many other things. This feature helps in homonym resolution.
A sample disambiguation page encoded in XML is shown in the figure 3.3. From fig-
ure 3.3, we can conclude that “Sachin” is a name variant for “Sachin Tendulkar”,“Sachin
Pilgaonkar” etc.
• Bold Text From First Paragraph : On randomly analyzing few pages in Wikipedia
we found that bold text from the first paragraph of a Wikipedia article in general
refers to the full/nick name of a named entity. This feature helps in identifying
full/nick names of an entity.
From figure 3.1, we can conclude that “Sachin Ramesh Tendulkar” (text in black
color) is a name variant of “Sachin Tendulkar”.
35
CHAPTER 3. CANDIDATE LIST GENERATION
Figure 3.3 A sample disambiguation document in Wikipedia.
Using the above features from Wikipedia we obtain different variations of a named
entity. For example, variations obtained for “Sachin Tendulkar” are
• “Sachin Ramesh Tendulkar” from bold text of first paragraph, which is in fact the
full name of “Sachin Tendulkar”.
• “Tendulkar” from redirect page, which is the less specific form of “Sachin Ten-
dulkar”.
• “Sachin” from disambiguation page.
All these variations are indexed using Lucene 2 a high-performance, full-featured text
search engine to enable fast retrieval of documents.
ER is important because we have information about various forms of named entities at
one place. These variations are used in our CLG phase to identify candidate nodes from
the KB.2http://lucene.apache.org
36
CHAPTER 3. CANDIDATE LIST GENERATION
3.2 Identifying Query Entity Variations
In this phase, we identify all possible variations of the query entity. We use query document
in context, web search results and ER to identify query entity variations. The entity vari-
ants obtained are then used during the candidate list 3 (CL) identification phase to identify
mapping nodes from the KB. We now describe the various steps in identifying query entity
variations in detail.
3.2.1 Using Query Document in Context
We use the given query document for two purposes. First, We use it identify expanded
form of the query entity, if it is an acronym. Secondly, we use it to identify full name, nick
name, alias name etc if any. We use Stanford NER for the establishing the second task. We
now describe each in detail.
Acronym Expansion : Here the goal is to find the expanded form of the query entity, if
it is an acronym. For this we check if the query entity is an acronym i.e. contains all upper
case characters. If the given query entity is an acronym, we try to find the expanded form
from the corresponding query document, if any. We use an N-Gram based approach to find
the expanded form of the query entity. For this we remove stop words from the document
and check if “N” continuous sequence of tokens have the same initials as our query entity.
If an expanded form is found we use it along with the query entity (acronym) to search in
ER. The intuition behind this is that, it is common for entities to be introduced in text as
full forms and subsequently referred to by shorter forms or pronouns. Resolving these in
document co-reference links to retrieve the full form can thus have a substantial impact on
candidate ambiguity.
For example, given the following sentences :
• ...the newly-formed All Basotho Convention (ABC) is far from certain...3The unordered list of candidate nodes obtained using query entity variations is referred as Candidate List
(CL).
37
CHAPTER 3. CANDIDATE LIST GENERATION
• ...Abbott Laboratories (ABT:NYSE) ...
• ...the Anti-Corruption Unit (ACU) of the International Cricket Council (ICC) ...
• ...member countries of Asian Clearing Union (ACU) recorded...
We can easily identify the expanded forms of all the above acronyms using our simple
N-Gram based technique. For example, ABC refers to All Basotho Convention, the first
ACU refers to Anti-Corruption Unit and the second ACU refers to Asian Clearing Union.
Stanford Named Entity Recognizer 4 : Stanford NER provides a general implemen-
tation of linear chain Conditional Random Field (CRF) sequence models, coupled with
well-engineered feature extractors for NER. It can identify Person, Location and Organiza-
tion.
We run the Stanford NER on the query document. It would tokenize and extract named
entity mentions from the text and tag them as either “PERSON/LOCATION/ ORGANI-
ZATION”. Phrases belonging to either of the three categories and having our query entity
as a sub-string are identified as possible variations of the query entity. This feature would
help us in identifying full name, nick name, alias name etc of the query entity, if any. The
purpose of this heuristic is to use the least ambiguous mentions in the document as the basis
for CL identification. It is common for entities to be introduced in discourse as full forms
and subsequently referred to by shorter forms or pronouns. Resolving these in document
co-reference links to retrieve the full form can thus have a substantial impact on candidate
ambiguity, and subsequently on an EL system.
For example, the mention of “Columbus” will be co-referred to the full form “Colum-
bus, Ohio” if it is extracted as a mention from the query document.
4http://nlp.stanford.edu/software/CRF-NER.shtml
38
CHAPTER 3. CANDIDATE LIST GENERATION
3.2.2 Using Entity Repository
Using the ER built, we obtain all possible name variants for the given query entity. In
simple terms the variations obtained are nothing but name variants of the query entity from
Wikipedia. The given query entity is searched upon the Lucene index built for the ER. The
results obtained are name variants of the query entity.
For example, “George W. Bush, George H. W. Bush, George P. Bush” etc are name
variants of “George Bush” found from ER.
3.2.3 Using Web Search Results
We use Google search engine to identify query entity variations. We use Google’s spell
suggestion and Google’s site specific search feature. We now describe each in detail.
Google Spell Suggestion : Essentially Google spell checking compares words entered
against a constantly changing list of the most common searches and isolates when a user
may have intended to enter a different word or words. Because it does not depend on a rigid
dictionary, it is more effective in isolating words and phrases that may be commonly used
but are often not included in formal dictionaries i.e. named entities. Google’s checker is
particularly good at recognizing frequently made typos, misspellings, and misconceptions.
For our purpose, although most of the query entity strings are well formed, there are still
some spelling errors, so we try to correct the spelling errors using spell suggestion feature
supplied by the Google search engine. We input the query entity string to the search engine,
and then the search engine will return a corrected spelling of the string if the original one
was wrong. Since our query entities are about named entities this would return the best
possible spelling.
Google Site Specific Search : Google allows a user to specify a single website from
which a user might want to get the results from. For example, the query [ Iraq site:nytimes.com
] will return pages about Iraq but only from nytimes.com . This feature of Google will per-
39
CHAPTER 3. CANDIDATE LIST GENERATION
form a site specific search on that particular website and return a ranked set of documents
from the mentioned website. We use this feature to obtain ranked set of documents for
our query entities from Wikipedia. This feature helps us in identifying name variant of the
query entity when Wikipedia documents are ranked using Google search engine.
“site:en.wikipedia.org” is used to obtain a ranked set of documents from the Wikipedia
domain for a query entity. From the ranked set of web search results we consider the top
most ranked result title as a variation of our query entity.
For Example, HDFC Bank is obtained as a variation of the query entity HDFC.
3.3 Candidate Nodes Identification
Once the set of name variants of the query entity are obtained, we need to identify the set
of possible mapping nodes from the KB. We search the name variants of the query entity in
the titles of the KB. This searching of the name variants to identify mapping nodes from the
KB is an important step because if the correct mapping node isn’t picked into the Candidate
List 5 (CL), the system will fail irrespective of how good the ranking algorithm might be.
We believe that as long as the correct mapping node is picked into the CL, the likelihood of
it being returned as a mapping node after ranking is very high. This search of name variants
on the KB titles is done in the following way.
• Token Search : The name variants of the query entity are searched on the titles of
KB nodes. Boolean “AND” search of all the tokens of each query entity variation is
done on the KB node title. If all the tokens are present, we add the KB node to CL.
For example, If the given query entity is “CCP” and we find its name variant to
be “Chinese Communist Party”; We would retrieve nodes with the title “Chinese
Communist party” or “Communist Party of China”.
5The unordered list of candidate nodes obtained during candidate node identification is referred as Can-
didate List.
40
CHAPTER 3. CANDIDATE LIST GENERATION
3.4 Adding Wikipedia Article to the Candidate List
As we need to predict NIL for query entities that don’t have a mapping node in the KB, we
add Wikipedia nodes also to the CL. We search the Wikipedia using the same name variants
obtained for a query entity. Token search used for searching the KB is used for searching
Wikipedia also. We only add Wikipedia nodes which aren’t present in the KB to the CL.
Adding Wikipedia articles to the CL allows us to consider strong matches against query
entities that do not have any corresponding node in the KB and hence we can return NIL.
That is, for a given query entity if the ranking function maps to the Wikipedia article from
the CL, we can confirm the non-presence of a node in the KB about the query entity. This
method of appending a given KB is far better strategy when compared to fixing a threshold
value for predicting NIL.
The result of the CLG phase is an unordered list of candidate nodes. We need to rank
this unordered list in order to find the correct mapping node. We have experimented with
various similarity functions for ranking which are explained in the next chapter.
A flow chart of our CLG phase is shown in the figure 3.4
3.5 Conclusions
In this chapter, we described various features used to build ER from Wikipedia. We used
Wikipedia specific syntax i.e redirect pages, disambiguation pages and bold text from first
paragraph of an article to build ER. Later, we used ER, web search results and Stanford
NER to identify query entity variations. Using these variations, we search the given KB
and Wikipedia to identify an unordered list of candidate nodes, referred as CL. In the next
chapter, we use various similarity techniques to rank the nodes in CL to obtain the mapping
node.
41
CHAPTER 3. CANDIDATE LIST GENERATION
Figure 3.4 Flow Chart of Candidate List Generation Phase
42
Chapter 4
Entity Linking as Ranking
In this chapter, we describe the core part of our approach i.e. predicting the mapping node
from the generated list of candidate nodes, CL. We rank the candidate nodes based on its
similarity to the query document. Predicting the mapping node can be broken down into
three steps :
1. The list of candidate nodes and the query document are tokenized and represented as
token vectors.
2. We use a wide variety of similarity techniques in IR to compute the similarity be-
tween candidate node vectors and query document vector. The candidate node with
highest similarity score is referred as Best Ranked Node (BRN).
3. Mapping node or NIL is predicted based on BRN ∈ KB or BRN ∈Wikipedia.
To calculate the similarity between candidate nodes and the query document, we have
experimented with cosine similarity, Naı̈ve Bayes, maximum entropy, Tf-idf ranking and
pseudo relevance feedback ranking.
43
CHAPTER 4. ENTITY LINKING AS RANKING
4.1 Entity Linking as Ranking
The result of CLG phase in Chapter 4 is an unordered list of candidate nodes. If |CL|=0
1, we return NIL, otherwise we rank the candidate nodes to predict the mapping node.
For |CL|=0 is a case where no name variant of the query entity is present in the KB or
Wikipedia titles. We predict NIL for such cases as no candidate node could be obtained.
When |CL| 6= 0, similarity is calculated between the candidate nodes and the query doc-
ument using various techniques. For this, we represent the query document Dq and the
candidate nodes C = {C1, C2, ..., Cn}, where Ci ∈ C, as vectors. Similarity is calculated
between the vector representations of the query document (Dq) and candidate nodes (Ci).
4.2 Vector Representation of Documents
In this section, we describe briefly the process of obtaining the vector representation of a
document. First, query document (Dq) and candidate nodes (Ci) are tokenized using space
as a delimiter. Tokens belonging to the stop words list 2 are removed and the remaining
tokens are stemmed to obtain vectors for each document. The representation of a set of
documents as vectors in a common vector space is known as the Vector Space Model [60]
and is fundamental to a host of information retrieval operations ranging from scoring doc-
uments on a query, document classification and document clustering.
Let S denote the set of all stop words. Consider the document associated with the
query entity as Dq, where Dq={q1, q2, ..., qn} with qi /∈ S and qi is the stemmed word. Let
~V (Dq)=(q1, q2, ..., qn) be the vector representation of the query document.
Similarly, let the set of candidate nodes be C, where C={C1, C2, ..., Cn}. Ci is a can-
didate node and Ci ∈ C, Ci={w1, w2, ..., wm} with wi /∈ S and wi is the stem word. Let
~C={~V (D1), ..., ~V (Dn)}, with ~V (Di)=(wi1, wi2, ..., wim), be the vector representation of
candidate nodes.1|CL| refers to the size of the candidate list (CL).2we used a list of 200 frequently occurring stop words from the web.
44
CHAPTER 4. ENTITY LINKING AS RANKING
We now discuss various techniques we experimented to calculate similarity between
candidate nodes and the query document.
4.3 Cosine Similarity
In this section, we describe in detail how we identify the BRN from the CL using cosine
similarity. The model is based on the intuition that documents with higher number of
common terms are more similar. In this model, we view the set of candidate nodes as a
set of vectors in a vector space, in which there is one axis for each token. We compute the
similarity between the query document and candidate nodes as the magnitude of the vector
difference between the vectors ~V (Dq) and candidate node vectors ~C .
Figure 4.1 Cosine Similarity.
The cosine similarity between the query document Dq and a candidate node Ci is com-
puted as
sim(Dq, Ci) =~V (Dq) · ~V (Ci)
|~V (Dq)||~V (Ci)|(4.1)
where the numerator represents the dot product (also known as the inner product) of the
vectors ~V (Dq) and ~V (Ci), while the denominator is the product of their euclidean lengths.
45
CHAPTER 4. ENTITY LINKING AS RANKING
The dot product ~V (Dq) · ~V (Ci) of two vectors is defined asM∑j=1
DqCi, with M representing
union of tokens representing the documents Dq and Ci. The Euclidean length of Dq is
defined to be
√√√√ M∑j=1
~V (Dq). Similarly euclidean length for Ci is calculated.
The effect of the denominator of equation (4.1) is to length-normalize the vectors
~V (Dq) and ~V (Ci) to unit vectors ~v(Dq) = ~V (Dq)/|~V (Dq)| and ~v(Ci) = ~V (Ci)/|~V (Ci)|.
We can then rewrite (4.1) as
sim(Dq, Ci) = ~v(Dq) · ~v(Ci) (4.2)
Thus, (4.2) can be viewed as the dot product of the normalized versions of the two
vectors. This measure is the cosine of the angle θ between the two vectors, shown in Figure
4.1.
The candidate node Ci with highest cosine similarity score to query document Dq is
returned as the BRN.
4.4 Classification Model
In the field of IR, document classification is the task of assigning a document to one or
more classes, based on its features. This task is also referred as text classification, text
categorization, topic classification or topic spotting. The notion of classification is very
general and has many applications within and beyond IR. In our scenario, we assume each
candidate node Ci to represent a unique class label (Li). We need to determine which class
(Li) is the closest mapping class for our query document Dq. We have experimented with
two classification techniques
• Naı̈ve Bayes.
• Maximum Entropy.
46
CHAPTER 4. ENTITY LINKING AS RANKING
We use the implementation of Naı̈ve Bayes and maximum entropy available in Rainbow
Text Classifier3.
Supervised classification models like Naı̈ve Bayes and maximum entropy require la-
beled training data. The labeled training data is obtained using a set of features to represent
a document. Selecting a set of features to represent a document is called as feature selec-
tion. We now explain the importance of feature selection process and later describe how
we use features to represent the training documents (Ci).
Feature Selection : Feature selection is the process of selecting a subset of the terms
occurring in the training set (C) and using only this subset as features in text classification.
Feature selection serves two main purposes.
• First, it makes training and applying a classifier more efficient by decreasing the size
of the effective vocabulary.
• Second, feature selection often increases classification accuracy by eliminating noise
features 4.
Our representation of the candidate nodes (C) obtained in section 4.2 serves this pur-
pose. By using the stop word removal and tokenization feature we have obtained subset of
effective vocabulary terms which represent the candidate nodes (Ci) better.
4.4.1 Naı̈ve Bayes
In this section, we explain how Naı̈ve Bayes is used for identifying BRN. Naı̈ve Bayes is a
simple probabilistic classifier based on applying Bayes theorem with strong independence
assumptions. It has been used for a wide range of applications like text classification [26,
56, 1, 61] , word sense disambiguation [49, 15] , sentiment classification [48, 64, 38] etc.
3http://www.cs.cmu.edu/∼mccallum/bow/rainbow/4A noise feature is one that, when added to the document representation, increases the classification error
on new data.
47
CHAPTER 4. ENTITY LINKING AS RANKING
We now describe how Naı̈ve Bayes is used for identifying BRN. The probability of the
query document Dq being in class Li (candidate node, Ci) is computed as
P (Li|Dq) ∝ P (Li)∏
1≤k≤n
P (qk|Li) (4.3)
where P (qk|Li) is the conditional probability of term qk occurring in a candidate node
of class Li. We interpret P (qk|Li) as a measure of how much evidence qk contributes that
Li is the correct class. P (Li) is the prior probability of a candidate node occurring in class
Li. If a candidate node’s terms do not provide clear evidence for one class versus another,
we choose the one that has a higher prior probability. < q1, q2, ..., qn > are the tokens in
query document Dq that are part of the vocabulary we use for classification and n is the
number of such tokens in Dq.
Our goal is to find the best mapping class (Li) for the query document (Dq). The best
class in Naı̈ve Bayes classification is the most likely or maximum a posterior (MAP) class
cmap :
cmap = argmaxLiεC
P̂ (Li|Dq) = argmaxLiεC
P̂ (Li)∏
1≤k≤n
P̂ (qk|Li) (4.4)
We write P̂ for P because we do not know the true values of the parameters P (Li) and
P (qk|Li), but estimate them from the training set.
We obtain the likelihood for each candidate node (Li) and rank them accordingly. The
candidate node (Ci) with best likelihood score is returned as the BRN.
4.4.2 Maximum Entropy
In this section, we describe maximum entropy technique for identifying BRN from the
candidate nodes set (C). Maximum entropy has been widely used for variety of natural
language tasks like, language modeling [10, 58], part-of-speech tagging [52] and preposi-
tional phrase attachment [53]. The over-riding principle in maximum entropy is that when
48
CHAPTER 4. ENTITY LINKING AS RANKING
nothing is known, the distribution should be as uniform as possible, that is, have maximal
entropy.
Our case being more similar to text classification, maximum entropy estimates the con-
ditional distribution of the class label (Li) given a candidate node Ci. We use the represen-
tation of the candidate nodes (C) obtained in section 4.2 and bag of words as a feature. The
labeled training data is used to estimate the expected value of the tokens on a class-by-class
basis. First, we introduce how to select a feature set for setting the constraints and building
the training model. Then, we move on to explain how it is used for identifying BRN.
Constraints and Features : In maximum entropy, we use the training data (Ci belong-
ing to class Li) to set constraints on the conditional distribution. We let any real-valued
function of the candidate node Ci and the class Li be a feature, fi(Ci, Li). Maximum en-
tropy allows us to restrict the model distribution to have the same expected value for this
feature as seen in the training data, candidate node setC. Thus, we stipulate that the learned
conditional distribution P (Li|Ci) must have the property:
1
|C|∑Ci∈C
fi(Ci, c(Ci)) =∑Ci
P (Ci)∑Li
P (Li|Ci)fi(Ci, Li) (4.5)
Thus, when using maximum entropy, the first step is to identify a set of feature functions
that will be useful for classification. Then, for each feature, measure its expected value
over the training data and take this to be a constraint for the model distribution. More
specifically, for each word-class combination we instantiate a feature as:
fw,L′i(Ci, Li) =
0, if Li 6= L
′i
N(Ci, w)
N(Ci)Otherwise,
(4.6)
where N(Ci, w) is the number of times word w occurs in document Ci, and N(Ci) is
the number of words in Ci. With this representation, if a word occurs often in one class,
we would expect the weight for that word-class pair to be higher than for the word paired
with other classes.
49
CHAPTER 4. ENTITY LINKING AS RANKING
We use the representation of the documents obtained in section 4.2 to train a maximum
entropy probability distribution model and use it to classify the query document Dq. The
candidate node (Ci) which receives the highest probability estimate is returned as BRN.
4.5 Tf-idf Ranking
The Tf-idf weight (term frequency-inverse document frequency) [59] is often used for vari-
ous tasks in information retrieval and text mining. This weight is a statistical measure used
to evaluate how important a word is to a document in a collection or corpus. The impor-
tance increases proportionally to the number of times a word appears in the document but
is offset by the frequency of the word in the corpus. The intuition behind this model is that
a document that mentions a term more often has more to do with that term and therefore
should receive a higher score. Variations of the tf-idf weighting scheme are often used by
search engines as a central tool in scoring and ranking a document’s relevance given a user
query. We now explain how Tf-idf ranking is used in identifying the BRN.
4.5.1 Term frequency and weighting
Term frequency (TF) refers to how often a term appears in a specific document. Each
term in candidate node Ci is assigned a weight depending on the number of occurrences
of the term in the Ci. ∀qi, qi ∈ Dq, we compute a score between the query term qi and
candidate node Ci, based on the weight of qi in Dq. We assign the weight to be equal to the
number of occurrences of term qi in document Ci. This weighting scheme is referred to as
term frequency and is denoted tfqi,Ci, with the subscripts denoting the query term and the
candidate node in order. The ordering of the terms in the Ci is ignored but the number of
occurrences of each qi is all that is considered. We only retain information on the number
of occurrences of each qi.
50
CHAPTER 4. ENTITY LINKING AS RANKING
4.5.2 Inverse document frequency
Inverse Document Frequency (IDF) is a measure of the general importance of a term.
Above mentioned raw term frequency suffers from a critical problem: all terms are con-
sidered equally important when it comes to assessing relevancy on a qi. In fact certain qi
have little or no discriminating power in determining relevance. To this end, we introduce
a mechanism for attenuating the effect of qi that occur too often in candidate nodes C to
be meaningful for relevance determination. An immediate idea is to scale down the term
weights of qi with high collection frequency, defined to be the total number of occurrences
of qi in the C. The idea would be to reduce the tf weight of qi by a factor that grows with
its frequency in candidate nodes C. By using this document-level statistic (the number of
documents containing qi) we discriminate between Ci for the purpose of scoring. IDF is
given by
idfqi = log|C|
1 + |qi ∈ Ci|(4.7)
where |C| is the total number of candidate nodes. |qi ∈ Ci| is the number of candidate
nodes where qi appears. If qi is not in the candidate nodes C, this will lead to a division-
by-zero. Hence, we use 1 + |qi ∈ Ci|.
4.5.3 Tf-idf Weighting
Combining the definitions of TF and IDF, we produce a composite weight for each qi in
each Ci. The tf-idf weighting scheme assigns to each qi a weight in document Ci and is
given by
tf − idfqi,Ci= tfqi,Ci
× idfqi (4.8)
In other words, tf − idfqi,Ciassigns to qi a weight in Ci that is
• highest when qi occurs many times within a small number of candidate nodesC (thus
51
CHAPTER 4. ENTITY LINKING AS RANKING
lending high discriminating power to those candidate nodes);
• lower when the qi occurs fewer times in Ci, or occurs in many candidate nodes C
(thus offering a less pronounced relevance signal);
• lowest when the qi occurs in virtually all candidate nodes C.
Finally, the similarity between the query documentDq and a candidate nodeCi, Ci ∈ C
is given by
Similarity(Ci, Dq) =∑qiinDq
tf(qi, Ci) ∗ idf(qi) (4.9)
The candidate nodes Ci are ranked in descending order and the candidate node with
highest tf-idf score is returned as the Best Ranked Node (BRN).
4.6 Pseudo Relevance Feedback for Re-ranking
In this section we give a brief overview of Hyperspace to Analogue Language (HAL)
model. Later, we show how HAL is used to re-rank the ranked set of candidate nodes
obtained by Tf-idf ranking to identify BRN.
4.6.1 Pseudo Relevance Feedback
Pseudo relevance feedback, also known as blind relevance feedback, provides a method for
automatic local analysis. It automates the manual part of relevance feedback, so that the
user gets improved retrieval performance without an extended interaction. The method is to
do normal retrieval to find an initial set of most relevant documents, to then assume that the
top k ranked documents are relevant, and finally to do relevance feedback as before under
this assumption. Following this intuition top k documents are used to generate a language
model using HAL model, which is used to re-rank the candidate nodes.
52
CHAPTER 4. ENTITY LINKING AS RANKING
4.6.2 Hyperspace to Analogue Language(HAL) Model
Hyperspace Analogue to Language [31] model constructs the dependencies of a word w on
other words based on their occurrence in the context of w in a sufficiently large corpus. The
intuition underlying HAL spaces is that when humans encounter a new concept, they derive
its meaning from accumulated experience of the context in which the concept appears. Thus
the meaning of the new concept can be learn’t from its usage with other concepts within
the same context. Lund and Burgess [31] discusses the use of lexical co-occurrence to
construct high dimensional semantic spaces in which a word can be represented as a point.
The representational model of this space can be constructed automatically from a corpus of
text.
The construction of HAL space can be seen as a vector representation of each word w,
occurring in the vocabulary T, in a high dimensional space spanned by different words in
the vocabulary. This process results in a |T |X|T | HAL matrix, where |T | is the number
of different words in the vocabulary. The HAL matrix is constructed by taking a window
of length K words and moving it across the corpus at one term increments. All words in
the window are said to co-occur with the first word, with strengths inversely proportional
to the distance between them. In our approach we have considered the co-occurrence to be
bidirectional, because in general it is agreed that preserving the word order is not useful for
IR. The weights assigned to each co-occurrence of terms are accumulated over the entire
corpus. That is, if n(w, k, w′) denote the number of times word w′ occurs k ≤ K distance
away from w when considered a window of length K, and W (k) = K − k + 1 denotes the
strength of this co-occurrence between the two words, then
HAL(w′/w) =
K∑k=0
W (k)n(w, k, w′) (4.10)
The length of the window size will invariably influence the quality of the associations
between a pair of terms. For instance, as the size of the window increases, the higher the
chance of representing spurious associations between terms. Various window sizes have
53
CHAPTER 4. ENTITY LINKING AS RANKING
been used from 2 to 10. However, it is unclear what the best size of window is, experi-
ments [31] suggest a window of 4 or 8 for the purposes of IR. The original HAL Space
is direction sensitive because it records the co-occurrence information for terms preceding
every term. In general, it was found that preserving this term order was not useful for IR
and the combination of the row and column vectors for a term (thus a bidirectional win-
dow) was more effective. For instance with the sentences “The black cat ...” and “The cat
is black.”, while the ordering is different the notion that the cat is a particular color, black,
is preserved when taking both directions into account.
4.6.3 Re-ranked Candidate Nodes :
Using the popular tf-idf weighting for ranking results in the most important candidate nodes
being ranked on the top. Though the most important candidate nodes might appear in the
top-k results, we are still left with the problem of choosing a single node as the BRN. For
this we re-rank the candidate nodes using pseudo relevance feedback approach.
We build an HAL matrix over the top-k ranked candidate nodes. From the HAL matrix,
we use all the co-occurring words around our query entity within a window of size four and
expand the query. Experiments for various window sizes for HAL showed that fixing it at
four captures sufficient context. We re-rank the candidate nodes using the expanded query
and obtain re-ranked score for each candidate node. From experimental results we choose
to consider the top-5 ranked candidate nodes for building HAL matrix.
The final score of each candidate node is a weighted linear combination of its rank score
and re-rank score. The final score is given by
Final Score = λ ∗RankingScore+ (1− λ) ∗Re− rankedScore (4.11)
λ is the weight for each of the scores. Experimental results show that setting λ to 0.7
gave the best results.
We illustrate how pseudo relevance feedback works with an example. For a query
entity “Laguna Beach” from the query set the correct mapping node is “Laguna Beach,54
CHAPTER 4. ENTITY LINKING AS RANKING
California”. The query document contains terms like “show, MTV, Jessica, Jason” etc and
this results in the tf-idf ranking function to assign a higher rank to “Laguna Beach: The
Real Orange County”, an MTV reality show.
After query expansion using HAL the words: “lifeguards, coastal, land, geography” etc
are added to the query. This actually results in a higher re-ranked score for the candidate
node “Laguna Beach, California”. The final score(a linear combination of ranking and
re-ranking score) results in “Laguna Beach, California” as the BRN.
4.7 Mapping Node Identification
Using the above five techniques we obtain a ranked set of candidate nodes, from the initially
unordered set of candidate nodes. From the above ranked list, candidate node with highest
similarity score to the query document is returned as BRN. BRN could be either from the
KB or Wikipedia. If BRN∈KB, we return it as a map for the query entity or NIL otherwise.
The output of our system for a query entity is summarized in equation 4.12
Mapping Node =
NIL, if CL=0
NIL, if CL ≥ 1 and BRN ∈Wikipedia
Node Id if CL ≥ 1 and BRN ∈ KB
(4.12)
4.8 Conclusions
In this chapter, we discussed various similarity techniques to rank the unordered list of
candidate nodes. We experimented with cosine similarity, Naı̈ve Bayes, maximum entropy,
TF-IDF ranking and pseudo relevance feedback ranking. The node with highest similarity
to the query document was returned as BRN. Mapping node or NIL was predicted based
on BRN ∈ KB or BRN ∈ Wikipedia. In the next chapter, we discuss the structure of the
data set used to evaluate our algorithm. We also describe the evaluation metric.
55
Chapter 5
Data Set
In this chapter, we give background of Text Analysis Conference (TAC). We then give a
brief overview of the data set that is required for evaluating an EL algorithm. We explain
in detail about the general structure of a Knowledge Base, Document Collection 1 and
Query Entities when encoded in XML. We conclude the chapter with an overview of the
evaluation metric.
5.1 Text Analysis Conference
Recently there has been wide spread interest in community wide evaluations for research
in information technologies. The Text Analysis Conference (TAC) is a series of evaluation
workshops organized to encourage research in Natural Language Processing (NLP) and
related applications, by providing a large test collection, common evaluation procedures,
and a forum for organizations to share their results. TAC comprises sets of tasks known
as “tracks”, each of which focuses on a particular sub problem of NLP. TAC tracks focus
on end-user tasks, but also include component evaluations situated within the context of
end-user tasks.
Question answering and information Extraction have been studied over the past decade;
1Set of query documents is referred as Document Collection.
56
CHAPTER 5. DATA SET
however evaluation has generally been limited to isolated targets or small scopes (i.e., sin-
gle documents). The Knowledge Base Population (KBP) Track at TAC was proposed to
explore extraction of information about entities with reference to an external knowledge
source. Using basic schema for persons, organizations, and locations, nodes in an ontology
must be created and populated using unstructured information found in text. This task has
been broken down into two sub problems: Entity Linking, where names must be aligned to
entities in the KB and Slot Filling, which involves mining information about entities from
text. The EL sub task was present in both TAC-KBP 20092 and 20103.
Compared to previous information extraction evaluations such as the Message Under-
standing Conference (MUC) and Automatic Content Extraction (ACE), KBP is different in
the following perspectives
• Extraction at large scale (e.g. 1 million documents).
• Using a representative collection (not selected for relevance).
• Cross-document entity resolution (extending the limited effort in ACE).
• Linking the facts in text to KB.
• Rapid adaptation to new relations.
We have evaluated the performance of our algorithm against the TAC-KBP 2009 and
2010 EL data sets. In the next section, we explain in detail the data set provided for TAC-
KBP track. We then give a brief overview of the evaluation metrics used to evaluate an EL
Table 6.5 The above table indicates the number of queries(2009 query set) havinga particular candidate list size.
Similarly Table 6.5 shows the mapping between number of query entities with a specific
CL size for various experiments (Runs) for 2009 query set. Clearly we can see the same
trend of average CL size increasing with the increase in number of heuristics used for iden-
tifying the name variants. Also it can be seen that when only redirect page from Wikipedia
is used the average CL size is 0.95. The major difference between the two data sets is the
percentage of queries for which the CL size is one for each heuristic. A large percentage
(48.6%) of queries in 2009 query set had resulted in a CL of size one compared to 40.3% in
2010 query set for Run No. 6. Using the redirect pages feature from Wikipedia will result
in either a single name variant or none. This redirect feature has very high impact on the
performance of our EL system which is shown in section 6.5.
Another key difference is the impact of Stanford NER and Google search for identify-
ing the name variants. Both the data sets show that by using Stanford NER and Google
0* indicates from Wikipedia and ** indicates from first paragraph of Wikipedia article. Google Search
includes both Google spell suggestion and Google directive search. Same notation is followed for rest of this
chapter.
67
CHAPTER 6. EVALUATION
Search, there was marginal increase in the CL size. The reason is that generally the name
variant obtained from Stanford NER and Google search might already be present in our
ER. Though both Stanford NER and Google Search result in small increase of CL size, the
impact of these two heuristics is very high which is discussed in section 6.5 .
6.4 Candidate List Generation Phase Analysis
The failure to list the correct mapping node in the CLG phase will result in failure of the
system irrespective of the ranking algorithm used. We believe that as long as the correct
mapping node is present in the CL, the context of the query entity will help in linking
it correctly. The column “Wrong map” in Table 6.6 indicates the failure to list the correct
candidate node in the CL even though the mapping node exists in the KB. The probability of
identifying correct mapping node in the CL increases as we add more heuristics to identify
named entity variations.
Run
No.
Heuristics Used Wrong Map - 2009 Wrong Map - 2010
1 Disambiguation pages* 555 468
2 Bold text** 525 458
3 Redirect pages* 609 422
4 Run No.s 1+2+3 279 212
5 Run No. 4 + Stanford NER 266 195
6 Run No. 5 + Google Search 241 117
Table 6.6 The above table indicates the failure to list the correct candidate nodein the Candidate List even though the mapping node exist in the Knowledge Base.
Google Search heuristic had more impact on 2010 query set than on 2009 query set for
identifying name variants not present in our ER. These name variants in turn resulted in
the correct mapping node being picked into the CL. This is evident with the reduction of68
CHAPTER 6. EVALUATION
wrong map from 195 to 117 for Run No. 6 i.e. for only 117 queries out of the 2250 queries
we could not pick the correct candidate node into the CL.
6.5 Entity Linking System Performance
In this section we evaluate the performance of our EL system. We use Micro-Average
Score(MAS) the standard metric proposed by TAC-KBP, EL track for evaluating system
performance.
Table 6.7 gives a brief overview of the number of participants for EL task at 2009
and 2010 TAC-KBP track. There was slight increase in the number of participants for
2010 EL task when compared to 2009. Each participating team is entitled to submit a
maximum of three runs. The TAC-KBP organizers would evaluate each run against the gold
standard data and report each teams performance. The base line score which is obtained
by predicting NIL for all the query entities is 57% and 54.6% for 2009 and 2010 EL query
sets. The average of Micro-Average Score obtained over 35 runs submitted at TAC-KBP
for 2009 EL task is 71.08% and 68.36% for 2010.
Year No:of participated
teams
Total runs submitted Base
line
Best Average
MAS
2009 13 35 57% 82.17% 71.08%
2010 16 46 54.6% 86.80% 68.36%
Table 6.7 Average Micro-Average Score and Base line scores obtained by variousparticipating universities/teams for TAC-KBP Entity Linking task on 2009 and2010 query sets.
Table 6.9 and Table 6.8 show the MAS obtained by our EL system on 2009 and 2010
EL query set respectively. Our best system achieved an MAS of 84.76% on 2010 query
set and 83.12% on 2009 query set. Our system performs close to current state-of-the-art
algorithms on EL. In fact, our system out performs all the systems submitted at TAC-KBP
69
CHAPTER 6. EVALUATION
2009 EL task and is only marginally behind the best system submitted at TAC-KBP 2010.
This shows the robustness of our algorithm and also the performance is very high when
compared to base line score or average MAS for all the runs submitted at 2009 and 2010
TAC-KBP EL task.
It is evident from Table 6.9 and Table 6.8 that pseudo relevance feedback for re-ranking
has performed very well. This shows that using co-occurrence statistics of a named entity
with other words helps in disambiguation and efficient ranking of candidate nodes. Re-
ranking has worked significantly well with all the CLG heuristics except in case of redirects
where the increase in performance is comparatively low. Cosine similarity and Naı̈ve Bayes
performed almost equally using bag of words as feature. This shows that using a simple bag
of words approach is sufficient to build a fairly well performing EL system. This simple
approach outperforms the baseline (54.6%) and the median(68.36%) across all the 46 runs
submitted for EL task at 2010 TAC-KBP, as well as for 2009. Maximum Entropy didnt fare
well as the data available for training the model was not sufficient i.e. certain candidate
nodes had sufficient text to describe an entity where as others didn’t. Hence, maximum
5 Run No. 4 + Stanford NER 76.27 80.40 80.53 81.56 82.00
6 Run No. 5 + Google Search 77.11 81.51 81.59 82.89 84.76
Table 6.8 Micro-Average Score for individual heuristics for 2010 Query set.Google Search includes both Google spell suggestion and Google directive search.
5 Run No. 4 + Stanford NER 78.76 81.58 81.66 82.12 82.79
6 Run No. 5 + Google Search 79.02 81.81 81.86 82.69 83.12
Table 6.9 Micro-average score for individual heuristics for 2009 Query set.Google Search includes both Google spell suggestion and Google directive search.
6.6 Precision Vs Top “N” results
In this section, we plot the Precision Vs Top “N” results for Non-Nil queries for the five
techniques. Figure 6.1 and Figure 6.2 shows the plot for 2010 and 2009 TAC-KBP EL
query set respectively. It can be seen clearly that as we consider a higher number of hits,
the probability of finding the correct map for the query entity in the hits list increases.
From the both the figures it is evident that Tf-idf technique results in ranking the map-
ping node higher (in the ranked list) when compared to others. Further simple techniques
like cosine similarity and Naı̈ve Bayes perform consistently better than maximum entropy,
which shows that word occurrence statistics are sufficient for building a decently perform-
ing EL System. This is also reflected in the results presented in the Section 6.5.
Pseudo relevance feedback re-ranking strategy results in picking the mapping node as
the BRN. Re-ranking will work only as long as the mapping node is present in the top 5
ranked nodes, because we consider only top 5 ranked nodes for query expansion. If the
mapping node is present in the top 5 ranked nodes, there is very good probability that
it might be the BRN after re-ranking. There can be only one mapping node at the best in
71
CHAPTER 6. EVALUATION
Figure 6.1 Precision Vs Top “N” results for Non-Nil Queries from 2010 TAC-KBP Entity Linking Query Set.
Figure 6.2 Precision Vs Top “N” results for Non-Nil Queries from 2009 TAC-KBP Entity Linking Query Set.
these top 5 ranked nodes. If the mapping node isn’t present in the top 5 ranked nodes, query
expansion using pseudo relevance feedback will result in addition of irrelevant tokens and
72
CHAPTER 6. EVALUATION
hence will result in performance degradation which is evident from the figures 6.1 and 6.2.
6.7 NIL Prediction Accuracy
In total there are 1230 (54.6%) queries in the 2010 TAC-KBP EL query set and 2229 (57%)
queries in 2009 TAC-KBP EL query set for which there is no mapping node in the KB.
Table 6.10 and Table 6.11 demonstrates the number of queries for which NIL was predicted
when |CL| = 0, and |CL| ≥ 1 for various approaches. The tables also show correct NIL
prediction count and accuracy. Since a query entity can occur in any of the variations, it is
very important to search the KB with all possible variations. Therefore the approach which
extracts major variations is likely to have better NIL accuracy. Experimental results for
Run No.6 on 2009 and 2010 EL query sets support this intuition.
Run
No.
|CL|=0 |CL| ≥ 1 and BRN ∈Wikipedia
- Predicted Correct
predictions
Accuracy Predicted Correct
predictions
Accuracy
1 417 304 72.9% 1203 859 71.4%
2 638 499 78.2% 921 665 72.2%
3 577 424 73.4% 972 747 76.8%
4 704 575 81.7% 630 553 87.7%
5 694 574 82.7% 626 553 88.3%
6 654 565 86.4% 535 513 95.8%
Table 6.10 Statistics of NIL predictions and its accuracy for 2010 Query Set.
73
CHAPTER 6. EVALUATION
Run
No.
|CL|=0 |CL| ≥ 1 and BRN ∈Wikipedia
- Predicted Correct
predictions
Accuracy Predicted Correct
predictions
Accuracy
1 1005 855 85.07% 1468 1112 75.74%
2 1388 1203 86.67% 1016 759 74.70%
3 1403 1135 80.89% 1234 959 77.7%
4 1468 1338 91.14% 679 572 84.24%
5 1468 1338 91.14% 679 572 84.24%
6 1410 1282 90.92% 602 532 88.37%
Table 6.11 Statistics of NIL predictions and its accuracy for 2009 Query Set.
6.8 Comparison with Top 5 systems at TAC-KBP
We compare the MAS of our best system with the top 5 runs submitted at 2009 and 2010
TAC-KBP, EL task [36] [24]. Siel is the team name with which we had participated. Our
system performed the best at 2009 TAC-KBP, EL task and was runner up at 2010 TAC-KBP.
Table 6.12 and Table 6.13 compare the performance of our system against the best ranked
systems developed by other teams. Some of the participating teams are IBM research labs
1, John Hopkins University 2, Stanford University 3 etc.
Our system got an MAS of 83.73% and 82.17% on TAC-KBP, 2010 and 2009 EL shared
task. After post analysis and improving the algorithm we obtained an MAS of 84.76% and
83.12% for 2010 and 2009, EL data sets respectively.
Table 6.12 Performance Comparison with Top 5 systems at TAC-KBP 2010 EntityLinking sub task.
Team Micro-Average Score
Siel 82.17%
QUANTA1 80.33%
hltcoe1 79.84%
Stanford UBC2 78.84%
NLPR KBP1 76.72%
Table 6.13 Performance Comparison with Top 5 systems at TAC-KBP 2009 EntityLinking sub task.
6.9 Error Analysis
In this section, we give a few example queries for which our EL System has failed. Our
system can fail either in the CLG phase or the ranking phase. In the CLG phase, our system
failed for queries like “Air Group Inc., Marufu, LULAC” etc. The correct mapping nodes
are “Midwest-airlines, Grace Mugabe, Texas’s 21st Congressional District” respectively.
This is because our heuristics in CLG phase couldn’t identify the latter as variations for
query entities. As these variations could not be identified, we could not pick those nodes
from the KB into the CL.
In the ranking phase, query entity might be wrongly mapped to KB node as the docu-
75
CHAPTER 6. EVALUATION
ment context in which the query entity occurs might not be sufficient for disambiguating it.
For the four techniques i.e. cosine similarity, maximum entropy, Naı̈ve Bayes and Tf-idf
ranking once the wrong node is mapped we can’t correct it. But in the case of pseudo
relevance feedback re-ranking strategy, we make use of the ranked results to expand the
query for re-ranking. Here we found that for certain generic and ambiguous query enti-
ties which were wrongly mapped during the ranking phase were correctly mapped after
re-ranking. For example, generic and ambiguous queries like “Cleveland, George Bush,
UC” were correctly mapped to “Cleveland, Ohio, George W. Bush, University of Cincin-
nati” respectively, when the contextual information from HAL was used for re-ranking.
(They were wrongly mapped to “Grover Cleveland, George H. W. Bush, Xavier University
(Cincinnati)” respectively when only ranking was done to predict the mapping node).
Our manual examination of 2010 TAC-KBP, EL gold standard data showed that 5
queries had been wrongly mapped. We have raised these issues with the TAC organiz-
ing committee and our suggestions were deemed correct. For example, for the query entity
“Jeff Fiser” the gold standard result was “2006 Tennessee Titans season”, whereas the cor-
rect answer is “NIL”. Jeff Fisher was the head coach of “2006 Tennessee Titans season”,
but linking them is wrong. The other errors were on similar lines. On incorporating these
changes to the gold standard data our best system i.e. 84.76% would become 84.98%.
6.10 Conclusions
In this chapter, we did a detail comparison of TAC-KBP, EL query sets for the year 2009
and 2010. Later, we discussed in detail the impact of each feature we used during the CLG
phase. Further, we described the performance of our algorithm on the TAC-KBP, EL data
set. We also compared the performance of our algorithm against the top 5 participants at
TAC-KBP, EL shared task. In the next chapter, we state the contributions of this thesis and
conclude with a real world application of an EL system.
76
Chapter 7
Conclusion
Structured KBs are a rich source of data for various NLP, IE and IR tasks. Recently, with
the emergence of publicly available databases, they have been exploited for a number of IE
tasks ranging from NER to relation extraction systems. But, KBs face quite a few problems
like : inconsistency in the information present, incompleteness, inaccuracy of the facts and
outdated information being present. These problems arise from the fact that the KBs are
maintained manually. In this thesis, we addressed the problem of linking named entities
from a document to nodes in a KB, a key component for automatic updation of KBs. In
the last decade, many techniques were proposed to extract structured information from un-
structured documents, but they never focused on integrating this extracted information to
globally available KBs. This motivated us to work on methodologies that can be used to
link entities in textual documents to KB nodes. We believe that research on EL will help
reduce the manual effort put in by contributors across the world in keeping the information
up to date in public KBs. This new area of research moves beyond the problems of NER,
CR and CDCR. EL breaks the document barrier and helps in automating the task of updat-
ing KBs. It opens up a range of applications from information aggregation to automated
reasoning over extracted information.
We showed that the process of creating and updating KBs can be automated. The
process of automating this task can be broken down into two sub problems.
77
CHAPTER 7. CONCLUSION
• Entity Linking
• Slot Filling
In this thesis, we addressed the problem of EL. We discussed in detail current ap-
proaches to EL and their short comings. Most of the current approaches are either too
rigid, cannot scale to large KBs or require huge training data. We discussed various chal-
lenges involved in EL like mention ambiguity, variations in named entities viz: acronyms,
nick names, spelling variations and NIL detection. We proposed a robust solution which
addresses the above issues and scales to large KBs with millions of entries.
Our proposed technique uses Wikipedia syntax to find variants of various named en-
tities. Wikipedia specific features like redirect pages, disambiguation pages and bold text
from first paragraph were used to identify synonyms, homonyms etc. Google spell sugges-
tion and Google site specific search was also used to obtain name variants from the web.
Additionally, an NER was used to find name variants of the query entity from the given
query document context. Using the variations obtained, a Boolean “AND” search was
done on the KB node titles. A subset of nodes, referred as candidate nodes (Candidate List,
CL), were obtained from the KB that can be linked to the query entity. Similarly, nodes
from Wikipedia were also added to the CL. Adding Wikipedia articles to the CL allows us
to consider strong matches against query entities that do not have any corresponding node
in the KB and hence we can return NIL. That is, for a given query entity if the ranking
function maps to the Wikipedia article from the CL, we can confirm the non-presence of a
node in the KB about the query entity. The identification of these candidate nodes from the
KB and Wikipedia was referred to as Candidate List Generation (CLG) phase.
Once the list of candidate nodes were obtained, the candidate nodes and query docu-
ment were tokenized and represented as token vectors. Using these vectors, similarity score
was calculated between the query document and candidate nodes. The similarity score be-
tween query document and the candidate nodes was calculated using five techniques. The
techniques used were cosine similarity, Naı̈ve Bayes, maximum entropy, Tf-idf and pseudo
78
CHAPTER 7. CONCLUSION
relevance feedback for re-ranking. The candidate node with highest similarity score was
returned as the Best Ranked Node (BRN). If BRN ∈ KB, we return it as a map for the
query entity or NIL otherwise.
Our algorithm was evaluated on a standard data set obtained from TAC-KBP, EL shared
task. Evaluation was done against TAC-KBP, 2009 and 2010 EL data set. Micro Average
Score (MAS) was used to evaluate our algorithms performance. We obtained very impres-
sive MAS of 83% and 85% on 2009 and 2010, EL data sets. Our results in chapter 6 show
that simple techniques like cosine similarity, Naı̈ve Bayes etc perform close to state of the
art. Pseudo relevance feedback performed close to state of the art algorithms and performed
the best on 2009 EL data set.
In this chapter, we discuss the contributions of this thesis and possible future directions.
We conclude with discussion on real world possible applications of an EL system.
7.1 Contributions
Most of the research community has focused on extracting structured information from
unstructured documents. But, using this extracted information to update KBs has received
very little focus. In this thesis, we attempted to fill this gap by trying to link entities occur-
ring in textual documents to nodes in a large KB. Once entities are linked to nodes in a KB,
document barrier is broken and information can be integrated across documents. We ap-
proached the problem of EL as a two stage problem. The basis for this technique is that we
focused on developing algorithms which can scale to large KBs and are robust. We experi-
mented with various similarity techniques and showed that simple approaches can perform
close to state-of-art algorithms and sometimes better. This was the major contribution of
the thesis. Some of the other contributions are :
• Identifying Named Entity Variations : We proposed three different methodologies
to identify named entity variations. We used Wikipedia specific syntax i.e. redi-
rect pages, disambiguation pages and bold text from first paragraph for identifying79
CHAPTER 7. CONCLUSION
synonyms, homonyms, nick names, alias names etc. Additionally, web search re-
sults and an NER were also used to identify various forms in which a named entity
could occur. Web search results and NER feature generate very few variations of an
entity, but their prediction accuracy is very high, which shows that they are highly
important features. We used Google spell suggestion feature and Google site specific
search for identifying spelling errors and to identify entity variations as well. We
used the obtained variations to identify candidate nodes from the KB.
• Robust Candidate Nodes Generation : Our system is flexible enough to find name
variants but sufficiently restrictive to produce manageable candidate list despite a
large-scale KB. We used Boolean “AND” search to identify candidate nodes from the
KB and Wikipedia. Table 6.6 shows that our system was able to identify mapping
node in the CL for high percentage of queries. We firmly believe that as long the
correct mapping node is present in the CL, the likelihood of it being returned as the
mapping node is very high, which is reflected in our results. Furthermore, our system
can scale to large KBs with millions of entries.
• Features for Entity Disambiguation and Ranking : We developed a rich and ex-
tensible set of features based on the query entity mention, the query document, and
KB nodes. We used tokenization, stop word removal and stemming to represent the
documents as vectors. This basic feature set had high impact on the final performance
of the system because of cleaner representation of the documents. Also, we experi-
mented with various similarity techniques to rank the candidate nodes. To the best of
our knowledge we found no work that experimented with so many different similarity
techniques. This is one of the major contributions of this thesis. We showed simple
techniques like cosine similarity, Naı̈ve Bayes, Tf-idf ranking etc perform close to
state of the art approaches, without any training data.
• NIL Detection : We proposed a technique of appending a given KB with Wikipedia
documents in order to identify NIL mapping entities, which obviates hand tuning.80
CHAPTER 7. CONCLUSION
From Table 6.10 and Table 6.11, it is clearly evident that this is a very useful feature.
This technique unlike other current approaches obviates the technique of fixing a
threshold for predicting NIL.
Our experiments were conducted on standard data sets for EL, provided by the TAC-
KBP. We evaluated our approach on both TAC-KBP, 2009 and 2010 EL data set. The
data set consisted of a KB, DC and query set. The DC consisted of news and blog arti-
cles providing real world documents and contexts to test our approach. Evaluation of our
methodology was confirming to the standard evaluation metrics for EL task. MAS was
used to evaluate our approach, which is the standard evaluation metric for EL.
Results of our experiments were reported in Chapter 6. Our algorithm achieved good
accuracy values while linking named entities to nodes in a large KB. Our results were on
par or better to the state-of-the-art approaches and systems developed as part of TAC-KBP
shared task. Our system was ranked first and second in TAC-KBP, EL shared tasks in 2009
and 2010 respectively.
7.2 Future Directions
Our approach can be considered as the building block for future research on EL. In this the-
sis, we have explored simple techniques like cosine similarity, Naı̈ve Bayes, Tf-idf ranking
etc and showed that they perform close to state-of-the-art and sometimes better. We feel
that now with training data available from TAC-KBP, machine learning techniques can be
explored. Another area that can be looked into is, refining the document context to cap-
ture only terms that describe the query entity. This would result in better ranking of the
candidate nodes and hence higher accuracy.
We firmly believe that as long as the candidate node is present in the CL, there is very
high likelihood of it being identified as the mapping node. The current research community
has been focusing more on the ranking algorithms as it is an interesting field. The perfor-
mance of an EL system is highly dependent on the CL identification phase. The higher81
CHAPTER 7. CONCLUSION
accuracy with which candidate nodes are identified, the higher the probability of identi-
fying it as a mapping node. It is worthwhile to consider candidate generation strategies
carefully.
Also, in our current approach we have exploited Wikipedia, an NER and Web search
results for identifying name variants. This is another area of research where we need to
separate this module and make it independent of any resource.
Nil node clustering is an area where focus of the research community is needed. Current
EL systems either predict a mapping node or NIL, if no mapping node is present in the KB.
If Nil node clustering is done, the data for a single named entity could be integrated into
one and hence can be used to creating new nodes in the KB.
A cross lingual EL system would be very promising area of research, because research
in this area will help in building KBs for local languages. With the growth of new websites
and blogs in the local languages of different regions this would certainly give the opportu-
nity for the less resourced languages to have a KB of their own which in due time will help
in the growth of users using local languages.
7.3 Application of Entity Linking
We discuss some real world applications of EL.
Metadata Integration : A possible application could be importing of metadata from
KBs by linking the named entities in a document. On successful linkage, metadata from the
KBs could be imported to the document, which otherwise might not be explicitly present
in the document. The metadata when imported can also contain property value pairs like
Age:35, Name:Sachin etc and complex queries like that of SPARQL [51] can be fired. In
figure 7.1, information flow of such a system is shown. From figure 7.2, we can see how a
document would look like when information about entities “Assange, WikiLeaks, Elmers”
is imported from a KB. With this integration of information into a document, search space
for a user is increased, because this document is retrieved even for keyword searches like
82
CHAPTER 7. CONCLUSION
“whistle blowers, swiss people, online archives” etc as this metadata is imported from the
KB.
Figure 7.1 An application of Entity Linking flow chart.
Figure 7.2 Possible application of Entity Linking.
Financial Domain : In the finance domain, EL can be used to identify company names
in a textual document and link them to a KB of tradable company names listed on the
stock markets. This can be used to aggregate company information into the document with
respect to stock market codes, analysis of relationship between news and share prices etc.
Search Feature Enhancement : Current day search engines return documents relevant
to the query posted by a user. The search engine results in general are a list of ranked
documents, where each result generally contains a title, a url, a snippet etc. We can use an
EL system to link the named entities present in the snippet/titles to a publicly available KB.
By doing this we can import information from the KB to the search results enhancing the83
CHAPTER 7. CONCLUSION
user experience and can also provide him with structured information.
84
Bibliography
[1] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, andP. Stamatopoulos. Learning to filter spam e-mail: A comparison of a naive bayesianand a memory-based approach. Arxiv preprint cs/0009009, 2000.
[2] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vec-tor space model. In Proceedings of the 17th international conference on Computa-tional linguistics-Volume 1, pages 79–85. Association for Computational Linguistics,1998.
[3] M. Banko, O. Etzioni, and T. Center. The tradeoffs between open and traditionalrelation extraction. Proceedings of ACL-08: HLT, pages 28–36, 2008.
[4] R. Barzilay and M. Elhadad. Using lexical chains for text summarization. In Proceed-ings of the ACL Workshop on Intelligent Scalable Text Summarization, volume 17.Madrid, spain, 1997.
[5] R. Barzilay, K.R. McKeown, and M. Elhadad. Information fusion in the context ofmulti-document summarization. In Proceedings of the 37th annual meeting of theAssociation for Computational Linguistics on Computational Linguistics, pages 550–557. Association for Computational Linguistics, 1999.
[6] R. Bunescu and R. Mooney. Subsequence kernels for relation extraction. Advancesin Neural Information Processing Systems, 18:171, 2006.
[7] R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disam-biguation. In Proceedings of EACL, volume 6, 2006.
[8] R.C. Bunescu and R.J. Mooney. A shortest path dependency kernel for relation ex-traction. In Proceedings of the conference on Human Language Technology and Em-pirical Methods in Natural Language Processing, pages 724–731. Association forComputational Linguistics, 2005.
[9] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li. Learning to rank: from pairwiseapproach to listwise approach. In Proceedings of the 24th international conferenceon Machine learning, pages 129–136. ACM, 2007.
85
BIBLIOGRAPHY
[10] S.F. Chen, R. Rosenfeld, and CARNEGIE-MELLON UNIV PITTSBURGH PASCHOOL OF COMPUTER SCIENCE. A Gaussian prior for smoothing maximumentropy models, 1999.
[11] H.L. Chieu and H.T. Ng. Named entity recognition: a maximum entropy approachusing global information. In Proceedings of the 19th international conference onComputational linguistics-Volume 1, pages 1–7. Association for Computational Lin-guistics, 2002.
[12] S.P. Converse. Resolving pronominal references in Chinese with the Hobbs algorithm.In Proceedings of the 4th SIGHAN workshop on Chinese language processing, pages116–122, 2005.
[13] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. InProceedings of EMNLP-CoNLL, volume 2007, pages 708–716, 2007.
[14] A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In Pro-ceedings of the 42nd Annual Meeting on Association for Computational Linguistics,pages 423–es. Association for Computational Linguistics, 2004.
[15] G. Escudero, L. Marquez, and G. Rigau. Naive Bayes and exemplar-based approachesto word sense disambiguation revisited. Arxiv preprint cs/0007011, 2000.
[16] O. Etzioni, M. Cafarella, D. Downey, A.M. Popescu, T. Shaked, S. Soderland, D.S.Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experi-mental study. Artificial Intelligence, 165(1):91–134, 2005.
[17] R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition throughclassifier combination. In Proceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4, pages 168–171. Association for Computa-tional Linguistics, 2003.
[18] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Con-ference on Artificial Intelligence, pages 6–12, 2007.
[19] E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natu-ral language processing. Journal of Artificial Intelligence Research, 34(1):443–498,2009.
[20] C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic informationfor relation extraction from biomedical literature. In Proceedings of the EleventhConference of the European Chapter of the Association for Computational Linguistics(EACL-2006), pages 5–7, 2006.
[21] X. Han and J. Zhao. NLPR KBP in TAC 2009 KBP Track: A Two-Stage Method toEntity Linking. In Proceedings of Test Analysis Conference 2009 (TAC 09).
86
BIBLIOGRAPHY
[22] E. Hovy and C.Y. Lin. Automated text summarization in SUMMARIST. Advancesin Automatic Text Summarization, 94, 1999.
[23] A. Iftene and A. Balahur-Dobrescu. Named entity relation mining usingwikipedia. Proceedings of the Sixth International Language Resources and Evalu-ation (LREC’08), pages 2–9517408, 2008.
[24] H. Ji, R. Grishman, H.T. Dang, and K. Griffitt. Overview of the TAC 2010 KnowledgeBase Population Track [DRAFT].
[25] K.T. JunichiKazama. Exploiting Wikipedia as external knowledge for named entityrecognition. In Proc. EMNLP-CoNLL, pages 698–707, 2007.
[26] S.B. Kim, K.S. Han, H.C. Rim, and S.H. Myaeng. Some effective techniques for naivebayes text classification. IEEE Transactions on Knowledge and Data Engineering,pages 1457–1466, 2006.
[27] M. Knights. Web 2.0. Communications Engineer, 5(1):30–35, 2007.
[28] F. Li, Z. Zhang, F. Bu, Y. Tang, X. Zhu, and M. Huang. THU QUANTA at TAC 2009KBP and RTE Track. In Text Analysis Conference (TAC), 2009.
[29] C.Y. Lin and E. Hovy. From single to multi-document summarization: A prototypesystem and its evaluation. In Proceedings of the 40th Annual Meeting on Associa-tion for Computational Linguistics, pages 457–464. Association for ComputationalLinguistics, 2002.
[30] V. Lopez, M. Pasin, and E. Motta. Aqualog: An ontology-portable question answeringsystem for the semantic web. The Semantic Web: Research and Applications, pages546–562, 2005.
[31] K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexicalco-occurrence. Behavior Research Methods Instruments and Computers, 28(2):203–208, 1996.
[32] I. Mani and E. Bloedorn. Multi-document summarization by graph search and match-ing. Arxiv preprint cmp-lg/9712004, 1997.
[33] I. Mani and M.T. Maybury. Advances in automatic text summarization. the MITPress, 1999.
[34] G.S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In Pro-ceedings of the seventh conference on Natural language learning at HLT-NAACL2003-Volume 4, pages 33–40. Association for Computational Linguistics, 2003.
[35] T. McArthur. Worlds of reference: lexicography, learning and language from the claytablet to the computer. 1986.
87
BIBLIOGRAPHY
[36] P. McNamee and H.T. Dang. Overview of the TAC 2009 knowledge base populationtrack. In Text Analysis Conference (TAC), 2009.
[37] P. McNamee, M. Dredze, A. Gerber, N. Garera, T. Finin, J. Mayfield, C. Piatko,D. Rao, D. Yarowsky, and M. Dreyer. HLTCOE approaches to knowledge base pop-ulation at TAC 2009. In Text Analysis Conference (TAC), 2009.
[38] P. Melville, W. Gryc, and R.D. Lawrence. Sentiment analysis of blogs by combininglexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 1275–1284.ACM, 2009.
[39] R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceed-ings of NAACL HLT, volume 2007, 2007.
[40] A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers.In Proceedings of the ninth conference on European chapter of the Association forComputational Linguistics, pages 1–8. Association for Computational Linguistics,1999.
[41] D. Milne, O. Medelyan, and I.H. Witten. Mining domain-specific thesauri fromwikipedia: A case study. In Proceedings of the 2006 IEEE/WIC/ACM InternationalConference on Web Intelligence, pages 442–448. IEEE Computer Society, 2006.
[42] D. Milne and I.H. Witten. Learning to link with wikipedia. In Proceeding of the 17thACM conference on Information and knowledge management, pages 509–518. ACM,2008.
[43] D.N. Milne, I.H. Witten, and D.M. Nichols. A knowledge-based search engine pow-ered by wikipedia. In Proceedings of the sixteenth ACM conference on Conferenceon information and knowledge management, pages 445–454. ACM, 2007.
[44] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju, R. Goodrum, andV. Rus. The structure and performance of an open-domain question answering sys-tem. In Proceedings of the 38th Annual Meeting on Association for ComputationalLinguistics, pages 563–570. Association for Computational Linguistics, 2000.
[45] K. Nakayama, T. Hara, and S. Nishio. Wikipedia mining for an association webthesaurus construction. Web Information Systems Engineering–WISE 2007, pages322–334, 2007.
[46] D.P.T. Nguyen, Y. Matsuo, and M. Ishizuka. Relation extraction from wikipedia usingsubtree mining. In Proceedings of the National Conference on Artificial Intelligence,volume 22, page 1414. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MITPress; 1999, 2007.
88
BIBLIOGRAPHY
[47] F. Ortega, J.M. Gonzalez-Barahona, and G. Robles. On the inequality of contributionsto Wikipedia. In Hawaii International Conference on System Sciences, Proceedingsof the 41st Annual, page 304. IEEE, 2008.
[48] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification usingmachine learning techniques. In Proceedings of the ACL-02 conference on Empiricalmethods in natural language processing-Volume 10, pages 79–86. Association forComputational Linguistics, 2002.
[49] T. Pedersen. A simple approach to building ensembles of Naive Bayesian classifiersfor word sense disambiguation. In Proceedings of the 1st North American chapterof the Association for Computational Linguistics conference, pages 63–69. MorganKaufmann Publishers Inc., 2000.
[50] W.J. Plath. REQUEST: a natural language question-answering system. IBM Journalof Research and Development, 20(4):326–335, 1976.
[51] E. PrudHommeaux, A. Seaborne, et al. SPARQL query language for RDF. W3Cworking draft, 4, 2006.
[52] A. Ratnaparkhi et al. A maximum entropy model for part-of-speech tagging. InProceedings of the conference on empirical methods in natural language processing,volume 1, pages 133–142, 1996.
[53] A. Ratnaparkhi, J. Reynar, and S. Roukos. A maximum entropy model for prepo-sitional phrase attachment. In Proceedings of the workshop on Human LanguageTechnology, pages 250–255. Association for Computational Linguistics, 1994.
[54] D. Ravichandran and E. Hovy. Learning surface text patterns for a question answeringsystem. In Proceedings of the 40th Annual Meeting on Association for ComputationalLinguistics, pages 41–47. Association for Computational Linguistics, 2002.
[55] M. Remy. Wikipedia: The free encyclopedia. Reference Reviews, 16(6):5, 2002.
[56] J.D.M. Rennie. Improving multi-class text classification with naive Bayes. PhD thesis,Citeseer, 2001.
[57] A.E. Richman and P. Schone. Mining wiki resources for multilingual named entityrecognition. In Proceedings of the 46th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technologies, pages 1–9. Citeseer, 2008.
[58] R. Rosenfeld. Adaptive statistical language modeling: a maximum entropy approach.PhD thesis, Citeseer, 2005.
[59] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval* 1.Information processing & management, 24(5):513–523, 1988.
89
BIBLIOGRAPHY
[60] G. Salton, A. Wong, and C.S. Yang. A vector space model for information retrieval.Journal of the American Society for information Science, 18(11):613–620, 1975.
[61] K.M. Schneider. A comparison of event models for Naive Bayes anti-spam e-mail fil-tering. In Proceedings of the tenth conference on European chapter of the Associationfor Computational Linguistics-Volume 1, pages 307–314. Association for Computa-tional Linguistics, 2003.
[62] Zongyu Zhang Xinsheng Li Jingyi Guan Weiran Xu Jun Guo Si Li, Sanyuan Gao.PRIS at TAC 2009: Experiments in KBP Track. In Proceedings of Test AnalysisConference 2009 (TAC 09).
[63] R. Srihari and W. Li. A question answering system supported by information extrac-tion. In Proceedings of the sixth conference on Applied natural language processing,pages 166–172. Association for Computational Linguistics, 2000.
[64] S. Tan, X. Cheng, Y. Wang, and H. Xu. Adapting naive bayes to domain adaptationfor sentiment analysis. Advances in Information Retrieval, pages 337–349, 2009.
[65] E.F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 sharedtask: Language-independent named entity recognition. In Proceedings of the seventhconference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics, 2003.
[66] M. Vela and T. Declerck. Concept and relation extraction in the finance domain.In Proceedings of the Eighth International Conference on Computational Semantics,pages 346–350. Association for Computational Linguistics, 2009.
[67] D.L. Waltz. An English language question answering system for a large relationaldatabase. Communications of the ACM, 21(7):526–539, 1978.
[68] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. TheJournal of Machine Learning Research, 3:1083–1106, 2003.
[69] T. Zesch, I. Gurevych, and M. M”uhlh”auser. Analyzing and accessing Wikipedia as a lexical semantic resource. DataStructures for Linguistic Resources and Applications, pages 197–205, 2007.
[70] T. Zesch, C. Muller, and I. Gurevych. Extracting lexical semantic knowledge fromwikipedia and wiktionary. In Proceedings of the Conference on Language Resourcesand Evaluation (LREC), pages 1646–1652. Citeseer, 2008.
[71] G.D. Zhou and J. Su. Named entity recognition using an HMM-based chunk tag-ger. In Proceedings of the 40th Annual Meeting on Association for ComputationalLinguistics, pages 473–480. Association for Computational Linguistics, 2002.