Modeling Mention, Context and Entity with Neural …Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation Yaming Sun y, Lei Lin , Duyu Tang , Nan Yangz,

Modeling Mention, Context and Entity withNeural Networks for Entity Disambiguation

Yaming Sun†, Lei Lin†∗ , Duyu Tang†, Nan Yang‡, Zhenzhou Ji†, Xiaolong Wang††Harbin Institute of Technology, Harbin, China

‡Microsoft Research, Beijing, [email protected], {linl, wangxl}@insun.hit.edu.cn,

[email protected], [email protected], [email protected]

Abstract

Given a query consisting of a mention (name string)and a background document, entity disambiguationcalls for linking the mention to an entity from ref-erence knowledge base like Wikipedia. Existingstudies typically use hand-crafted features to rep-resent mention, context and entity, which is labor-intensive and weak to discover explanatory factorsof data. In this paper, we address this problem bypresenting a new neural network approach. Themodel takes consideration of the semantic repre-sentations of mention, context and entity, encodesthem in continuous vector space and effectivelyleverages them for entity disambiguation. Specif-ically, we model variable-sized contexts with con-volutional neural network, and embed the posi-tions of context words to factor in the distance be-tween context word and mention. Furthermore,we employ neural tensor network to model the se-mantic interactions between context and mention.We conduct experiments for entity disambigua-tion on two benchmark datasets from TAC-KBP2009 and 2010. Experimental results show thatour method yields state-of-the-art performances onboth datasets.

1 IntroductionEntity disambiguation is a fundamental task in the fieldof natural language processing [Zheng et al., 2010; Rati-nov et al., 2011; Han et al., 2011; Kataria et al., 2011;Sen, 2012], and a crucial step for knowledge base popula-tion [Ji and Grishman, 2011]. Given a document and a men-tion which is usually a text span occurred in the document, en-tity disambiguation targets at mapping the mention to an en-tity from reference knowledge base like Wikipedia1. For ex-ample, given a text span “President Obama” in the document“After campaigning on the promise of health care reform,President Obama gave a speech in March 2010 in Pennsyl-vania.” as input, the purpose of entity disambiguation is to

∗Corresponding author.1https://www.wikipedia.org/

link the mention “President Obama” in this context to an en-tity in the Wikipedia. The ground truth in this example isBarack Obama2.

Previous studies in the literature typically regard entity dis-ambiguation as a ranking problem, and utilize similarity mea-sure to compare the context of a mention with the text as-sociated with a candidate entity (e.g. the text in the corre-sponding page in reference KB). Since the performance ofentity disambiguation is heavily dependent on the choice offeature representations of mention and entity, a variety of al-gorithms are developed to effectively represent them for ob-taining a better disambiguation performance. Representativemention features include document surface feature like lexi-cal and part-of-speech tags of context words, entropy basedrepresentations [Mendes et al., 2011], structured text repre-sentations like dependency paths and topic feature represen-tations [Taylor Cassidy et al., 2011]. Typical entity featuresinclude name tagging, KB infoboxes, synonyms and semanticcategories [Chen and Ji, 2011; Cassidy et al., 2012].

Feature engineering is important but labor intensive andinsufficient to disentangle the underlying explanatory fac-tors of data. In natural language processing community,an effective feature learning approach is to compose therepresentation of a text (e.g. phrase, sentence or docu-ment) based on the representation of words using neuralnetwork [Mitchell and Lapata, 2010; Socher et al., 2013b;Kalchbrenner et al., 2014]. For the task of entity disambigua-tion, [He et al., 2013a] uses deep neural network to learn therepresentations of an input document containing mention aswell as a KB document referring to a candidate entity. Theyfeed a document as input and employ Stacked DenoisingAuto-encoders [Vincent et al., 2008] to produce the seman-tic representation. However, we argue that the approach isnot effective enough as it ignores the mention which is to belinked. Let us again take the example of “President Obama”as given before, the document representations towards twodifferent mentions “President Obama” and “Pennsylvania”are identical according to He et al. [2013a]’s approach. Thisis problematic for entity disambiguation because the target tobe linked is the mention rather than a document.

In this paper, we introduce a new neural network approachthat simultaneously takes consideration of mention, context

2http://en.wikipedia.org/wiki/Barack Obama

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

1333

Mention EntityContext

After campaigning on the promise of health care reform , President Obama gave a speech in March 2010 in Pennsylvania .

Barack Obama

ContextRepresentation

MentionRepresentation

Entity WordRepresentation

𝑣𝑐 𝑣𝑚

cos(𝑣𝑚𝑐 , 𝑣𝑒)

𝑣𝑒𝑣𝑚𝑐

𝑠𝑖𝑚(𝑣𝑚𝑐 , 𝑣𝑒)

Tensor

Entity ClassRepresentation

Tensor

𝑣𝑒𝑐𝑣𝑒𝑤

Figure 1: The proposed neural network method for entity disambiguation. In this example, the mention “President Obama”comes from an original document “After campaigning on the promise of health care reform, President Obama gave a speech inMarch 2010 in Pennsylvania.”. The candidate entity in this example is “Barack Obama”.

and entity for entity disambiguation. The neural architectureis illustrated in Figure 1. We cast entity disambiguation asa ranking task by comparing the similarities between an in-put (mention, context) pair and candidate entities. Specifi-cally, we embed mention, context and entity in continuousvector space to capture their semantic representations. Thevariable-sized context are modeled with convolutional neu-ral networks. Since a closer context word might be more in-formative than a farther one for disambiguating a mention,we also embed the distance between a context word and themention in continuous vector space. Furthermore, we exploitlow-rank neural tensor network to model the semantic com-position between context and mention. We design a rank-ing type hinge loss function, and collect 1M anchor text fromWikipedia as training data for parameter learning without us-ing any manual annotation.

We apply the neural network to entity disambiguationon two benchmark datasets from Text Analysis Conference-Knowledge Base Population (TAC-KBP) in 2009 and 2010.We compare to the top-performed systems in KBP evalua-tion along with state-of-the-art methods [Han et al., 2011;He et al., 2013a]. Experimental results show that our methodyields state-of-the-art performances on both datasets.

The main contributions of this work are as follows.• We present a new neural network approach which ef-

fectively captures the semantics of mention, context andentity simultaneously for entity disambiguation.• We factor in context words as well as their position in-

formation with convolutional neural network, and lever-age low-rank neural tensor network to model semanticcomposition between mention and context.• We report empirical results on two benchmark datasets

from KBP 2009 and KBP 2010. We show that the pro-posed method yields state-of-the-art performances onboth datasets.

2 MethodologyWe describe the proposed neural network for entity disam-biguation in this section. We first give an overview of the ap-proach, followed by the methods for modeling context, men-tion and entity, respectively. Afterwards, we describe the useof our method for entity disambiguation and the strategy formodel training.

2.1 An Overview of the ApproachA bird-view of the proposed neural network for entity disam-biguation is given in Figure 1. As is shown, the input includesthree parts, namely a mention, the context of mention and acandidate entity from reference knowledge base. The outputof our neural network stands for the similarity between a can-didate entity and a pair of mention and context. Specifically,we learn the continuous representations of context words withconvolution neural networks, and produce its semantic com-position with mention using a neural tensor network (detailedin Section 2.2). Meanwhile, we learn the continuous repre-sentation of a candidate entity. We then apply the learnedrepresentations of context, mention and entity for calculatingthe similarity between a candidate entity and a given (men-tion, context) pair, which will be conveniently applied to en-tity disambiguation (detailed in Section 2.3).

2.2 Modeling Context, Mention and EntityWe describe our method for learning continuous representa-tions of mention, context and entity as well as their semantic

1334

composition in this section.

Context Modeling The representation of a context is com-posed of the representations of words it contains according tothe principal of compositionality [Frege, 1892]. In addition,we argue that representation of a context is also influenced bythe distance between a context word and the mention. This isbased on the consideration that a closer context word mightbe more informative than a farther one for disambiguating themention.

Word Embedding

Position Embedding

Convolution

Pooling

Figure 2: Context modeling with convolutional neural net-work. The input of context convolution includes word embed-ding and position embedding. The weights with same color(e.g. red, purple or green) are shared cross different filters.

To this end, the vector of each context word is made upof two parts: a word embedding ew = Lwiw and a posi-tion embedding ep = Lpip, where Lw ∈ Rdw×|Vw| andLp ∈ Rdp×|Vp| are the lookup tables of words and posi-tions, respectively; dw and dp are the dimensions of wordvector and position vector, respectively; iw and ip are bi-nary vectors which is zero in all positions except at the w-th and p-th index. The position of a context word is itsdistance to the mention in a given piece of text. Sincethe number of context words is of variable length, we useconvolutional neural network, which is a state-of-the-art se-mantic composition approach [Kalchbrenner et al., 2014;Kim, 2014], to produce a fixed-length vector for a context.The convolution layer is a list of linear layers whose param-eters are shared in different filter windows, as given in Fig-ure 2. Formally, suppose the filter window size of each con-volution layer is K, the output vector of a convolution layeris calculated as follow.

Oconv =Wconvinconv + bconv (1)

where Wconv ∈ Rhl×K·(dw+dp), hl is the output length ofconvolution layer, inconv ∈ RK·(dw+dp) is the concatenationof representations of K words in a filter window, bconv ∈Rhl. The subsequent pooling layer captures the global infor-mation of the context, and outputs a fixed-length vector forcontext with variable length. In this paper, we use averagepooling layer but the method can naturally incorporate otherpooling functions like max pooling or k-max pooling [Kalch-brenner et al., 2014].

Mention Modeling Since a mention is typically one tothree words, we simply represent them as the average of em-beddings of words it contains [Socher et al., 2013b]. Recallthat we cast entity disambiguation as a ranking task, whichrequires the similarity between a candidate entity and a pairof mention and its context. Under this perspective, we need tocalculate the representation of an input document consistingof mention and context based on the representation of men-tion, the representation of context and their semantic compo-sitionality [Frege, 1892]. We employ neural tensor network[Socher et al., 2013c] as the composition function because itis a state-of-the-art performer in the field of vector based se-mantic composition [Mitchell and Lapata, 2010]. Standardneural tensor network with rank 3 is a list of bilinear layer,each of which conducts bilinear operation on two input vec-tors and outputs a scalar. A bilinear layer is typically param-eterized by a matrix M ∈ RN×N , where N is the dimensionof each input vector.

+≈

… … …

slice=1

+≈slice=L

Figure 3: An illustration of low-rank neural tensor networkfor modeling the semantic composition between mention andcontext.

In this paper, we follow [Socher et al., 2013c] and repre-sent each input as the concatenation of mention vector andcontext vector. To decrease the number of parameters in stan-dard neural tensor network, we make low rank approximationthat represent each matrix by two low-rank matrices plus di-agonal, as illustrated in Figure 3. Formally, the parameter ofthe i-th slice is Mappr

i = Mi1 × Mi2 + diag(mi), whereMi1 ∈ RN×r, Mi2 ∈ Rr×N , mi ∈ RN . The output of neuraltensor layer is formalized as follows.

vmc = [vm; vc]T [Mappr

i ][1:L][vm; vc] (2)

where [vm; vc] ∈ RN is the concatenation of mention vectorvm and context vector vc; [Mappr

i ][1:L] is the low-rank tensorthat defines multiple low-rank bilinear layers, L is the slicenumber of neural tensor network which is also equal to theoutput length of vmc.

Entity Modeling We model the semantics of an entity inknowledge base from two aspects: entity surface words andentity class. For example, the surface words of entity BarackObama are barack and obama. Entity class of an entity is aword or a phrase provided in infobox of reference knowledgebase, which indicates the category information of the entity.For example, the class of Barack Obama is president of theunited states. We use the embeddings of class words to repre-sent the semantics of entity class. This is based on the consid-eration that entity classes are semantically related with each

1335

other in a continuous vector space rather than independent ina discrete vector space. Since surface words and class wordsare both short and variable-sized, we average them separatelyto produce entity word vector and entity class vector [Socheret al., 2013b]. In order to encode the interactions betweenthese two vectors, we use low-rank neural tensor network asdetailed above to produce the final entity representation.

2.3 Entity DisambiguationWe apply the learned representation of candidate entity as

well as composed context and mention for entity disambigua-tion in a ranking based framework. Given the representationof a candidate entity ve and the representation of a mentioncontext pair vmc, we use the cosine similarity between thesetwo vectors to represent their semantic relatedness, namely

sim(e,mc) = cosine(ve, vmc) (3)In the prediction process, we calculate the similarity betweena context mention pair with each candidate entity, and selectthe closest one as the final result. For effectively training themodel, we devise a ranking type loss function as given inEquation 4. The basic idea is that the output score of a correctentity should be larger than the score of a randomly selectedcandidate entity by a margin of 1.

loss =∑

(m,c)∈T

max(0, 1−sim(e,mc)+sim(e′,mc)) (4)

where e is the gold standard entity, and e′ is a corrupted entitywhich is randomly selected from the entire entity vocabularyof reference KB.

2.4 Model TrainingIt is commonly accepted that large training data is crucial forobtaining a better performance if one uses neural network.In order to obtain massive training data without any manualannotation, we collect queries (including mention and con-text) and corresponding target entities using anchor text fromWikipedia. For example, a document containing anchor textPresident Obama linked to entity Barack Obama will be re-garded as a gold disambiguation instance, whose mention isPresident Obama and ground truth is Barack Obama. Wetrain the word embeddings on our training set with SkipGram[Mikolov et al., 2013] which is integrated into the widelyused word2vec toolkit3. We set the dimension of word vec-tor as 50, window size as 5. We convert all words to lower-case, normalize digit number with a special symbol. The vo-cabulary size of the word embedding is 1.63M. We train theneural network by taking derivative of the loss through back-propagation with respect to the whole set of parameters. Theparameters of linear layer and low-rank neural tensor networkare initialized from a uniform distribution U(−rnd, rnd),where rnd = 0.01. We empirically set the learning rate as0.01, the window size of convolution neural network as 2, theoutput length of neural tensor layer as 30.

3 ExperimentIn this section, we describe experimental settings and empir-ical results on the task of entity disambiguation.

3https://code.google.com/p/word2vec/

3.1 Experimental SettingWe conduct experiments on two benchmark datasets forentity disambiguation from Text Analysis Conference-Knowledge Base Population (TAC-KBP4) in 2009 [Mc-Namee and Dang, 2009] and 2010 [Ji and Grishman, 2011].TAC-KBP officially provides a collection of queries, each ofwhich contains a mention and its background document. Par-ticipants are asked to link the mention in a query to an entityfrom the officially provided reference knowledge base. Eachentity in reference knowledge base is accompanied with itsinfobox and description text.

We follow the experiment protocols as described in [Heet al., 2013a], and use only non-NIL queries (target entitiesare in KB) from KBP 2009 and KBP 2010 for testing. Thenumbers of non-NIL queries from KBP 2009 and KBP 2010are 1,675 and 1,020, respectively. The reference knowledgebase contains 818,741 entities, and 2,344 entity classes in to-tal. For model training, we collect anchor texts which containthe entities covered by the reference knowledge base. Wefinally get 1M instances as training data to train our neuralnetwork. We use micro-averaged accuracy as the evaluationmetric, which measures whether a top-ranked entity candi-date is the ground truth.

We use several heuristic rules to obtain candidate entitiesfor a given query as detailed below. We save the entitieswhich are (a) exact matches of a given mention, (b) the an-chor entities of a mention in Wikipedia, (c) the redirected en-tities of a mention if they are contained in redirect pages inWikipedia, (d) the entities whose minimum edit distance withthe mention is smaller than two. To reduce the number ofcandidates, we use the context of mention to filter out somecandidates with simple string matching rules. The final re-calls of our candidate entities on KBP 2009 and KBP 2010are 90.08% and 91.17%, respectively.

3.2 Experimental ResultsWe report empirical results of our method as well as base-line methods for entity disambiguation on two benchmarkdatasets from TAC-KBP 2009 and 2010.

The methods presented in this work can be divided intofour models with incremental integration of semantics. Wedescribe the details of these four models as follows.• Model 1. We only use the semantics of mention and

candidate entity surface words, without using contextsof mention or class information of entity. We simplyaverage the word vectors of a mention and an entity astheir representations. This is analogous to the methodused in [Blanco et al., 2015].• Model 2. We use the semantics of mention, context

words, and candidate entity in this model. We extendModel 1 by using convolutional neural network to cap-ture the semantics of context words. In Model 2, wesimply concatenate the mention vector and context vec-tor without capturing their interactions. For the entitycomponent, we integrate the entity class information andconcatenate its vector with entity surface word vector asthe entity representation.

4http://www.nist.gov/tac/

1336

• Model 3. We extend Model 2 by taking position infor-mation of context words into consideration. We embedeach position into a continuous vector space, and con-catenate it with context word vector for subsequent usein convoluational neural network. We use concatenationas the semantic composition function in both mention-context part and entity part.• Model 4. We extend Model 3 by incorporating the inter-

actions between (a) context vector and mention vectoras well as (b) entity surface word vector and entity classvector. We use low-rank neural tensor network to modelsemantic composition in both components, which is de-tailed in Section 2.2.

We report empirical results of our models and baselinemethods on TAC-KBP 2009 and 2010 test datasets. The offi-cial 1st, 2nd and 3rd ranked systems of KBP 2009 and KBP2010 are marked as Rank 1, Rank 2 and Rank 3. We also com-pare with a collective entity disambiguation method [Han etal., 2011] and a state-of-the-art neural network approach [Heet al., 2013a]. Our methods are abbreviated as Model 1-4.Experimental results are illustrated in Table 1.

Method KBP 2009 KBP 2010Rank 1 77.25 80.59Rank 2 76.54 75.20Rank 3 75.88 73.73[Han et al., 2011] 79.00 –[He et al., 2013a] – 80.97Model 1 73.85 75.98Model 2 80.47 81.56Model 3 80.75 83.92Model 4 82.26 81.07

Table 1: Experimental results on the test set of TAC-KBP2009 and 2010 for entity disambiguation. Evaluation met-ric is micro-averaged accuracy (in KB). The best result is inbold.

We can find that our method (Model 3, Model 4) yieldsthe best performance on both datasets compared with manystrong baselines. The performance of Model 1 is relativelylow because it only utilizes the surface word-level seman-tics of mention and entity, but ignores the crucial contextualinformation of a mention. Model 2 obtains significant per-formance boost over Model 1 by integrating semantic repre-sentations of context words. Besides, we surprisingly findthat Model 2 outperforms the best baseline methods on bothdatasets. This result verifies the effectiveness of context in-formation for the task of entity disambiguation. ComparingModel 3 with Model 2, we can find an improvement (0.28%and 2.36% in accuracy) is further achieved by incorporatingthe position information of context words. This is intuitivesince a closer context word might be more informative thana farther one for disambiguating the mention. ComparingModel 4 with Model 3, we can see that neural tensor net-work is more powerful than vector concatenation for seman-tic composition on TAC-KBP 2009 dataset. The reason liesin that neural tensor network better captures the semantic in-teractions between mention and context.

3.3 Model AnalysisWe investigate the influential factors of our method for entitydisambiguation in this part.

0 1 2 3 4 5 6 7 8 90.798

0.8

0.802

0.804

0.806

0.808

0.81

Dimension of Position Embedding

inK

B A

ccur

acy

KBP2009

Figure 4: Experiments of our neural network (Model 3) forentity disambiguation with different dimensions of positionembedding.

We first explore the effect of position embedding on KBP2009. Specifically, we investigate how the dimension of po-sition embedding affects the performance of our Model 3 forentity disambiguation. We vary the dimension of position em-bedding δp from 0 to 9, increased by 1. Results with differentdimensions of position embedding are given in Figure 4. Themodel with δp = 0 stands for Model 2 without using positioninformation. We can find that position embedding is effectivefor entity disambiguation, because all models with δp > 0outperform the model with δp = 0. Model 3 performs betterwhen δp is 7.

We vary the rank size of low-rank neural tensor networkin Model 4. The best performances on TAC-KBP 2009 and2010 datasets are achieved at rank size equals to 1 and 2, re-spectively. For one iteration, the training time costs of Model4 with different rank sizes are illustrated in Figure 5. Wecan find that the time cost is (almost) linearly increased alongwith rank size. This is because the parameter number of low-rank neural tensor network is linearly increased along withrank size. We run experiments on one computer with 64Gmemory 24 core Inter Xeon CPU.

4 Related WorkWe briefly review existing studies on entity disambiguationand neural network approaches for natural language process-ing in this section.

4.1 Entity DisambiguationEntity disambiguation is typically regarded as a ranking task,which calls for measuring the similarity between context of amention and the text associated with a candidate entity (e.g.the text in the corresponding page in KB). Existing algorithmsfor entity disambiguation can be generally divided into lo-cal approaches and global (collective) approaches. The for-mer [Zheng et al., 2010; Mendes et al., 2011; Ji and Grish-man, 2011] uses local statistics of a mention mi and an entitytitle ti. The latter [Han et al., 2011; Ratinov et al., 2011;

1337

1 2 3 4 550

55

60

65

70

75

80

rank size of low−rank neural tensor network

min

utes

time cost

Figure 5: The training time costs of our neural network(Model 4) with different rank sizes.

He et al., 2013b] takes consideration of all mentions in agiven document simultaneously. Both directions require thesemantic relatedness between mention mi and entity ti. Rep-resentative mention features in the literature include docu-ment surface feature such as lexical and part-of-speech tagsof context words, entropy based representations [Mendes etal., 2011], structured text representations such as dependencypaths and topic feature representation [Taylor Cassidy et al.,2011]. Typical entity features include name tagging, KB in-foboxes, synonyms and semantic categories [Chen and Ji,2011; Cassidy et al., 2012]. Since feature engineering istime-consuming and weak to discover underlying explanatoryfactors of data, it is desirable to learn features automaticallyfrom data. Under this perspective, [He et al., 2013a] inves-tigate Stacked Denoising Autoencoder to learn continuousrepresentation of context text and entity document. Unlikedominating existing studies that use hand-crafted features, welearn discriminative features with neural network automati-cally from data. Our method differs from [He et al., 2013a]in two aspects. On one hand, we use continuous representa-tions of context positions to capture the distance between acontext word and the mention. On the other hand, we explic-itly model the semantic composition between context vectorand mention vector with low-rank neural tensor network.

4.2 Neural Network for Natural LanguageProcessing

We briefly introduce neural network approaches for naturallanguage processing (NLP) in literature. Existing neural net-work approaches can be divided into two directions. Oneis learning continuous representation of words [Mikolov etal., 2013]. Another direction focuses on semantic composi-tion [Mitchell and Lapata, 2010] in order to obtain the rep-resentation of phrases, sentences and documents. The se-mantic representation of text can be effectively used as fea-tures for a variety of NLP tasks, including machine trans-lation [Cho et al., 2014], syntactic parsing [Socher et al.,2013a], discourse parsing [Li et al., 2014], relation classifi-cation [Zeng et al., 2014], sentiment analysis [Socher et al.,2013c; Tang et al., 2014; Li, 2014], part-of-speech taggingand named entity recognition [Collobert et al., 2011]. Ourapproach for modeling variable-sized context representation

is relevant to the field of vector based semantic composition[Mitchell and Lapata, 2010]. Representative algorithms inthis field are recursive neural network [Socher et al., 2013c]and convolutional neural network [Kalchbrenner et al., 2014;Kim, 2014]. These methods are on the basis of the principalof compositionality, which states that the representation of atext (e.g. a sentence or a document) is composed from therepresentations of the words it contains. In this work, we pre-fer convolutional neural network as it does not rely on a fixedparse tree and is a state-of-the-art performer in this field. Wetake consideration of the continuous representations of con-text positions, which has been exploited as useful signals forrelation classification [Zeng et al., 2014] and semantic role la-beling [Collobert et al., 2011]. We model the semantic com-position of context and mention with neural tensor network,which has been explored as powerful composition functionfor sentiment analysis [Socher et al., 2013c] and knowledgebase completion [Socher et al., 2013b]. Our strategy for ap-proximating standard neural tensor network with a low-rankform is inspired by [Socher et al., 2012], which representseach matrix with a low-rank approximation.

5 ConclusionWe present a new neural network approach in this work forentity disambiguation. The model leverages the semantics ofmention, context and entity as well as their compositionalityin a unified way. We represent contexts with convolutionalneural network, and encode the positions of context words incontinuous space for capturing the distance between contextword and mention. We use low-rank neural tensor networkto model semantic composition between context and mentionas well as entity surface words and entity class. We apply themodel to entity disambiguation on TAC-KBP 2009 and 2010datasets. Empirical results show that the model outperformsprevious studies on both datasets. We show that incorporatingsemantics of contexts significantly boosts the performance onentity disambiguation.

AcknowledgmentsWe gratefully acknowledge the helpful discussions with Shu-jie Liu and Yuhang Guo. We thank the anonymous review-ers for their insightful feedbacks. This work was partly sup-ported by National Natural Science Foundation of China (No.61300114, No. 61272383).

References[Blanco et al., 2015] Roi Blanco, Giuseppe Ottaviano, and

Edgar Meij. Fast and space-efficient entity linking inqueries. In WSDM, pages 2061–2069, 2015.

[Cassidy et al., 2012] Taylor Cassidy, Heng Ji, Lev-ArieRatinov, Arkaitz Zubiaga, and Hongzhao Huang. Anal-ysis and enhancement of wikification for microblogs withcontext expansion. In COLING, pages 441–456, 2012.

[Chen and Ji, 2011] Zheng Chen and Heng Ji. Collaborativeranking: A case study on entity linking. In EMNLP, pages771–781, 2011.

1338

[Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer,Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. Learning phraserepresentations using rnn encoder–decoder for statisticalmachine translation. In EMNLP, pages 1724–1734, 2014.

[Collobert et al., 2011] Ronan Collobert, Jason Weston,Leon Bottou, Michael Karlen, Koray Kavukcuoglu, andPavel Kuksa. Natural language processing (almost) fromscratch. Journal of Machine Learning Research, 2011.

[Frege, 1892] Gottlob Frege. On sense and reference. Lud-low (1997), pages 563–584, 1892.

[Han et al., 2011] Xianpei Han, Le Sun, and Jun Zhao. Col-lective entity linking in web text: a graph-based method.In SIGIR, pages 765–774. ACM, 2011.

[He et al., 2013a] Zhengyan He, Shujie Liu, Mu Li, MingZhou, Longkai Zhang, and Houfeng Wang. Learningentity representation for entity disambiguation. In ACL,pages 30–34, 2013.

[He et al., 2013b] Zhengyan He, Shujie Liu, Yang Song,Mu Li, Ming Zhou, and Houfeng Wang. Efficient collec-tive entity linking with stacking. In EMNLP, pages 426–435, 2013.

[Ji and Grishman, 2011] Heng Ji and Ralph Grishman.Knowledge base population: Successful approaches andchallenges. In ACL, pages 1148–1158, 2011.

[Kalchbrenner et al., 2014] Nal Kalchbrenner, EdwardGrefenstette, and Phil Blunsom. A convolutional neuralnetwork for modelling sentences. In ACL, pages 655–665,2014.

[Kataria et al., 2011] Saurabh S Kataria, Krishnan S Kumar,Rajeev R Rastogi, Prithviraj Sen, and Srinivasan H Sen-gamedu. Entity disambiguation with hierarchical topicmodels. In SIGKDD, pages 1037–1045. ACM, 2011.

[Kim, 2014] Yoon Kim. Convolutional neural networks forsentence classification. In EMNLP, pages 1746–1751,2014.

[Li et al., 2014] Jiwei Li, Rumeng Li, and Eduard Hovy. Re-cursive deep models for discourse parsing. In EMNLP,pages 2061–2069, 2014.

[Li, 2014] Jiwei Li. Feature weight tuning for recursive neu-ral networks. Arxiv preprint, 1412.3714, 2014.

[McNamee and Dang, 2009] Paul McNamee and Hoa TrangDang. Overview of the tac 2009 knowledge base popula-tion track. In Text Analysis Conference (TAC), pages 111–113, 2009.

[Mendes et al., 2011] Pablo N Mendes, Max Jakob, AndresGarcıa-Silva, and Christian Bizer. Dbpedia spotlight:shedding light on the web of documents. In Proceedingsof the 7th International Conference on Semantic Systems,pages 1–8. ACM, 2011.

[Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, KaiChen, Greg S Corrado, and Jeff Dean. Distributed rep-resentations of words and phrases and their composition-ality. In NIPS, pages 3111–3119, 2013.

[Mitchell and Lapata, 2010] Jeff Mitchell and Mirella Lap-ata. Composition in distributional models of semantics.Cognitive science, 34(8):1388–1429, 2010.

[Ratinov et al., 2011] Lev Ratinov, Dan Roth, DougDowney, and Mike Anderson. Local and global algo-rithms for disambiguation to wikipedia. In ACL, pages1375–1384, 2011.

[Sen, 2012] Prithviraj Sen. Collective context-aware topicmodels for entity disambiguation. In WWW, pages 729–738. ACM, 2012.

[Socher et al., 2012] Richard Socher, Brody Huval, Christo-pher D. Manning, and Andrew Y. Ng. Semantic com-positionality through recursive matrix-vector spaces. InEMNLP, pages 1201–1211, 2012.

[Socher et al., 2013a] Richard Socher, John Bauer, Christo-pher D. Manning, and Ng Andrew Y. Parsing with compo-sitional vector grammars. In ACL, pages 455–465, 2013.

[Socher et al., 2013b] Richard Socher, Danqi Chen, Christo-pher D Manning, and Andrew Ng. Reasoning with neuraltensor networks for knowledge base completion. In NIPS,pages 926–934, 2013.

[Socher et al., 2013c] Richard Socher, Alex Perelygin,Jean Y Wu, Jason Chuang, Christopher D Manning,Andrew Y Ng, and Christopher Potts. Recursive deepmodels for semantic compositionality over a sentimenttreebank. In EMNLP, pages 1631–1642, 2013.

[Tang et al., 2014] Duyu Tang, Furu Wei, Nan Yang, MingZhou, Ting Liu, and Bing Qin. Learning sentiment-specific word embedding for twitter sentiment classifica-tion. In ACL, pages 1555–1565, 2014.

[Taylor Cassidy et al., 2011] Zheng Chen Taylor Cassidy,Javier Artiles, Heng Ji, Hongbo Deng, Lev-Arie Ratinov,Jing Zheng, Jiawei Han, and Dan Roth. Cuny-uiuc-sri tac-kbp2011 entity linking system description. In ProceedingsText Analysis Conference (TAC2011), 2011.

[Vincent et al., 2008] Pascal Vincent, Hugo Larochelle,Yoshua Bengio, and Pierre-Antoine Manzagol. Extract-ing and composing robust features with denoising autoen-coders. In ICML, pages 1096–1103. ACM, 2008.

[Zeng et al., 2014] Daojian Zeng, Kang Liu, Siwei Lai,Guangyou Zhou, and Jun Zhao. Relation classificationvia convolutional deep neural network. In COLING, pages2335–2344, 2014.

[Zheng et al., 2010] Zhicheng Zheng, Fangtao Li, MinlieHuang, and Xiaoyan Zhu. Learning to link entities withknowledge base. In NAACL, pages 483–491, 2010.

1339

Modeling Mention, Context and Entity with Neural …Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation Yaming Sun y, Lei Lin , Duyu Tang , Nan Yangz,

Documents