Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1531–1542,Lisbon, Portugal, 17-21 September 2015. c©2015 Association for Computational Linguistics.

Syntax-Aware Multi-Sense Word Embeddingsfor Deep Compositional Models of Meaning

Jianpeng ChengUniversity of Oxford

Department ofComputer Science

[email protected]

Dimitri KartsaklisQueen Mary University of LondonSchool of Electronic Engineering

and Computer [email protected]

Abstract

Deep compositional models of meaningacting on distributional representations ofwords in order to produce vectors of largertext constituents are evolving to a pop-ular area of NLP research. We detaila compositional distributional frameworkbased on a rich form of word embeddingsthat aims at facilitating the interactionsbetween words in the context of a sen-tence. Embeddings and composition lay-ers are jointly learned against a genericobjective that enhances the vectors withsyntactic information from the surround-ing context. Furthermore, each word isassociated with a number of senses, themost plausible of which is selected dy-namically during the composition process.We evaluate the produced vectors qualita-tively and quantitatively with positive re-sults. At the sentence level, the effective-ness of the framework is demonstrated onthe MSRPar task, for which we report re-sults within the state-of-the-art range.

1 Introduction

Representing the meaning of words by using theirdistributional behaviour in a large text corpus isa well-established technique in NLP research thathas been proved useful in numerous tasks. Ina distributional model of meaning, the semanticrepresentation of a word is given as a vector insome high dimensional vector space, obtained ei-ther by explicitly collecting co-occurrence statis-tics of the target word with words belonging to arepresentative subset of the vocabulary, or by di-rectly optimizing the word vectors against an ob-jective function in some neural-network based ar-chitecture (Collobert and Weston, 2008; Mikolovet al., 2013).

Regardless their method of construction, distri-butional models of meaning do not scale up to

larger text constituents such as phrases or sen-tences, since the uniqueness of multi-word expres-sions would inevitably lead to data sparsity prob-lems, thus to unreliable vectorial representations.The problem is usually addressed by the provisionof a compositional function, the purpose of whichis to prepare a vectorial representation for a phraseor sentence by combining the vectors of the wordstherein. While the nature and complexity of thesecompositional models may vary, approaches basedon deep-learning architectures have been shown tobe especially successful in modelling the meaningof sentences for a variety of tasks (Socher et al.,2012; Kalchbrenner et al., 2014).

The mutual interaction of distributional wordvectors by a means of a compositional model pro-vides many opportunities for interesting research,the majority of which still remains to be explored.One such direction is to investigate in what waylexical ambiguity affects the compositional pro-cess. In fact, recent work has shown that shal-low multi-linear compositional models that explic-itly handle extreme cases of lexical ambiguity in astep prior to composition present consistently bet-ter performance than their “ambiguous” counter-parts (Kartsaklis and Sadrzadeh, 2013; Kartsakliset al., 2014). A first attempt to test these obser-vations in a deep compositional setting has beenpresented by Cheng et al. (2014) with promisingresults.

Furthermore, a second important question re-lates to the very nature of the word embeddingsused in the context of a compositional model. In asetting of this form, word vectors are not any morejust a means for discriminating words based ontheir underlying semantic relationships; the maingoal of a word vector is to contribute to a biggerwhole—a task in which syntax, along with seman-tics, also plays a very important role. It is a centralpoint of this paper, therefore, that in a composi-tional distributional model of meaning word vec-tors should be injected with information that re-flects their syntactical roles in the training corpus.

1531

The purpose of this work is to improve thecurrent practice in deep compositional models ofmeaning in relation to both the compositional pro-cess itself and the quality of the word embed-dings used therein. We propose an architecturefor jointly training a compositional model and aset of word embeddings, in a way that imposesdynamic word sense induction for each word dur-ing the learning process. Note that this is in con-trast with recent work in multi-sense neural wordembeddings (Neelakantan et al., 2014), in whichthe word senses are learned without any composi-tional considerations in mind.

Furthermore, we make the word embeddingssyntax-aware by introducing a variation of thehinge loss objective function of Collobert and We-ston (2008), in which the goal is not only to predictthe occurrence of a target word in a context, but toalso predict the position of the word within thatcontext. A qualitative analysis shows that our vec-tors reflect both semantic and syntactic features ina concise way.

In all current deep compositional distributionalsettings, the word embeddings are internal param-eters of the model with no use for any other pur-pose than the task for which they were specificallytrained. In this work, one of our main consid-erations is that the joint training step should begeneric enough to not be tied in any particulartask. In this way the word embeddings and the de-rived compositional model can be learned on datamuch more diverse than any task-specific dataset,reflecting a wider range of linguistic features. In-deed, experimental evaluation shows that the pro-duced word embeddings can serve as a high qual-ity general-purpose semantic word space, present-ing performance on the Stanford Contextual WordSimilarity (SCWS) dataset of Huang et al. (2012)competitive to and even better of the performanceof well-established neural word embeddings sets.

Finally, we propose a dynamic disambiguationframework for a number of existing deep compo-sitional models of meaning, in which the multi-sense word embeddings and the compositionalmodel of the original training step are further re-fined according to the purposes of a specific taskat hand. In the context of paraphrase detection, weachieve a result very close to the current state-of-the-art on the Microsoft Research Paraphrase Cor-pus (Dolan and Brockett, 2005). An interestingaspect at the sideline of the paraphrase detectionexperiment is that, in contrast to mainstream ap-proaches that mainly rely on simple forms of clas-

sifiers, we approach the problem by following asiamese architecture (Bromley et al., 1993).

2 Background and related work

2.1 Distributional models of meaningDistributional models of meaning follow the dis-tributional hypothesis (Harris, 1954), which statesthat two words that occur in similar contexts havesimilar meanings. Traditional approaches for con-structing a word space rely on simple counting: aword is represented by a vector of numbers (usu-ally smoothed by the application of some func-tion such as point-wise mutual information) whichshow how frequently this word co-occurs withother possible context words in a corpus of text.

In contrast to these methods, a recent class ofdistributional models treat word representations asparameters directly optimized on a word predic-tion task (Bengio et al., 2003; Collobert and We-ston, 2008; Mikolov et al., 2013; Pennington etal., 2014). Instead of relying on observed co-occurrence counts, these models aim to maximizethe objective function of a neural net-based ar-chitecture; Mikolov et al. (2013), for example,compute the conditional probability of observ-ing words in a context around a target word (anapproach known as the skip-gram model). Re-cent studies have shown that, compared to theirco-occurrence counterparts, neural word vectorsreflect better the semantic relationships betweenwords (Baroni et al., 2014) and are more effectivein compositional settings (Milajevs et al., 2014).

2.2 Syntactic awarenessSince the main purpose of distributional modelsuntil now was to measure the semantic relatednessof words, relatively little effort has been put intomaking word vectors aware of information regard-ing the syntactic role under which a word occursin a sentence. In some cases the vectors are POS-tag specific, so that ‘book’ as noun and ‘book’as verb are represented by different vectors (Kart-saklis and Sadrzadeh, 2013). Furthermore, wordspaces in which the context of a target word is de-termined by means of grammatical dependencies(Pado and Lapata, 2007) are more effective in cap-turing syntactic relations than approaches basedon simple word proximity.

For word embeddings trained in neural settings,syntactic information is not usually taken explic-itly into account, with some notable exceptions.At the lexical level, Levy and Goldberg (2014)propose an extension of the skip-gram model

1532

based on grammatical dependencies. Following adifferent approach, Mnih and Kavukcuoglu (2013)weight the vector of each context word dependingon its distance from the target word. With regardto compositional settings (discussed in the nextsection), Hashimoto et al. (2014) use dependency-based word embeddings by employing a hinge lossobjective, while Hermann and Blunsom (2013)condition their objectives on the CCG types of theinvolved words.

As we will see in Section 3, the current paperoffers an appealing alternative to those approachesthat does not depend on grammatical relations ortypes of any form.

2.3 Compositionality in distributional modelsThe methods that aim to equip distributional mod-els of meaning with compositional abilities comein many different levels of sophistication, fromsimple element-wise vector operators such as ad-dition and multiplication (Mitchell and Lapata,2008) to category theory (Coecke et al., 2010).In this latter work relational words (such as verbsor adjectives) are represented as multi-linear mapsacting on vectors representing their arguments(nouns and noun phrases). In general, the abovemodels are shallow in the sense that they do nothave functional parameters and the output is pro-duced by the direct interaction of the inputs; yetthey have been shown to capture the compositionalmeaning of sentences to an adequate degree.

The idea of using neural networks for compo-sitionality in language appeared 25 years ago ina seminal paper by Pollack (1990), and has beenrecently re-popularized by Socher and colleagues(Socher et al., 2011a; Socher et al., 2012). Thecompositional architecture used in these worksis that of a recursive neural network (RecNN)(Socher et al., 2011b), where the words get com-posed by following a parse tree. A particularvariant of the RecNN is the recurrent neural net-work (RNN), in which a sentence is assumed tobe generated by aggregating words in sequence(Mikolov et al., 2010). Furthermore, some re-cent work (Kalchbrenner et al., 2014) models themeaning of sentences by utilizing the concept of aconvolutional neural network (LeCun et al., 1998),the main characteristic of which is that it acts onsmall overlapping parts of the input vectors. In allthe above models, the word embeddings and theweights of the compositional layers are optimizedagainst a task-specific objective function.

In Section 3 we will show how to removethe restriction of a supervised setting, introduc-

ing a generic objective that can be trained on anygeneral-purpose text corpus. While we focus onrecursive and recurrent neural network architec-tures, the general ideas we will discuss are in prin-ciple model-independent.

2.4 Disambiguation in composition

Regardless of the way they address composition,all the models of Section 2.3 rely on ambiguousword spaces, in which every meaning of a poly-semous word is merged into a single vector. Es-pecially for cases of homonymy (such as ‘bank’,‘organ’ and so on), where the same word is usedto describe two or more completely unrelated con-cepts, this approach is problematic: the semanticrepresentation of the word becomes the average ofall senses, inadequate to express any of them in areliable way.

To address this problem, a prior disambiguationstep on the word vectors is often introduced, thepurpose of which is to find the word representa-tions that best fit to the given context, before com-position takes place (Reddy et al., 2011; Kartsak-lis et al., 2013; Kartsaklis and Sadrzadeh, 2013;Kartsaklis et al., 2014). This idea has been testedon algebraic and tensor-based compositional func-tions with very positive results. Furthermore, ithas been also found to provide minimal benefitsfor a RecNN compositional architecture in a num-ber of phrase and sentence similarity tasks (Chenget al., 2014). This latter work clearly suggests thatexplicitly dealing with lexical ambiguity in a deepcompositional setting is an idea that is worth to befurther explored. While treating disambiguationas only a preprocessing step is a strategy less thanoptimal for a neural setting, one would expect thatthe benefits should be greater for an architecturein which the disambiguation takes place in a dy-namic fashion during training.

We are now ready to start detailing a compo-sitional model that takes into account the aboveconsiderations. The issue of lexical ambiguity iscovered in Section 4; Section 3 below deals withgeneric training and syntactic awareness.

3 Syntax-based generic training

We propose a novel architecture for learning wordembeddings and a compositional model to usethem in a single step. The learning takes placesin the context of a RecNN (or an RNN), and bothword embeddings and parameters of the composi-tional layer are optimized against a generic objec-tive function that uses a hinge loss function.

1533

input

output

input

output

(a) RecNN

(b) RNN

Figure 1: Recursive (a) and recurrent (b) neuralnetworks.

Figure 1 shows the general form of recursiveand recurrent neural networks. In architectures ofthis form, a compositional layer is applied on eachpair of inputs x1 and x2 in the following way:

p = g(Wx[1:2] + b) (1)

where x[1:2] denotes the concatenation of the twovectors, g is a non-linear function, and W,b arethe parameters of the model. In the RecNN case,the compositional process continues recursivelyby following a parse tree until a vector for thewhole sentence or phrase is produced; on the otherhand, an RNN assumes that a sentence is gener-ated in a left-to-right fashion, taking into consider-ation no dependencies other than word adjacency.

We amend the above setting by introducing anovel layer on the top of the compositional one,which scores the linguistic plausibility of the com-posed sentence or phrase vector with regard toboth syntax and semantics. Following Collobertand Weston (2008), we convert the unsupervisedlearning problem to a supervised one by corrupt-ing training sentences. Specifically, for each sen-tence s we create two sets of negative examples.In the first set, S′, the target word within a givencontext is replaced by a random word; as in theoriginal C&W paper, this set is used to enforcesemantic coherence in the word vectors. Syntac-tic coherence is enforced by a second set of nega-tive examples, S′′, in which the words of the con-text have been randomly shuffled. The objectivefunction is defined in terms of the following hingelosses:∑

s∈S

∑s′∈S′

max(0,m− f(s) + f(s′)) (2)

∑s∈S

∑s′′∈S′′

max(0,m− f(s) + f(s′′)) (3)

where S is the set of sentences, f the composi-tional layer, and m a margin we wish to retainbetween the scores of the positive training ex-amples and the negative ones. During training,all parameters in the scoring layer, the composi-tional layers and word representations are jointlyupdated by error back-propagation. As output,we get both general-purpose syntax-aware wordrepresentations and weights for the correspondingcompositional model.

4 From words to senses

We now extend our model to address lexical ambi-guity. We achieve that by applying a gated archi-tecture, similar to the one used in the multi-sensemodel of Neelakantan et al. (2014), but advancingthe main idea to the compositional setting detailedin Section 3.

We assume a fixed number of n senses perword.1 Each word is associated with a main vector(obtained for example by using an existing vectorset, or by simply applying the process of Section3 in a separate step), as well as with n vectors de-noting cluster centroids and an equal number ofsense vectors. Both cluster centroids and sensevectors are randomly initialized in the beginningof the process. For each word wt in a training sen-tence, we prepare a context vector by averagingthe main vectors of all other words in the samecontext. This context vector is compared with thecluster centroids of wt by cosine similarity, andthe sense corresponding to the closest cluster is se-lected as the most representative of wt in the cur-rent context. The selected cluster centroid is up-dated by the addition of the context vector, and theassociated sense vector is passed as input to thecompositional layer. The selected sense vectorsfor each word in the sentence are updated by back-propagation, based on the objectives of Equations2 and 3. The overall architecture of our model, asdescribed in this and the previous section, is illus-trated in Figure 2.

5 Task-specific dynamic disambiguation

The model of Figure 2 decouples the training ofword vectors and compositional parameters from

1Note that in principle the fixed number of senses assump-tion is not necessary; Neelakantan et al. (2014), for exam-ple, present a version of their model in which new senses areadded dynamically when appropriate.

1534

main (ambiguous)

vectors

sense vectors

gate

compositionallayer

phrase vector

plausibility layer

compositional layer

sentence vectorplausibility layer

Figure 2: Training of syntax-aware multi-senseembeddings in the context of a RecNN.

a specific task, and as a consequence from anytask-specific training dataset. However, note thatby replacing the plausibility layer with a classi-fier trained for some task at hand, you get a task-specific network that transparently trains multi-sense word embeddings and applies dynamic dis-ambiguation on the fly. While this idea of a single-step direct training seems appealing, one consid-eration is that the task-specific dataset used for thetraining will not probably reflect the linguistic va-riety that is required to exploit the expressivenessof the setting in its full. Additionally, in manycases the size of datasets tied to specific tasks isprohibiting for training a deep architecture.

It is a merit of this proposal that, in cases likethese, it is possible for one to train the genericmodel of Figure 2 on any large corpus of text, andthen use the produced word vectors and compo-sitional weights to initialize the parameters of amore specific version of the architecture. As aresult, the trained parameters will be further re-fined according to the task-specific objective. Fig-ure 3 illustrates the generic case of a composi-tional framework applying dynamic disambigua-tion. Note that here sense selection takes place bya soft-max layer, which can be directly optimizedon the task objective.

6 A siamese network for paraphrasedetection

We will test the dynamic disambiguation frame-work of Section 5 in a paraphrase detection task.A paraphrase is a restatement of the meaning of a

sentence using different words and/or syntax. Thegoal of a paraphrase detection model, thus, is toexamine two sentences and decide if they expressthe same meaning.

While the usual way to approach this problem isto utilize a classifier that acts (for example) on theconcatenation of the two sentence vectors, in thiswork we follow a novel perspective: specifically,we apply a siamese architecture (Bromley et al.,1993), a concept that has been extensively usedin computer vision (Hadsell et al., 2006; Sun etal., 2014). While siamese networks have been alsoused in the past for NLP purposes (for example,by Yih et al. (2011)), to the best of our knowledgethis is the first time that such a setting is appliedfor paraphrase detection.

In our model, two networks sharing the sameparameters are used to compute the vectorial rep-resentations of two sentences, the paraphrase rela-tion of which we wish to detect; this is achieved byemploying a cost function that compares the twovectors. There are two commonly used cost func-tions: the first is based on the L2 norm (Hadsellet al., 2006; Sun et al., 2014), while the second onthe cosine similarity (Nair and Hinton, 2010; Sunet al., 2014). The L2 norm variation is capable ofhandling differences in the magnitude of the vec-tors. Formally, the cost function is defined as:

Ef =

{12‖f(s1)− f(s2)‖22 , if y = 1

12

max(0, m− ‖f(s1)− f(s2)‖2)2, o.w.

where s1, s2 are the input sentences, f the com-positional layer (so f(s1) and f(s2) refer to sen-tence vectors), and y = 1 denotes a paraphrase re-lationship between the sentences; m stands for themargin, a hyper-parameter chosen in advance. On

classifier

main (ambiguous)vectors

sense vectors

soft-max layer

selected sense vectors

compositional layer(s)

sentence vector

Figure 3: Dynamic disambiguation in a genericcompositional deep net.

1535

deepcompositional

layer(s)

deepcompositional

layer(s)

Figure 4: A siamese network for paraphrase detec-tion.

the other hand, the cost function based on cosinesimilarity handles only directional differences, asfollows:

Ef =12(y − σ(wd+ b))2 (4)

where d = f(s1)·f(s2)‖f(s1)‖2‖f(s2)‖2 is the cosine similar-

ity of the two sentence vectors, w and b are thescaling and shifting parameters to be optimized, σis the sigmoid function and y is the label. In theexperiments that will follow in Section 7.4, bothof these cost functions are evaluated. The overallarchitecture is shown in Figure 4.

In Section 7.4 we will use the pre-trained vec-tors and compositional weights for deriving sen-tence representations that will be subsequently fedto the siamese network. When the dynamic disam-biguation framework is used, the sense vectors ofthe words are updated during training so that thesense selection process is gradually refined.

7 Experiments

We evaluate the quality of the compositionalword vectors and the proposed deep compositionalframework in the tasks of word similarity andparaphrase detection, respectively.

7.1 Model pre-training

In all experiments the word representations andcompositional models are pre-trained on theBritish National Corpus (BNC), a general-purposetext corpus that contains 6 million sentences ofwritten and spoken English. For comparison wetrain two sets of word vectors and compositionalmodels, one ambiguous and one multi-sense (fix-

ing 3 senses per word). The dimension of the em-beddings is set to 300.

As our compositional architectures we use aRecNN and an RNN. In the RecNN case, thewords are composed by following the result of anexternal parser, while for the RNN the composi-tion takes place in sequence from left to right. Toavoid the exploding or vanishing gradient problem(Bengio et al., 1994) for long sentences, we em-ploy a long short-term memory (LSTM) network(Hochreiter and Schmidhuber, 1997). During thetraining of each model, we minimize the hinge lossin Equations 2 and 3. The plausibility layer is im-plemented as a 2-layer network, with 150 units atthe hidden layer, and is applied at each individ-ual node (as opposed to a single application at thesentence level). All parameters are updated withmini-batches by AdaDelta (Zeiler, 2012) gradientdescent method (λ = 0.03, initial α = 0.05).

7.2 Qualitative evaluation of the word vectors

As a first step, we qualitatively evaluate the trainedword embeddings by examining the nearest neigh-bours lists of a few selected words. We com-pare the results with those produced by the skip-gram model (SG) of Mikolov et al. (2013) andthe language model (CW) of Collobert and Weston(2008). We refer to our model as SAMS (Syntax-Aware Multi-Sense). The results in Table 1 showclearly that our model tends to group words thatare both semantically and syntactically related; forexample, and in contrast with the compared mod-els which group words only at the semantic level,our model is able to retain tenses, numbers (singu-lars and plurals), and gerunds.

The observed behaviour is comparable to that ofembedding models with objective functions con-ditioned on grammatical relations between words;Levy and Goldberg (2014), for example, present asimilar table for their dependency-based extensionof the skip-gram model. The advantage of our ap-proach against such models is twofold: firstly, theword embeddings are accompanied by a genericcompositional model that can be used for creat-ing sentence representations independently of anyspecific task; and secondly, the training is quiteforgiving to data sparsity problems that in gen-eral a dependency-based approach would intensify(since context words are paired with the grammati-cal relations they occur with the target word). As aresult, a small corpus such as the BNC is sufficientfor producing high quality syntax-aware word em-beddings.

1536

SG CW SAMSbegged beg, begging, cried begging, pretended, beg persuaded, asked, criedrefused refusing, refuses, refusal refusing , declined, refuse declined, rejected, denied

interrupted interrupting, punctuated,interrupt interrupts, interrupt, interrupting punctuated, preceded, disrupted

themes thematic, theme, notions theme, concepts, subtext meanings, concepts, ideaspatiently impatiently, waited, waits impatiently, queue, expectantly impatiently, siliently, anxiouslyplayer players, football, league game, club, team athlete, sportsman, teamprompting prompted, prompt, sparking prompt, amid, triggered sparking, triggering, forcingreproduce reproducing, replicate, humans reproducing, thrive, survive replicate, produce, repopulate

predictions prediction, predict, forecasts predicting, assumption,predicted

expectations, projections,forecasts

Table 1: Nearest neighbours for a number of words with various embedding models.

7.3 Word similarityWe now proceed to a quantitative evaluation ofour embeddings on the Stanford Contextual WordSimilarity (SCWS) dataset of Huang et al. (2012).The dataset contains 2,003 pairs of words and thecontexts they occur in. We can therefore makeuse of the contextual information in order to selectthe most appropriate sense for each ambiguousword. Similarly to Neelakantan et al. (2014), weuse three different metrics: globalSim measuresthe similarity between two ambiguous word vec-tors; localSim selects a single sense for each wordbased on the context and computes the similaritybetween the two sense vectors; avgSim representseach word as a weighted average of all senses inthe given context and computes the similarity be-tween the two weighted sense vectors.

We compute and report the Spearman’s corre-lation between the embedding similarities and hu-man judgments (Table 2). In addition to the skip-gram and Collobert and Weston models, we alsocompare against the CBOW model (Mikolov etal., 2013) and the multi-sense skip-gram (MSSG)model of Neelakantan et al. (2014).

Model globalSim localSim avgSimCBOW 59.5 – –SG 61.8 – –CW 55.3 – –MSSG 61.3 56.7 62.1SAMS 59.9 58.5 62.5

Table 2: Results for the word similarity task(Spearman’s ρ × 100).

Among all methods, only the MSSG model andours are capable of learning multi-prototype wordrepresentations. Our embeddings show top per-formance for localSim and avgSim measures, andperformance competitive to that of MSSG and SGfor globalSim, both of which use a hierarchical

soft-max as their objective function. Compared tothe original C&W model, our version presents animprovement of 4.6%—a clear indication for theeffectiveness of the proposed learning method andthe enhanced objective.

7.4 Paraphrase detection

In the last set of experiments, the proposed com-positional distributional framework is evaluatedon the Microsoft Research Paraphrase Corpus(MSRPC) (Dolan and Brockett, 2005), which con-tains 5,800 pairs of sentences. This is a binaryclassification task, with labels provided by humanannotators. We apply the siamese network detailedin Section 6.

While MSRPC is one of the most used datasetsfor evaluating paraphrase detection models, itssize is prohibitory for any attempt of training adeep architecture. Therefore, for our trainingwe rely on a much larger external dataset, theParaphrase Database (PPDB) (Ganitkevitch et al.,2013). The PPDB contains more than 220 millionparaphrase pairs, of which 73 million are phrasalparaphrases and 140 million are paraphrase pat-terns that capture syntactic transformations of sen-tences. We use these phrase- and sentence-levelparaphrase pairs as additional training contextsto fine-tune the generic compositional model pa-rameters and word embeddings and to train thebaseline models. The original training set of theMSRPC is used as validation set for deciding hy-perparameters, such as the margin of the errorfunction and the number of training epochs.

The evaluations were conducted on various as-pects, and the models are gradually refined todemonstrate performance within the state-of-the-art range.

Comparison of the two error functions In thefirst evaluation, we compare the two error func-tions of the siamese network using only ambigu-

1537

ous vectors. As we can see in Table 3, the co-sine error function consistently outperforms theL2 norm-based one for both compositional mod-els, providing a yet another confirmation of thealready well-established fact that similarity in se-mantic vector spaces is better reflected by length-invariant measures.

Model L2 CosineRecNN 73.8 74.9RNN 73.0 74.3

Table 3: Results with different error functions forthe paraphrase detection task (accuracy × 100).

Effectiveness of disambiguation We now pro-ceed to compare the effectiveness of the two com-positional models when using ambiguous vectorsand multi-sense vectors, respectively. Our errorfunction is set to cosine similarity, following theresults of the previous evaluation. When dynamicdisambiguation is applied, we test two methods ofselecting sense vectors: in the hard case the vectorof the most plausible sense is selected, while in thesoft case a new vector is prepared as the weightedaverage of all sense vectors according to proba-bilities returned by the soft-max layer (see Figure3). As a baseline we use a simple compositionalmodel based on vector addition.

The dynamic disambiguation models and theadditive baseline are compared with variations thatuse a simple prior disambiguation step applied onthe word vectors. This is achieved by first se-lecting for each word the sense vector that is theclosest to the average of all other word vectorsin the same sentence, and then composing the se-lected sense vectors without further considerationsregarding ambiguity. The baseline model and theprior disambiguation variants are trained as sepa-rate logistic regression classifiers. The results areshown in Table 4.

Model Ambig. Prior Hard DD Soft DDAddition 69.9 71.3 – –RecNN 74.9 75.3 75.7 76.0RNN 74.3 74.6 75.1 75.2

Table 4: Different disambiguation choices for theparaphrase detection task (accuracy × 100).

Overall, disambiguated vectors work better thanthe ambiguous ones, with the improvement to bemore significant for the additive model; there, asimple prior disambiguation step produces 1.4%gains. For the deep compositional models, simple

prior disambiguation is still helpful with small im-provements, a result which is consistent with thefindings of Cheng et al. (2014). The small gainsof the prior disambiguation models over the am-biguous models clearly show that deep architec-tures are quite capable of performing this elemen-tary form of sense selection intrinsically, as partof the learning process itself. However, the situ-ation changes when the dynamic disambiguationframework is used, where the gains over the am-biguous version become more significant. Com-paring the two ways of dynamic disambiguation(hard method and soft method), the numbers thatthe soft method gives are slightly higher, produc-ing a total gain of 1.1% over the ambiguous ver-sion for the RecNN case.2

Note that, at this stage, the advantage of us-ing the dynamic disambiguation framework oversimple prior disambiguation is still small (0.7%for the case of RecNN). We seek the reason be-hind this in the recursive nature of our architecture,which tends to progressively “hide” local featuresof word vectors, thus diminishing the effect of thefine-tuned sense vectors produced by the dynamicdisambiguation mechanism. The next section dis-cusses the problem and provides a solution.

The role of pooling One of the problems of therecursive and recurrent compositional architec-tures, especially in grammars with strict branchingstructure such as in English, is that any given com-position is usually the product of a terminal and anon-terminal; i.e. a single word can contribute tothe meaning of a sentence to the same extent as therest of a sentence on its whole, as below:

[[kids]NP [play ball games in the park]VP]S

In the above case, the contribution of the wordswithin the verb phrase to the final sentence rep-resentation will be faded out due to the recursivecomposition mechanism. Inspired by related workin computer vision (Sun et al., 2014), we attemptto alleviate this problem by introducing an aver-age pooling layer at the sense vector level andadding the resulting vector to the sentence repre-sentation. By doing this we expect that the newsentence vector will reflect local features from allwords in the sentence that can help in the clas-sification in a more direct way. The results forthe new deep architectures are shown in Table 5,where we see substantial improvements for bothdeep nets. More importantly, the effect of dynamic

2For all subsequent experiments, the reported results arebased on the soft selection method.

1538

disambiguation now becomes more significant, asexpected by our analysis.

Table 5 also includes results for two modelstrained in a single step, with word and sense vec-tors randomly initialized at the beginning of theprocess. We see that, despite the large size of thetraining set, the results are much lower than theones obtained when using the pre-training step.This demonstrates the importance of the initialtraining on a general-purpose corpus: the result-ing vectors reflect linguistic information that, al-though not obtainable from the task-specific train-ing, can make great difference in the result of theclassification.

Model Ambig. Prior DynamicRecNN+pooling 75.5 76.3 77.6RNN+pooling 74.8 75.9 76.61-step RecNN+pooling 74.4 – 72.91-step RNN+pooling 73.6 – 73.1

Table 5: Results with average pooling for the para-phrase detection task (accuracy × 100).

Cross-model comparison In this section wepropose a method to further improve the perfor-mance of our models, and we present an evaluationagainst some of the previously reported results.

We notice that using distributional propertiesalone cannot capture efficiently subtle aspects of asentence, for example numbers or human names.However, even small differences on those aspectsbetween two sentences can lead to a different clas-sification result. Therefore, we train (using theMSPRC training data) an additional logistic re-gression classifier which is based not only on theembeddings similarity, but also on a few hand-engineered features. We then ensemble the newclassifier (C1) with the original one. In terms offeature selection, we follow Socher et al. (2011a)and Blacoe and Lapata (2012) and add the fol-lowing features: the difference in sentence length,the unigram overlap among the two sentences, fea-tures related to numbers (including the presence orabsence of numbers from a sentence and whetheror not the numbers in the two sentences are thesame). In Table 6 we report results of the originalmodel and the ensembled model, and we comparewith the performance of other existing models.

In all of the implemented models (including theadditive baseline), disambiguation is performed toguarantee the best performance. We see that byensembling the original classifier with C1, we im-prove the result of the previous section by another1%. This is the second best result reported so far

Model Acc. F1

BL All positive 66.5 79.9

Addition (disamb.) 71.3 81.1

Dyn

amic

Dis

. RecNN 76.0 84.0RecNN+Pooling 77.6 84.7RecNN+Pooling+C1 78.6 85.3RNN 75.2 83.6RNN+Pooling 76.6 84.3RNN+Pooling+C1 77.5 84.6

Publ

ishe

dre

sults

Mihalcea et al. (2006) 70.3 81.3Rus et al. (2008) 70.6 80.5Qiu et al. (2006) 72.0 81.6Islam and Inkpen (2009) 72.6 81.3Fernando and Stevenson (2008) 74.1 82.4Wan et al. (2006) 75.6 83.0Das and Smith (2009) 76.1 82.7Socher et al. (2011a) 76.8 83.6Madnani et al. (2012) 77.4 84.1Ji and Eisenstein (2013) 80.4 85.9

Table 6: Cross-model comparison in the para-phrase detection task.

for the specific task, with a 0.6 difference in F-score from the first (Ji and Eisenstein, 2013).3

8 Conclusion and future work

The main contribution of this paper is a deep com-positional distributional model acting on linguis-tically motivated word embeddings.4 The effec-tiveness of the syntax-aware, multi-sense wordvectors and the dynamic compositional disam-biguation framework in which they are used wasdemonstrated by appropriate tasks at the lexicaland sentence level, respectively, with very posi-tive results. As an aside, we also demonstrated thebenefits of a siamese architecture in the context ofa paraphrase detection task. While the architec-tures tested in this work were limited to a RecNNand an RNN, the ideas we presented are in prin-ciple directly applicable to any kind of deep net-work. As a future step, we aim to test the proposedmodels on a convolutional compositional architec-ture, similar to that of Kalchbrenner et al. (2014).

Acknowledgments

The authors would like to thank the three anony-mous reviewers for their useful comments, as wellas Nal Kalchbrenner and Ed Grefenstette for earlydiscussions and suggestions on the paper, and Si-mon Suster for comments on the final draft. Dim-itri Kartsaklis gratefully acknowledges financialsupport by AFOSR.

3Source: ACL Wiki (http://www.aclweb.org/acl-wiki), August 2015.

4Code in Python/Theano and the word embeddings can befound at https://github.com/cheng6076.

1539

ReferencesMarco Baroni, Georgiana Dinu, and German

Kruszewski. 2014. Don’t count, predict! asystematic comparison of context-counting vs.context-predicting semantic vectors. In Proceedingsof the 52nd Annual Meeting of the Association forComputational Linguistics, volume 1.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gra-dient descent is difficult. Neural Networks, IEEETransactions on, 5(2):157–166.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. The Journal of Machine Learning Re-search, 3:1137–1155.

William Blacoe and Mirella Lapata. 2012. A com-parison of vector-based representations for semanticcomposition. In Proceedings of the 2012 Joint Con-ference on Empirical Methods in Natural LanguageProcessing and Computational Natural LanguageLearning, pages 546–556. Association for Compu-tational Linguistics.

Jane Bromley, James W Bentz, Leon Bottou, Is-abelle Guyon, Yann LeCun, Cliff Moore, EduardSackinger, and Roopak Shah. 1993. Signatureverification using a siamese time delay neural net-work. International Journal of Pattern Recognitionand Artificial Intelligence, 7(04):669–688.

Jianpeng Cheng, Dimitri Kartsaklis, and EdwardGrefenstette. 2014. Investigating the role ofprior disambiguation in deep-learning composi-tional models of meaning. In 2nd Workshop ofLearning Semantics, NIPS 2014, Montreal, Canada,December.

B Coecke, M Sadrzadeh, and S Clark. 2010. Math-ematical foundations for a distributional composi-tional model of meaning. lambek festschrift. Lin-guistic Analysis, 36:345–384.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th international conference onMachine learning, pages 160–167. ACM.

Dipanjan Das and Noah A Smith. 2009. Paraphraseidentification as probabilistic quasi-synchronousrecognition. In Proceedings of the Joint Confer-ence of the 47th Annual Meeting of the ACL and the4th International Joint Conference on Natural Lan-guage Processing of the AFNLP: Volume 1-Volume1, pages 468–476. Association for ComputationalLinguistics.

W.B. Dolan and C. Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In Third International Workshop on Paraphrasing(IWP2005).

Samuel Fernando and Mark Stevenson. 2008. A se-mantic similarity approach to paraphrase detection.In Proceedings of the 11th Annual Research Collo-quium of the UK Special Interest Group for Compu-tational Linguistics, pages 45–52. Citeseer.

Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. Ppdb: The paraphrasedatabase. In HLT-NAACL, pages 758–764.

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006.Dimensionality reduction by learning an invariantmapping. In Computer vision and pattern recog-nition, 2006 IEEE computer society conference on,volume 2, pages 1735–1742. IEEE.

Zellig S Harris. 1954. Distributional structure. Word.

Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa,and Yoshimasa Tsuruoka. 2014. Jointly learn-ing word representations and composition functionsusing predicate-argument structures. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1544–1555, Doha, Qatar, October. Association forComputational Linguistics.

Karl Moritz Hermann and Phil Blunsom. 2013. Therole of syntax in vector space models of composi-tional semantics. In Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 894–904,Sofia, Bulgaria, August. Association for Computa-tional Linguistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Eric H Huang, Richard Socher, Christopher D Man-ning, and Andrew Y Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers-Volume 1, pages 873–882. Asso-ciation for Computational Linguistics.

Aminul Islam and Diana Inkpen. 2009. Semantic sim-ilarity of short texts. Recent Advances in NaturalLanguage Processing V, 309:227–236.

Yangfeng Ji and Jacob Eisenstein. 2013. Discrimi-native improvements to distributional sentence sim-ilarity. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Process-ing, pages 891–896, Seattle, Washington, USA, Oc-tober. Association for Computational Linguistics.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics, June.

1540

Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2013.Prior disambiguation of word tensors for construct-ing sentence vectors. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1590–1601, Seattle, Wash-ington, USA, October. Association for Computa-tional Linguistics.

Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, and StephenPulman. 2013. Separating disambiguation fromcomposition in distributional semantics. In Pro-ceedings of 17th Conference on Natural LanguageLearning (CoNLL), pages 114–123, Sofia, Bulgaria,August.

Dimitri Kartsaklis, Nal Kalchbrenner, and MehrnooshSadrzadeh. 2014. Resolving lexical ambiguity intensor regression models of meaning. In Proceed-ings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Vol. 2: Short Pa-pers), pages 212–217, Baltimore, USA, June. Asso-ciation for Computational Linguistics.

Yann LeCun, Leon Bottou, Yoshua Bengio, and PatrickHaffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE,86(11):2278–2324.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages302–308, Baltimore, Maryland, June. Associationfor Computational Linguistics.

Nitin Madnani, Joel Tetreault, and Martin Chodorow.2012. Re-examining machine translation metricsfor paraphrase identification. In Proceedings of the2012 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 182–190. Asso-ciation for Computational Linguistics.

Rada Mihalcea, Courtney Corley, and Carlo Strappa-rava. 2006. Corpus-based and knowledge-basedmeasures of text semantic similarity. In AAAI, vol-ume 6, pages 775–780.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In IN-TERSPEECH 2010, 11th Annual Conference of theInternational Speech Communication Association,Makuhari, Chiba, Japan, September 26-30, 2010,pages 1045–1048.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems, pages 3111–3119.

Dmitrijs Milajevs, Dimitri Kartsaklis, MehrnooshSadrzadeh, and Matthew Purver. 2014. Evaluating

neural word representations in tensor-based compo-sitional settings. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 708–719, Doha, Qatar,October. Association for Computational Linguistics.

Jeff Mitchell and Mirella Lapata. 2008. Vector-basedmodels of semantic composition. In Proceedingsof ACL-08: HLT, pages 236–244, Columbus, Ohio,June. Association for Computational Linguistics.

Andriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efficiently with noise-contrastiveestimation. In Advances in Neural Information Pro-cessing Systems, pages 2265–2273.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Proceedings of the 27th International Conferenceon Machine Learning (ICML-10), pages 807–814.

Arvind Neelakantan, Jeevan Shankar, Alexandre Pas-sos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings perword in vector space. In Proceedings of EMNLP.

S. Pado and M. Lapata. 2007. Dependency-basedConstruction of Semantic Space Models. Compu-tational Linguistics, 33(2):161–199.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors forword representation. Proceedings of the EmpiricialMethods in Natural Language Processing (EMNLP2014), 12.

Jordan B Pollack. 1990. Recursive distributed repre-sentations. Artificial Intelligence, 46(1):77–105.

Long Qiu, Min-Yen Kan, and Tat-Seng Chua. 2006.Paraphrase recognition via dissimilarity significanceclassification. In Proceedings of the 2006 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 18–26. Association for Compu-tational Linguistics.

Siva Reddy, Ioannis P Klapaftis, Diana McCarthy, andSuresh Manandhar. 2011. Dynamic and static pro-totype vectors for semantic composition. In IJC-NLP, pages 705–713.

Vasile Rus, Philip M McCarthy, Mihai C Lintean,Danielle S McNamara, and Arthur C Graesser.2008. Paraphrase identification with lexico-syntactic graph subsumption. In FLAIRS confer-ence, pages 201–206.

Richard Socher, Eric H Huang, Jeffrey Pennin, Christo-pher D Manning, and Andrew Y Ng. 2011a. Dy-namic pooling and unfolding recursive autoencodersfor paraphrase detection. In Advances in Neural In-formation Processing Systems, pages 801–809.

Richard Socher, Cliff C Lin, Chris Manning, and An-drew Y Ng. 2011b. Parsing natural scenes and natu-ral language with recursive neural networks. In Pro-ceedings of the 28th international conference on ma-chine learning (ICML-11), pages 129–136.

1541

R. Socher, B. Huval, C. Manning, and Ng. A.2012. Semantic compositionality through recursivematrix-vector spaces. In Conference on EmpiricalMethods in Natural Language Processing 2012.

Yi Sun, Yuheng Chen, Xiaogang Wang, and XiaoouTang. 2014. Deep learning face representation byjoint identification-verification. In Advances in Neu-ral Information Processing Systems, pages 1988–1996.

Stephen Wan, Mark Dras, Robert Dale, and CecileParis. 2006. Using dependency-based featuresto take the para-farce out of paraphrase. In Pro-ceedings of the Australasian Language TechnologyWorkshop, volume 2006.

Wen-tau Yih, Kristina Toutanova, John C. Platt, andChristopher Meek. 2011. Learning discriminativeprojections for text similarity measures. In Proceed-ings of the Fifteenth Conference on ComputationalNatural Language Learning, pages 247–256, Port-land, Oregon, USA, June. Association for Computa-tional Linguistics.

Matthew D Zeiler. 2012. Adadelta: an adaptive learn-ing rate method. arXiv preprint arXiv:1212.5701.

1542

Syntax-Aware Multi-Sense Word Embeddings for Deep Compositional Models of Meaning

Documents