Dimension Reduction for Short Text Similarity and its ...

Dimension Reduction for Short Text Similarity and itsApplications

Weiwei Guo

Submitted in partial fulfillment of the

requirements for the degree

of Doctor of Philosophy

in the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2015

c©2015

Weiwei Guo

All Rights Reserved

ABSTRACT

Dimension Reduction for Short Text Similarity and itsApplications

Weiwei Guo

Recently, due to the burst of online text data, much of the focus of natural language processing

(NLP) research has shifted from long documents to shorter ones such as sentences and utterances.

However, short texts posit significant challenges from an NLP perspective especially if the goal is

to get at sentence level semantics in the absence of larger contexts. Motivated by this challenge,

this thesis focuses on the problem of predicting the similarity between two short text samples by

extracting the latent representation of the text data, and we apply the resulting models in various

NLP tasks that involves short text similarity computation.

The major challenge of computing short text similarity is insufficient information in the text

snippets. In a sentence similarity benchmark [Agirre et al., 2012], on average a sentence has 10.8

words. Hence, there are very few overlapping words in a text pair even when they are semantically

related, and the widely used bag-of-words representation fails to captures the semantics relatedness.

To this end, we propose several weighted matrix factorization models for learning latent represen-

tation of texts , which induces meaningful similar scores:

1. Modeling Missing Words: To address the word sparsity issue, we propose to model the

missing words (words that are not in the short text), a feature that is typically overlooked in the liter-

ature. We define the missing words of a text as the whole vocabulary in a corpus minus the observed

words in the text. The model carefully handles the missing words that by assigning them a small

weight in the matrix factorization framework. In the experiments, the new model weighted matrix

factorization (WMF) achieves superior performance to Latent Dirichlet Allocation (LDA) [Blei et

al., 2003], which does not use missing words, and latent semantic analysis (LSA) [Deerwester et

al., 1990], which uses missing words but does not distinguish missing words from observed words.

2. Modeling Lexical Semantics: We improve the previous (WMF) model in terms of lexical

semantics. For short text similarity, it is crucial to robustly model each word in the text to capture the

complete semantic picture of the text, since there is very few repetitive information given the short

context. To this end, we incorporate both corpus based (bigrams) and knowledge-based (similar

words extracted from a dictionary) lexical semantics into the WMF model. The experiments show

both additional information are helpful and complementary to each other.

3. Similarity Computing for Large-scale data sets: We tackle the short text similarity

problem in large scale setting, i.e., given a query tweet, compute the similarity/distance with all

other data point in a database, and rank them based on similarity/distance score. To reduce the

computation time, we exploit binary coding to transform each data sample into a compact binary

code, hence enables highly efficient similarity computations via Hamming distances between the

generated codes. In order to preserve as much original data as possible in the binary bits, we restrict

the projection directions to be nearly orthogonal hence reduce redundant information. The resulting

model demonstrate better performance in both short text similarity task and a tweet retrieval task.

We not only are interested in the short text similarity task itself, but also are concerned with

how much the model could contribute to other NLP tasks. Accordingly, we adapt the short text

similarity for several NLP tasks closely associated to semantics, which involve intensive similarity

computation:

4. Text Summarization Evaluation: The pyramid method is one of the most popular methods

for evaluating content selection in summarization, which requires manual inspection during eval-

uation. Recently some efforts have been made to automate the evaluation process: Harnly et al.

[2005] searched for key facts/concepts covered in the summaries based on surface word matching.

We apply WMF model to this task to enable more accurate key facts identification in summaries.

The resulting automated pyramid scores correlate very well with manual pyramid scores.

5. Word Sense Disambiguation: The unsupervised Word Sense Disambiguation (WSD) sys-

tems highly rely on a sense similarity module that returns a similarity score given two senses. Cur-

rently the most popular sense similarity measure is Extended Lesk [Banerjee and Pedersen, 2003],

which calculates the similarity score based on the number of overlapping words and phrases between

two extended dictionary definitions. We propose a new sense similarity measure wmfvec by running

WMF on the sense definition data and integrating WordNet [Fellbaum, 1998] features. The WSD

system using wmfvec significantly outperforms traditional surface form based WSD algorithms as

well as LDA based systems.

6. Linking Tweets to News: In this task we target at social media data and news data. We

propose a new task of linking a tweet to a news article that is most relevant to the tweet. The

motivation of the task is to argument the context of a tweet by a news article. We extend the WMF

model and incorporate multiple Twitter/news specific features, i.e., hashtag, named entities and

timestamps, in the new model. Our experiments show significant improvement of the new model

over baselines in various evaluation metrics.

Table of Contents

List of Figures v

List of Tables viii

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Related Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 STS Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.3 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

I Dimension Reduction for Short Text Similarity 15

2 Enrich Short Text by Modeling Missing Words 16

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Limitations of LDA and LSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Weighted Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.2 Modeling Missing Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.1 Experiment setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

i

2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Enrich Lexical Features by Modeling Bigrams and Similar Words 32

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Incorporating Bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Incorporating Bigrams from Dependency Tree . . . . . . . . . . . . . . . 37

3.4 Incorporating Similar Word Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


4 Binary Coding for Large Scale Similarity Computing 45

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Binary Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.3 Applications in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.1 Binarized version of WMF . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.2 Removing Redundant Information . . . . . . . . . . . . . . . . . . . . . . 50

4.3.3 Implementation of Orthogonal Projections . . . . . . . . . . . . . . . . . . 53

4.4 Experiments on Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 Experiments on STS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

ii


II Applications 64

5 Automated Pyramid Method for Summaries 65

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3 A Scoring Approach based on Distributional Similarity . . . . . . . . . . . . . . . 70

5.3.1 A Student Summary Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.2 Criteria for Automated Scoring of Student Summaries . . . . . . . . . . . 70

5.3.3 A Dynamic Programming Approach . . . . . . . . . . . . . . . . . . . . . 71

5.4 Experiments on Student Summaries . . . . . . . . . . . . . . . . . . . . . . . . . 73


5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Experiments on TAC 2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


6 Unsupervised Word Sense Disambiguation 78

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.3 A New Sense Similarity Measure – wmfvec . . . . . . . . . . . . . . . . . . . . . 82

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88


7 Linking Tweets to News 91

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.3 Searching Complementary texts via Twitter/News Features . . . . . . . . . . . . . 95

iii

7.3.1 Hashtags and Named Entities . . . . . . . . . . . . . . . . . . . . . . . . 95

7.3.2 Temporal Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.3.3 Authorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.3.4 Creating Relations on News . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.4 WMF on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99


7.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101


III Conclusions 108

8 Conclusions 109

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2 Limitations and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

IV Bibliography 115

Bibliography 116

iv

List of Figures

2.1 An example to illustrate why missing words should be helpful: the red dots are

observed words in the text; the green dots represent missing words; the black dot

denotes the hypothesis of the latent vector of the text data. After taking into consid-

eration the missing words, we will have a better estimation for the text, i.e., where

the black dot should be. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Matrix Factorization: the M ×N matrix X is factorized into two matrices, K ×M

matrix P and K ×N matrix Q; K denotes the number of latent dimensions. . . . . 23

2.3 Pearson’s Correlation percentage scores of WMF on each data set: the missing word

weight wm varies from 0.001 to 0.1; the dimensionK is fixed to 100; regularization

factor λ is fixed to 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4 Pearson’s Correlation percentage scores of WMF and LDA on each data set: the

dimension K varies from 50 to 200; missing word weight wm is fixed to 0.01;

regularization factor λ is fixed to 20. . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 In current dimension reduction models (WMF/LSA and LDA), the features to rep-

resent a word are simply document IDs, which are denoted by the red circles. . . . 34

3.2 Each bigram is integrated in the original corpus matrix X as an additional column.

From the model’s perspective, a bigram is treated as a pseudo-text; accordingly,

only two cells in a bigram column have non-0 values. . . . . . . . . . . . . . . . . 37

3.3 WMF+BK model (WMF + corpus-based [B]igram semantics + [K]nowledge-based

similar word pairs semantics): a w/d/b node represents a word/document/bigram,

respectively; the extra node in Figure 3.3c denotes w2 and w3 constitute a similar

word pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

v

3.4 Pearson’s Correlation percentage scores of WMF-B (with corpus-based [B]igram

semantics alone) on each data set: corpus-based semantics weight γ is chosen from

{0, 1, 2}; the dimension K is 100; missing word weight wm is fixed as 0.01; regu-

larization factor λ is fixed as 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5 Pearson’s Correlation percentage scores of WMF-K (with [K]nowledge-based simi-

lar word pairs semantics alone) on each data set: knowledge-based semantics weight

δ is chosen from {0, 10, 30, 50}; the dimension K is 100; missing word weight wm

is fixed as 0.01; regularization factor λ is fixed as 20. . . . . . . . . . . . . . . . . 44

4.1 Two views of the P matrix: K is the number of dimensions, and M is the number

of distinct words. The first view, columns of P matrix, is frequently observed in the

WMF model (algorithm 1). Now we are going to apply the second view, rows of P

matrix which are projections, to improve the WMF model. . . . . . . . . . . . . . 51

4.2 Three examples to illustrate the noisiness in the P matrix. In general, we would like

to remove as much noise as possible. . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 Hamming ranking on tweet retrieval data set: precision curve under top 1000 re-

turned list of all 6 binary coding models, with dimension K = {64, 96, 128}. . . . 56

4.4 Hamming ranking on tweet retrieval data set: recall curve under top 100,000 re-

turned list of all 6 binary coding models, with dimension K = {64, 96, 128}. . . . 57

4.5 Impact of the missing word weight wm on the MP@1000 performance for OrMF

and WMF models: wm is chosen from 0.05 to 0.2; regularization factor λ is fixed

as 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6 Pearson’s Correlation percentage scores of OrMF, WMF and LDA on each data set:

the dimension K varies from 50 to 200; missing word weight wm is fixed as 0.01;

regularization factor λ is fixed as 20. . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 The pipeline for pyramid method to evaluate student summaries: the first annotation

of pyramid method is to create pyramids from model summaries; the second anno-

tation is to find the SCUs in target summaries. After the procedure, we can score a

target summary based on how many SCUs it has. . . . . . . . . . . . . . . . . . . 67

vi

5.2 Notation used for the 45 variants of automated pyramid methods. The 5 thresholds

correspond to inverse cumulative density function. . . . . . . . . . . . . . . . . . . 73

6.1 Unsupervised Graph-based Word Sense Disambiguation System: several sense nodes

are created for each word; the weights on edges are similarity scores between the

two senses; for simplicity, the edges between walk senses and friend sense are not

shown. The final decision of disambiguated words are the sense nodes that achieve

maximum indegree values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 The WSD performance, measured by F-measure, of ldavec and wmfvec on each data

set. The latent dimension K varies from 50 to 150. . . . . . . . . . . . . . . . . . 87

7.1 The general framework for linking a tweet to its most relevant new article, by firstly

transforming the textual data into latent representation, and then choosing the one

with maximum cosine similarity score. . . . . . . . . . . . . . . . . . . . . . . . . 92

7.2 The tweet nodes t and news nodes n are connected by hashtags, named entities or

temporal edges. For simplicity, the missing tokens are not shown in the figure. All

the grey nodes are observed information, such as TF-IDF values, while white nodes

are latent vectors to be inferred. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.3 Impact of the weight of links δ of model WMF-G on development set and test set

evaluated by three evaluation metrics: latent dimension K = 100, and neighbor

tweets number is k = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.4 Impact of latent dimension K of model WMF-G on test set evaluated by three met-

rics: the neighbor tweet number is fixed k = 4. Dimension K varies from 50 to

150. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

vii

List of Tables

1.1 Genres and data sources for each of the subset data for STS12/STS13/STS14 . . . 9

1.2 This table contains the word pairwise similarity. The content words in the first

sentence are cemetery, place, body, ash; the content words in the second sentence are

graveyard, area, land, sometime, near, church. Each cell stores the word similarity

value; the numbers in red denote the word pair alignment that maximizes the total

sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 List of features used in DKpro [Bar et al., 2013] . . . . . . . . . . . . . . . . . . . 12

2.1 Three possible latent vectors hypotheses for the text data, which is the WordNet

sense definition of bank#n#1: a financial institution that accepts deposits and

channels the money into lending activities. Assume there are only three topics in

the corpus: finance, sport, institution. Ro denotes the relatedness score between

the hypothesis with observed words; Rm denotes the relatedness score between the

hypothesis with missing words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Pearson’s correlation (in percentage) on the four data sets: latent dimension K =

100 for LSA/LDA/WMF. For WMF models, the regularization factor λ is fixed as

20. Model 4-6 are WMF with different missing word weight wm, where the first

two models are analogous to LSA and LDA, respectively. . . . . . . . . . . . . . . 26

2.3 Pearson’s correlation (in percentage) on the four data sets: the models are trained

on long documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

viii

3.1 Pearson’s correlation (in percentage) on the four data sets. Latent dimension K =

100 for LSA/LDA/WMF/WMF-BK. For matrix factorized based models, the regu-

larization factor λ is fixed as 20. Model 5 is WMF with bigram semantics alone;

model 6 is WMF with similar word pairs alone; model 7 is the final model with both

semantics incorporated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Symbols used in binary coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Mean precision among top 1000 returned list (MP@1000) on the tweet retrieval

data set. TF-IDF is the only system that does not use binary encoding, and serves

as the upper bound of the task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Pearson’s correlation (in percentage) on the data sets. Latent dimension K = 100

for LSA/LDA/WMF/OrMF. We use the real-valued vectors produced by OrMF for

short text similarity evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 An example of summary content unit created from 5 model summaries. The concept

has 4 contributors, all expressing the same meaning yet with different wording.

Accordingly this SCU has a weight of 4. . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Five top performing variants out of 45 variants ranked by correlation scores, with

confidence interval and rank (P=Pearson’s, S=Spearman, K=Kendalls tau) . . . . 74

5.3 SCU selection results: averaged recall, precision and F-measure over the 20 student

summaries, for each combination of similarity method and method of comparison

to the SCU (9 categories). The number in bracket is the standard deviation for

precision and recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 SCU selection results: averaged recall, precision and F-measure over the 20 student

summaries, for variants of the top five variants in Table 5.2. . . . . . . . . . . . . . 76

6.1 The statistics of annotated senses in the four WSD data sets, as well as the distribu-

tion per part-of-speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.2 The WSD performance measured by F-measure of 7 models on each data set, as

well as the performance per part-of-speech. For the models ldavec and wmfvec,

they are trained with the latent dimension K = 100. . . . . . . . . . . . . . . . . . 86

ix

6.3 The similarity values of wmfvec and elesk in two examples. The first example is

the target word mouse in a biology context that contains words gene, cell, etc. The

second example is the word church in a context that involves stop, chat. . . . . . . 89

7.1 Performance for Linking-Tweets-to-News under three evaluation metrics (latent di-

mension K = 100 for LDA/WMF/WMF-G) . . . . . . . . . . . . . . . . . . . . . 101

7.2 Contribution of subgraphs of hashtag/named entity/temporal/author, when K =

100, k = 4, δ = 3, measured by gain over baseline WMF. . . . . . . . . . . . . . . 104

x

Acknowledgments

Writing thesis is a great process to review not only my academic work, but also the journey I

took as a PhD student. Throughout the years I spent at Columbia, I have been fortunate to come

across so many brilliant researchers and genuine friends. It is the people who I met shaped who I

am today. My gratitude goes out to all of them. I would like to send my thanks in particular:

Above all, I would express my supreme gratitude to my supervisor, Professor Mona Diab, for

her great support in the past years. It is Mona who encourages me to the challenging field of natural

language processing. Her passion and positive attitude will always be an inspiration to me. She

provided me great opportunities to participate a lot of interesting research projects and organize

different academic activities. There would not exist this thesis without her patience and guidance.

And to Professor Kathy McKweon, my departmental advisor, who always has confidence on

me and encourages me to pursue a higher standard. To Owen Rambow, who patiently helped me

solve the issues in our projects. Their insightful guidance and sense of responsibility motivates me

towards a professional researcher.

To my coauthors, Professor Heng Ji, Doctor Rebecca Passonneau and Doctor Smaranda Mure-

san. I am very grateful for their generous help in addressing research issues and paper revision. I

also learned a lot from their serious attitude towards academia. To my intern mentor Rakesh Gupta,

who together with Prof. Ji, Prof. Mckweon, my advisor Mona Diab, provided generous help during

my job seeking.

To my colleagues and friends in Columbia University, Apoorv Agarwal Mohamed Altantawy,

Daniel Bauer, Or Biran, Hao Dang, Zhe He, Weiwei Jiang, Ahmed El Kholy, Heba Elfardy, Noura

Farra, Wei-Yun Ma, Vinodkumar Prabhakaran, Mohammad Sadegh Rasooli, Wael Salloum, Xiaorui

Sun, Kapil Thadani. I would never forget the surprising birthday parties, wonderful trips and crazy

deadlines I spent with you. To my dear friends, Jinai A, Ti-wei David Chen, Yiding Cheng, Pradeep

Dasigi, Mevlana Gemici, Jun Hu, Jianzhao Huang, Qiao Hui, Jia Liu, Peng Liu, Shih-hao Liao,

xi

Yuan Ma, Misagh Mb, Ruiyang Wu, Yu Xie, Aya Zerikly, Fan Zhang, Hang Zhao, who shared the

excitement of studying abroad.

Also to Hao Li, Qi Li, Wei Liu, Junming Xu, Feng Song, Xiaoxiao Shi, who happened to pursue

their PhD degrees with me at the same time and left precious memories for me, including my best

“comrades” Boyi Xie and Leon Wu, who helped me dealing with all sorts of things in CCLS.

To CCLS staff members, Daniel Alicea, Hatim Diab, Kathy Hickey, Idrija Ibrahimagic, Derrick

Lim, Axinia Radeva, who makes it a big family to me.

My special appreciation goes to my parents Wenliang Guo, Xiaoying Li, and my girlfriend,

Wenhui Li, with whom I always share my good news and frustration. My parents always support me

to pursue what interests me. Hardly can I achieve any accomplishment without their unconditional

love. My girlfriend helps me in every aspect of my life. And my deepest thanks for my grandpa

Jianzhong Li, who cares for me more than himself.

Finally, I would like to thank all committee members, Prof. Diab Mona, Prof. Kathy Mckweon,

Dr. Owen Rambow, Dr. Smaranda Muresan, and Prof. David Blei, for attending my PhD thesis

defense, and all staffs in the Department of Computer Science at Columbia University.

xii

To my parents

xiii

CHAPTER 1. INTRODUCTION 1

Chapter 1

Introduction

This thesis is dedicated to developing dimension reduction models. We especially focus on ad-

dressing the problem of short text similarity task as well as their application to natural language

processing (NLP) tasks.

1.1 Overview

Recently, online communication makes up a large portion of social media content, especially mi-

croblogs as in Facebook comments and Twitter data which have gained tremendous popularity. The

latter have now become major sources that contain first story breaking news before being reported

in traditional media. Because of this trend, a significant amount NLP research focus has shifted

from large lengthy documents to smaller texts such as sentences and utterances.

Identifying the degree of semantic similarity between two short texts is at the crux of many

NLP applications that address sentence level semantics. In Machine Translation [Kauchak and

Barzilay, 2006] and Text Summarization [Zhou et al., 2006], sentence similarity based metrics

have been applied to evaluate the closeness between yielded translation/summary and reference.

In Text Coherence Detection [Lapata and Barzilay, 2005], different sentences are linked by their

similarity scores. In Word Sense Disambiguation, Lesk [1986] measured the relatedness of two

senses by their definition similarity. Moreover, computing similarity between short text data is an

indispensable step in social media analysis research. Take Twitter data as an example, social media

analysis research includes tweet recommendation [Yan et al., 2012], tweet retrieval [Ramage et al.,


2010], tweet paraphrase detection [Xu et al., 2014], event summarization [Shen et al., 2013] and

extraction [Ritter et al., 2012] on tweets, etc. (cf. more details on applications of short text similarity

can be found in section 1.2.4).

Due to the relevance of the problem, this thesis presents a comprehensive study of computing

similarity between short texts. The task of Short Text Similarity (STS) requires systems to calculate

a score that reflects the similarity between a pair of sentences/short snippets of texts, e.g., a score of

0.47, between [0, 1], is given to the following sentence pair from [Li et al., 2006]:

• Cord is a strong, thick string.

• String is a thin rope made of twisted threads, used for tying things together or tying up parcels.

At the first glance, the short text similarity (STS) task closely resembles sentence level para-

phrase recognition. However, paraphrase recognition only provides a binary score for a short text

pair, whereas STS seeks a more nuanced fine-grained continuous score. The continuous score makes

STS more applicable to NLP tasks, since in most cases two sentences are not exactly semantically

equivalent, and knowing the amount of overlapping information could be very helpful, e.g., any two

WordNet senses are not exactly the same, but the degree of similarity between senses is essential

for word sense disambiguation. Also, STS shares some commonality with Textual Entailment. In-

tuitively, if text A entails text B, usually their similarity score should be high. Therefore, the textual

similarity scores could be a useful feature for the Textual Entailment. In Section 1.2.1, we highlight

the difference among these tasks.

Different from typical lengthy document similarity calculation, extracting similarity for short

texts is very difficult, due to the limited features observed in a data unit. Hence, the widely used

TF-IDF weighting on bag-of-words fails to capture semantic relatedness of two sentences unless

they share many overlapping words. In our example above, typical measures of similarity will not

succeed since the two sentences share very few words overall.

Previous research on STS falls in two main thrusts. The first set of approaches work within

the high-dimensional word space exploiting lexical semantics techniques such as word similarity

measures, which are either corpus based [Islam and Inkpen, 2008], or knowledge based [Li et

al., 2006; Mihalcea et al., 2006; Tsatsaronis et al., 2010]. Majority of this work was introduced

within the context of early work on STS [Li et al., 2006; Mihalcea et al., 2006; Islam and Inkpen,


2008; Tsatsaronis et al., 2010]. The second set of approaches work within the low-dimensional

space. The low dimensional space is represented by dimension reduction techniques, such as Latent

Semantic Analysis (LSA) [Deerwester et al., 1990], Probabilistic Latent Semantic Analysis (PLSA)

[Hofmann, 1999], and Latent Dirichlet Allocation (LDA) [Blei et al., 2003]. Such techniques can

fully exploit word co-occurrence information by modeling the semantics of words and sentences

simultaneously in the low-dimensional latent space. However, early attempts at addressing STS

using LSA [Mihalcea et al., 2006; O’Shea et al., 2008], or LDA (experiments shown in [Guo and

Diab, 2012b]), performed significantly below high dimensional word similarity based models. (cf.

we present previous work on STS in section 1.2).

In the first part of the thesis, we introduce several of our approaches to improve STS perfor-

mance in a matrix decomposition framework. From lexical semantics based approaches, we ob-

serve that the key to inducing robust sentence similarity is to introduce additional information to

overcome the data sparseness issue (in the [Agirre et al., 2012] data set, on average only 10.8 words

exist in a short text snippet). In Chapter 2, we propose an unsupervised approach, Weighted Matrix

Factorization (WMF) [Guo and Diab, 2012b] that accounts for and accordingly explicitly models

“the missing words” for each short text. We hypothesize that the semantic profile of a sentence is

defined by both what is *in* the text as observed words and what is *not* in the text. Accordingly,

we define the missing words of a short text as all the vocabulary in a training corpus minus the

observed words in the short text. Modeling missing words in practice adds thousands of more fea-

tures for a text, by contrast other low dimensional models such as LDA, for example, only leverages

observed words (around 10 words) to infer a 100 dimension latent vector for a text.

In Chapter 3, we propose an approach for more robust modeling of the lexical items in short

texts [Guo and Diab, 2013], which can further boost the short text semantics. Explicitly modeling

lexical semantics nuances for each word in canonical dimension reduction algorithms does not draw

much attention in the community, as these models are typically used for long documents, which in

turn have abundant word features to induce the document level semantics. However, in the short

textual similarity setting, it is crucial to make good use of each word in the text, in order not to miss

salient topics represented in the short text. Accordingly, we explicitly encode lexical semantics,

derived from both corpus-based and knowledge-based information, in the weighted matrix factor-

ization (WMF) model. The experiments illustrate that these new models gain even better short text


similarity scores.

Moreover, given the observation that there is a massive flow of Twitter data online, we note the

need to approximate such large collections of data in efficient ways for several NLP applications

on Twitter such as first story detection. We exploit binary coding to tackle the scalability issue,

which compresses each data sample into a compact binary code and hence enables highly efficient

similarity computations via Hamming distances between the generated codes. One obvious side

effect of using binary bits is a lot of nuanced information is lost. Aiming at alleviating this issue, we

convert the WMF model into a binarized version, and force the projection directions in the model

nearly orthogonal to reduce the redundant information in the resulting binary bits. Our proposed

technique finds the most similar tweets given a query tweet in the large scale Twitter data scenario.

Also, the experiments on STS data sets show its superiority over the previous models. More details

can be found in Chapter 4.

We show the efficacy of our proposed models not only intrinsically on the short text similar-

ity task, but also extrinsically on several applications for various NLP tasks, which is the focus of

the second part of the thesis. From Chapter 5 to 7, we show how our matrix factorization mod-

els are very powerful at extracting quality latent semantics of text at different level of granularity

(phrases/sentences/short texts) that address specific applications.

The first application is automatic pyramid evaluation for text summarization [Passonneau et

al., 2013]. One key component in pyramid evaluation is to identify whether a summary covers

an important concept in the original documents. In traditional pyramid evaluation [Nenkova and

Passonneau, 2004], this is done manually by searching a word sequence in the summary that covers

a key concept, which could be expensive. To propose an automatic method, we extract all the

phrase level ngrams, convert them into low dimensional vectors, and apply a dynamic programming

approach to automatically match concepts. Our approach is shown to correlate better with manual

scores than two string matching baselines on a student summary assessment task. With further

experiments, we find the new approach can extract concepts with higher precision and recall.

We also investigate the impact of the distributional similarity in an unsupervised word sense dis-

ambiguation setting in Chapter 6. Traditional sense similarity measures compute sense similarity

by counting overlapping words and phrases [Lesk, 1986; Banerjee and Pedersen, 2003]. However,


surface word matching in the sparse word space does not reveal the true semantic relatedness of two

senses, especially in the context where sense definitions are usually short. To obtain meaningful

sense similarity values, we convert sense definitions to low dimensional dense vectors. We fur-

ther construct a more powerful sense similarity measure wmfvec using WordNet defined relations.

The WSD system using wmfvec outperforms surface word based WSD systems and LDA based

algorithms, significantly.

In Chapter 7, we apply our proposed STS techniques on social network Twitter data [Guo et al.,

2013]. The short nature of tweets poses a big challenge for NLP tools to extract useful information

from the data. To enable the NLP tools to better understand Twitter feeds, we propose the task of

linking a tweet to a relevant news article, in effect augmenting the context of the tweet. We develop

a new model that is able to capture the tweet/news relatedness in the data. Our model utilizes the

tweet specific features (e.g., hashtag) and news specific features (e.g., named entities) as well as

temporal constraints, to find tweets that are on the same topic (and hence complementary) to a

specific tweet. We crawl a tweet-news pairs data set, and the new model significantly outperforms

the baselines in three different evaluation metrics on this data set.

1.2 Related Work

In this section, we summarize the previous work focusing on the task of short text similarity. The

related work of applying short text similarity for other NLP tasks, e.g., word sense disambiguation

and pyramid evaluation for text summarization, will be presented in each application chapter.

We first review two related tasks, Textual Entailment and Paraphrase Recognition, where we

compare the difference and commonality between them and the STS task. Then we briefly introduce

the unsupervised and supervised approaches for this problem, as well as the recent development of

the STS data sets. At last, we briefly summarize the applications of STS.

1.2.1 Related Tasks

Paraphrase Recognition (sentence level) is a task closely related to STS. In [Dolan et al., 2004],

Paraphrase Recognition is defined as identifying two texts “which are more or less semantically

equivalent”, but may differ in the syntactic structure or the amount of shared details. Under this


definition, the two tasks closely resemble one another: if two texts are paraphrases, then the semantic

similarity score should be very high. One distinct difference is when one text is the negation of the

other, then they are not paraphrases, yet in STS they still have a relatively high similarity score.

This is partly caused by the design nature of the two tasks. STS aims at reflecting the degree of

information overlapping hence it has a continuous score, whereas Paraphrase Recognition score is

a binary value focusing on exact semantic equivalence; Paraphrase Recognition has a very strict

constraint on the positive label, for example the meaning of two sentences are very close but their

label is negative (not paraphrase):

• Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.

• In the memo, Ballmer reiterated the open-source threat to Microsoft.

Due to this characteristic, supervised models are much more popular in the task of paraphrase

recognition yielding significantly better results, since supervised models can exploit some human

designed features that are highly discriminative for the task. In contrast, in the STS task, unsuper-

vised approaches are able to achieve comparable performance.1 Meanwhile, it is worth noting that

the current best performing systems on the Microsoft Paraphrase corpus [Dolan et al., 2004] are

dimension reduction models plus manually designed features in a supervised setting: Socher et al.

[2011a] applied recursive neural network model for the task, achieving an accuracy score of 76.8

on Microsoft paraphrase corpus; Ji and Eisenstein [2013] developed a discriminative dimension

reduction model with 80.41 accuracy on the same data set.

Textual entailment is defined as the directional relationship between a text T (text) and a second

text H (hypothesis) where T entails H (T ⇒ H) “if the meaning of H can be inferred from the

meaning of T, as would typically be interpreted by people” [Dagan et al., 2006]. Intuitively, if T

entails H, T and H are often highly similar. However, sometimes H can be logically inferred from

T, and the similarity value between T and H might not be very high. Textual entailment differs from

STS in two respects: (1) Textual Entailment is directional where T entails H, hence the opposite does

not hold, i.e., H does not entails T; (2) similar to Paraphrase Recognition, Textual Entailment outputs

1Results for the *SEM 2013 STS shared task are available on http://ixa2.si.ehu.es/sts/index.php,

the DEFT system, an unsupervised system based on our approaches, was ranked 3rd among 89 runs of 34 STS systems

which includes many supervised methods.

http://ixa2.si.ehu.es/sts/index.php


a binary decision while STS is defined in a graded continuous space. Based on the observations, we

can conclude STS scores could be a very helpful feature for the Textual Entailment task.

1.2.2 STS Datasets

LI06: The most popular data set before 2012 is LI06 [Li et al., 2006]. The LI06 data set consists of

65 pairs of noun definitions selected from the Collin Cobuild Dictionary [Sinclair, 2001]. A subset

of 30 pairs is further selected by Li et al. to render the similarity scores evenly distributed. Each pair

is associated with a continuous score from 0 to 1, which is the average judgment from 32 human

annotators. For example, a score of 0.65 is assigned to the following pair:

• A gem is a jewel or stone that is used in jewelry.

• A jewel is a precious stone used to decorate valuable things that you wear, such as rings or

necklaces.

Typically in the literature Pearson’s correlation coefficient, or Spearman rank correlation coefficient,

between an STS system output and the groundtruth similarity are used to evaluate the performance

of a STS system. While this is the ideal data set to evaluate STS, the small size makes it impossible

for tuning STS algorithms or deriving significant performance conclusions.

LEE05: A less popular data set is developed by Lee et al. [2005], which comprises 50 short texts as

newspaper articles from the political domain. Every two texts among the 50 texts constitute a pair,

resulting in 1225 pairs in total. Each pair is annotated with a similarity score on a discrete 1-5 scale.

The main reason for this data set drawing less attention in the NLP community might be the

length of text unit being relatively long. An example from this data set is shown:

• Beijing has abruptly withdrawn a new car registration system after drivers demonstrated ”an

unhealthy fixation” with symbols of Western military and industrial strength - such as FBI and

007. Senior officials have been infuriated by a popular demonstration of interest in American

institutions such as the FBI. Particularly galling was one man’s choice of TMD, which stands

for Theatre Missile Defense, a US-designed missile system that is regularly vilified by Chinese

propaganda channels.


• The Russian defense minister said residents shouldn’t feel threatened by the growing number

of Chinese workers seeking employment in the country’s sparsely populated Far Eastern and

Siberian regions. There are no exact figures for the number of Chinese working in Russia,

but estimates range from 200,000 to as many as 5 million. Most are in the Russian Far

East, where they arrive with legitimate work visas to do seasonal work on Russia’s low-tech,

labor-intensive farms.

The length of a short text in LEE05 is 45 to 126 words. Many NLP tasks deal with a much smaller

context. For example, in word sense disambiguation, a crucial component is sense similarity cal-

culation between the sense definitions, each with around 15 words. Meanwhile, for NLP tasks that

process large contexts, operating in the surface word space can achieve reasonably good results,

hence obfuscating the need to pay special attention to incorporating lexical semantics explicitly.

MSR04: Another data set widely used to evaluate STS models is the Microsoft Paraphrase Corpus

(MSR04) [Dolan et al., 2004]. The MSR04 data set comprises a larger set of sentence pairs: 4, 076

training and 1, 725 test pairs, taken from web news sources. The data set is originally targeting

the task of Paraphrase Recognition, and accordingly is not accompanied by continuous scores. The

paraphrase ratings are binary labels: similar/not similar. This is not a problem per se, however the

issue is that it is very conservative in its assignment of a positive label (similar), for example the

following sentence pair as cited in [Islam and Inkpen, 2008] is rated as not semantically equivalent:

• Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.

• In the memo, Ballmer reiterated the open-source threat to Microsoft.

Since the labels are binary, apart from accuracy, F-measure is also used to evaluate performance in

terms of precision and recall of positive examples.

STS12, STS13 and STS14: In Semantic Textual Similarity Task, SemEval 2012 Task 6, [Agirre et

al., 2012] (STS12), a large collection of sentence pairs (a training set of 2, 234 pairs and a test set of

3, 150 pairs) is annotated with graded similarity scores in the range of [0, 5]. The scale is inspired

by the annotation schema of LI06.

STS12/STS13/STS14 include sentence pairs of very different data genres. The sentence pairs of

STS12 cover dictionary sense definitions of WordNet and OntoNotes [Hovy et al., 2006], machine


data set size (pairs) description

STS12 train 750 msr-par: Microsoft Research Paraphrase Corpus [Dolan et al., 2004]

750 msr-vid: Microsoft Research Video Description Corpus [Chen and Dolan,

2011]

734 smt-eur: shared task of the 2007 ACL Workshops on Statistical Machine

Translation [Callison-Burch et al., 2007]

STS12 test 750 msr-par: Same as STS12 train

750 msr-vid: Same as STS12 train

459 smt-eur: Same as STS12 train

399 smt-news: News conversation sentence pairs from Workshop on Machine

Translation [Callison-Burch et al., 2008]

750 on-wn: Pairs of sentences where the first sentence is an OntoNotes [Hovy et

al., 2006] gloss and the second sentence is a WordNet gloss

STS13 750 headlines: News headlines mined from several news sources by European Me-

dia Monitor [Clive et al., 2005] leveraging RSS feeds

189 fn-wn: Pairs of sentences where the first sentence is a FrameNet [Baker et al.,

1998] gloss and the second sentence is a WordNet gloss

561 on-wn: Same as STS12 test

750 smt: This SMT dataset is derived from the DARPA GALE HTER and HyTER

datasets, where one sentence is an MT output and the other is a reference

translation

STS14 750 headlines: Same as STS13

750 on-wn: Same as STS12 test

450 deft-forum: A subset of discussion forum data in the DARPA DEFT data col-

lection

300 deft-news: A subset of news article data in the DARPA DEFT data collection

750 images: A subset of the Image Descriptions data set from PASCAL VOC-2008

[Rashtchian et al., 2010]

750 tweet-news: A subset of the Linking-Tweets-to-News short text pairs [Guo

et al., 2013], where the first text snippet is from tweets and the second text

snippet is from news headline

Table 1.1: Genres and data sources for each of the subset data for STS12/STS13/STS14


translation output and references pairs from the translation shared task of the 2007 and 2008 ACL

Workshops on Statistical Machine Translation [Callison-Burch et al., 2007; Callison-Burch et al.,

2008], some pairs from video paraphrase corpus [Chen and Dolan, 2011], and the existing news

paraphrase data set Microsoft Paraphrase Corpus. Later in the *SEM 2013 Shared Task [Agirre

et al., 2013] (STS13), a similar test set of 2, 250 pairs is developed under the same annotation

guidelines, which contains new genres of news headline pairs gathered by Europe Media Monitor

engine [Clive et al., 2005], FrameNet [Baker et al., 1998] gloss to WordNet gloss pairs. In SemEval

2014 Task 10 [Agirre et al., 2014] (STS14), a test set of 3, 750 English pairs is released. The new

genres in STS14 are DEFT forum data, image description [Rashtchian et al., 2010], tweets/news

pairs [Guo et al., 2013]. For the first time, the organizers also developed a test set of 804 Spanish

pairs. A brief description of the genres of STS12/STS13/STS14 is presented in Table 1.1.

The development of large training data in STS12/STS13/STS14 has a very big impact, which

enables supervised learning on the similarity scores. Because of the large data size and non-binary

similarity scores, these three data sets are highly beneficial for future work. In our evaluation, we

conduct experiments on these three data sets.

1.2.3 Approaches

We can see a clear correlation of supervised/unsupervised methods with the data sets they are eval-

uated on. The early works on STS are mostly unsupervised and evaluated on small data sets such

as LI06 or MSR04. The most recent work benefit from the development of large data sets of

STS12/STS13/STS14, and thus supervised approaches are extensively adopted.

Unsupervised Approaches: STS enjoys a close relationship to lexical semantics, as the STS task is

first introduced by the lexical semantics community. Accordingly, early work on short text similarity

[Li et al., 2006; Mihalcea et al., 2006; Islam and Inkpen, 2008; Tsatsaronis et al., 2010] focuses on

leveraging lexical semantics techniques to discover the similarity between different words within

the two sentences and attempts to discover if they are related.

The general framework of these works is: (1) firstly decompose the short text similarity problem

into word similarity problem; (2) then calculate the overall textual similarity by summing up some

of the word similarity values with normalization.

Specifically, the lexical semantics techniques are sense/word similarity measures, which are


graveyard area land sometime near church

cemetery 0.505 0.010 0.195 0.162 0.297 0.449

place 0.018 0.248 0.204 0.083 0.017 0.011

body 0.242 0 0.039 0.071 0.032 0.044

ash 0.416 0.041 0.134 0.133 0.225 0.124

Table 1.2: This table contains the word pairwise similarity. The content words in the first sentence

are cemetery, place, body, ash; the content words in the second sentence are graveyard, area, land,

sometime, near, church. Each cell stores the word similarity value; the numbers in red denote the

word pair alignment that maximizes the total sum

knowledge based [Li et al., 2006; Feng et al., 2008; Ho et al., 2010; Tsatsaronis et al., 2010],

corpus-based [Islam and Inkpen, 2008] or hybrid [Mihalcea et al., 2006]. Most knowledge based

word similarity measures rely on machine readable dictionaries, of which the most widely used is

WordNet [Fellbaum, 1998], where the graph structure of the taxonomy is the main resource to com-

pute word similarity. In the corpus-based approach [Islam and Inkpen, 2008], the word similarity

is computed based on mutual information between words in a corpus. Ho et al. [2010] went be-

yond the word token, and transformed the sentence into sense representation after performing word

sense disambiguation, hence the step is replaced by summing up the sense similarity scores. They

achieved better performance, however not statistically significant, on the LI06 data set.

In terms of the second step, of calculating the overall short text similarity, we present some

representative methods. Mihalcea et al., [2006] calculated the text similarity as the sum of word

similarity normalized by inverse document frequency (IDF) values:

sim(T1, T2) =1

2

(∑w∈{T1}maxSim(w, T2) ∗ idf(w)∑

w∈{T1} idf(w)+

∑w∈{T2}maxSim(w, T1) ∗ idf(w)∑

w∈{T2} idf(w)

)Instead of choosing the maximum similarity value for a word, Islam et al. [2008] searched for an

alignment between words in two texts, and then computed the sum of the similarity of the aligned

word pairs. The aligned word pairs are chosen to maximize the sum. An example of such align-

ment from their paper [Islam and Inkpen, 2008] is illustrated in Table 1.2, accordingly the textual

similarity is the sum of similarity scores of these four aligned word pairs.

The second set of approaches work within the low-dimensional space. Dimension reduction


techniques, such as LSA/LDA, can fully exploit word co-occurrence information, and subsequently

map the short texts into low dimensional dense vectors. Textual similarity is computed as the cosine

similarity between two vectors. However, early attempts at addressing STS using LSA [Mihalcea

et al., 2006; O’Shea et al., 2008], or LDA (experiments shown in [Guo and Diab, 2012b]), are

significantly outperformed by lexical semantics based models. Recently, a lot of supervised methods

directly use the similarity scores returned by LSA or LDA as features, however, there is very few

effort on improving the dimension reduction models.

Supervised Approaches: The development of a large scale dataset STS12 makes supervised

systems for short text similarity possible. A supervised system is able to combine NLP features

from different aspects, and train a regression model on these features to better approximate the

groundtruth similarity scores. To create features, a common technique researchers adopted is stack-

ing, which is to train a model to combine the predictions of several other learning algorithms.

This is evident in many competitive supervised systems [Bar et al., 2013; Severyn et al., 2013;

Han et al., 2013]. Table 1.3 shows a list of such features used in the system DKpro [Bar et al.,

2013].

features description

string similarity the number of overlapping of ngram characters

pairwise word similarity Similar to the approach [Mihalcea et al., 2006]

vector space model the similarity value returned by LSA

syntactic similarity overlapping of POS ngram

stylistic Similarity a measure which compares function word frequencies

phonetic Similarity pairwise phonetic comparisons of words

Table 1.3: List of features used in DKpro [Bar et al., 2013]

Apart from that, the supervised labels enable exploiting another interesting category of features,

namely, sentential structural features. Usually a sentence contains a large number of structural

features, not all of which are relevant, typically given supervised labels the useful ones can be more

easily identified.

In the following, we elaborate some of the features and techniques used in supervised ap-


proaches by reviewing several successful systems.

DKpro [Bar et al., 2013] is an example of stacking system, which is the best performing system

among the SemEval 2012 Task participants. DKpro achieves a Pearson’s correlation of 0.8239 on

the STS12 test data set. They used simple surface lexical features, such as character, word ngrams,

common letter subsequences, combined with complex features such as LSA latent vectors and word

similarity scores. Also to alleviate the word sparsity problem they employed lexical substitution

and machine translation to obtain more lexical features. All these features are subsequently fed into

a log-linear regression model.

UMBC EBIQUITY [Han et al., 2013] is the best performing participant system on STS13 (The

weighted Pearson’s correlation is 0.6181). Han et al. trained a support vector regression model

using features such as lexical semantics with WordNet, word ngrams, word alignment between the

two sentences, and stacking with tree kernel similarity.

Severyn et al. [2013] are the first to incorporate structural features in this task. They converted

the input texts to syntactic trees, and relied on tree kernels to learn relevant features. Several syn-

tactic tree representations are combined in a tree kernel, meanwhile other features (such as string

similarity, word pair similarity) are incorporated in a stacking fashion. Together with domain adap-

tation, their model achieved state-of-the-art Pearson’s correlation of 0.8810 on STS12.

Another interesting structural feature is probabilistic soft logic [Kimmig et al., 2012; Bach et al.,

2013], which is applied to STS by Beltagy et al. [2014]. The benefits of using probabilistic soft logic

is (1) it allows fast inference; (2) it is designed for computing similarity between complex structured

objects; (3) compared to tree kernels, the logic representation captures more direct semantics. Their

model is evaluated on msr-vid and msr-par data sets in STS12, however receiving a lower Pearson’s

score than DKpro [Bar et al., 2013]: 0.83 on msr-vid and 0.49 on msr-par, compared to 0.87 and

0.68 of DKpro.

1.2.4 Applications

STS is a core component in many sentential semantics based NLP tasks, hence it is applied in a

wide range of tasks. In Text Coherence Detection [Lapata and Barzilay, 2005], similarity between

adjacent sentences is calculated to measure the local coherence of machine-generated texts. In

unsupervised Word Sense Disambiguation, sense relatedness plays a crucial role in disambiguating


senses. Lesk [1986] measured the relatedness of senses by similarity of two sense definitions,

counting number of overlapping words/phrases between the two sense definition sentences. We

acquire a more accurate sense similarity by projecting a definition sentence into a latent vector,

where the sense similarity is the cosine similarity of the two latent vectors [Guo and Diab, 2012a].

In automated pyramid evaluation for text Summarization [Passonneau et al., 2013], phrase similarity

is employed to identify the same concepts appearing in model summary and submitted summaries.

Moreover, computing similarity between tweets is a common step in Twitter related research.

In tweets clustering [Jin et al., 2011], extensive pairwise tweet similarity is computed during clus-

tering. To overcome the word sparsity problem, the presence of url in the tweets is used to augment

the tweet data impacting performance significantly, boosting the clustering purity score from 0.280

to 0.392. In tweet recommendation [Yan et al., 2012] and tweet retrieval [Huang et al., 2012], the

most relevant tweets towards a given tweet or keywords are identified based on similarity scores.

In tweet paraphrase detection [Xu et al., 2014], tweet pairwise similarity is a strong unsupervised

baseline. In event summarization [Shen et al., 2013], a hybrid TF-IDF approach is used to extract

representative tweets.

15

Part I

Dimension Reduction for Short Text

Similarity

CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 16

Chapter 2

Enrich Short Text by Modeling Missing

Words

To date most of the NLP community focus on document level similarity, where abundant words

exist in a document and thus accurate similarity scores can be obtained simply by cosine similarity

in the original word space. With the pervasive presence of social media such as Twitter feeds and

SMS, the notion of document has changed from being hundreds of words to simple utterances or

sentences rendering the need for computing meaningful similarity scores for short text snippets.

However, due to the small context of these short texts, the cosine similarity approach in the original

word space fails to identity a lot of semantically relevant: most text pairs have a cosine similarity

of 0, because of few common words between them (even though sometimes they are semantically

relevant). In this chapter, we present our first attempt to solve this problem.

We believe that the bottleneck lies in the explicitly available number of features (the observed

high dimensional words in the short text) that represent such short text data are way too few.

Thereby, we focus our efforts on augmenting these explicit features with other features, namely,

modeling the missing words for the short text. Missing words for a text is defined as the total vo-

cabulary in the collection while excluding those words that are present in the text. Our intuition

behind explicitly modeling for missing words is that the missing words serve as negative examples

telling us what the text is not about. Together with the observed words in the text, the missing words

complete the full semantic map for the utterance modeled. Explicitly modeling for missing words


empirically, in practice, adds thousands more features for each text, which leads to robust modeling

of the short text data.

2.1 Introduction

The challenge of the short text similarity (STS) problem lies in the sparsity of features present in

the text data. In the data set released by Agirre et al. [2012], on average there are only 10.8 words in

each text snippet. Such a small number of words typically results in very few overlapping words in

a short text pair, resulting in a 0 cosine similarity score among most short text pairs, ignoring many

cases where the two text data are indeed highly semantically related.

One natural solution is to leverage dimension reduction models, such as Latent Semantic Anal-

ysis (LSA) [Deerwester et al., 1990], Probabilistic Latent Semantic Analysis (PLSA) [Hofmann,

1999] or Latent Dirichlet Allocation (LDA) [Blei et al., 2003], to extract a low dimensional rep-

resentation for each short text, on which meaningful cosine similarity scores can be calculated.

However, previous attempts at addressing the short text similarity task using LSA performed signif-

icantly below high dimensional word similarity based models [Mihalcea et al., 2006; O’Shea et al.,

2008]. When topic models are applied on short text data, we observe that only one dominant topic

can be extracted. The reason is again there are very few observed words in a text. It is very hard for

the topic models to learn a K-dimensional vector based only on 10 words.

We believe that the dimension reduction approaches applied to date had not yielded positive

results due to the deficient modeling of the sparsity in the semantic space. In this thesis, we propose

to model the missing words (words that are not observed in the text data), a feature that is typically

overlooked in the text modeling literature, to address the sparseness issue for the short text similarity

task. We define the missing words of a text as the whole vocabulary in a corpus minus the observed

words in the text. Our intuition is since observed words in a short text are too few to tell us what

the text is about, missing words can be used to tell us what the text is not about. We want to use

the missing words as negative examples to guide us in finding the optimal semantic hypothesis for

a text data. Our idea is illustrated in Figure 2.1.

After analyzing the way traditional dimension reduction models (LSA/PLSA/LDA) handle miss-

ing words, we decide to model data using a weighted matrix factorization approach [Srebro and


(a) When only observed words are explicitly taken into account, the

words in the text are close to all the missing words, which is the center

of the observed words

(b) After missing words are explicitly taken into account, the text node’s

position would be adjusted so that it is also away from the missing

words

Figure 2.1: An example to illustrate why missing words should be helpful: the red dots are observed

words in the text; the green dots represent missing words; the black dot denotes the hypothesis of

the latent vector of the text data. After taking into consideration the missing words, we will have a

better estimation for the text, i.e., where the black dot should be.


Jaakkola, 2003], which allows us to treat observed words and missing words differently. We handle

missing words using a weighting scheme that distinguishes missing words from observed words

yielding robust latent vectors for short texts.

The properties of our model are: (1) it is an unsupervised approach, without requiring labels

annotated; (2) it is a simple model that only exploits bag-of-word features for short texts (exactly

the same information LSA/LDA uses); (3) since we use the missing word feature, which is already

implied by the text itself, our approach is very general (similar to LSA/LDA) in that it can be applied

to any format of short texts. In contrast, existing work on modeling short texts focuses on exploiting

additional data, e.g., Ramage et al. [2010] modeled tweets using their metadata (author, hashtag,

etc.).

2.2 Limitations of LDA and LSA

Usually dimension reduction models aim to find a latent semantic profile for a text that is most

relevant to the observed words. By explicitly modeling missing words, we set another criterion

to the latent semantic profile: it should not be associated with the missing words from the text.

Intuitively, missing words are not as informative as observed words, but they bear on the overall

semantic picture for textual data, as they inform us what the text is not about. Therefore there is a

need for a model that does a good job of representing this information, that the missing words are

relevant but with crucially modeling with them with right level of emphasis/impact.

LSA and PLSA/LDA work on a word-document co-occurrence matrix (in our context, each

short text is considered a document). Given a corpus, the row entries of the matrix are the uniqueM

words in the corpus, and the N columns are the document ids. The yielded M ×N co-occurrence

matrix X comprises the TF-IDF values in each Xij cell, namely that TF-IDF value of word wi in

document dj . All zero cells Xij = 0 are missing words.

Topic models (PLSA/LDA) do not explicitly model missing words. PLSA assumes each docu-

ment has a distribution over K topics P (zk|dj), k = 1, 2...K, j = 1, 2, ...N , and each topic has a

distribution over all the vocabularies in the corpus P (wi|zk), i = i, 2, ...M . Therefore, PLSA finds

a topic distribution for each document that maximizes the log likelihood of the corpus X (LDA has


a similar form): ∑i

∑j

Xij log∑k

P (zk|dj)P (wi|zk) (2.1)

In this formulation, missing words do not contribute to the estimation of document semantics, i.e.,

excluding missing words (Xij = 0) in equation 2.1 does not make a difference.

However, empirical results show that given a small number of observed words in a document,

usually topic models can only find one dominant topic (the most evident topic) for a document,

e.g., the concept definitions of bank#n#1 and stock#n#1 are assigned the financial topic alone

without any further discernability. This results in many documents are assigned exactly the same

semantics profile as long as they are pertaining/mentioned within the same domain/topic. There-

fore, any two documents in the same topic will have a cosine similarity of 1, otherwise the cosine

similarity is 0. This is not a desirable feature since we need differentiable similarity scores for ap-

plications. The reason for extracting only the dominant topic is that these topic models try to learn

a 100-dimension latent vector (assume dimension K = 100) from very few features (10 observed

words on average). It would be desirable if topic models can exploit missing words (a lot more data

than observed words) to render more nuanced latent semantics, so that pairs of documents in the

same domain can be differentiable.

On the other hand, LSA explicitly models missing words but not at the right level of emphasis.

LSA finds another matrix X with rank K to approximate X using Singular Vector Decomposition

(X ≈ X = UKΣKV>K ), such that the Frobenius norm of difference between the two matrices is

minimized: √∑i

∑j

(Xij −Xij

)2(2.2)

In effect, LSA allows missing and observed words to equally impact the objective function.

Given the inherent short length of the texts, LSA (equation 2.2) allows much more potential influ-

ence from missing words rather than observed words (99.9% cells are 0 in X). Hence the con-

tribution of the observed words is significantly diminished. Moreover, the true semantics of the

document is actually related to some missing words, but such true semantics will not be favored

by the objective function, since equation 2.2 allows for too strong an impact by forcing Xij = 0

for any missing word. Therefore the LSA model, in the context of short texts, is allowing missing

words to have a significant “uncontrolled” impact on the model.


model financial sport institution Ro Rm Ro −Rm Ro − 0.01Rm

topic models: v1 1 0 0 20 600 -580 14

LSA: v2 0.2 0.3 0.2 5 100 -95 4

ideal: v3 0.6 0 0.1 18 300 -282 15

Table 2.1: Three possible latent vectors hypotheses for the text data, which is the WordNet sense

definition of bank#n#1: a financial institution that accepts deposits and channels the money into

lending activities. Assume there are only three topics in the corpus: finance, sport, institution.

Ro denotes the relatedness score between the hypothesis with observed words; Rm denotes the

relatedness score between the hypothesis with missing words.

2.2.1 An Example

We list three latent semantics profiles expressing the short text corresponding to the concept defini-

tion of bank#n#1 in Table 2.1, which illustrates our analysis for topic models and LSA. Assume

there are three dimensions: financial, sports, institution. We use Rvo to denote the sum of seman-

tic relatedness scores between latent vector v and all observed words; similarly, Rvm is the sum of

relatedness between the vector v and all missing words. The first vector profile v1 is chosen by

maximizing Ro = 600, hence generated by topic models. It suggests bank#n#1 is only related

to the financial dimension. The second latent vector (found by LSA) has the maximum value of

Ro−Rm = −95, but obviously the latent vector is not related to bank#n#1 at all. This is because

LSA treats observed words and missing words equally the same, and due to the large number of

missing words, the information of observed words is lost: Ro−Rm ≈ −Rm. The third vector is the

ideal semantic profile, since it is also related to the institution dimension. It has a slightly smaller

Ro in comparison to the first vector, yet it has a substantially smaller Rm.

In order to favor the ideal vector over other hypotheses, we simply need to adjust the objective

function by assigning a smaller weight to Rm, such as: Ro − 0.01 × Rm in the 8th column of

Table 2.1. Accordingly, we use weighted matrix factorization [Srebro and Jaakkola, 2003] to model

missing words.


2.3 The Proposed Approach

2.3.1 Weighted Matrix Factorization

The weighted matrix factorization approach is very similar to SVD, except that it allows for direct

control on each matrix cell Xij . The model factorizes the original matrix X into two matrices P

and Q such that X ≈ P>Q, where X is a M ×N matrix, P is a K ×M matrix, and Q is a K ×N

matrix (Figure 2.2).

The model parameters (vectors in P andQ) are optimized by minimizing the objective function:∑i

∑j

Wij

(P·,i>Q·,j −Xij

)2+ λ||P ||22 + λ||Q||22 (2.3)

where λ is a free regularization factor, and the weight matrix W defines a weight for each cell in X .

Accordingly, P·,i is a K-dimension latent semantic vector profile for word wi; similarly, Q·,j is

a K-dimension vector profile that represents the text dj . Operations on theseK-dimensional vectors

have very intuitive semantic meanings:

(1) the inner product of P·,i and Q·,j is used to approximate semantic relatedness of word wi and

document dj : P·,i · Q·,j ≈ Xij , as the shaded parts in Figure 2.2; a large value of Xij means P·,i

and Q·,j should be more similar; in other words, they should share more common topics;

(2) equation 2.3 explicitly requires a document should not be related to its missing words by forcing

P·,i ·Q·,j = 0 for missing words Xij = 0.

(3) we can compute the similarity of two documents dj and dj′ using the cosine similarity between

vectors Q·,j and Q·,j′ .

Alternating least square [Srebro and Jaakkola, 2003] can be used for computing the latent vec-

tors in P and Q: P and Q are first randomly initialized, then computed iteratively by the following

equations (derivation can be found in [Srebro and Jaakkola, 2003]):

P·,i =(QW (i)Q> + λI

)−1QW (i)X>i,·

Q·,j =(PW (j)P> + λI

)−1PW (j)X·,j

(2.4)

where W (i) = diag(W·,i) is an N × N diagonal matrix containing ith row of weight matrix W .

Similarly, W (j) = diag(W·,j) is an M ×M diagonal matrix containing jth column of W .

It is worth noting that P and Q are computed iteratively, i.e., in a iteration each P·,i(i =

1, · · · ,M) is calculated based on Q, then each Q·,j(j = 1, · · · , N) is calculated based on P .


X PT Q

≈ ×

Figure 2.2: Matrix Factorization: the M × N matrix X is factorized into two matrices, K ×M

matrix P and K ×N matrix Q; K denotes the number of latent dimensions.

This can be computed efficiently since: (1) all P·,i share the same QQ>; similarly all Q·,j share

the same PP>; (2) X is very sparse. More details on accelerating the computation can be found in

[Steck, 2010].

2.3.2 Modeling Missing Words

It is straightforward to implement the idea in section 2.2.1 (choosing a latent vector that maximizes

Ro − 0.01× Rm) in the weighted matrix factorization framework, by assigning a small weight for

all the missing words in equation 2.3:

Wi,j =

1, if Xij 6= 0

wm, if Xij = 0(2.5)

We refer to resulting model as Weighted Matrix factorization [WMF]. The algorithm that uses al-

ternating least square is presented in algorithm 1.

This solution is elegant: 1. it explicitly tells the model that in general all missing words should

not be related to the short text; 2. meanwhile latent semantics are mainly generalized based on ob-

served words, and the model is not penalized too much (wm is very small) when it is very confident

that the text is highly related to a small subset of missing words based on their latent semantic

profiles (e.g., bank#n#1 definition text is strongly related to its missing words check loan).

In fact, the weight value reflects the confidence we have on the X cells. If Xij = 0 a missing

word, most likely word i is irrelevant to document j. However, there is still a small chance that

word i is a related word such as check loan to bank#n#1 sense definition. Therefore we are less

confident at the 0 values, and assign a small weight to it.

We adopt the same approach (assigning a small weight for some cells in WMF, some feature

values) proposed for recommender systems [Steck, 2010]. In recommender systems, an incomplete

rating matrix R is formed, where rows are users and columns are items. Typically, a user rates only


Algorithm 1: WMF

1 Procedure P = WMF(X,W, λ, n itr, α)

2 n words, n docs← size(X);

3 randomly initialize P,Q;

4 itr ← 1;

5 while itr < n itr do

6 for j ← 1 to n docs do

7 Qj,· =(P>W (j)P + λI

)−1P>W (j)X·,j

8 for i← 1 to n words do

9 Pi,· =(Q>W (i)Q+ λI

)−1Q>W (i)X>i,·

10 itr ← itr + 1;

a small portion of the items, hence, the recommender system needs to predict the missing ratings.

Steck [2010] imputed a value for all the missing cells, and set a small weight for those cells.

Compared to [Steck, 2010], we are facing a different problem and targeting a different goal. We

have a full matrixX where missing words have a 0 value, while the missing ratings in recommender

systems are unavailable – the values are unknown, hence the rating matrix R is not complete. In

the recommender system setting, they are interested in predicting individual ratings, while we are

interested in the text semantics. More importantly, they do not have the sparsity issue (on average

each movie has been rated over 250 times in the movie lens data1) and robust predictions can be

made based on the observed ratings alone.

2.4 Experiments

2.4.1 Experiment setting

Task and data sets: the WMF model is evaluated within the context of the short text similarity

task, where a system needs to predict a similarity score given two text data. The evaluation metric

is the Pearson correlation coefficient between the gold similarity scores and a system’s predicted

1http://www.grouplens.org/node/73, with 1M data set being the most widely used.


scores.

The evaluation data sets are the data developed from SemEval-2012 Semantic Textual Semantics

task (STS12) [Agirre et al., 2012], *SEM 2013 shared STS task (STS13) [Agirre et al., 2013],

and SemEval-2014 STS task (STS14) [Agirre et al., 2014].2 For STS12, the training data (2234

pairs) is used as the tuning set for setting parameters for our models. This data comprises msr-par

news sentence paraphrases from Microsoft Paraphrase Corpus [Dolan et al., 2004], msr-vid video

description paraphrases [Chen and Dolan, 2011], smt-eur translation data [Callison-Burch et al.,

2007]. Once the models are tuned, we evaluate them on the STS12 test data set, STS13 data set

and STS14 data set. It is worth noting that the tuning data and test data are not from the same

sources: STS12 test set comprises out of domain sentence pairs such as OntoNotes [Hovy et al.,

2006] dictionary glosses; STS13 has FrameNet [Baker et al., 1998] glosses, OntoNotes glosses,

news headlines; The new genres in STS14 are tweets-news data [Guo et al., 2013], OntoNotes

glosses, news headlines, image descriptions [Rashtchian et al., 2010],and Deft-forum forum data.

Baselines: The performance of WMF is compared against (a) TF-IDF: a surface word based TF-

IDF weighting schema in the original high dimensional space, (b) LSA, and (c) LDA that uses

Collapsed Gibbs Sampling for inference [Griffiths and Steyvers, 2004]. The similarity of two short

texts is computed by cosine similarity either in the original word space (TF-IDF) or latent space

(LSA, LDA, WMF).

To eliminate randomness in WMF and LDA, all the reported results are averaged over 10 runs.

We run 20 iterations for WMF, and run 5000 iterations for LDA; Each LDA model is averaged over

the last 10 Gibbs Sampling iterations to obtain more robust predictions.

The latent vector of a text is computed by: (1) equation 7.2 in WMF, or (2) summing up the

latent vectors of all the constituent words weighted by Xij in LSA and LDA, following the work

reported in [Mihalcea et al., 2006]. For LDA, the latent vector of a word is computed by P (z|w).

It is worth noting that we could directly use the estimated topic distribution of a text θ to represent

a sentence, however, the topic distribution has only non-zero values in one or two topics, hence it

loses a lot of nuanced information, leading to a much worse performance.

Corpora: The corpora used by the dimension reduction models (LSA/LDA/WMF) comprise def-

2A description of the data sets is summarized in Table 1.1


Models Parameters STS12 tune STS12 test STS13 STS14

1. TF-IDF - 72.8 66.2 58.4 70.2

2. LSA - 16.1 23.0 24.9 27.5

3. LDA α = 0.05, β = 0.05 73.5 67.1 72.5 63.6

4. WMF wm = 1, λ = 20 15.7 23.9 23.7 27.4

5. WMF wm = 0, λ = 20 58.7 55.7 48.7 46.2

6. WMF wm = 0.01, λ = 20 74.3 71.7 71.8 71.7

Table 2.2: Pearson’s correlation (in percentage) on the four data sets: latent dimensionK = 100 for

LSA/LDA/WMF. For WMF models, the regularization factor λ is fixed as 20. Model 4-6 are WMF

with different missing word weight wm, where the first two models are analogous to LSA and LDA,

respectively.

initions from two dictionaries WordNet, Wiktionary3 and the Brown corpus. All definitions are

simply treated as individual documents. For the Brown corpus, each sentence is treated as a docu-

ment in order to mimic documents that are short text thereby creating more coherent co-occurrence

data. All data is tokenized and stemmed using the Porter Stemmer [Porter, 2001]. The importance

of words in a text is measured by the TF-IDF schema. All the dimension reduction models (LSA,

LDA, WMF) are built on the same set of corpora: WordNet+Wiktionary+Brown (393, 667 short

texts, 5, 252, 143 tokens, and 81, 848 distinct words).

2.4.2 Results

Table 2.2 summarizes the Pearson correlation coefficient values on the tuning and test sets. All

parameters are tuned based on the tuning set. In LDA, we chose an optimal combination of α and β

from {0.01, 0.05, 0.1, 0.5}.4 In WMF, we choose the best parameters for the weight wm for missing

words and λ for regularization. We fix the dimension K = 100. Later in section 2.4.3, we will see

3http://en.wiktionary.org/wiki/Wiktionary:Main Page

4Here α (Dirichlet prior for topic distribution of a document θj) and β (Dirichlet prior for word distribution given a

topic φk) serve as the regularization terms in LDA just like λ in WMF, since a larger α or β makes P (z|θj) and P (z|wi)

more evenly distributed.


that a larger value of K can further improve performance.

There are several interesting observation in Table 2.2. Firstly, LSA is the most ineffective model.

This is caused by the fact that most cells in the corpora matrix X are missing words with 0 values;

LSA focuses is overwhelmed with the missing words, accordingly they induce noisy latent vectors

for the texts. Also, we note that compared to the TF-IDF model that works in high dimensional word

space, LDA is not consistently better: LDA achieved better Pearson’s correlation than TF-IDF on

STS12 and STS13 data sets, but not on STS14. One major reason is that STS14 is a relatively easier

corpus for the TF-IDF model in the sense that it contains more common words between text pairs:

in STS14 on average each pair has 54% words that appear in both sentences, while the percentage

is 51.8% for STS13 excluding the translation pairs.5

WMF that models missing words using a small weight (model 6 with wm = 0.01) outperforms

the second best model LDA by a large margin on all data sets except STS13 (+4.6%, −0.7%,

+8.1% on STS12 test, STS13, STS14, respectively). This is because LDA only uses 10 observed

words to infer a 100 dimension vector for a text, while WMF takes advantage of overwhelmingly

more missing words to learn more robust latent profiles.

We also present model 4 and 5 (both are WMF), to show the impact of: (1) modeling missing

words with equal weights as observed words (wm = 1) (mimicking LSA modeling), and (2) not

modeling missing words at all (wm = 0) (mimicking LDA modeling) in the context of WMF model.

As expected, both model 4 and model 5 generate much worse results.

Both LDA and model 5 ignore missing words, with better correlation scores achieved by LDA.

This may be due to the different inference algorithms: Gibbs sampling is a better inference algo-

rithm than alternating least squares which finds a local optimum solution. Model 4 and LSA are

comparable, where missing words are used with a large weight wm = 1. Both of them yield low

results. This confirms our assumption that allowing for equal impact of both observed and missing

words is not the appropriate manner for modeling the semantic space.

5The percentage is around 60% for STS12 tune and STS12 test, however, the ground truth similarity score has a lower

correlation with TF-IDF model.


2.4.3 Analysis

In WMF and LDA models, there are several essential parameters: weight of missing words wm,

and dimension K. Figure 2.3 and 2.4 illustrate the impact of these parameters on predicting text

similarity scores.

Figure 2.3 shows the influence of wm on performance. When wm ≤ 0.01, the correlation scores

are very stable (all better than LDA except STS13), with wm = 0.005 being at the peak (even better

than our tuned parameterwm = 0.01). The scores drop significantly after wm becomes larger than

0.01. Accordingly, we can conclude the short text similarity task prefer a small missing word weight

wm ≤ 0.01.

We also illustrate the influence of the dimension K = {50, 75, 100, 150, 200} on LDA and

WMF in Figure 2.4, where parameters for WMF are fixed as wm = 0.01, λ = 20, and for LDA are

α = 0.05, β = 0.05. We observed two trends: (1) in most cases, a larger dimension K produces

higher Pearson’s correlation for both models, as more dimensions allow the encoding of more se-

mantics about the original data; (2) WMF outperforms LDA in all dimensions, except the STS13

data set. In Figure 2.4c on STS13, it seems a smaller number of dimension K = 75 yields the best

score for WMF, whereas LDA continue benefiting from a larger dimension K = 200.

Based on the two figures, we can conclude that WMF outperforms LDA in most cases; the result

is robust with different values of the K dimension and the missing word weight wm.

Another interesting factor is the training data. In all the experiments the models are trained on

short text data; as we observe, LSA performs particularly poorly because of the length of the data.

To investigate the impact of the length of the training data, we train the models on long documents:

400, 000 wikipedia documents, the same size as the short text corpora, with 156 tokens on average

in a document (versus 13.3 in short text corpora). The results are shown in Table 2.3.

With more words in a document, LSA is able to improve the results by a large margin, however

still worse than LDA and WMF. The performance of LDA and WMF degrades compared to trained

on the short texts. The reason may be that using the dictionary definitions is able to cover more

topics.


wm

0.001 0.005 0.01 0.05 0.1

Pe

ars

on

's c

orr

ela

tio

n %

60

65

70

75

WMF

(a) STS12 train

wm

0.001 0.005 0.01 0.05 0.1

Pe

ars

on

's c

orr

ela

tio

n %

60

65

70

75

WMF

(b) STS12 test

wm

0.001 0.005 0.01 0.05 0.1

Pe

ars

on

's c

orr

ela

tio

n %

60

65

70

75

WMF

(c) STS13

wm

0.001 0.005 0.01 0.05 0.1

Pe

ars

on

's c

orr

ela

tio

n %

60

65

70

75

WMF

(d) STS14

Figure 2.3: Pearson’s Correlation percentage scores of WMF on each data set: the missing word

weight wm varies from 0.001 to 0.1; the dimensionK is fixed to 100; regularization factor λ is fixed

to 20.


K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

WMFLDA

(a) STS12 train

K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

WMFLDA

(b) STS12 test

K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

WMFLDA

(c) STS13

K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

WMFLDA

(d) STS14

Figure 2.4: Pearson’s Correlation percentage scores of WMF and LDA on each data set: the dimen-

sion K varies from 50 to 200; missing word weight wm is fixed to 0.01; regularization factor λ is

fixed to 20.



TF-IDF - 72.8 66.2 58.4 70.2

LSA - 48.80 43.22 43.28 43.13

LDA α = 0.05, β = 0.05 73.46 66.83 59.68 54.77

WMF wm = 0.01, λ = 20 76.93 71.55 64.65 68.84

Table 2.3: Pearson’s correlation (in percentage) on the four data sets: the models are trained on long

documents.

2.5 Summary and Discussion

In this chapter, we analyzed how traditional models (LSA and topic models) handle missing words.

Accordingly, we explicitly take special treatment to missing words to alleviate the sparsity problem

in modeling short texts. Experiment results on three data sets confirm our hypothesis, and show that

our model WMF significantly outperforms existing methods.

One limitation of the bag-of-words based models is they neglect a lot of nuanced semantics

expressed by phrases such as word order. For example, the two sentences the dog bit him and he bit

the dog have the same feature vectors. This is a factor that may not hurt similarity prediction too

much but likely prevents short text similarity from being used for other NLP tasks. Therefore, in the

future work, we would like to integrate phrases as additional features for the short texts. One major

challenge lies in that most phrases are very infrequent and hence too sparse to model. Therefore,

we intend to filter a set of meaningful phrases and only concentrate on these sets.

CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 32

Chapter 3

Enrich Lexical Features by Modeling

Bigrams and Similar Words

In last chapter, we introduce the bottleneck of the short text modeling, i.e., simply employing the

observed words provides too few features for a short text, which causes inaccurate low dimensional

text representation. To this end, we integrate the missing words as additional features for a text

in the matrix factorization framework, and achieve significantly more robust representation for the

short text data.

In this chapter, we further investigate the representation for short texts in another perspective.

We argue that the current dimension reduction models, including our WMF model, do not pay

enough attention to lexical semantics. In LSA/LDA/WMF, the features to represent a word are sim-

ply the document IDs (see Figure 3.1), hence not very expressive. Under this simple assumption, a

lot of nuanced lexical information such as selectional preference is lost. Therefore, in this chapter

we focus on extracting and incorporating more features for words under the matrix factorization

framework, in order to infer robust lexical semantics. We believe modeling robust lexical items

is very important in the short text modeling context, since a short text contains very few observed

words, and the text representation will benefit significantly from quality word representation. The

experimental results support our hypothesis, where the new model [Guo and Diab, 2013] signifi-

cantly outperforms the WMF model in short text similarity data sets.


3.1 Introduction

Our proposed Weighted Matrix Factorization (WMF) [Guo and Diab, 2012b] has outperformed

LSA [Deerwester et al., 1990] and LDA [Blei et al., 2003] by a large margin in the short text

similarity task, yielding previous state-of-the-art performance among unsupervised systems on the

STS12 [Agirre et al., 2012] data sets. However, all of these three models make over simiplified

assumptions on how a token is generated: (1) in LSA/WMF (Figure 3.1a), a token is generated by

the inner product of the word latent vector and the corresponding document latent vector; (2) in

LDA (Figure 3.1b), all the tokens in a document are sampled from the same document level topic

distribution. Under this assumption, all these models ignore rich lexical linguistic phenomena such

as inter-word dependency, semantic scope of words, and so on; accordingly all the models simply

assume each word is related to all other words in the document. This is a result of merely using

document IDs as features to represent a word (As shown in Figure 2.2, in the data matrix X , each

row represents a word, hence the columns, which are document IDs, can be seen as features for the

word).

It is worth noting that in text modeling community, using the document IDs alone to represent

a word is prevalent, since these dimension reduction techniques are usually applied for documents,

where abundant words exist for extracting the document level semantics. Nonetheless, we believe

this simple assumption is harmful in the short text setting. Given the limited number of words in

a text, it is crucial to make good use of each word; if one word is not modeled accurately, the

corresponding topics might not appear in the latent vector of the short text.

In this chapter, we focus on creating more features to induce quality latent semantic vectors

for words. This is motivated by the belief that a reasonable word generation story will encourage

robust lexical semantics, which can further boost the short text semantics. The features that we

are interested in are belonging to two very different categories: the first one is bigrams, which is

purely corpus-based lexical semantic evidence. The second is similar word pairs, extracted from

human constructed knowledge base. These two kinds of lexical semantics are naturally different,

and hence complementary for each other. We integrate both of them in the WMF model, resulting

in even better performance in the short text similarity task.


d1, d2 … dN

X

w1, w2 … wM

(a) matrix facotization

θ z w N

M α

K ϕ β

(b) topic models

Figure 3.1: In current dimension reduction models (WMF/LSA and LDA), the features to represent

a word are simply document IDs, which are denoted by the red circles.


3.2 Related Work

The modeling of bigrams is closely related to selectional preference. Selectional Preference de-

notes a word’s likelihood to co-occur with certain lexical sets, by “encoding the set of admissible

argument values for a relation” [Ritter et al., 2010]. For example, the word drink prefers drink-

able objects after it; names of people are more likely to appear in the argument of the verb meet.

Selectional preference proves to be helpful for a number of NLP applications, such as syntactic dis-

ambiguation [Hindle and Rooth, 1993], semantic role labeling [Gildea and Jurafsky, 2002], textual

inference [Pantel et al., 2007] and word sense disambiguation [Resnik, 1997], and many more.

Many previous work have proposed models for address selectional preference. Resink [1996]

made use of the WordNet predefined noun word classes, and calculated the selectional preference

strength between the noun classes and observed verbs. Erk [2007] demonstrated that an approach

of computing similarity between arguments is able to provide better lexical coverage. Rooth et al.

[1999] studied the relations and arguments in a generative probabilistic model, which is extended

by Ritter et al. [2010] in the LDA framework.

In this chapter, we are targeting on modeling the selectional preference to achieve a better latent

presentation for lexical items. As shown in the next section, we relax the traditional notion of

selectional preference (modeling the association between nouns and verbs in [Resnik, 1996]), and

model the association between two words in bigrams, which is purely data driven without human

resources. Also, our approach has the benefit of learning co-occurrence tendency for all words,

compared to other work which target on a specific lexical type.

The other type of resource we are exploiting is knowledge based information, similar word

pairs extracted from a dictionary WordNet [Fellbaum, 1998]. The human constructed knowledge is

a great complement for the corpus-based data. Because of its robustness, researchers have found the

knowledge based semantics extremely valuable in various NLP tasks such as paraphrasing [Barzilay

and Lee, 2003], lexical semantics [Yih et al., 2012], etc. In this chapter, we extract similar word

pairs from WordNet, and test its influence for short text similarity.


3.3 Incorporating Bigrams

The additional corpus-based information we exploit, other than word-document co-occurrence, is

the bigrams, a feature already existing in the data yet ignored by most distributional similarity

models. The bigrams encodes the admissible arguments for a word, thus capturing more nuanced

semantics than the document IDs. Consider the following example (in our data set, a short text as a

document):

Many analysts say the global Brent crude oil benchmark price, currently around $111 a barrel

By the nature of WMF/LSA/LDA, a word will receive semantics from all the other words in a doc-

ument, therefore, the word oil, in the above example, will be assigned the incorrect finance topic

that is the dominant topic in the text level semantics. Moreover, the problem worsens for adjectives,

adverbs and verbs, which have a much narrower semantic scope than the whole sentence/short

text/document. For example, the verb say should only be associated with analyst (only receiving

semantics from analyst), as its semantics should not be related to any other word in the sentence.

In contrast, the word oil, according to its selectional preference, should only be associated with its

modifier crude which indicates the correct resource topic. We believe modeling bigrams captur-

ing local evidence completes the semantic picture for words, hence subsequently rendering better

short text semantics. To our best knowledge, this is the first work to model bigrams for short text

semantics.

If two words form a bigram, then the two words should share similar latent topics.1 In the

previous example, crude and oil form a bigram, and they share the resource topic. In our framework,

this is implemented by adding extra columns in X , so that each additional column corresponds to

a bigram, treating each bigram as a pseudo-document that only contains these two words, as shown

in Figure 3.2. The corresponding graphical model is illustrated in Figure 3.3b, where the extra b

nodes stand for the bigrams. Therefore, oil will receive more resource topic from crude through the

bigram crude oil, instead of only finance topic from the sentence as a whole.

Each non-zero cell in the new columns of X , i.e. an observed token in a bigram (pseudo-

1Note this distinguishes our work from previous efforts that mainly work on noun-verb relations, e.g., admissible

nouns for a verb. Since we are targeting on enhancing the latent representation for all words, our approach is very general

that can be applied on any word.


X

analyst says

.

.

.

crude oil

d1 d2 … dN

0 1 0 1 0 0 . . . . 1 0 1 0

b1 b2..

Figure 3.2: Each bigram is integrated in the original corpus matrixX as an additional column. From

the model’s perspective, a bigram is treated as a pseudo-text; accordingly, only two cells in a bigram

column have non-0 values.

document), is given a different weight:

Wi,j =

1, if Xij 6= 0 and j is a document index

γ · freq(j), if Xij 6= 0 and j is a bigram index

wm, if Xij = 0

(3.1)

freq(j) denotes the frequency of bigram j appearing in the corpus, hence the strength of associ-

ation is differentiated such that higher weights are assigned on the more frequented bigrams. The

coefficient γ, whose value is manually set, is a hyperparamter that controls the importance of the

bigram evidence. Assigning a large γ value indicates that the bigram is more trustable than the

global textual semantics.

3.3.1 Incorporating Bigrams from Dependency Tree

An alternative way to incorporate the relations between two words is to extract bigrams from the

syntactic dependency tree. The benefit of doing so is that we can extract long range word relations.

After parsing each sentence, we derive the bigrams as the tuples of modifier and head. We use

Stanford parser [Klein and Manning, 2003] to parse the sentences. However, we are not able to

obtain better performance in the short text similarity task. The results is analyzed in the experiment

section.


3.4 Incorporating Similar Word Pairs

We also integrate knowledge-based semantics in the WMF framework. Knowledge-based seman-

tics, as a type of human-annotated clean resource, is an important complement to corpus-based

noisy co-occurrence information. In this section, the knowledge-based semantics we exploit is sim-

ilar word pairs extracted from Wordnet [Fellbaum, 1998].

These similar word pairs is very valuable for improving the quality of infrequent word latent

profiles, because the model does not have enough contexts to understand an infrequent word. Lever-

aging the pairs, an infrequent word such as purchase can “borrow” relatively robust latent vectors

from its synonyms such as buy, as buy appears much more frequently and the model is able to

capture its semantics more accurately.

Similar words pairs can be seamlessly modeled in WMF, since in the matrix factorization frame-

work a latent vector profile is explicitly created for each word, therefore we can directly operate on

these word vectors. By contrast, in LDA all the data structures are designed for documents rather

than words. To integrate the knowledge, we construct a graph to connect words according to the

extracted similar word pairs, to encourage similar words to enjoy similar latent vector profiles, as

the nodes w2 and w4 shown in Figure 3.3c.

We first extract synonym pairs from WordNet, which are words associated with the same sense,

aka synset. We further expand the set by exploiting the relations defined in WordNet. For the

extracted words, we consider the first sense of each word, and if it is connected to other senses by

any of the WordNet defined relations (such as hypernym, meronym, etc.), then we treat the words

associated with the other senses as similar words. In total, we are able to discover more than 80, 000

pairs of similar words for the 46, 000 distinct words in our corpus.

Given a pair of similar words wi1 and wi2 , we want the two corresponding latent vectors P·,i1

and P·,i2 to be as close as possible, namely the cosine similarity to be close to 1. Accordingly, a

term is added in equation 2.3 for each similar word pair wi1 , wi2 :

δ · (P>·,i1P·,i2|P·,i1 ||P·,i2 |

− 1)2 (3.2)

|P·,i| denotes the Euclidean length of the vector P·,i. The coefficient δ, analogous to γ, denotes the

importance of the knowledge-based evidence. Figure 3.3c shows the final WMF+BK model (WMF

+ corpus-based Selectional [P]references semantics + [K]nowledge-based semantics), where the


extra link connecting w2 and w4 denotes the term in equation 3.2 that forces the two corresponding

word profile vectors to be similar.

3.5 Experiments

3.5.1 Experiment Setting

The experiment setting is almost the same as the setting in last chapter (more details to be found in

section 2.4).

Task and data sets: All the models are evaluated against the short text similarity task with dat sets

STS12, STS13 and STS14. STS12 training data set is used as tuning set.

Baselines: (a) TF-IDF: a surface word based TF-IDF weighting schema in the original high di-

mensional space, (b) LSA, (c) LDA that uses Collapsed Gibbs Sampling for inference [Griffiths and

Steyvers, 2004], and (d) WMF.

Corpora: The co-occurrance corpora is: definitions from two dictionaries WordNet, Wiktionary,

and Brown corpus.

3.5.2 Results

Table 3.1 lists the results at dimension K = 100 (the dimension of latent topics). To remove

randomness, each reported number is the averaged results of 10 runs. Based on the STS12 tuning

set, we experiment with different values for the bigram weight γ = {0, 1, 2}, and likewise for the

similar word pairs weight varying the weight δ as follows δ = {0, 10, 30, 50}. The performance on

STS12 tuning and STS12 test, STS13, STS14 is illustrated in Figure 3.4 and 3.5. The parameters of

model 7 in Table 3.1 (γ = 2, δ = 50) are the chosen values based on tuning set performance.

Table 3.1 shows WMF is already a very strong baseline: it outperforms TF-IDF, LSA and LDA

by a large margin. Using corpus-based bigram semantics alone (model 5 WMF+B in Table 3.1)

boosts the performance of WMF from +0.4% to +0.7% on the test sets, while using knowledge-

based semantics alone (model 6 WMF+K) improves the over the WMF results by an absolute value

of at most +1.1% (on STS13). That the performance gain of similar word pairs is larger than bigram

is expected, since the former is a cleaner source of semantics created by human annotators.


w1 w2 w3

d1

w4 w5

d2

(a) WMF

w1 w2 w3

d1

b1 b2

w4 w5

d2

b3 b4

(b) WMF with bigrams

w1 w2 w3

d1

b1 b2

w4 w5

d2

b3 b4

(c) WMF with bigrams and similar words (full model)

Figure 3.3: WMF+BK model (WMF + corpus-based [B]igram semantics + [K]nowledge-based

similar word pairs semantics): a w/d/b node represents a word/document/bigram, respectively; the

extra node in Figure 3.3c denotes w2 and w3 constitute a similar word pair.



1. TF-IDF - 72.8 66.2 58.4 70.2

2. LSA - 16.1 23.0 24.9 27.5

3. LDA α = 0.05, β = 0.05 73.5 67.1 72.5 63.6

4. WMF 74.3 71.7 71.8 71.7

5. WMF+B γ = 2, δ = 0 74.5 72.2 72.6 72.5

6. WMF+K γ = 0, δ = 50 74.6 72.7 72.9 72.3

7. WMF+BK γ = 2, δ = 50 74.8 73.1 73.0 72.8

5. WMF+syn γ = 2, δ = 0 73.2 71.6 71.3 70.5

Table 3.1: Pearson’s correlation (in percentage) on the four data sets. Latent dimension K = 100

for LSA/LDA/WMF/WMF-BK. For matrix factorized based models, the regularization factor λ is

fixed as 20. Model 5 is WMF with bigram semantics alone; model 6 is WMF with similar word

pairs alone; model 7 is the final model with both semantics incorporated.

Combining them (model 7 WMF+BK) yields the best results, with an absolute increase of

+0.7% to +1.4%, which suggests that the two sources of semantic evidence are useful, but more

importantly, they are complementary for each other.

Observing the performance using different values of weights in Figure 3.4 (corpus-based se-

mantics weight γ) and 3.5 (knowledge-based semantics δ), we can conclude that the bigram and

similar word pairs yield very promising results. The trends hold in different parameter conditions

with a consistent improvement.

At last, we present the WMF+syn setting, where the bigrams are extracted from the dependency

tree of the sentences. We can see the performance is worse than the baseline WMF in all data sets.

The reason might be that the dependency parser is not mature enough to be applied on our data set –

we use the sense definitions which do not have the same structures of the natural langue in the news

genre data sets.



Motivated by the importance of recognizing the correct topics of words in the short text context,

we incorporate corpus-based (bigrams) and knowledge-based (similar word pairs) lexical semantics

into our matrix factorization model. Our system yields significant unsupervised performance gains

on short text similarity data sets over an existing strong baseline WMF model.

This method bridges the gap between lexical semantics and short text similarity, by applying

lexical semantics techniques to the P matrix of word latent vectors. Yet there is still room to benefit

from knowledge based semantics in the current framework. One direction is similar to the idea

introduced in [Yih et al., 2012; Chang et al., 2013], where the proposed model is aware of the

relations between sense pairs (synonym, antonym, meronym...). Intuitively, different word relation

should have different impact on the word profile vectors, e.g., in [Yih et al., 2012] two words that are

antonyms should have a −1 cosine similarity over their latent vectors. Nonetheless, in the current

model, all relations are simply abstracted as word neighbors without further differentiation (all word

pairs should have a cosine similarity value close to 1). We hope that explicitly modeling the sense

relations would yield better word latent profiles that encode more linguistic intuitions.


γ

0 0.5 1 1.5 2

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-B

(a) STS12 train

γ

0 0.5 1 1.5 2

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-B

(b) STS12 test

γ

0 0.5 1 1.5 2

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-B

(c) STS13

γ

0 0.5 1 1.5 2

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-B

(d) STS14

Figure 3.4: Pearson’s Correlation percentage scores of WMF-B (with corpus-based [B]igram se-

mantics alone) on each data set: corpus-based semantics weight γ is chosen from {0, 1, 2}; the

dimension K is 100; missing word weight wm is fixed as 0.01; regularization factor λ is fixed as

20.


δ0 10 30 50

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-K

(a) STS12 train

δ0 10 30 50

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-K

(b) STS12 test

δ0 10 30 50

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-K

(c) STS13

δ0 10 30 50

Pe

ars

on

's c

orr

ela

tio

n %

70

71

72

73

74

75

WMF-K

(d) STS14

Figure 3.5: Pearson’s Correlation percentage scores of WMF-K (with [K]nowledge-based similar

word pairs semantics alone) on each data set: knowledge-based semantics weight δ is chosen from

{0, 10, 30, 50}; the dimension K is 100; missing word weight wm is fixed as 0.01; regularization

factor λ is fixed as 20.

CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 45

Chapter 4

Binary Coding for Large Scale

Similarity Computing

In previous two chapters, we presented matrix factorization models to convert text data into low-

dimensional real-valued vectors. We demonstrate that the models are very effective at predicting

semantic similarity for short text pairs.

We now turn our attention to computing similarity scores in a massive data set, specifically in

this chapter Twitter data, where millions of tweets are posted each day. One obvious issue is that

the massive data lead to time-consuming cosine similarity computation. To overcome the problem,

we focus on exploiting binary bit representation, rather than the real-valued representation in the

previous two chapters, for textual semantic similarity computation. We introduce a new model that

potentially removes redundant information in the model, and produces better performance in tweet

retrieval task and the short text similarity task.

4.1 Introduction

Twitter is rapidly gaining worldwide popularity, with 500 million active users generating more than

340 million tweets daily1. Massive-scale tweet data is freely available on the Web and contains

rich linguistic phenomena and valuable information, therefore making it one of most favorite data

1http://en.wikipedia.org/wiki/Twitter


sources used by a variety of Natural Language Processing (NLP) applications. Successful examples

include first story detection [Petrovic et al., 2010], local event detection [Agarwal et al., 2012],

Twitter event discovery [Benson et al., 2011], extraction [Ritter et al., 2012] and summarization

[Chakrabarti and Punera, 2011], etc.

In these NLP applications, one of core technical components is tweet similarity computing

to search for the desired tweets with respect to some sample tweets. For example, in first story

detection [Petrovic et al., 2010], the purpose is to find an incoming tweet that is expected to report

a novel event not revealed by the previous tweets. This is done by measuring cosine similarity

between the incoming tweet and each previous tweet.

One obvious issue is that cosine similarity computations among Twitter data will become very

slow once the scale of Twitter data grows drastically. In this chapter, we investigate the problem of

computing similarity score in large scale data set. We evaluate the similarity scores by the task of

tweet retrieval, where a system searches for the most similar tweets given a query tweet.2 Specifi-

cally, we propose a binary coding approach to render computationally efficient tweet comparisons

that should benefit practical NLP applications, especially in the massive data scenarios. Using the

proposed approach, each tweet is compressed into short-length binary bits (i.e., a compact binary

code), so that tweet comparisons can be performed substantially faster through measuring Ham-

ming distances between the generated compact codes. Crucially, Hamming distance computation

only involves very cheap NOR and popcount operations instead of floating-point operations needed

by cosine similarity computation.

Since Twitter messages contains very few words, naturally we can apply the WMF model on

Twitter data to get quality latent vectors. And then we convert the real-valued vectors to a binarized

version. Intuitively, the binary bits loses a lot of information, compared to the real-valued vectors.

Therefore, we focus on improving the WMF model to preserve as much information as possible in

the binary strings for tweets, and reduce any redundant information in the model.

Looking at the objective function, we find the WMF model solely focuses on exhaustively en-

coding the local context, i.e., whether a word appears in a short text. One issue caused by the local

approach is that it introduces some overlapping information, which is reflected in its associated pro-

jections (the P matrix in Figure 2.2). In order to remove the redundant information and meanwhile

2We also evaluate the model by short text similarity task, on the real-valued latent vector rather than binary bits.


Symbol Definition

N Number of tweets in the corpus.

M Dimension of a tweet vector, i.e., the vocabulary size.

xi The sparse TF-IDF weighted vector corresponding to the i-th tweet in the corpus.

xi The vector subtracted by the mean µ of the tweet corpus: xi = xi − µ.

X, X The tweet corpus in a matrix format, and the zero-centered tweet data.

K The number of binary coding functions, i.e., the number of latent topics.

fk The k-th binary coding function.

Table 4.1: Symbols used in binary coding.

discover more distinct topics, we employ a gradient descent method to make the projection direc-

tions nearly orthogonal. We name the improved model Orthogonal Matrix Factorization (OrMF)

[Guo et al., 2014].

In our experiments, we evaluate the quality of similarity/disimilarity scores by searching for

most similar tweets given a query tweet. We use Twitter hashtags to create the gold (i.e., groundtruth)

labels, where tweets with the same hashtag are considered semantically related, hence relevant. We

collect a tweet data set which consists of 1.35 million tweets over 3 months where each tweet has

exactly one hashtag. The experimental results show that our proposed model OrMF significantly

outperforms competing binary coding methods.

4.2 Background and Related Work

4.2.1 Preliminaries

We first introduce some notations used in this chapter to formulate our problem. Suppose that we

are given a data set ofN tweets and the size of the vocabulary isM . A tweet is represented by all the

words it contains. We use notation x ∈ RM to denote a sparse M -dimensional TF-IDF weighted

vector corresponding to a tweet, where each word stands for a dimension. For ease of notation,

we represent all N tweets in a matrix X = [x1,x2, · · · ,xn] ∈ RM×N . For binary coding, we

seek K binarization functions{fk : Rd → {1,−1}

}Kk=1

so that a tweet xi is encoded into an K-bit


binary code (i.e., a string of K binary bits). Table 4.1 illustrates the symbols used in this chapter for

notation.

Hamming Ranking: In the chapter we evaluate the quality of binary codes in terms of Hamming

ranking. Given a query tweet, all data items are ranked in an ascending order according to the Ham-

ming distances between their binary codes and the query’s binary code, where a Hamming distance

is the number of bit positions in which bits of two codes differ. Compared with cosine similarity,

computing Hamming distance can be substantially efficient. This is because fixed-length binary bits

enable very cheap logic operations for Hamming distance computation, whereas real-valued vectors

require floating-point operations for cosine similarity computation. Since logic operations are much

faster than floating-point operations, Hamming distance computation is typically significantly faster

than cosine similarity computation.

4.2.2 Binary Coding

Early explorations of binary coding focused on using random permutations or random projections to

obtain binary coding functions (aka, hash functions), such as Min-wise Hashing (MinHash) [Broder

et al., 1998] and Locality-Sensitive Hashing (LSH) [Indyk and Motwani, 1998]. MinHash and LSH

are generally considered data-independent approaches, as their coding functions are generated in

a randomized fashion. In the context of Twitter, the simple LSH scheme proposed in [Charikar,

2002] is of particular interest. Charikar proved that the probability of two data points colliding

is proportional to the angle between them, and then employed a random projection w ∈ RM to

construct a binary coding function:

f(x) = sgn(w>x

)=

1, if w>x > 0,

−1, otherwise.(4.1)

The current held view is that data-dependent binary coding can lead to better performance. A

data-dependent coding scheme typically includes two steps: 1) learning a series of binary coding

functions with a small amount of training data; 2) applying the learned functions to larger scale data

to produce binary codes.

In the context of tweet data, Latent Semantic Analysis (LSA) [Deerwester et al., 1990] can di-

rectly be used for data-dependent binary coding. LSA reduces the dimensionality of the data inX by

performing singular value decomposition (SVD) over X: X = UΣV >. Let X be the zero-centered


data matrix, where each tweet vector xi is subtracted by the mean vectorµ, resulting in xi = xi−µ.

The K coding functions are then constructed by using the K eigenvectors u1,u2, · · · ,uK as-

sociated with the K largest eigenvalues, that is, fk(x) = sgn(U·,k

>x)

= sgn(U·,k

>(x − µ))

(k = 1, · · · ,K).

Iterative Quantization (ITQ) [Gong and Lazebnik, 2011] is another popular unsupervised binary

coding approach. ITQ attempts to find an orthogonal rotation matrix R ∈ RK×K to minimize the

squared quantization error: ‖B − RV ‖2F, where B ∈ {1,−1}K×N contains the binary codes of

all data, V ∈ RK×N contains the LSA-projected and zero-centered vectors, and ‖ · ‖F denotes

Frobenius norm. After R is optimized, the binary codes are simply obtained by B = sgn(RV ).

Much recent work learns nonlinear binary coding functions, including Spectral Hashing [Weiss

et al., 2008], Anchor Graph Hashing [Liu et al., 2011a], Bilinear Hashing [Liu et al., 2012b],

Kernelized LSH [Kulis and Grauman, 2012], etc. Concurrently, supervised information defined

among training data samples was incorporated into coding function learning such as Minimal Loss

Hashing [Norouzi and Fleet, 2011] and Kernel-Based Supervised Hashing [Liu et al., 2012a]. Our

proposed method falls into the category of unsupervised, linear, data-dependent binary coding.

4.2.3 Applications in NLP

The NLP community has successfully applied LSH in several tasks such as first story detection

[Petrovic et al., 2010], and paraphrase retrieval for relation extraction [Bhagat and Ravichandran,

2008], etc. This chapter shows that our proposed data-dependent binary coding approach is superior

to data-independent LSH in terms of the quality of generated binary codes.

Subercaze et al. [2013] proposed a binary coding approach to encode user profiles for recom-

mendations. Compared to [Subercaze et al., 2013] in which a data unit is a whole user profile

consisting of all his/her Twitter posts, we are tackling a more challenging problem, since our data

units are extremely short – namely, a single tweet.


4.3 The Proposed Approach

4.3.1 Binarized version of WMF

Our approach is based on WMF model. Adapting WMF to binary coding is very straightforward.

Following LSA (section 4.2.2), we use the matrix P to linearly project tweets into low-dimensional

vectors, and then apply the sign function. The k-th binarization function uses the k-th row of the P

matrix (Pk,·) as follows

fk(x) = sgn (Pk,·x) =

1, if Pk,·x > 0,

−1, otherwise.(4.2)

Note that we use the zero-centered version x, which is the original data vector x subtracted by

the mean of the all tweets µ: x = x − µ. The goal of using zero-centered data X is to have a

balanced number of 1 bits and −1 bits in the data set.

4.3.2 Removing Redundant Information

Transforming a real-valued vector to binary bits loses a lot of information. Therefore, in this section

we aim to preserve as much original information as possible, and reduce redundant information in

the model.

We elaborate how to remove redundant information from the word semantics matrix P . Firstly,

it is worth noting that there are two explanations of the K ×M matrix P , as in Figure 4.1. The

columns of P (Figure 4.1a), denoted by P·,i, may be viewed as the collection of K-dimensional

latent profiles of words, which we observe frequently in the WMF model. On the other hand,

the rows of P (Figure 4.1b) are seen as projection vectors, denoted by Pk,·, which are similar to

eigenvectors U obtained by LSA. The projection vector Pk,· is employed to multiply to a zero

centered data vector x to generate a binary string: sgn(P·,k>x) for the text. In this section, we

focus on the property of the P matrix rows.

To compute the optimal P and Q, each columns in matrices P and Q is iteratively optimized to

approximate the data: Pi,·>Qj,· ≈ Xij , as shown in the line 6-9 in Algorithm 2 (which is basically

equation 2.4). While it does a good job at preserving the existence/relevance of each word in a

short text, it might encode repetitive information by means of the dimensionality reduction or the

projection vectors P·,k (the columns of P ).


P P P P P P P P P P P P P P⋅,iK

M

(a) Each column P·,i represents a word profile

Pk,⋅K

M

(b) Each row Pk,· is a projection vector

Figure 4.1: Two views of the P matrix: K is the number of dimensions, and M is the number

of distinct words. The first view, columns of P matrix, is frequently observed in the WMF model

(algorithm 1). Now we are going to apply the second view, rows of P matrix which are projections,

to improve the WMF model.

Figure 4.2 illustrates the redundant information (noisiness) in the P matrix. With the local

approach adopted in WMF, it is very likely to produce the topics in Figure 4.2a (which contains

some redundant information): the first projection vector P1,· may be 90% about the politics topic

and 10% about the war topic, and the second projection vector P2,· is 95% on war and 5% on food

topics, respectively. An extreme case of such redundancy is presented in Figure 4.2b, where the

second topic is exactly the same as the first topic.3

Ideally we would like the dimensions to be uncorrelated, so that more distinct topics of data

could be captured, such as in Figure 4.2c where first dimension is only about politics topic and

second dimension is only about war topic. We believe such a model is able to encode enrich infor-

mation by removing the repetitive information.

Inspired by LSA, one way to ensure the uncorrelatedness is to force P to be orthogonal, i.e.,

PP> = I . It implies Pj,·Pk,·> = 0 if k 6= j.

3It won’t happen in a real-world setting; we just use this example to illustrate the noisiness.


topic 1: poli%cs topic 2: war obama war congress soldier budget food

government water war iraq army weapon … …

(a) A noisy case: the politics topic con-

tains some war words; the war topic con-

tains some food words

topic 1: poli%cs topic 2: poli%cs obama obama congress congress budget budget

government government war war army army … …

(b) The extreme case: the two topics are ex-

actly the same

topic 1: poli%cs topic 2: war obama war congress soldier budget injure

government peace elec5on iraq policy weapon … …

(c) The perfect case: each topic only con-

tains relevant words

Figure 4.2: Three examples to illustrate the noisiness in the P matrix. In general, we would like to

remove as much noise as possible.


Algorithm 2: OrMF

1 Procedure P = OrMF(X,W, λ, n itr, α)

2 n words, n docs← size(X);

3 randomly initialize P,Q;

4 itr ← 1;

5 while itr < n itr do

6 for j ← 1 to n docs do

7 Q·,j =(PW (j)P> + λI

)−1PW (j)X·,j

8 for i← 1 to n words do

9 P·,i =(QW (i)Q> + λI

)−1QW (i)X>i,·

10 c = mean(diag(PP>));

11 P ← P − α(PP> − cI)P ;

12 itr ← itr + 1;

4.3.3 Implementation of Orthogonal Projections

To produce nearly orthogonal projections in the current framework, we could add a regularizer

β(PP> − I)2 with the weight β in the objective function of the WMF model (equation 2.4). How-

ever, in practice this method does not lead to the convergence of P . This is mainly caused by

the situation that any word profile P·,i would become dependent of all other word profiles after an

iteration.

Therefore, we adopt a simpler method, gradient descent, in which P is updated by taking a small

step in the direction of the negative gradient of (PP> − I)2. It should be noted that (PP> − I)2

requires each projection Pk,· to be a unit vector because of Pk,·Pk,·> = 1, which is infeasible when

the nonzero values in X are large. Therefore, we multiply the matrix I by a coefficient c, which is

calculated from the mean of the diagonal of P>P in the current iteration. The following two lines

are added at the end of an iteration:

c← mean(diag(PP>)),

P ←P − α(PP> − cI)P.(4.3)

Using the coefficient c, the magnitude of P is not affected. The step size α is fixed to 0.0001.


This procedure is presented in Algorithm 2. We refer to this new model as Orthogonal Matrix

Factorization (OrMF).

4.4 Experiments on Twitter Data


Twitter data: We crawled English tweets spanning three months from October 5th 2013 to January

5th 2014 using the Twitter API.4 We cleaned the data such that each hashtag appears at least 100

times in the corpus, and each word appears at least 10 times. This data collection consists of

1,350,159 tweets, 15 million word tokens, 30,608 unique words, and 3,214 unique hashtags.

One of main reasons to use hashtags is to enhance accessing topically similar tweets [Efron,

2010]. In a large-scale data setting, it is impossible to manually identify relevant tweets for a

query tweet. Therefore, we use Twitter hashtags to create groundtruth labels, which means that

tweets marked by the same hashtag as the query tweet are considered relevant. Accordingly, in our

experiments all hashtags are removed from the original data corpus. We choose a subset of hashtags

from the most frequent hashtags to create groundtruth labels: we manually remove some tags from

the subset that are not topic-related (e.g., #truth, #lol) or are ambiguous; we also remove all the

tags that are referring to TV series (the relevant tweets can be trivially obtained by named entity

matching). The resulting subset contains 18 hashtags.5

100 tweets are randomly selected as queries (test data) for each of the 18 hashtags. The median

number of relevant tweets per query is 5,621. The small size of relevant tweets makes the task

relatively challenging. We need to identify 5,621 (0.42% of the whole data set) tweets out of 1.35

million tweets.

200,000 tweets are randomly selected (not including the 1,800 queries) as training data for

the data dependent models (LSAH, ITQ, SH, WMF, OrMF) to learn binarization functions.6 The

functions are subsequently applied on all the 1.35 million tweets, including the 1,800 query tweets.

4https://dev.twitter.com

5The tweet data set and their associated list of hashtags will be available upon request.

6Although we use the word “training”, the hashtags are never seen by the models. The training data is used for the

models to learn the word co-occurrence, and construct binary coding functions.


Evaluation metric: We evaluate a model by the search quality: given a tweet as query, we would

like to rank the relevant tweets (the tweets sharing the same hashtag as the query tweet) as high as

possible. Following previous work [Weiss et al., 2008; Liu et al., 2011a], we use mean precision

among top 1000 returned list (MP@1000) to measure the ranking quality. Let pre@k be the preci-

sion among top k return data, then MP@1000 is the average value of pre@1, [email protected]@1000.

Obviously MP gives more reward on the systems that can rank relevant data in the top places, e.g., if

the highest ranked tweet is a relevant tweet, then all the precision values (pre@2, pre@3, pre@4...)

are increased. We also calculate the precision and recall curve at varying values of top k returned

list.

Baselines: We evaluate the proposed unsupervised binary coding models OrMF, whose perfor-

mance is compared against 5 other unsupervised methods, LSH, SH, LSA, ITQ, and WMF. All the

binary coding functions except LSH are learned on the 200,000 tweet set. All the methods have the

same form of binary coding functions: sgn(P·,k>x), where they differ only in the projection vec-

tor P·,k. The retrieved tweets are ranked according to their Hamming distance to the query, where

Hamming distance is the number of different bit positions between the binary codes of a tweet and

the query.

For ITQ and SH, we use the code provided by the authors. Note that the dense matrix XX> is

impossible to compute due the large vocabulary, therefore we replace it by sparse matrixXX>. For

the two matrix factorization based methods (WMF, OrMF) we run 10 iterations. The regularizer

λ in equation 2.3 is fixed at 20 as in our previous experiments [Guo and Diab, 2012b]. A small

set of 500 tweets is selected from the training set as tuning set to choose the missing word weight

wm in the baseline WMF, and then its value is fixed for OrMF. In fact WMF/OrMF are very stable,

consistently outperforming the baselines regardless of different values of wm, as later shown in

Figure 4.5.

We also present the results of cosine similarity on the original word space (TF-IDF) as an upper

bound of the binary coding methods. We implemented an efficient algorithm for TF-IDF, which is

the algorithm 1 in [Petrovic et al., 2010]. It firstly normalizes each data to a unit vector, then cosine

similarity is calculated by traversing only once the tweets via inverted word index.


# of samples0 200 400 600 800 1000

Pre

cis

ion

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

OrMFWMFITQLSASHLSH

(a) K = 64

# of samples0 200 400 600 800 1000

Pre

cis

ion

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

OrMFWMFITQLSASHLSH

(b) K = 96

# of samples0 200 400 600 800 1000

Pre

cis

ion

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

OrMFWMFITQLSASHLSH

(c) K = 128

Figure 4.3: Hamming ranking on tweet retrieval data set: precision curve under top 1000 returned

list of all 6 binary coding models, with dimension K = {64, 96, 128}.


# of samples×10

40 2 4 6 8 10

Re

ca

ll

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35 OrMFWMFITQLSASHLSH

(a) K = 64

# of samples×10

40 2 4 6 8 10

Re

ca

ll

0

0.05

0.1

0.15

0.2

0.25

0.3


(b) K = 96

# of samples×10

40 2 4 6 8 10

Re

ca

ll

0

0.05

0.1

0.15

0.2

0.25

0.3


(c) K = 128

Figure 4.4: Hamming ranking on tweet retrieval data set: recall curve under top 100,000 returned

list of all 6 binary coding models, with dimension K = {64, 96, 128}.


wm

0.05 0.08 0.1 0.15 0.2

MP

@1000

26

27

28

29

30

31

32OrMFWMF

(a) K = 64

wm

0.05 0.08 0.1 0.15 0.2

MP

@1000

26

27

28

29

30

31

32OrMFWMF

(b) K = 96

wm

0.05 0.08 0.1 0.15 0.2

MP

@1000

26

27

28

29

30

31

32

OrMFWMF

(c) K = 128

Figure 4.5: Impact of the missing word weight wm on the MP@1000 performance for OrMF and

WMF models: wm is chosen from 0.05 to 0.2; regularization factor λ is fixed as 20.


Models Parameters K = 64 K = 96 K = 128

LSH – 19.21% 21.84% 23.75%

SH – 18.29% 19.32% 19.95%

LSA – 21.04% 22.07% 22.67%

ITQ – 20.8% 22.06% 22.86%

WMF wm = 0.1 26.64% 29.39% 30.38%

OrMF wm = 0.1 27.7% 30.48% 31.26%

TF-IDF – 33.68%

Table 4.2: Mean precision among top 1000 returned list (MP@1000) on the tweet retrieval data set.

TF-IDF is the only system that does not use binary encoding, and serves as the upper bound of the

task.

4.4.2 Results

Table 4.2 presents the ranking performance measured by MP@1000 (the mean precision at top

1000 returned list). Figure 4.3 and 4.4 illustrate the corresponding precision and recall curve for the

Hamming distance ranking. The number of K binary coding functions corresponds to the number

of dimensions in the 5 data-dependent models LSA, SH, ITQ, WMF, OrMF. The missing words

weight wm is fixed as 0.1 based on the tuning set in the two weighted matrix factorization based

models WMF, OrMF. Later in Figure 4.5 we experiment with different values of wm.

As the number of bits increases, all binary coding models yield better results. This is under-

standable since the binary bits really record very tiny bits of information from each tweet, and more

bits, the more they are able to capture more semantic information.

SH has the worst MP@1000 performance. The reason might be it is designed for vision

data where the data vector is relatively dense. ITQ yields comparable results to LSA in terms

of MP@1000, yet the recall curves in Figure 4.4b (K = 96) and 4.4c (K = 128) clearly show the

superiority of ITQ over LSA.

WMF outperforms LSA by a large margin (around 5% to 7%) through properly modeling miss-

ing words, which is also observed in Chapter 2 and 3. Although WMF already reaches a very high

MP@1000 performance level, OrMF can still achieve around 1% improvement over WMF, which


can be attributed to orthogonal projections that captures more distinct topics. The trend holds con-

sistently across all conditions. The precision and recall curves in Figure 4.3 and 4.4 confirm the

trend showed in Table 4.2 as well.

All the binary coding models yield worse performance than TF-IDF baseline. This is expected,

as the binary bits are employed to gain efficiency at the cost of accuracy: the 128 bits significantly

compress the data losing a lot of nuanced information, whereas in the high dimensional word space

128 bits can be only used to record two words (32 bits for two word indices and 32 bits for two

TF-IDF values). We manually examine the ranking list. We find in the binary coding models, there

exist a lot of ties (128 bits only result in 128 possible Hamming distance values), whereas the TF-

IDF baseline can correctly rank them by detecting the subtle difference signaled by the real-valued

TF-IDF values.

4.4.3 Analysis

We are interested in whether other values of missing word weight wm can generate good results –

in other words, whether the performance is robust to the parameter value. Accordingly, we present

the influence of wm on MP@1000 in Figure 4.5, where the missing word weight wm is chosen

from {0.05, 0.08, 0.1, 0.15, 0.2}. The figure indicates we can achieve even better MP@1000 around

33.2% when selecting the optimal wm = 0.05. In general, the curves for all the code length are

very smooth; the chosen value of wm does not have a negative impact, i.e., the gain from OrMF

over WMF is always positive.

4.5 Experiments on STS Data

We repeat the experiments on STS data for the OrMF model. As in previous 2 chapters, OrMF is

evaluated against the short text similarity task on dat sets STS12, STS13 and STS14. Since there is

no parameter in OrMF to tune, STS12 training set can be seen as a test set. OrMF is trained on the

corpora which is constituted by definitions from two dictionaries WordNet, Wiktionary, and Brown

corpus.

The baselines are: (a) TF-IDF: a surface word based TF-IDF weighting schema in the origi-

nal high dimensional space, (b) LSA, (c) LDA that uses Collapsed Gibbs Sampling for inference



1. TF-IDF - 72.8 66.2 58.4 70.2

2. LSA - 16.1 23.0 24.9 27.5

3. LDA α = 0.05, β = 0.05 73.5 67.1 72.5 63.6

4. WMF wm = 0.01, λ = 20 74.3 71.7 71.8 71.7

5. WMF+BK γ = 2, δ = 50 74.8 73.1 73.0 72.8

5. OrMF wm = 0.01, λ = 20 76.7 72.6 74.1 71.9

Table 4.3: Pearson’s correlation (in percentage) on the data sets. Latent dimension K = 100 for

LSA/LDA/WMF/OrMF. We use the real-valued vectors produced by OrMF for short text similarity

evaluation.

[Griffiths and Steyvers, 2004], (d) WMF, and (e) WMF-BK.

In this experiments, we are not evaluating the binary coding performance for OrMF. Therefore,

the output of OrMF is the real-valued low dimensional vector for a short text. The missing word

weight wm and regularization factor λ are set as the optimal values for WMF: wm = 0.01, λ = 20.

4.5.1 Results

Table 4.3 summarizes the Pearson’s correlation values for the 5 models. It is very clear that OrMF

is the best performing model, consistently yielding better scores than the second best model WMF

in all 4 data sets.

It is interesting to observe that the improvement of OrMF over WMF on STS14 is the smallest

(+0.2%), compared to an averaged improvement of +1.7% on the other 3 data sets. Recall that

STS14 is the easiest data set in the sense that it has the many common surface words between pairs,

we can imagine that OrMF can perform even better when the task is more challenging.

As usual, we also present the Pearson’s correlation scores by varying number of dimensions K

for OrMF/WMF/LDA on the four data sets in Figure 4.6 where K = {50, 100, 150, 200}. OrMF

notably defeats WMF consistently on the first three data sets by 1% − 2%, with almost the same

performance on STS14. This achievement is very difficult, given that OrMF does not use any

additional features compared to WMF.



In this paper, we propose a novel unsupervised binary coding model which provides efficient simi-

larity search in massive tweet data. The resulting model, Orthogonal Matrix Factorization (OrMF),

improves an existing matrix factorization model through learning nearly orthogonal projection di-

rections. We collect a data set whose groundtruth labels are created from Twitter hashtags. Our

first experiment conducted on this data set show significant performance gains of OrMF over the

competing methods. We also evaluate on short text similarity tasks, the results of which show OrMF

consistently outperform WMF on all 4 short text similarity data sets.

To further enhance the accuracy of the tweet retrieval task, we can introduce supervised labels

to make hashtags visible to the models. Previous work on supervised hashing [Liu et al., 2012a]

already demonstrated significant improvement. In our task, we want to learn binary bits that are

similar among those tweets tagged by the same hashtag and hence triggered by the same event.

Another promising direction is to model time stamp as a feature for the tweet, motivated by the

observation that many tweets are posted to describe the same event in a short period of time. We

believe that two tweets with close timestamps should be more likely to be similar. Our preliminary

approach is to build a model for each time span, following the idea in [Blei and Lafferty, 2006], so

that tweets within the same timestamp are generated from the same time specific model.


K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

78

OrMFWMFLDA

(a) STS12 train

K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

78

OrMFWMFLDA

(b) STS12 test

K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

78

OrMFWMFLDA

(c) STS13

K

50 75 100 150 200

Pe

ars

on

's c

orr

ela

tio

n %

60

62

64

66

68

70

72

74

76

78 OrMFWMFLDA

(d) STS14

Figure 4.6: Pearson’s Correlation percentage scores of OrMF, WMF and LDA on each data set: the

dimension K varies from 50 to 200; missing word weight wm is fixed as 0.01; regularization factor

λ is fixed as 20.

64

Part II

Applications

CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 65

Chapter 5

Automated Pyramid Method for

Summaries

short text similarity has a wide range of applications in NLP tasks. The first task studied in the

thesis is automated pyramid evaluation methods for text summarization. Text summarization is the

process of compressing a text document and create a short summary that retains the most important

points of the original document.

The pyramid method is an evaluation method to assess the qualities of summaries. Essentially,

it will assess a summary by assigning a score, whose value is high if the summary contains as

many facts as the original documents. The score is computed mainly by manually identifying text

snippets in summaries that cover the key concepts in the source documents. Therefore, the pyramid

method evaluation requires human manual annotation. As a previous work, Harnly et al. [2005]

proposed a dynamic programming approach to automatically compute the pyramid scores relying

on bag-of-words matching.

In this chapter, we propose to use the Weighted Matrix Factorization (WMF) model to determine

whether the important facts are included in the summary. We believe the current surface word

matching based approach can be improved by the latent semantic approach, as there are many

different ways to express the same facts in natural language in summaries. Our experiments show

that our approach is able to identify facts covered in the summaries with greater precision and recall,

which leads to better correlation with human judgement of the quality of summaries.


Index 105

label matter is what makes up all objects or substances

contributor 1 matter is what makes up all objects or substances

contributor 2 matter as the stuff that all objects and substances in the universe are made of

contributor 3 matter is identified as being present everywhere and in all substances

contributor 4 matter is all the objects and substances around us

weight 4

Table 5.1: An example of summary content unit created from 5 model summaries. The concept has

4 contributors, all expressing the same meaning yet with different wording. Accordingly this SCU

has a weight of 4.

5.1 Introduction

The pyramid method [Nenkova and Passonneau, 2004] is an annotation and scoring procedure to

measure how much content is covered by a summary. It is designed in an attempt to address a key

problem in summarization – namely the fact that different humans choose different content when

writing summaries. It has been shown to yield reliable rankings of text summarization systems on

multiple summarization tasks.

The pyramid method consists of two phases of manual annotation: (1) identifying content units

in model summaries; the model summaries are written by humans and serve as gold standard

summary; (2) identifying which content units are included in a system summary, and accordingly

assign a score for the system summary. The procedure is illustrated in Figure 5.1.

The first annotation phase yields Summary Content Units (SCUs), sets of text segments that

express a basic content in the model summaries. Note that each SCU is weighted by the number of

model summaries it occurs in, accordingly more frequent SCUs have larger weights. By intuition,

an SCU that appears in all model summaries is a more important fact, hence the higher weight.

After manual annotation, the resulting SCUs extracted from the same set of model summaries is

referred to as a pyramid.

Table 5.1 demonstrates an example of SCU extracted from five model summaries. The elements

of a SCU are its index, a label, contributors and its weight. In this example, (1) the index is 105. (2)


Original documents

Model Summaries

Student Summaries

pyramids

1st annota6on of pyramid method

Iden6fied SCUs of student summaries

2nd annota6on of pyramid method

Figure 5.1: The pipeline for pyramid method to evaluate student summaries: the first annotation of

pyramid method is to create pyramids from model summaries; the second annotation is to find the

SCUs in target summaries. After the procedure, we can score a target summary based on how many

SCUs it has.


The label is a sentence describing the content unit, which is summarized by annotators. (3) Each

contributor is a text snippet in one distinct model summary that refers to the SCU. In this example,

four out of five model summaries (hence 4 contributors) express the SCU Matter is what makes up

all objects or substances, therefore, (4) this SCU has a weight of 4. Obviously, the weight of a SCU

should range from 1 to M , where M is the number of model summaries. Therefore, the first phase

of manual annotation in pyramid method includes identifying the contributors and summarizing the

corresponding label.

The procedure of scoring a target summary is basically identifying those SCUs that have been

expressed by the summary. Because each summary use paraphrases with different words referring

to the same concepts, it also requires human to identify SCUs, which is the effort of second phase of

manual annotation. As in Table 5.1, the contributors have lexical items in common (matter, objects,

substances), but a lot of differences (stuff, present, around).

In this chapter, we aim to complete the second phrase without human annotation, employing

latent representation of the labels/contributors of SCUs to automatically identify SCUs in a system

summary and accordingly score the system summary. This is a perfect application of the WMF

model, which is good at extracting semantic similarity between two text snippets that are expressed

by different words. The general procedure is to run the WMF model on the SCU labels and con-

tributors, as well as the ngram in the system summaries, and then score the system summaries.

Generally, if the similarity between a label/contributor and an ngram exceeds a threshold, then the

summary is potentially considered matching the SCU.

We evaluate our automated pyramid method on an assessment task of student reading compre-

hension. Previous work on automated pyramid method performed well at ranking systems on many

documents set, but is not precise enough on a document (a student summary for reading materials).

For evaluation, we have produced the manual pyramid scores for 20 student summaries, which serve

as gold standard scores. We have tested three automated pyramid scoring procedures, and the one

based on WMF correlates best with manual pyramid scores. It also has the best precision and recall

for matching SCUs in the student summaries.


5.2 Related Work

ROUGE [Lin and Hovy, 2003; Lin, 2004] is the most popular automated evaluation method for

text summarization, which originates from the BLEU score in machine translation [Papineni et

al., 2002]. ROUGE contains a set of metrics that compare the summary against reference (model

summaries) based on ngram matching. Because it relies on the string matching, it performs better

with large sets of model summaries. Compared to ROUGE, pyramid method is more robust as it

requires as few as four model summaries.

Nenkova and Passonneau [2004] proposed the the pyramid method. It is based on the idea that

no single model summary is perfect, and hence assign differential weights to content unites based

on the their frequency in all the model summaries. Essentially a pyramid is a weighted inventory

of SCUs, created for each document (set) to be summarized. In addition, each SCU has a weight

to differentiate the importance of the content unit, which yields more reliable and stable scores.

Pyramid method has been shown to perform well on ranking summarization systems.

Harnly et al. [2005] proposed the first automated summary evaluation to replace the second

manual annotation phase, which makes use of the labels and contributors of SCUs. They reduced

the problem to similarity computation between summary ngram and SCU labels/contributors based

on unigram overlap. The automated method yielded higher correlation to manual pyramid method

than the ngram overlap based ROUGE systems. Our method is an extension of this framework, and

the experiment results show the superiority of our method in terms of ranking summaries, as well

as identifying the correct SCUs.

On the other hand, distributional similarity models have been applied in reading comprehen-

sion. Foltz et al. [2000] found using LSA [Deerwester et al., 1990] correlates well with read-

ing comprehension. More recently, LSA has been combined with word matching to assess stu-

dents’ reading comprehension skills [Boonthum-Denecke et al., 2011]. The resulting tool, and

similar assessment tools such as Coh-Metrix, assess aspects of readability of texts, such as co-

herence, but do not assess students comprehension through their writing [Graesser et al., 2004;

Graesser et al., 2011].


5.3 A Scoring Approach based on Distributional Similarity

In this section, we first introduce how we create the student summary corpus, and what are the

criteria for a good automated scoring schema. Then we explain the details of our approach based

on dynamic programming that is designed to match the criteria.

5.3.1 A Student Summary Corpus

Pyramid scores of students summaries correlate well with a manual main ideas score developed for

an intervention study with community college freshmen who attended remedial classes [Perin et al.,

2013]. Twenty student (target) summaries by students who attended the same college and took the

same remedial course were selected from a larger set of 322 that summarized an elementary physics

text. All were native speakers of English, and scored within 5 points of the mean reading score

for the larger sample. For the intervention study, student summaries had been assigned a score to

represent how many main ideas from the source text were covered [Perin et al., 2013]. Interrater

reliability of the main ideas score, as given by the Pearson correlation coefficient, was 0.92.

We first collected model summaries that are written by proficient Masters of Education students.

Then, Perin created a model pyramid from the model summaries, annotated the 20 target (student)

summaries against this pyramid, and scored the results. There are many ways to score a target sum-

mary: (1) the raw score of a target summary is simply the sum of its identified SCU weights; (2)

the pyramid scores are the raw scores normalized by the number of SCUs in the target summary

(analogous to precision), (3) or normalized by the average number of SCUs in model summaries

(analogous to recall); (4) in this chapter, we normalize raw scores as the average of the two previous

normalizations (analogous to F-measure). The resulting pyramid scores have a high Pearson’s cor-

relation of 0.85 with the main idea score [Perin et al., 2013] that was manually and directly assigned

to each student summary.

5.3.2 Criteria for Automated Scoring of Student Summaries

To be pedagogically useful, an automated method to assign pyramid scores to students summaries

should meet the following two criteria: 1) reliably rank students summaries of a source text, i.e.,

preserving the ranking generated by the manual pyramid scores, and 2) identify the correct SCUs


in each student summary. Ideally a method should do well on criterion 1, as we want to find out the

best summaries. Also, since each weight partition will have more than one SCU and a score is a

sum of the SCU weights, it is possible to produce the correct numeric final score sum by matching

incorrect SCUs that have the correct weights. Compared to the previous methods, our method meets

the first criterion, and has superior performance on the second criterion.

5.3.3 A Dynamic Programming Approach

Harnly et al. [2005] have observed that assignment of SCUs to a target summary can be cast as a

dynamic programming problem. The method presented there relied on unigram overlap to score

the closeness of the match of each eligible substring in a target summary against each SCU in

the pyramid. It returned the set of matches that yielded the highest score for the summary. It

produced good rankings across summarization tasks, but assigned scores much lower than those

assigned by humans. This is because the surface word matching is too strict that many SCUs using

different words than the summary string are not discovered by the algorithm. Therefore, in this

section we extend the dynamic programming approach in two ways. We test two new semantic text

similarities, a string comparison method and a distributional semantic method, and we present a

general mechanism to set a threshold value for an arbitrary computation of text similarity, below

which the match between a summary substring and an SCU is not considered.

Unigram overlap ignores word order, and cannot consider the latent semantic content of a string.

To take word order into account, we use Ratcliff/Obershelp (R/O), which measures overlap of com-

mon subsequences [Ratcliff and Metzener, 1988]. To take the underlying semantics into account,

we use cosine similarity of 100-dimensional latent vectors of the candidate strings (ngrams from

target summary) and of the textual components of the SCU (label and contributors). Because the

algorithm optimizes for the total sum of all SCUs, when there is no threshold similarity to count as

a match, a lot of false matches will occur. Therefore, we add a threshold to the algorithm, below

which matches are not considered. Because each similarity metric has different properties and dis-

tributions, a single absolute value threshold is not comparable across metrics. We present a method

to set comparable thresholds across metrics.

Latent Representation: To represent the latent semantics of SCUs and candidate substrings of

target summaries, we apply the weighted matrix factorization model (WMF) [Guo and Diab, 2012b].


Comparing summary substrings with SCUs is a very ideal setting for WMF, since both text data are

at phrase level, and WMF is able to learn a robust latent representation for short texts using missing

words, as introduced in previous chapters.

A 100-dimension latent vector representation is learned for every span of contiguous words

within sentence bounds in a target summary, for the 20 summaries. The training data is selected to

be domain independent, so that our model could be used for summaries across domains. Thus we

prepare a corpus that is balanced across topics and genres. It is drawn from from WordNet sense

definitions, Wiktionary sense definitions, and the Brown corpus. It yields a co-occurrence matrix

X of unique words by sentences of size 46,619 393,666. Xij holds the TF-IDF value of word wi

in sentence sj . Similarly, the contributors to and the label for an SCU are given a 100- dimensional

latent vector representation. These representations are then used to compare candidates from a

summary to SCUs in the pyramid.

Three Comparison Methods: An SCU consists of at least two text strings: the SCU label and mul-

tiple contributors. As in Harnly et al. [2005], we use three similarity comparisons scusim(ngram,

SCU), where ngram is the target summary string. When the comparison parameter is set to min

(max, or mean), the similarity of ngram to each SCU contributor and the label is computed in turn,

and the minimum (maximum, or mean) similarity value is returned.

Similarity Thresholds: We define a threshold parameter for a target SCU to match a pyramid SCU,

based on the distributions of scores each similarity method gives to the target SCUs identified by the

human annotator. Annotation of the target summaries yields 204 SCUs in total. The similarity score

being a continuous random variable, the empirical sample of 204 scores is very sparse. Hence, we

use a Gaussian kernel density estimator to provide a non-parametric estimation of the probability

densities of scores assigned by each of the similarity methods to the manually identified SCUs. We

then select five threshold values corresponding to those for which the inverse cumulative density

function (icdf) is equal to 0.05, 0.10, 0.15, 0.20 and 0.25. Each threshold represents the probability

that a manually identified SCU will be missed.


5.4 Experiments on Student Summaries


The three similarity computations (Uni, R/O, WMF), three methods to compare against SCUs

(max, min, mean), and five icdf thresholds yield 45 variants, as shown in Figure 5.2. Each

variant was evaluated by comparing the unnormalized automated variant, e.g., WMF, max, 0.64 (its

0.15 icdf), to the human goldstandard scores, using each of the evaluation metrics described in the

next subsection. To compute confidence intervals for the evaluation metrics for each variant, we use

bootstrapping with 1000 samples [Efron and Tibshirani, 1986].

(3 similarities) × (3 comparisons) × (5 thresholds) = 45

(Uni, R/O, WMF) × (max, min, mean) × (0.05, . . . , 0.25)

Figure 5.2: Notation used for the 45 variants of automated pyramid methods. The 5 thresholds

correspond to inverse cumulative density function.

To assess the 45 variants, we compare their automated pyramid scores to the manual scores. By

our criterion 1, an automated score that correlates well with manual scores for summaries of a given

text could be used to indicate how well students rank against other students. We report several types

of correlation tests. Pearson’s coefficient tests the strength of a linear correlation between the two

sets of scores; it will be high if the same order is produced, with the same distance between pairs of

scores. The Spearman rank correlation is said to be preferable for ordinal comparisons, meaning the

absolute unit interval is less relevant. Kendall’s tau, an alternative rank correlation, is less sensitive

to outliers and more intuitive. It is the proportion of concordant pairs (pairs in the same order) minus

the proportion of discordant pairs. Since correlations can be high when differences are uniform, we

use Student’s T to test whether differences score means statistically significant.

We also evaluate at the SCU level, as another set of experiments to assess the 45 variants. In

Perin’s annotation, the correct SCUs mentioned in each student summary are manually identified.

According to criterion 2, the best variant would be able to retrieve the correct SCUs. Therefore, we

use precision, recall and F-score to measure the performance.


Variant (with icdf) P (95% conf.), rank S (95% conf.), rank K (95% conf.), rank

WMF, max, 0.64(0.15) 0.93(0.94, 0.92), 1 0.94(0.93, 0.97), 1 0.88(0.85, 0.91), 1

R/O, mean, 0.23(0.15) 0.92(0.91, 0.93), 3 0.93(0.91, 0.95), 2 0.83(0.80, 0.86), 3

R/O, mean, 0.26(0.20) 0.92(0.90, 0.93), 4 0.92(0.90, 0.94)4 0.80(0.78, 0.83), 5

WMF, max, 0.59(0.10) 0.91(0.89, 0.92), 8 0.93(0.91, 0.95)3 0.83(0.80, 0.87), 2

WMF, min, 0.40(0.20) 0.92(0.90, 0.93), 2 0.87(0.84, 0.91)11 0.74(0.69, 0.79), 11

Table 5.2: Five top performing variants out of 45 variants ranked by correlation scores, with confi-

dence interval and rank (P=Pearson’s, S=Spearman, K=Kendalls tau)

5.4.2 Results

The correlation tests indicate that several variants achieve sufficiently high correlations for students’

summaries to manual gold scores (criterion 1). On all correlation tests, the highest ranking auto-

mated method is WMF, max, 0.64; this similarity threshold corresponds to the 0.15 icdf. As shown

in Table 5.2, it has the best Pearson correlation (0.93), Spearman’s correlation (0.94) and Kendall’s

tau (0.88). It should be attributed that WMF is able go beyond the surface words and extract more

accurate SCU matches. We also observe tthat R/O achieves better results than Uni, thanks to the

capture of word order in R/O.

The differences in the unnormalized score computed by the automated systems from the score

assigned by human annotation are consistently positive. Inspection of the SCUs retrieved by each

automated variant reveals that the automated systems lean toward the tendency to identify false

positives (to match more SCUs even if the summary does not cover the SCU). This may result

from the dynamic programming implementation decision to maximize the score. To get a measure

of the degree of overlap between the SCUs that were selected automatically versus manually, we

computed recall and precision for the various methods.

Table 5.3 shows the mean recall, precision (with standard deviations) and F measure scores

across all five thresholds for each combination of similarity method and method of comparison to

the SCU. The low standard deviations show that the recall and precision are relatively similar across

thresholds for each variant. The WMF methods outperform R/O and unigram overlap methods, indi-

cating the use of distributional semantics is a superior approach for pyramid summary scoring than


Variant recall (std) precision (std) F-measure

Uni, min 0.69(0.08) 0.35(0.02) 0.52

Uni, max 0.70(0.03) 0.35(0.04) 0.53

Uni, mean 0.69(0.02) 0.39(0.04) 0.54

R/O, min 0.69(0.08) 0.34(0.01) 0.51

R/O, max 0.72(0.03) 0.33(0.04) 0.52

R/O, mean 0.71(0.06) 0.38(0.02) 0.54

LCv, min 0.61(0.03) 0.38(0.04) 0.49

LCv, max 0.74(0.06) 0.48(0.01) 0.61

LCv, mean 0.75(0.06) 0.50(0.02) 0.62

Table 5.3: SCU selection results: averaged recall, precision and F-measure over the 20 student

summaries, for each combination of similarity method and method of comparison to the SCU (9

categories). The number in bracket is the standard deviation for precision and recall.

methods based on string matching. It is worth noting that the high F-measure scores WMF achieved

mainly come from precision, which confirms our hypothesis that unigram and R/O methods produce

too many false positive SCU matches.

In Table 5.4, We also collect the performance on SCU selection of the top five variants in Ta-

ble 5.2. Generally the systems achieving better correlations scores for the summaries would also

perform well on selecting the SCUs. The table also reveals an interesting observation about the

best models of WMF and R/O methods: WMF beats R/O because it is able to find more SCUs, by

increasing the recall and maintaining the precision.

5.5 Experiments on TAC 2011

We are also interested in the performance our evaluation method on the machine generated sum-

maries. Therefore, we apply it on the data set of traditional summarization task in Text Analysis

Conference (TAC) 2011. TAC 2011 contains 44 topics. Each topic falls into one of 5 predefined

event categories and contains 10 related news documents. TAC had four writers to write model

summaries for each topic.


Variant recall precision F-measure

WMF, max, 0.64(0.15) 0.78 0.51 0.61

R/O, mean, 0.23(0.15) 0.71 0.48 0.56

R/O, mean, 0.26(0.20) 0.70 0.50 0.57

WMF, max, 0.59(0.10) 0.82 0.51 0.62

WMF, min, 0.40(0.20) 0.54 0.49 0.51

Table 5.4: SCU selection results: averaged recall, precision and F-measure over the 20 student

summaries, for variants of the top five variants in Table 5.2.

There are 50 team submission in TAC 2011, and TAC had manually calculated the Pyramid

scores. To evaluate the performance of our automated pyramid method, we just need to compare

the correlation between our pyramid scores and the gold standard manual scores.

We test the WMF, max variant with similarity threshold values of 0.59 and 0.64, which are the

best two variants shown in table 5.2. The Pearson’s correlation is 0.93 and 0.92 for the similarity

value of 0.59 and 0.64. It demonstrates that the automated pyramid evaluation is also reliable to

differentiate the performance of different methods for machine generated summaries.


We extend a dynamic programming approach [Harnly et al., 2005] to automate pyramid scores more

accurately by applying our WMF model for phrase level data. Our contribution mainly results from

principled thresholds for similarity scores, and from extracting latent vector representation for the

short spans of text. We propose two criteria for a good automated pyramid method, and accordingly

design two experiments: evaluation in the summary level (the correlation with final gold manual

pyramid scores) and SCU level (identifying the correct SCUs). we find the latent semantics based

methods perform best at the two criteria for a pedagogically useful automatic metric, evaluated by

both the correlation with gold manual scores and the gold SCUs.

For future work, we are interested in applying our approach for text summarization systems

as an attempt for improving summarization quality. Since our approach is able to identify with

higher precision and recall whether a text snippet contains the key concepts, it could also be helpful


for choosing which ngrams to be included in the summary. We hope by incorporating our model,

the yielded summary in a fixed length is able to convey maximum information from the source

documents.

CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 78

Chapter 6

Unsupervised Word Sense

Disambiguation

In this chapter, we study the impact of short text similarity for a lexical semantics task – word sense

disambiguation (WSD). WSD is the task to identify which sense of a word is used in a context.

Usually the sense inventory is obtained from a lexicon such as WordNet [Fellbaum, 1998].

In many unsupervised WSD systems, a most important component is a sense similarity measure

that returns a similarity score given two sense IDs. Previous work adopted very simple approaches

to compute the similarity score: most similarity measure use the taxonomy structure of WordNet

such as jcn [Jiang and Conrath, 1997], while Extended Lesk (elesk) [Banerjee and Pedersen, 2003]

computes the number of overlapping words/phrases between the two sense definitions. The latter

one gains much wider popularity, since the many other similarity measures rely on taxonomies,

hence they can only compute similarity between noun/verb pairs, while adjectives and adverbs do

not have a taxonomic representation structure in WordNet.

Because of the short nature of the sense definitions, we believe that exploiting our WMF model

can yield more meaningful sense similarity scores given two sense definitions. We first apply WMF

model on the sense definition data sets to get a low dimensional representation of the data, based on

which we construct a new sense similarity measure wmfvec [Guo and Diab, 2012a]. We make some

crucial adjustment on the procedure of sense similarity computation, inspired by the noticeable traits

of sense similarity measure Extend Lesk (elesk). To our best knowledge, wmfvec is the first sense


s4 s2

I am walking on the bank with my friend

s1 s3 walk by foot

escort

riverbank

financial bank

s5 friend 1.2 0.5

0.7

Figure 6.1: Unsupervised Graph-based Word Sense Disambiguation System: several sense nodes

are created for each word; the weights on edges are similarity scores between the two senses; for

simplicity, the edges between walk senses and friend sense are not shown. The final decision of

disambiguated words are the sense nodes that achieve maximum indegree values.

similarity measure calculated on low dimensional representation of sense definitions. Extensive

WSD experiments performed on four standard benchmarks demonstrate that our proposed sense

similarity measure outperforms the baselines by a large margin.

6.1 Introduction

To date, many unsupervised WSD systems rely heavily on a sense similarity module that returns

a similarity score given two senses. For example, the graph-based WSD systems [Mihalcea et

al., 2006; Guo and Diab, 2010] build a graph where nodes are senses of content words, and the

weight on an edge denotes the sense similarity score between the two sense (Figure 6.1). The sense

disambiguation is performed to choose the sense node with the maximum indegree value (the sum

of weights of edges associated to the node), since such nodes are perceived to have the maximum

relatedness with the context words.

Because sense similarity measure is the most crucial component in many unsupervised WSD

systems, in the lexical semantics community much effort has been devoted to developing useful

sense similarity measures based on some knowledge base lexicons such as WordNet [Fellbaum,


1998]. For example, many similarity measures take advantage of the taxonomy structure of Word-

Net, which is constructed by “is-a” relations . The sense similarity value is computed based on the

positions of the two senses and their least common subsumer in the noun/verb hierarchy. However,

it only allows noun-noun and verb-verb pair similarity computation, as the other part-of-speech

(adjectives and adverbs) do not have a taxonomic representation structure.

The most popular sense similarity measure is the Extended Lesk (elesk) measure [Banerjee

and Pedersen, 2003]. In elesk, a similarity score is computed based on the length of overlapping

words/phrases between two extended dictionary definitions (hence it works for all part-of-speech

types). The definitions are extended by definitions of neighbor senses to discover more overlapping

words. However, exact word matching is lossy. Below are two definitions from WordNet:

• bank#n#1: a financial institution that accepts deposits and channels the money into lending

activities

• stock#n#1: the capital raised by a corporation through the issue of shares entitling holders

to an ownership interest (equity)

Despite the high semantic relatedness of the two senses, the overlapping words in the two definitions

are only a, the, yielding a very low sense similarity score.

Accordingly we are interested in extracting latent semantics from sense definitions to improve

elesk. However, the challenge lies in that sense definitions are typically too short/sparse for la-

tent variable models to learn accurate semantics, since these models are designed for long docu-

ments. For example, topic models such as LDA [Blei et al., 2003], can only find the dominant topic

(finance topic in bank#n#1 and stock#n#1) without further discernibility. In this case, many

senses will share the same latent semantics profile, as long as they are in the same topic/domain,

which results in a maximum cosine similarity of 1, or 0 cosine similarity if the dominant topic is

different for them.

To obtain quality latent vector representations for senses and enable meaningful textual sim-

ilarity, we apply the WMF model on the WordNet sense definitions. We then show how to use

WordNet neighbor sense definitions to construct a more nuanced sense similarity wmfvec, relying

on the inferred latent semantic vectors of senses. The WordNet neighbor senses are induced by

the sense relations defined in WordNet. We show that wmfvec is superior to elesk and LDA based


approaches in four All-words WSD data sets. To our best knowledge, wmfvec is the first sense

similarity measure based on latent semantics of sense definitions.

6.2 Related Work

Many systems over the years have been proposed for the WSD task. A thorough review of state-

of-the-art through the late 1990s is presented in [Ide and Veronis, 1998] and more recently in

[Navigli, 2009]. Several techniques have been used to address the problem ranging from rule

based/knowledge based approaches to unsupervised and supervised machine learning techniques.

In this chapter, we focus on the unsupervised All-words task, where systems are required to disam-

biguate all the content words (noun, adjective, adverb and verb) in documents.

Sense similarity measures have been the core components in many unsupervised WSD systems

and lexical semantics research/applications. Among these sense similarity measure, elesk is the most

widely used one. Sometimes people use jcn to obtain similarity of noun-noun pairs. McCarthy

et al. [2004] tested elesk and jcn to find predominant word sense, where elesk produced better

performance. Patwardhan et al. [2005] builded a WSD system that integrated elesk similarity values

between target words and its neighbor words. In [Mihalcea, 2005], they constructed a graph where

nodes are senses of context words and edges are sense similarity values returned by elesk. Following

the graph framework of [Mihalcea, 2005], researchers [Sinha and Mihalcea, 2007; Guo and Diab,

2010] replaced elesk by jcn for noun-noun and verb-verb pairs and obtained better WSD results.

Sense similarity measures can be mainly broken into three categories on the basis of the re-

sources they depend on: (1) WordNet relations, (2) WordNet noun/verb taxonomy + Information

Content, (3) WordNet relations + sense definitions.

lch [Leacock and Chodorow, 1998], wup [Wu and Palmer, 1994] and path [Pedersen et al., 2004]

are instances of the first group, e.g., path returns the inverse of the shortest path length between two

senses in the WordNet sense graph where senses are connected by relations. However, all of them

simply used WordNet relations to create a graph connecting the senses, without further exploiting

any information on senses.

The second category, first proposed by Resnik [1995] in the similarity measure res, then fol-

lowed by lin [Lin, 1998], jcn, combined information content and noun/verb taxonomy (which is


the hypernym/hyponym relations defined in WordNet). Information content is the frequency in-

formation of a sense. The ideal way to record information content is to use a corpus that is sense

annotated, but such corpus is expensive to build. Hence people estimated the frequency information

in alternative ways. To illustrate how information content (IC) is incorporated into sense similarity

computation, we present the formula of jcn to calculate sense similarity:

sim(s1, s2) = IC(s1) + IC(s2)− 2× IC(LCS(s1, s2)) (6.1)

where LCS(s1, s2) is the least common subsumer of sense s1 and s2 in the taxonomy. It should be

noted that all these three measures (res, lin and jcn) use the IC of least common subsumer of sense

s1 and s2. The disadvantage is clear: it requires a taxonomy, thereby only noun-noun and verb-verb

pair similarity can be computed.

At last, elesk and Gloss Vector [glsvec] [Patwardhan and Pedersen, 2006] are sense-definition

based similarity measures. If WordNet relations are not available, elesk and glsvec can still return

similarity values, while sense similarity measures in the previous two categories fail to do so. This

feature enables them to be applied to other dictionaries, as long as sense definitions are available.

glsvec is similar to elesk except it converts the definitions into high-dimensional word space vector

representation and returns the cosine similarity of the two sense vectors. Therefore glsvec also does

not extract the latent semantics of definitions.

Our similarity measure wmfvec exploits the same information (sense definitions and WordNet

relations) elesk use, and outperforms it significantly on four data sets. Therefore, we believe wmfvec

will make a great contribution to lexical semantics community. To our best knowledge, we are the

first to construct a sense similarity by latent semantics of sense definitions.

6.3 A New Sense Similarity Measure – wmfvec

We first run WMF on the WordNet sense definition data sets. Thus, a sense is presented by the

K-dimensional vector induced from the sense definition. A natural way to obtain sense similarity is

to calculate the cosine similarity of the two corresponding K-dimensional vectors. Inspired by the

elesk, we make some crucial changes when constructing the sense similarity measure, as explained

in details in the following.


After applying WMF on WordNet sense definitions, we can further use the features of WordNet

to construct a better low dimensional representation for senses. The most important feature of

WordNet is that senses are connected to each other through relations such as hypernym, meronym,

homonym, similar attributes, etc. In our experiments, we use all the 28 relations defined in WordNet

3.0. We observe that neighbor senses are semantically similar in most cases, for example air bag is

a meronym of car, hence the semantics of the neighboring senses could be a good indicator for the

latent semantics of the target sense.

We use these WordNet neighbors in a manner similar to elesk. As shown in section 6.1, sense

definitions are fairly short and do not provide sufficient vocabulary to cover the relatedness. To

address the issue, in elesk a definition is augmented by including the definitions of its neighbor

senses, in order to yield more overlapping words/phrases. Therefore, in our method, a sense is

represented by the sum of its original latent vector and its neighbors’ latent vectors. Let N(j) be

the set of neighbor senses of sense j, then new latent vector becomes:

Qnew·,j = Q·,j +

k∈N(j)∑k

Q·,k (6.2)

It is also worth noting that the similarity score of elesk is not normalized by the length of sense

definitions. This is understandable since otherwise it will gives an unfair advantage to the short

definitions. Hence we also adopt a similar idea: inner product (instead of cosine similarity) of the

two resulting sense low dimensional vectors is employed as function to calculate the sense pair

similarity. We refer to our sense similarity measure as wmfvec.

6.4 Experiments


Task and data sets: We choose the fine-grained All-Words Sense Disambiguation task for eval-

uation. The data sets we use are all-words tasks in SensEval2 [Palmer et al., 2001], SensEval3

[Snyder and Palmer, 2004], SemEval-2007 [Pradhan et al., 2007], and Semcor [Miller et al., 1993].

Statistics of annotated senses on the four data sets is listed in Table 6.1. We tune the parameters in

of all models based on their performance on SensEval2, and then directly apply the tuned models

on other three data sets.


data set docs noun adj adv verb

SensEval2 3 1064 465 301 554

SensEval3 3 902 358 - 732

SemEval-2007 3 159 - - 296

Semcor 381 86994 31706 18947 88320

Table 6.1: The statistics of annotated senses in the four WSD data sets, as well as the distribution

per part-of-speech.

Data: The sense inventory is WordNet 3.0 for the four WSD data sets. WMF and LDA are built on

the corpus of sense definitions of two dictionaries: WordNet and Wiktionary.1 We do not link the

senses across dictionaries, hence Wiktionary is only used as augmented data for distributional mod-

els to better learn word latent profiles. All data is tokenized, POS tagged with Stanford POS Tagger

[Toutanova et al., 2003] and lemmatized,2 resulting in 341, 557 sense definitions and 3, 563, 649

words.

WSD Algorithm: To perform WSD we need two components: (1) a sense similarity measure that

returns a similarity score given two senses, which will be the baselines in the next paragraph; (2)

a disambiguation algorithm that determines which senses to choose as final answers based on the

sense pair similarity scores. We choose the Indegree algorithm used in [Sinha and Mihalcea, 2007;

Guo and Diab, 2010] as our disambiguation algorithm.

Baselines: We compare with (1) elesk, the most widely used sense similarity, and (2) glsvec which

is similar to elesk except glsvec converts the definitions into high-dimensional word space vector

representation and return the cosine similarity of vectors. We use the implementation of elesk and

glsvec in [Pedersen et al., 2004].

The third baseline is (3) ldavec, LDA using Gibbs sampling [Griffiths and Steyvers, 2004].

We calculate the latent vector of a sense definition by summing up the P (z|w) of all constituent

words weighted by Xij (more details can be found in section 2). (4) At last, we compare wmfvec

1http://en.wiktionary.org/

2The lemmatization is conducted with WordNet::QueryData package

http://en.wiktionary.org/


with a sense similarity combination for a very mature WSD system, [jcn+elesk], introduced in

[Sinha and Mihalcea, 2007], where the authors evaluated six sense similarities, select the best of

them and combine them into one system. Specifically, in their implementation they use jcn [Jiang

and Conrath, 1997] for noun-noun and verb-verb pairs, and elesk for other pairs. jcn+elesk with

Indegree algorithm [Sinha and Mihalcea, 2007] used to be the state-of-the-art system on SensEval2

and SensEval3.

6.4.2 Results

The disambiguation results (K = 100) are summarized in Table 6.2. We also present in Figure

6.2 results using other values of dimensions K for wmfvec and ldavec. There are very few words

that are not covered due to failure of lemmatization or POS tag mismatches, thereby F-measure is

reported.

Based on SensEval2, wmfvec’s parameters are tuned as λ = 20, wm = 0.01; ldavec’s param-

eters are tuned as α = 0.05, β = 0.05. We run WMF on WordNet+Wiktionary for 30 iterations,

and LDA for 2000 iterations. For LDA, more robust P (w|z) is generated by averaging over the

last 10 sampling iterations. We also set a threshold to elesk similarity values, which yields better

performance. Same as [Sinha and Mihalcea, 2007], values of elesk larger than 240 are set to 1, and

the rest are mapped to [0,1].

GlsVec vs elesk: Both of GlsVec and elesk compute similarity based on surface word matching,

yet GlsVec produces much worse WSD results than elesk. The reason may be GlsVec uses cosine

similarity, hence sometimes even though two definitions have a lot of words in common, their

similarity value is still small as long as one definition is lengthy. Another disadvantage compared to

elesk is that GlsVec cannot capture phrases.

elesk vs wmfvec: wmfvec outperforms elesk consistently in all POS cases (noun, adjective, adverb

and verb) on four datasets by a large margin (2.9% − 4.5% in total case). Observing the results

yielded per POS, we find a large improvement comes from nouns. Same trend has been reported

in other distributional methods based on word co-occurrence [Cai et al., 2007; Li et al., 2010;

Guo and Diab, 2011]. More interestingly, wmfvec also improves verbs accuracy significantly, partly

because verb is a harder POS for disambiguation.


Data Model Total Noun Adj Adv Verb

SensEval2 random 40.7 43.9 43.6 58.2 21.6

glsvec 49.1 51.8 55.7 66.4 28.5

elesk 56.0 63.5 63.9 62.1 30.8

ldavec 58.6 68.6 60.2 66.1 33.2

wmfvec 60.5 69.7 64.5 67.1 34.9

jcn+elesk 60.1 69.3 63.9 62.8 37.1

jcn+wmfvec 62.1 70.8 64.5 67.1 39.9

SensEval3 random 33.5 39.9 44.1 - 33.5

GlsVec 39.8 45.6 54.0 - 24.7

elesk 52.3 58.5 57.7 - 41.4

ldavec 53.5 58.1 60.8 - 43.7

wmfvec 55.8 61.5 64.4 - 43.9

jcn+elesk 55.4 60.5 57.7 - 47.4

jcn+wmfvec 57.4 61.2 64.4 - 48.8

SemEval-2007 random 25.6 27.4 - - 24.6

GlsVec 31.6 33.3 - - 30.7

elesk 42.2 47.2 - - 39.5

ldavec 43.7 49.7 - - 40.5

wmfvec 45.1 52.2 - - 41.2

jcn+elesk 44.5 52.8 - - 40.0

jcn+wmfvec 45.5 53.5 - - 41.2

Semcor random 35.26 40.13 50.02 58.90 20.08

GlsVec 39.1 42.2 57.2 67.6 23.5

elesk 55.43 61.04 69.30 62.85 43.36

ldavec 58.17 63.15 70.08 67.97 46.91

wmfvec 59.10 64.64 71.44 67.05 47.52

jcn+elesk 61.61 69.61 69.30 62.85 50.72

jcn+wmfvec 63.05 70.64 71.45 67.05 51.72

Table 6.2: The WSD performance measured by F-measure of 7 models on each data set, as well as

the performance per part-of-speech. For the models ldavec and wmfvec, they are trained with the

latent dimension K = 100.


K

50 75 100 125 150

F-m

ea

su

re %

45

50

55

60

wmfvecldavec

(a) SensEval2

K

50 75 100 125 150

F-m

ea

su

re %

45

50

55

60

wmfvecldavec

(b) SensEval3

K

50 75 100 125 150

F-m

ea

su

re %

45

50

55

60 wmfvecldavec

(c) SemEval-2007

K

50 75 100 125 150

F-m

ea

su

re %

45

50

55

60

wmfvecldavec

(d) Semcor

Figure 6.2: The WSD performance, measured by F-measure, of ldavec and wmfvec on each data

set. The latent dimension K varies from 50 to 150.


ldavec vs wmfvec: ldavec also performs very well, again proving the superiority of latent semantics

over surface words matching. However, wmfvec again outperforms ldavec in every POS case (at

least +1% in total case) except Semcor adverbs. We can verify that the trend is consistent by results

in Figure 6.2 where different dimensions are used for ldavec and wmfvec. These results confirm our

argument that given the same text data, WMF outperforms LDA on modeling latent semantics of

senses by exploiting missing words. Another interesting observation is that using different number

of dimensions do not have a very big impact on the performance when K ≥ 100.

jcn+elesk vs jcn+wmfvec: jcn+elesk is a very mature sense similarity combination that takes advan-

tage of the great performance of jcn on noun-noun and verb-verb pairs. Although wmfvec does much

better than elesk, using wmfvec solely is sometimes outperformed by jcn+elesk on nouns and verbs.

Therefore to beat jcn+elesk, we replace the elesk in jcn+elesk with wmfvec (hence jcn+wmfvec).

Similar to [Sinha and Mihalcea, 2007], we normalize the similarity values of wmfvec such that val-

ues greater than 400 are set to 1, and the rest values are mapped to [0,1]. We choose the value 400

based on the WSD performance on tuning set SensEval2. As expected, the resulting jcn+wmfvec

can further improve jcn+elesk for all cases. Moreover, jcn+wmfvec produces similar results to state-

of-the-art unsupervised systems on SensEval2, 61.92% F-mearure in [Guo and Diab, 2010], and

SensEval3, 57.4% in [Agirre and Soroa, 2009]. It shows wmfvec is robust that it not only performs

very well individually, but also can be easily incorporated with existing evidence as represented

using jcn.

6.4.3 Analysis

We look closely into WSD results to obtain an intuitive feeling about what is captured by wmfvec

and what is not captured. We mainly compare wmfvec with surface word matching approach elesk.

The different behaviors and hence different performance between wmfvec and elesk are exhibited

by sense similarity scores, which are listed in Table 6.3. The first example involves the target word

mouse in the following context:

• ... in experiments with mice that a gene called p53 could transform normal cells into cancer-

ous ones...


sense similarity target senses gene#n#1 cell#n#2

wmfvec animal mouse 27.00 16.57

computer mouse 7.14 0.01

elesk animal mouse 68 78

computer mouse 80 68

sense similarity target senses stop#v#1 chat#v#1

wmfvec church place 3.44 4.89

church service 10.53 6.26

elesk church place 48 17

church service 12 6

Table 6.3: The similarity values of wmfvec and elesk in two examples. The first example is the

target word mouse in a biology context that contains words gene, cell, etc. The second example is

the word church in a context that involves stop, chat.

elesk returns the wrong sense computer mouse, due to the lack of overlapping words between sense

definitions of animal mouse and context words. However, wmfvec chooses the correct sense animal

mouse, by recognizing the biology dimension of the animal mouse sense and related context words

gene, cell, cancerous.

We perform some basic data analysis on the items that wmfvec is not capable of capturing. A

negative example shows the deficiency of the distributional similarity models:

• ... stop to chat at the church door...

Here church clearly refers to the meaning of “a place for public (especially Christian) worship”.

wmfvec chooses a similar sense “a service conducted in a house of worship”. In wmfvec it may not

have a specific latent dimension for concept place, hence it cannot differentiate place from service.

In contrast, elesk can distinguish place from service via surface word matching. The exact sense

similarity scores in Table 6.3 support our hypothesis.



We construct a sense similarity measure wmfvec based on the latent semantics of WordNet sense

definitions by explicitly modeling missing words in the weighted matrix factorization framework.

To our best knowledge, we are the first to construct a sense similarity based on the latent semantics

of sense definitions. Experiment results on four fine-grained all-words WSD data sets show wm-

fvec significantly outperforms previous definition-based similarity measures elesk, glsvec and LDA

based vectors. Moreover, jcn+wmfvec produces results comparable to state-of-the-art systems on

SensEval2 and SensEval3 data sets.

Although in this chapter only WSD experiments are conducted, our method has much impact in

many other sense related tasks. For example, our method could be applied for first sense acquisition

[McCarthy et al., 2007], which aims to find the most frequent sense for words. Given the word

embeddings from WMF model and sense embeddings from our method, the first sense could be

the one that enjoys maximum cosine similarity with the target word. The wide application of our

method should be attributed to that we have learned embeddings for senses, which is very unique.

In the future work, we are looking forward to further exploiting the features of WordNet to bring

more semantics into the sense and strengthen the quality of sense vectors. For example, the part-of-

speech and super tags of the senses can enrich the syntax information of senses, whereas the current

model only concerns about semantic relatedness. We believe this will result in a more robust sense

similarity measure, as the current framework has very little information about the sense entry other

than the definition. In addition, modeling another WordNet feature – antonyms – in wmfvec is very

challenging yet quite useful, since such a similarity measure, with sentiment polarity incorporated,

would be beneficial for many sentiment related tasks.

CHAPTER 7. LINKING TWEETS TO NEWS 91

Chapter 7

Linking Tweets to News

In this chapter, we focus on applying our model on the social media data. A common observation

on Twitter data is that the short nature of tweets makes it very hard for NLP tools to understand

the data. One example of such NLP tools is sentiment analysis: a bag-of-words SVM model gains

much better performance on paragraph level product reviews than sentence level product reviews,

as shown in [Wang and Manning, 2012; Li et al., 2012]. Therefore, we propose the Linking Tweets

to News task, which aims to find a most relevant news article for a tweet if the tweet is discussing

a newsworthy event. We believe the news article serves as a much larger context for the tweet,

and thus the NLP tools can better understand Twitter data, e.g., in the sentiment analysis case, a

significant amount of sentiment clues can be supplied from the news article.

A straightforward solution for the linking tweets to news task is: (1) firstly applying our WMF

model on both Twitter data and news data, (2) for each tweet, choosing the news article with the

maximum similarity score with the tweet as the most relevant news article, as shown in Figure 7.1.

However, this simple approach ignores a distinct characteristic of the Twitter data: due to the

length constraint, a tweet does not retain all the information of a news event; in most cases a tweet

only covers one aspect of the event. This leads to potential inaccurate linking in the approach

introduced in last paragraph. To this end, we propose to search the missing information in other

tweets that are on the same topic of the target tweet, as an attempt to complete the full semantic

picture of the target tweet. This is motivated by the observation that many tweets are triggered by

the same event, thereby become dependent to each other.

To find relevant tweets for a target tweet, we mainly exploit three features: hashtags, named


Pray for Mali…

[0.1, -‐0.2, 1.4, 0.67, -‐1.2]

French troops a4ack rebels in Mali [2.1, -‐0.1, -‐0.5, 3.2, 0.2]

With California Rebounding, Governor Pushes Big Projects [0.3, -‐0.1, -‐0.9, 3.2, 1.5]

Pakistani province in mourning aCer blasts kill scores

[0.7, -‐0.1, 0.7, -‐0.2, -‐0.1]

Tweets News

……

Figure 7.1: The general framework for linking a tweet to its most relevant new article, by firstly

transforming the textual data into latent representation, and then choosing the one with maximum

cosine similarity score.

entities and timestamps. We extend the original WMF model and incorporate correlation between

short texts, such as the target tweet and relevant tweets [Guo et al., 2013]. Our experiments analyze

the impact of the three individual features, and demonstrate significant improvement of the new

model over the baselines.

7.1 Introduction

Recently there has been an increasing interest in language understanding of Twitter messages. Re-

searchers [Speriosui et al., 2011; Brody and Diakopoulos, 2011] were interested in sentiment anal-

ysis on Twitter feeds, and object oriented opinion mining such as towards political issues or politi-

cians [Tumasjan et al., 2010; Conover et al., 2011; Jiang et al., 2011]. Others [Ramage et al., 2010;

Jin et al., 2011] summarized tweets using topic models. Although these NLP techniques are ma-

ture, the performance on tweets inevitably degrades, mainly due to the inherent sparsity in short


texts.1 In the case of sentiment analysis, many previous efforts have reported an accuracy drop from

around 87% on a paragraph level movie review dataset released in [Pang and Lee, 2004], to around

75% [Wang and Manning, 2012] on a sentence level movie review dataset released in [Pang and

Lee, 2005]. The problem worsens when some existing NLP systems can hardly produce any results

given the short texts. Considering the following tweet:

Pray for Mali...

As shown in [Benson et al., 2011; Ritter et al., 2012], a typical event extraction/discovery system [Ji

and Grishman, 2008] would likely be unable to discover the war event due to the lack of contextual

clues, and thus fails to shed light on the user’s focus/interests.

To enable the NLP tools to better understand Twitter feeds, we propose the task of linking a

tweet to a news article that is relevant to the tweet, thereby augmenting the context of the tweet. For

example, we want to supplement the implicit context of the above tweet with a news article such as

the following entitled:

State of emergency declared in Mali

To address the Linking-Tweets-To-News task, we find there are mainly two challenges : (1)

Tweets are too short. In our Twitter data set, on average there are only 14 words in a tweet. It is

very hard hard to pinpoint the relevant news article based on very few information. (2) Tweets are

incomplete, in the sense that usually only one aspect of the event is covered in the tweet. In the

Pray for Mali example, the tweet only contains the location Mali while the event is about French

army participated in Mali war. In this scenario, we would like to find the missing dimensions of the

tweet such as French, war from other complementary short texts, to complete the semantic picture

of Pray in Mali tweet.

For the first challenge, we can directly apply our model WMF on the tweets to generate the

low dimensional representations, since WMF can handle the short text context very well by adding

missing words. After that, we perform cosine similarity computing and choose the most relevant

news document according the similarity values.

For the second issue, we extend the WMF model and incorporate the inter short text correlations

1Apart from the short context issue, tweets exhibit other irregularities of social media data, such as slang, disfluency,

ungrammaticality, informality [Eisenstein, 2013].


(relevance between two texts) in the dimension reduction model. We show that using tweet specific

feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are

able to extract relevant texts that might be complementary to the target tweet. We focus on explicitly

integrating the text relevance relations in the matrix factorization framework, and accordingly the

semantic picture of a tweet is completed by receiving semantics from its related tweets.

We created a data set of news and tweets, where the groundtruth (the most relevant news article

given a tweet) is automatically obtained by extracting the URL in the tweet. Our experiments show

significant improvement of our new model over baselines under three different evaluation metrics

in the new task.

7.2 Related Work

We are targeting at a new task, namely linking a tweet to a news article, which is related to some

existing natural language processing tasks. In the remaining of the section, we will have a brief

introduction of the related tasks, and highlight the difference among them.

Modeling Tweets in a Latent Space: Ramage et al. [2010] leveraged hashtags to improve the latent

representation of tweets in a LDA framework, Labeled-LDA [Ramage et al., 2009], treating each

hashtag as a label. Jin et al. [2011] proposed an LDA based model for Twitter data by incorporating

URL referred documents in tweets. The semantics of long documents were transferred to the topic

distribution of tweets. Evaluated on tweet clustering, the new model increased the purity score from

0.28 to 0.39.

News recommendation: A news recommender system [Claypool et al., 1999; Corso et al., 2005;

Lee and Park, 2007] aims to recommend news articles to a user based on the features (e.g., key

words, tags, category) in the documents that the user likes, hence these documents form a training

set. Our paper resembles news recommendation in searching for a related news article. However,

we target on “recommending” news articles only based on a tweet, which is a much smaller context

than the set of favorite documents chosen by a user .

Linking on Tweets: In tweet ranking [Duan et al., 2010], the availability of a URL is an important

feature. However, one possible bottleneck preventing their approach from broad applications is that


the number of tweets with an explicit URL is very limited. Similarly, Huang et al. [2012] proposed

a graph-based framework to propagate tweet ranking scores, where relevant web documents are

found to be helpful to discover informative tweets. Both work can take advantage of our work to

either extract potential URL features or retrieve topically similar web documents.

Sankaranarayanan et al. [2009] aimed at capturing tweets that correspond to late breaking news.

They adopted a simple approach by clustering tweets and choosing a URL referred news in those

tweets as the related news for the whole cluster (the URLs are visible to the systems). Compared to

our work, their approach lacks variety, since the whole cluster of tweets are assigned the same news

URL. The work presented in [Abel et al., 2011] is most related work to our paper, however their

focus is the user profiling task, therefore they did not provide a paired tweet/news data set and have

to conduct manual evaluation.

7.3 Searching Complementary texts via Twitter/News Features

WMF exploits the text-to-word information in a very nuanced way, whereas the dependency be-

tween texts is ignored (Figure 7.2a). However in the social media context, many tweets and news

article are in fact dependent or complementary for each other, as they are triggered by the same

event. In this section, we introduce how to extract similar tweets to find the missing elements for

a given tweet. We exploit three features to find similar tweets: hashtags, named entities and times-

tamps. These features will help induce better latent representation for tweets/news.

7.3.1 Hashtags and Named Entities

Hashtags highlight the topics of a tweet, e.g., The #flu season has started. We believe two tweets

sharing the same hashtag should be related, hence we place a link between two tweet nodes to

explicitly inform the model that these two tweets should be similar (Figure 7.2b).

We find that only 8, 701 tweets out of 34, 888 tweets in our collected data set include hashtags.

In fact, we observe many hashtag words are mentioned in tweets without explicitly being tagged

with #. Here, we adopt a simple but effective approach to overcome the hashtag sparseness issue:

we collect all the hashtags in the dataset, and automatically hashtag any word in a tweet if that word

appears hashtagged in any other tweets. After the automatic hashtag discovery, we start extracting


relevant tweets: for each tweet, and for each hashtag it contains, we extract k tweets that contain

this hashtag, assuming they are complementary to the target tweet, and place a link between the k

tweets and the target tweet, as in Figure 7.2b.2

Named entities are some of the most salient features in event based text data. Directly applying

Named Entity Recognition (NER) tools on news titles or tweets results in many errors [Liu et al.,

2011b], due to the noisy nature of the data such as slang in tweets and capitalization in news titles.

Accordingly, we first apply the NER tool on news summaries, then label named entities in the tweets

in the same manner as labeling the hashtags: if there is a string in the tweet that matches a named

entity from the summaries, then the string is labeled as a named entity in the tweet.3 To create the

similar tweet set, we find k tweets that also contain the named entity.

7.3.2 Temporal Relations

Intuitively, tweets published in the same time interval have a larger chance of being of the same

topic than those are not chronologically close [Wang and McCallum, 2006]. However, we cannot

simply assume any two tweets are similar only based on the timestamp. Therefore, for a tweet we

link it to the k most similar tweets whose published time is within 24 hours of the target tweet’s

timestamp. To find out the most similar ones, we use the latent representation returned by WMF

model to measure the similarity of two tweets.

7.3.3 Authorship

We also experiment with other features such as authorship. We note that it does not have a posi-

tive contribution for this problem. While authorship information helps in the task of news/tweets

recommendation for a user [Corso et al., 2005; Yan et al., 2012], the authorship information is too

general for this task where we target on “recommending” a news article for a tweet. The results of

using author subgraph can be found in Table 7.2.

2If there are more than k tweets found, we choose the top k ones whose publishing timestamps are most chronologi-

cally close to that of the target tweet.

3Note that there are some false positive named entities detected such as apple. We plan to address removing noisy

named entities and hashtags in our future work.


7.3.4 Creating Relations on News

We can also extract the three subgraphs (based on hashtags, named entities and temporal) for news

articles. However, automatically tagging hashtags or named entities leads to much worse perfor-

mance (around 93% ATOP values, a 3% decrease from baseline WMF). This is because a news

article is long enough to contain a lot of hashtag words and named entity, some of which are not

very relevant to the theme of the event, and thus result in noisy matching. Therefore we only extract

temporal relations for news articles.

7.4 WMF on Graphs

Now we are focusing on incorporating the links, generated as described in the previous section, into

the WMF model.

If two texts are connected by a link, it means they should be semantically similar or sharing a

similar latent profile. In the matrix factorization framework, we would like the latent vectors of two

text nodes Q·,j1 , Q·,j2 to be as similar as possible, namely that their cosine similarity to be close to

1. To implement this, we add a regularization term in the objective function of WMF (equation 2.3)

for each linked pairs Q·,j1 , Q·,j2 in Figure 7.2b:

δ · ( Q·,j1 ·Q·,j2|Q·,j1 ||Q·,j2 |

− 1)2 (7.1)

where |Q·,j | denotes the length of vector Q·,j . The coefficient δ denotes the importance of the

text-to-text links. A larger δ means we put more weights on the text-to-text links and less on the

text-to-word links. We refer to this model as WMF-G (WMF on graphs). The graphical model is

illustrated in Figure 7.2b.

Alternating Least Square [Srebro and Jaakkola, 2003] is used for inference in weighted matrix

factorization. However, alternating least square is no longer applicable here with the new regular-

ization term (equation 7.1) involving the length of text vectors |Q·,j |, which is not in quadratic form.

Therefore we approximate the objective function by treating the vector length |Q·,j | as fixed values


w1 w2 w3

t1

w4 w5

t2

w6 w7 w8

t3

n1 n2

(a) Applying WMF on tweets and news data sets

w1 w2 w3

t1

w4 w5

t2

w6 w7 w8

t3

n1 n2

temporal

#healthcare Obama

(b) WMF-G model

Figure 7.2: The tweet nodes t and news nodes n are connected by hashtags, named entities or

temporal edges. For simplicity, the missing tokens are not shown in the figure. All the grey nodes

are observed information, such as TF-IDF values, while white nodes are latent vectors to be inferred.


during the alternating least square iterations:

P·,i =(QW (i)Q> + λI

)−1QW (i)X·,i

Q·,j =(PW (j)P> + λI + δL2

(j)Q·,s(j)diag(L2(s(j)))Q

>·,s(j)

)−1(PW (j)X>j,· + δL(j)Q·,s(j)Ln(j)

) (7.2)

We define n(j) as the linked neighbors of short text j, and Q·,n(j) as the set of latent vectors of

j’s neighbors. The reciprocal of length of these vectors in the current iteration are stored in Ls(j).

Similarly, the reciprocal of the length of the short text vector Q·,j is Lj . W (i) = diag(W·,i) is an

M ×M diagonal matrix containing the ith row of weight matrix W .

7.5 Experiments


Task and Data: The task is given the text in a tweet, a system aims to find the most relevant news

article. For gold standard annotation, we harvest all the tweets that have a single URL link to a CNN

or NYTIMES news article, dated from the 11th of Jan to the 27th of Jan, 2013. In evaluation, we

consider this URL referred news article as the gold standard – the most relevant document for the

tweet. We remove the URL from the text of the tweet, so that URLs are invisible to the algorithms.

We also collect all the news articles from both CNN and NYTIMES from RSS feeds during the same

timeframe. Each tweet entry has the published time, author, text, URL; each news entry contains

published time, title, news summary, URL. The tweet/news pairs are extracted by matching URLs.

We manually filter “trivial” tweets where the tweet content is simply the news title or the news

summary. The final data set has 34,888 tweets and 12,704 news articles.

For our task evaluation, ideally, we would like the system to be able to identify the news article

specifically referred to by the URL within each tweet in the gold standard. However, this is very

difficult given the large number of potential news article candidates, especially those news docu-

ments with slight variations. Therefore, the systems are measured by ranking performance of the

URL referred news document.

We use three metrics for evaluating the ranking of the correct news article:

• Area under the top-k recall curve (ATOP), same in the concept definition retrieval task in


[Guo and Diab, 2012b]. Basically, it is the normalized ranking ∈ [0, 1] of the correct news

article among all candidate articles: ATOP = 1 means the URL referred news article has the

highest similarity value with the tweet among all news candidates; ATOP= 0.95 means the

similarity value with correct news article is larger than 95% of the candidates, i.e. within the

top 5% of the candidates. ATOP is calculated as follows:

ATOP =

∫ 1

0TOPK(k)dk (7.3)

where TOPKt(k) = 1 if the URL referred news article is in the “top k” list, otherwise

TOPKt(k) = 0. Here k ∈ [0, 1] is the relative position (when k = 1, it means the cor-

rect article is above all the candidates).

• Reciprocal Rank (RR), which is the reciprocal of the rank of the correct news article, e.g.,

RR = 1/3 if the correct news article is ranked at the 3rd highest place in the returned list.

• Top 10 recall rate (TOP10), e.g., TOP10 = 1 if the correct news article is among the top 10

returned list, otherwise TOP10 = 0.

Similar to [Guo and Diab, 2012b], for each tweet, we collected the 1,000 news articles published

prior to the tweet whose dates of publication are closest to that of the tweet, as the candidate set. The

cosine similarity score between the URL referred news article and the tweet is compared against the

scores of these 1,000 news articles to calculate the three metric scores. 10% of the gold standard

tweet/news pairs are used as development set, based on which all the parameters of models are

tuned.

Corpora: We use the same corpora as in [Guo and Diab, 2012b]: Brown corpus (each sentence is

treated as a document), sense definitions of Wiktionary and Wordnet [Fellbaum, 1998]. The tweets

and news articles are included in the corpus as well, yielding a total of 441,258 short texts and

5,149,122 tokens.

Baseline: We present 4 baselines: 1. Information Retrieval model [IR], which simply treats a tweet

as a document, and performs traditional surface word matching. 2. LDA-θ with Gibbs Sampling

as inference method. We use the inferred topic distribution θ as a latent vector to represent the

tweet/news. 3. LDA-wvec. where the latent vector is the average of the word latent vectors P (z|w)


Models ParametersATOP TOP10 RR

dev test dev test dev test

IR - 90.795% 90.743% 73.478% 74.103% 46.024% 46.281%

LDA-θ α = 0.05, β = 0.05 81.368% 81.251% 32.328% 31.207% 13.134% 12.469%

LDA-wvec α = 0.05, β = 0.05 94.148% 94.196% 53.500% 53.952% 28.743% 27.904%

WMF - 95.964% 96.092% 75.327% 76.411% 45.310% 46.270%

WMF-G k = 3, δ = 3 96.450% 96.543% 76.485% 77.479% 47.516% 48.665%

WMF-G k = 5, δ = 3 96.613% 96.701% 76.029% 77.176% 47.197% 48.189%

WMF-G k = 4, δ = 3 96.510% 96.610% 77.782% 77.782% 47.917% 48.997%

Table 7.1: Performance for Linking-Tweets-to-News under three evaluation metrics (latent dimen-

sion K = 100 for LDA/WMF/WMF-G)

weighted by TF-IDF. 4. WMF. In these baselines, hashtags and named entities are simply treated as

words.

To curtail variation in results due to randomness, each reported number is the average of 10

runs. For WMF and WMF-G, we assign the same initial random values and run 20 iterations.

In both systems we fix the missing words weight as wm = 0.01 and regularization coefficient at

λ = 20, which is the best condition of WMF found in Chapter 2. For LDA-θ and LDA-wvec, we

run Gibbs Sampling based LDA for 2000 iterations and average the estimated variables over the last

10 iterations.

7.5.2 Results

Table 7.1 summarizes the performance of the baselines and WMF-G at latent dimension K = 100.

All the parameters are chosen based on the development set. For WMF-G, we try different values

of k (the number of neighbors linked to a tweet/news for a hashtag/NE/time constraint) and δ (the

weight of link information). We decided to integrate the links in four subgraphs: (a) hashtags in

tweets; (b) named entities in tweets; (c) timestamp in tweets; (d) timestamp in news articles. For

LDA we tune the hyperparameter α (Dirichlet prior for topic distribution of a document) and β

(Dirichlet prior for word distribution given a topic). It is worth noting that ATOP measures the

overall ranking in 1000 samples whereas TOP10/RR focus more on whether the groundtruth news

article is in the first few returned results.


0 1 2 3 4

96

96.2

96.4

96.6

96.8

AT

OP

δ

dev

test

(a) ATOP

0 1 2 3 475

75.5

76

76.5

77

77.5

78

TO

P10

δ

dev

test

(b) TOP10

0 1 2 3 445

46

47

48

49

RR

δ

dev

test

(c) RR

Figure 7.3: Impact of the weight of links δ of model WMF-G on development set and test set

evaluated by three evaluation metrics: latent dimension K = 100, and neighbor tweets number is

k = 4.


50 75 100 125 15095

95.5

96

96.5

97

AT

OP

D

WTMF

WTMF−G

(a) ATOP

50 75 100 125 15070

72

74

76

78

80

TO

P10

D

WTMF

WTMF−G

(b) TOP10

50 75 100 125 15040

42

44

46

48

50

RR

D

WTMF

WTMF−G

(c) RR

Figure 7.4: Impact of latent dimension K of model WMF-G on test set evaluated by three metrics:

the neighbor tweet number is fixed k = 4. Dimension K varies from 50 to 150.


Conditions LinksATOP TOP10 RR

dev test dev test dev test

hashtags tweets 375,371 +0.397% +0.379% +1.015% +1.021% +0.504% +0.641%

NE tweets 164,412 +0.141% +0.130% +0.598% +0.479% +0.278% +0.294%

time tweet 139,488 +0.126% +0.136% +0.512% +0.503% +0.241% +0.327%

time news 50,008 +0.036% +0.026% +0.156% +0.256% +1.890% +1.924%

full model (all 4 subgraphs) 573,999 +0.546% +0.518% +1.556% +1.371% +2.607% +2.727%

full model minus hashtags tweets 336,963 +0.288% +0.276% +1.129% +1.037% +2.488% +2.541%

full model minus NE tweets 536,333 +0.528% +0.503% +1.518% +1.393% +2.580% +2.680%

full model minus time tweet 466,207 +0.457% +0.426% +1.281% +1.145% +2.449% +2.554%

full model minus time news 523,991 +0.508% +0.490% +1.300% +1.190% +0.632% +0.785%

author tweet 21,318 +0.043% +0.042% +0.028% +0.057% −0.003% −0.017%

full model plus author tweet 593,483 +0.575% +0.545% +1.465% +1.336% +2.415% +2.547%

Table 7.2: Contribution of subgraphs of hashtag/named entity/temporal/author, whenK = 100, k =

4, δ = 3, measured by gain over baseline WMF.

Same as reported in [Guo and Diab, 2012b], LDA-θ has the worst results due to directly using

the inferred topic distribution of a text θ. The inferred topic vector has only a few non-zero values,

hence a lot of information is missing. LDA-wvec preserves more information by creating a dense

latent vector from the topic distribution of a word P (z|w), and thus does much better in ATOP.

It is interesting to see that IR model has a very low ATOP (90.795%) and an acceptable RR

(46.281%) score, in contrast to LDA-wvec with a high ATOP (94.148%) and a low RR(27.904%)

score. This is caused by the nature of the two models. LDA-wvec is able to identify global coarse

grained topic information (such as politics vs. economics), hence receiving a high ATOP by exclud-

ing the most irrelevant news articles, however it does not distinguish fine grained difference such as

Hillary vs. Obama. IR model exerts the opposite influence via word matching. It ranks a correct

news article very high if overlapping words exist (leading to a high RR), or the news article is ranked

very low if no overlapping words (hence a low ATOP).

We can conclude WMF is a very strong baseline given that it achieves high scores with three

metrics. As a dimension reduction model, it is able to capture global topics (+1.89% ATOP over

LDA-wvec); moreover, by explicitly modeling missing words, the existence of a word is also en-

coded in the latent vector (+2.31% TOP10 and −0.011% RR over IR model).

With WMF being a very challenging baseline, WMF-G can still significantly improve all 3


metrics. In the case k = 4, δ = 3 compared to WMF, WMF-G receives +1.371% TOP10, +2.727%

RR, and +0.518% ATOP value (this is a significant improvement of ATOP value considering that it

is averaged on 30,000 data points, at an already high level of 96% reducing error rate by 13%). All

the improvement of WMF-G over WMF is statistically signicant at the 99% condence level with a

two-tailed paired t-test.

We also present results using different number of links k in WMF-G in Table 7.2. We experi-

mented with k = {3, 4, 5}. k = 4 is found to be the optimal value (although k = 5 has a better

ATOP). Figure 7.3 demonstrates the influence of δ = {0, 1, 2, 3, 4} on each metric when k = 4.

Note when δ = 0 no link is used, which is the baseline WMF. We can see using links is always

helpful. When δ = 4, we receive a higher ATOP value but lower TOP10 and RR.

Figure 7.4 illustrates the impact of dimensionK = {50, 75, 100, 125, 150} on WMF and WMF-

G (k = 4) over the test set. The trends hold in different K values with a consistent improvement.

Generally a larger K leads to a better performance. In all conditions WMF-G outperforms WMF.

7.5.2.1 Contribution of Subgraphs

We are interested in the contribution of each feature subgraph. Therefore we list the impact of

individual components in Table 7.2. The impact of each subgraph is evaluated in two conditions:

(a) the subgraph-only; (b) the full-model-minus the subgraph. The full model is the combination of

the 4 subgraphs (which is also the best model k = 4 in Table 7.1). In the last two rows of Table

7.2 we also present the results of using authorship only and the full model plus authorship. The 2nd

column lists the number of links in the subgraph. To highlight the difference, we report the gain of

each model over the baseline model WMF.

We have several interesting observations from Table 7.2. It is clear that the hashtag subgraph

on tweets is the most useful subgraph: with hashtag tweet it has the best ATOP and TOP10 values

among subgraph-only condition (ATOP: +0.379% vs. 2nd best +0.136%, TOP10: +1.021% vs.

2nd best +0.503%), while in the full-model-minus condition, minus hashtag has the lowest ATOP

and TOP10. Observing that it also contains the most links, we believe the coverage is another

important reason for the great performance.

It seems the named entity subgraph helps the least. Looking into the extracted named entities

and hashtags, we found many popular named entities are already covered by hashtags. That said,


adding named entity subgraph into final model has a positive contribution.

It is worth noting that the time news subgraph has the most positive influence on RR. This is

because temporal information is very salient in news domain: usually there are several reports to

describe an event within a short period, therefore the news latent vector is strengthened by receiving

semantics from its neighbors.

At last, we analyzed the influence of authorship of tweets. Adding authorship into the full model

greatly hurts the scores of TOP10 and RR, whereas it is helpful to ATOP. This is understandable

since by introducing author links between tweets, to some degree we are averaging the latent vectors

of tweets written by the same person. Therefore, for a tweet whose topic is vague and hard to detect,

it will get some prior knowledge of topics through the author links (hence increase ATOP), whereas

this prior knowledge becomes noise for those tweets that are already handled very well by the

WMF-G model (hence decrease TOP10 and RR).


Motivated by the difficulty of understanding tweets, we propose a Linking-Tweets-to-News task,

which potentially benefits many NLP applications where off-the-shelf NLP tools can be applied to

the most relevant news. We also collect a gold standard dataset by crawling tweets each with a URL

referring to a news article. We formalize the linking task as a short text modeling problem, and

extract Twitter/news specific features to extract text-to-text relations, which are then incorporated

under the matrix factorization framework. The new model achieve significant improvement over

baselines.

Aiming at increasing the accuracy of the linking, it is worth investigating the supervised setting

for this task, which can be cast as a classic ranking problem. With the groundtruth available, we

can extract a plenty of interesting features that disclose the relatedness between a tweet and a news

article, such as surface word similarity, whether there exists a named entity appearing on both sides,

and many more.

More importantly, since the Linking-Tweets-to-News task is to provide more context to under-

stand tweets, it would be nice if the predictions of our model can actually improve the performance

of other NLP tasks focusing on tweets. In the future, we would like to collaborate with researchers


working on tasks such as tweet summarization, event extraction on tweets, to test the influence.

108

Part III

Conclusions

CHAPTER 8. CONCLUSIONS 109

Chapter 8

Conclusions

Nowadays the internet generates massive short text data, limited progress has been made toward

alternative ways to compute meaningful similarity other than surface word overlap calculations or

applying word level semantic comparison, which are typically ineffective given the short context or

too time-consuming. To this end, we focus on developing dimension reduction models to address

the issue in the first part of the thesis. This thesis has been an exploration of unsupervised methods

for modeling short text representations in the latent space. We use the task of calculating semantic

textual similarity to illustrate the efficacy of our approach.

In the second part of the thesis we further exemplify the impact our models have on several

NLP application tasks. We address adapting short text similarity models within the context of

several semantics based NLP tasks: word sense disambiguation, automated pyramid evaluation,

and linking tweets to news.

In the following we discuss some new challenges to the short text similarity task and some

potential future work. Firstly, focusing on the current matrix factorization framework, we note

that it only exploits bag-of-words features, and overlooks the structural information in the text,

such as word order, syntax. We believe the structural features convey more subtle and nuanced

semantics that cannot be covered by individual words. Secondly, given that our current models learn

a latent representation for a text, it would be very interesting to see the impact of neural networks

on this task, since neural networks are well known for learning semantic embeddings of words

or texts. In addition, neural network based models are flexible enough to be able to integrate and

incorporate syntactic information. Another direction for enhancing short text modeling performance


is by adding supervised information.

8.1 Summary

Our work on modeling short text data is motivated by the nature of the texts, which can be character-

ized by the following two features/challenges: (1) very few words in the text; (2) large scale of data,

especially in the online generated portion. Our first two models concentrate on the first trait by (a)

integrating more features for texts, (b) integrating more features for words; and the third model (c)

targets the second trait by exploiting binary coding. From the perspective of the matrix factorization

model, Chapter 2 is devoted to improving the Q matrix (textual latent profiles) in Figure 2.2, while

Chapter 3 and 4 work on modeling a better P matrix (lexical latent profiles).

In the first half of the thesis, we begin our investigation for the word sparsity characteristic of

short texts in Chapter 2, by modeling missing words for short text data. The bottleneck for short text

similarity is that the number of words in a text is very small. Using missing words adds thousands

of more features, thereby alleviates the data sparsity issue. We analyze two classic models LSA

and LDA, and provide some important insight on how they handle missing words. Accordingly,

we design a mechanism that sits between the two methods, which models the missing words at the

appropriate level of granularity: the weighted matrix factorization uses all the missing words, and

give a small weight for the missing words, so that the impact of the observed words is not diminished

by the sheer overwhelming number of missing words.

We extend our effort in Chapter 3, however we tackle the same challenge in another direction

by robustly modeling lexical semantics, which translates into improving the P matrix in Figure 2.2.

The intuition is that because there are only around 10 observed words in a short text, we need to

make good use of each word very well, in case some important topics are not included in the text

semantics. We integrate corpus-based semantics (bigrams) and knowledge based semantics (similar

word pairs) in the weighted matrix factorization framework. Because they are very different kinds

of lexical semantics, they are complementary to each other. It is worth noting that this approach is

able to improve the WMF model that already performs significantly above LSA/LDA.

We then move to applying our model to massive data set such as Twitter data. In the online

data scenario, the new challenge is the large size of the data – each day 500 million tweets are


generated. To this end, we convert our model into a binarized version, which produces a binary bit

string for each tweet. The binary strings allow Hamming distance computation using computational

hardware more directly which results in much faster computation than cosine similarity on real-

valued vectors. Since the succinct binary bits lose significant nuance semantics, we also propose a

method to reduce redundancy in the P matrix in order to store as much information as possible.

We developed several models to improve Pearson’s correlation coefficient on predicting short

text similarity. Yet this is not adequate. We need to know whether the improvement is substantial

enough to boost other tasks that comes with short text similarity computation components. There-

fore, we take a step forward and evaluate the performance in other tasks as an extrinsic evaluation.

Theoretically, any tasks that involve similarity computation should benefit from our work. In this

thesis, we select several NLP tasks that are strongly associated with semantics.

The first task is automated pyramid method for text summarization. The similarity computation

happens in the process of determining which key concepts from the original documents are men-

tioned in the ngrams of summaries. The primary difference between this task and the short text

similarity task is the text granularity becomes ngram phrases. Evaluated on student summaries, the

dimension reduction based method is able to extract key concepts with higher precision and recall,

and hence achieve a higher correlation with manuals scores, than previous methods.

Another NLP task that involves heavy similarity computation is unsupervised word sense disam-

biguation (WSD). We note that WSD systems highly rely on sense similarity measures. Moreover,

sense definitions are usually very short, thereby rendering this task ideal as a test bed for our models.

By exploiting the sense relations in WordNet, we construct a new sense similarity measure wmfvec,

where each sense is represented by a latent vector learned from its WordNet definition. On WSD

experiments, wmfvec significantly outperform LDA based similarity measure and the surface word

comparison based elesk measure [Banerjee and Pedersen, 2003].

We then apply our model in social media and news data. Here, we show that our model WMF

can be extended easily to adapt to a new task. We identify the key challenge of modeling tweets

for events – a tweet is fragmented usually covering only one aspect of the event. We integrate the

tweet specific feature hashtags, the news specific feature named entities, to find the complementary

tweets that contain the missing aspects for the target tweet. The resulting model achieves even better

performance than WMF model.


8.2 Limitations and Future work

Despite the progress presented in this thesis, there remain some interesting and exciting challenges

for short text similarity task. In the following, we discuss some limitations of the current proposed

models, and several promising topics that we will explore in our future research. In general, we

intend to continue working on this task from four aspects: (1) adding new features; (2) exploiting

new embedding techniques; (3) providing supervised labels; (4) producing new properties for text

embeddings.

1. Adding new features – Incorporating Syntax: In our current models, one major impediment

is we mainly use bag-of-words features. Intuitively, individual words are not capable of expressing

more subtle semantics compared to phrases, and breaking texts into individual words loses a lot of

information such as word order. An example from [Feng et al., 2011] is, the meaning from cancer

to prevent cancer is completely reversed; yet the textual similarity score, computed in either the

lexical semantics based method or the dimension reduction model, is relatively high. Maybe this is

the reason why short text similarity is rarely applied in sentiment analysis.

To overcome the issue, modeling the new feature syntactic structures could be helpful. Previous

work [Severyn et al., 2013] showed that tree kernels proved to boost the similarity scores, where the

similarity of two texts is the sum of common subtrees extracted from constituent and dependency

trees.

Considering that our model is a dimension reduction model, one preliminary idea is to integrate

the vector based compositional semantics [Mitchell and Lapata, 2008] in the factorization model.

Compositional semantics studies how the meaning of individual text units can be combined to pro-

duce the meaning of bigger units such as phrases or sentences [Hodges, 1997]. Mitchell and Lapata

[2008] investigated constructing the vector representation of phrases from words. In our case, we

can generate the phrase vectors following the constituent tree structure of the short text: the vectors

of two nodes will be merged in a new vector following the compositional operations. We hope that

the compositional semantics is able to correctly model phrases such as not bad.

2. Exploiting word embeddings – Neural Networks as a new method to produce textual em-

bedding: Nowadays neural network techniques have been proven to be very powerful models

to learning word and text embeddings, such as recurrent neural networks [Mikolov et al., 2010],


recursive autoencoders [Socher et al., 2011b]. They were found to be successful in a variety of

NLP tasks, including paraphrase detection [Socher et al., 2011a], sentiment analysis [Socher et al.,

2011b], language modeling [Mnih and Hinton, 2007] by employing automatic feature extraction.

Given the great performance on these NLP tasks, it is worth exploration in the context of short text

similarity. Moreover, because of its recursive property, it provides a natural way to model syntax

of sentences, as shown in [Socher et al., 2011b]. The first step could be to apply recurrent neural

networks [Mikolov et al., 2010] to learn a short text embedding, which preserves the word order by

treating the text as a word sequence. Then we can move to apply recursive autoencoder [Socher et

al., 2011a] on the syntactic tree of text data. Note that these methods already model compositional

semantics.

3. Providing Supervised Labels – Supervised Tweet Retrieval: Another direction of enhancing

the short text similarity performance is to add supervised labels. We want to test the performance

in tweet retrieval task, since a lot of noisy labels, hashtags, are already available. By observing the

hashtag labels, the model will learn a more informative binary string for each tweet, where Ham-

ming distances are minimized among tweets with the same labels, and simultaneously maximized

among tweets pertaining to different labels. This direction is worth exploration since the labels,

which are hashtags, can be easily obtained without expensive manual annotation.

In the meanwhile, it raises new challenges in the Twitter context. Because tweets are short and

hence fragmented that a tweet only reports one aspect of the event, two tweets sharing the same

hashtag label are not necessarily talking the same aspect of the event. Consider the two following

tweets,

• my favorite on #Oscar2015 Red Carpet. @ladygaga agreed, saying she looked ”beautiful.”

• JulieAndrews presents Oscar for Best Film Score. #Oscar2015

Both of them are Oscar 2015 related, however, one focuses on the red carpet and the other talks

about the best film. To address this special issue, we may need to relax the strong assumption that

all tweets sharing the same labels should have similar binary bits.

4. Producing new properties for text embeddings – Sentiment aware short text Embeddings:

The previous three topics focus on improving the quality of embedding so that the text embeddings


encode more similarity information. Now we are considering another aspect of embeddings by

augmenting them with new properties such as sentiment. Our idea is inspired by the work in [Yih

et al., 2012], where the induced word embeddings are able to distinguish antonyms. In their model,

ideally the cosine similarity between hot and cold should be close to −1, which implies the two

words are negatively correlated. We would like our model to enjoy a similar effect – two sentences

that express opposite semantics should have a cosine similarity value of −1. Such an embedding

would have significant impact on research in areas such as sarcasm detection [Gonzalez-Ibanez et

al., 2011; Riloff et al., 2013], where the contradiction of two text segments is an underlying attribute.

To accomplish this goal, we will need to annotate new data, and dramatically change the current

similarity annotation schema. A preliminary annotation schema should take into consideration both

topical similarity as well as sentiment polarity.

In terms of applications, it is interesting to build joint models to bridge the gap between short

text similarity task and the application tasks. Examples of such tasks include paraphrase detection

and textual entailment. Here, we briefly discuss the application and challenges of applying our

models for novelty detection for events.

Novelty Detection aims to find an event that has not been covered by the news media. Thus,

the short text similarity is a great baseline for the task – if a text has very low similarity scores

with all the previous news article, it is considered a potential novelty event. However, there is one

major challenge we need to face, which is the named entities. Identifying novel event highly relies

on the named entities involved in the event, but our models do not handle the new named entities

very well – any distributional model needs to see a string multiple times before it can get the correct

semantics of the string. In this case, it makes more sense to combine our model with the surface

word matching method to get a robust similarity score for novelty detection task.

115

Part IV

Bibliography

BIBLIOGRAPHY 116

Bibliography

[Abel et al., 2011] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Semantic enrichment of

twitter posts for user profile construction on the social web. In Proceedings of the 49th Annual

Meeting of the Association for Computational Linguistics, 2011.

[Agarwal et al., 2012] Puneet Agarwal, Rajgopal Vaithiyanathan, Saurabh Sharma, and Gautam

Shroff. Catching the long-tail: Extracting local news events from twitter. In Proceedings of the

Sixth International AAAI Conference on Weblogs and Social Media, 2012.

[Agirre and Soroa, 2009] Eneko Agirre and Aitor Soroa. Proceedings of personalizing pagerank

for word sense disambiguation. In the 12th Conference of the European Chapter of the ACL,

2009.

[Agirre et al., 2012] Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. Semeval-

2012 task 6: A pilot on semantic textual similarity. In First Joint Conference on Lexical and

Computational Semantics (*SEM), 2012.

[Agirre et al., 2013] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei

Guo. *sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical

and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the

Shared Task: Semantic Textual Similarity, 2013.

[Agirre et al., 2014] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor

Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-

2014 task 10: Multilingual semantic textual similarity. In SemEval 2014, 2014.

BIBLIOGRAPHY 117

[Bach et al., 2013] Stephen H. Bach, Bert Huang, Ben London, and Lise Getoor. Hingeloss markov

random fields: Convex inference for structured prediction. In In Uncertainty in Artificial Intelli-

gence, 2013.

[Baker et al., 1998] Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet

project. In Proceedings of the 36th Annual Meeting of the Association for Computational Lin-

guistics and 17th International Conference on Computational Linguistics-Volume 1, 1998.

[Banerjee and Pedersen, 2003] Satanjeev Banerjee and Ted Pedersen. Extended gloss overlaps as

a measure of semantic relatedness. In Proceedings of the 18th International Joint Conference on

Artificial Intelligence, pages 805–810, 2003.

[Bar et al., 2013] Daniel Bar, Torsten Zesch, and Iryna Gurevych. Dkpro similarity: An open

source framework for text similarity. In Proceedings of the 51st Annual Meeting of the Asso-

ciation for Computational Linguistics: System Demonstrations, 2013.

[Barzilay and Lee, 2003] Regina Barzilay and Lillian Lee. Learning to paraphrase: an unsuper-

vised approach using multiple-sequence alignment. In Proceedings of the 2003 Conference of

the North American Chapter of the Association for Computational Linguistics on Human Lan-

guage Technology-Volume 1, 2003.

[Beltagy et al., 2014] Islam Beltagy, Katrin Erk, and Raymond Mooney. Probabilistic soft logic

for semantic textual similarity. Proceedings of the 52nd Annual Meeting of the Association for

Computational Linguistics, 2014.

[Benson et al., 2011] Edward Benson, Aria Haghighi, and Regina Barzilay. Event discovery in

social media feeds. In Proceedings of the 49th Annual Meeting of the Association for Computa-

tional Linguistics: Human Language Technologies, 2011.

[Bhagat and Ravichandran, 2008] Rahul Bhagat and Deepak Ravichandran. Large scale acquisition

of paraphrases for learning surface patterns. In Proceedings of ACL-08: HLT, 2008.

[Blei and Lafferty, 2006] David M Blei and John D Lafferty. Dynamic topic models. In Proceed-

ings of the 23rd international conference on Machine learning, 2006.

BIBLIOGRAPHY 118

[Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.

Journal of Machine Learning Research, 3, 2003.

[Boonthum-Denecke et al., 2011] Chutima Boonthum-Denecke, Philip M McCarthy, Travis Alan

Lamkin, G Tanner Jackson, Joseph Magliano, and Danielle S McNamara. Automatic natural lan-

guage processing and the detection of reading skills and reading comprehension. In Proceedings

of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference,

2011.

[Broder et al., 1998] Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher.

Min-wise independent permutations. In Proceedings of the Thirtieth Annual ACM Symposium

on Theory of Computing, 1998.

[Brody and Diakopoulos, 2011] Samuel Brody and Nicholas Diakopoulos. Coooooooooooooooll-

llllllllllll!!!!!!!!!!!!!! using word lengthening to detect sentiment in microblogs. In Proceedings

of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011.

[Cai et al., 2007] Jun Fu Cai, Wee Sun Lee, and Yee Whye Teh. Improving word sense disambigua-

tion using topic features. In Proceedings of the 2007 Joint Conference on Empirical Methods in

Natural Language Processing and Computational Natural Language Learning, 2007.

[Callison-Burch et al., 2007] Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof

Monz, and Josh Schroeder. (meta-) evaluation of machine translation. In Proceedings of the

Second Workshop on Statistical Machine Translation, 2007.

[Callison-Burch et al., 2008] Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof

Monz, and Josh Schroeder. Further meta-evaluation of machine translation. In Proceedings

of the Third Workshop on Statistical Machine Translation, 2008.

[Chakrabarti and Punera, 2011] Deepayan Chakrabarti and Kunal Punera. Event summarization

using tweets. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social

Media, 2011.

BIBLIOGRAPHY 119

[Chang et al., 2013] Kai-Wei Chang, Wen-tau Yih, and Christopher Meek. Multi-relational latent

semantic analysis. In Proceedings of the 2013 Conference on Empirical Methods in Natural

Language Processing, 2013.

[Charikar, 2002] Moses S. Charikar. Similarity estimation techniques from rounding algorithms.

In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, 2002.

[Chen and Dolan, 2011] David L. Chen and William B. Dolan. Collecting highly parallel data

for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for


[Claypool et al., 1999] Mark Claypool, Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry

Netes, and Matthew Sartin. Combining content-based and collaborative lters in an online news-

paper. In In Proceedings of the ACM SIGIR Workshop on Recommender Systems, 1999.

[Clive et al., 2005] Best Clive, Erik van der Goot, Ken Blackler, Teofilo Garcia, and David Horby.

Europe media monitorsystem description. EUR Report, 2005.

[Conover et al., 2011] Michael Conover, Jacob Ratkiewicz, Matthew Francisco, Bruno Goncalves,

Filippo Menczer, and Alessandro Flammini. Political polarization on twitter. In Proceedings of

the Fifth International AAAI Conference on Weblogs and Social Media, 2011.

[Corso et al., 2005] Gianna M. Del Corso, Antonio Gulli, and Francesco Romani. Ranking a stream

of news. In Proceedings of the 14th international conference on World Wide Web, 2005.

[Dagan et al., 2006] Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising

textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncer-

tainty, Visual Object Classification, and Recognising Tectual Entailment. Springer, 2006.

[Deerwester et al., 1990] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W.

Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. Journal of the American

Society for Information Science, 1990.

[Dolan et al., 2004] William Dolan, Chris Quirk, and Chris Brockett. Unsupervised construction

of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the

20th International Conference on Computational Linguistics, 2004.

BIBLIOGRAPHY 120

[Duan et al., 2010] Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, and Heung-Yeung Shum. An

empirical study on learning to rank of tweets. In COLING, 2010.

[Efron and Tibshirani, 1986] Bradley Efron and Robert Tibshirani. Bootstrap methods for standard

errors, confidence intervals, and other measures of statistical accuracy. Statistical science, 1986.

[Efron, 2010] Miles Efron. Information search and retrieval in microblogs. In Journal of the Amer-

ican Society for Information Science and Technology, 2010.

[Eisenstein, 2013] Jacob Eisenstein. What to do about bad language on the internet. In Proceedings

of NAACL-HLT 2013, pages 359–369, 2013.

[Erk, 2007] Katrin Erk. A simple, similarity-based model for selectional preferences. In ANNUAL

MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2007.

[Fellbaum, 1998] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press,

1998.

[Feng et al., 2008] Jin Feng, Yi-Ming Zhou, and Trevor Martin. Sentence similarity based on rele-

vance. In Proceedings of IPMU, 2008.

[Feng et al., 2011] Song Feng, Ritwik Bose, and Yejin Choi. Learning general connotation of

words using graph-based algorithms. In Proceedings of the Conference on Empirical Methods in

Natural Language Processing, 2011.

[Foltz et al., 2000] Peter W Foltz, Sara Gilliam, and Scott Kendall. Supporting content-based feed-

back in on-line writing evaluation with lsa. Interactive Learning Environments, pages 111–127,

2000.

[Gildea and Jurafsky, 2002] Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic

roles. Computational linguistics, 2002.

[Gong and Lazebnik, 2011] Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A pro-

crustean approach to learning binary codes. In Proceedings of IEEE Conference on Computer

Vision and Pattern Recognition, 2011.

BIBLIOGRAPHY 121

[Gonzalez-Ibanez et al., 2011] Roberto Gonzalez-Ibanez, Smaranda Muresan, and Nina Wa-

cholder. Identifying sarcasm in twitter: a closer look. In Proceedings of the 49th Annual Meeting

of the Association for Computational Linguistics: Human Language Technologies, 2011.

[Graesser et al., 2004] Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang

Cai. Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods,

Instruments, & Computers, 2004.

[Graesser et al., 2011] Arthur C Graesser, Danielle S McNamara, and Jonna M Kulikowich. Coh-

metrix providing multilevel analyses of text characteristics. Educational Researcher, 2011.

[Griffiths and Steyvers, 2004] Thomas L. Griffiths and Mark Steyvers. Finding scientific topics.

Proceedings of the National Academy of Sciences, 101, 2004.

[Guo and Diab, 2010] Weiwei Guo and Mona Diab. Combining orthogonal monolingual and mul-

tilingual sources of evidence for all words wsd. In Proceedings of the 48th Annual Meeting of

the Association for Computational Linguistics, 2010.

[Guo and Diab, 2011] Weiwei Guo and Mona Diab. Semantic topic models: Combining word

distributional statistics and dictionary definitions. In Proceedings of the 2011 Conference on

Empirical Methods in Natural Language Processing, 2011.

[Guo and Diab, 2012a] Weiwei Guo and Mona Diab. Learning the latent semantics of a concept by

its definition. In Proceedings of the 50th Annual Meeting of the Association for Computational

Linguistics, 2012.

[Guo and Diab, 2012b] Weiwei Guo and Mona Diab. Modeling sentences in the latent space. In

Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012.

[Guo and Diab, 2013] Weiwei Guo and Mona Diab. Improving lexical semantics for sentential

semantics: Modeling selectional preference and similar words in a latent variable model. In

Proceedings of the 2013 Conference of the North American Chapter of the Association for Com-

putational Linguistics: Human Language Technologies, 2013.

BIBLIOGRAPHY 122

[Guo et al., 2013] Weiwei Guo, Hao Li, Heng Ji, and Mona Diab. Linking tweets to news: A

framework to enrich online short text data in social media. In Proceedings of the 51th Annual


[Guo et al., 2014] Weiwei Guo, Wei Liu, and Mona Diab. Fast tweet retrieval with compact binary

codes. In Proceedings of COLING 2014, the 25th International Conference on Computational

Linguistics, 2014.

[Han et al., 2013] Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield, and Jonathan Weese.

Umbc ebiquity-core: Semantic textual similarity systems. In Second Joint Conference on Lexical

and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the

Shared Task, 2013.

[Harnly et al., 2005] Aaron Harnly, Ani Nenkova, Rebecca Passonneau, and Owen Rambow. Au-

tomation of summary evaluation by the pyramid method. In Recent Advances in Natural Lan-

guage Processing, 2005.

[Hindle and Rooth, 1993] Donald Hindle and Mats Rooth. Structural ambiguity and lexical rela-

tions. Computational linguistics, 1993.

[Ho et al., 2010] Chukfong Ho, Masrah Azrifah Azmi Murad, Rabiah Abdul Kadir, and Shya-

mala C. Doraisamy. Word sense disambiguation-based sentence similarity. In Proceedings of

the 23rd International Conference on Computational Linguistics, 2010.

[Hodges, 1997] Wilfrid Hodges. Compositional semantics for a language of imperfect information.

Logic Journal of IGPL, 1997.

[Hofmann, 1999] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the

22nd annual international ACM SIGIR conference on Research and development in information

retrieval, 1999.

[Hovy et al., 2006] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph

Weischedel. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology

Conference of the North American Chapter of the ACL, 2006.

BIBLIOGRAPHY 123

[Huang et al., 2012] Hongzhao Huang, Arkaitz Zubiaga, Heng Ji, Hongbo Deng, Dong Wang, Hieu

Le, Tarek Abdelzather, Jiawei Han, Alice Leung, John Hancock, and Clare Voss. Tweet rank-

ing based on heterogeneous networks. In Proceedings of the 24th International Conference on


[Ide and Veronis, 1998] Nancy Ide and Jean Veronis. Introduction to the special issue on word

sense disambiguation: the state of the art. Computational linguistics, 1998.

[Indyk and Motwani, 1998] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: to-

wards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM sympo-

sium on Theory of computing, 1998.

[Islam and Inkpen, 2008] Aminul Islam and Diana Inkpen. Semantic text similarity using corpus-

based word similarity and string similarity. ACM Transactions on Knowledge Discovery from

Data, 2, 2008.

[Ji and Eisenstein, 2013] Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to dis-

tributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in

Natural Language Processing, 2013.

[Ji and Grishman, 2008] Heng Ji and Ralph Grishman. Refining event extraction through cross-

document inference. In Proceedings of ACL-08: HLT, 2008.

[Jiang and Conrath, 1997] Jay J. Jiang and David W. Conrath. Semantic similarity based on cor-

pus statistics and lexical taxonomy. In Proceedings of International Conference Research on


[Jiang et al., 2011] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target-

dependent twitter sentiment classification. In Proceedings of the 49th Annual Meeting of As-

sociation for Computational Linguistics, 2011.

[Jin et al., 2011] Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. Transferring topical

knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM

international conference on Information and knowledge management, 2011.

BIBLIOGRAPHY 124

[Kauchak and Barzilay, 2006] David Kauchak and Regina Barzilay. Paraphrasing for automatic

evaluation. In Proceedings of the Human Language Technology Conference of the North Ameri-

can Chapter of the ACL, 2006.

[Kimmig et al., 2012] Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, and

Lise Getoor. A short introduction to probabilistic soft logic. In Proceedings of the NIPS Work-

shop on Probabilistic Programming: Foundations and Applications, 2012.

[Klein and Manning, 2003] Dan Klein and Christopher D Manning. Accurate unlexicalized pars-

ing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-

Volume 1, pages 423–430. Association for Computational Linguistics, 2003.

[Kulis and Grauman, 2012] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hash-

ing. IEEE Transactions On Pattern Analysis and Machine Intelligence, 34(6):1092–1104, 2012.

[Lapata and Barzilay, 2005] Mirella Lapata and Regina Barzilay. Automatic evaluation of text co-

herence: Models and representations. In Proceedings of the 19th International Joint Conference

on Artificial Intelligence, 2005.

[Leacock and Chodorow, 1998] Claudia Leacock and Martin Chodorow. Combining local context

and wordnet similarity for word sense identification. Fellbaum, C., ed., WordNet: An electronic

lexical database, 1998.

[Lee and Park, 2007] H. J. Lee and Sung Joo Park. Moners: A news recommender for the mobile

web. Expert Systems with Applications, 2007.

[Lee et al., 2005] Michael David Lee, BM Pincombe, and Matthew Brian Welsh. An empirical

evaluation of models of text document similarity. In Proceedings of the 27th Annual Conference

of the Cognitive Science Society, 2005.

[Lesk, 1986] Michael Lesk. Automatic sense disambiguation using machine readable dictionaries:

How to tell a pine cone from an ice cream cone. In Proceedings of the ACM SIGDOC Conference,

pages 24–26, 1986.

BIBLIOGRAPHY 125

[Li et al., 2006] Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crock-

ett. Sentence similarity based on semantic nets and corpus statistics. IEEE Transaction on

Knowledge and Data Engineering, 18, 2006.

[Li et al., 2010] Linlin Li, Benjamin Roth, and Caroline Sporleder. Topic models for word sense

disambiguation and token-based idiom detection. In Proceedings of the 48th Annual Meeting of

the Association for Computational Linguistics, 2010.

[Li et al., 2012] Hao Li, Yu Chen, Heng Ji, Smaranda Muresan, and Dequan Zheng. Combining

content-based and collaborative lters in an online newspaper. In In Proceedings of the 26th Pacific

Asia Conference on Language, Information and Computation, 2012.

[Lin and Hovy, 2003] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using

n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North Ameri-

can Chapter of the Association for Computational Linguistics on Human Language Technology-

Volume 1, 2003.

[Lin, 1998] Dekang Lin. Verb semantics and lexical selection. In Proceedings of the 15th Interna-

tional Conference on Machine Learning, 1998.

[Lin, 2004] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text

Summarization Branches Out: Proceedings of the ACL-04 Workshop, 2004.

[Liu et al., 2011a] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs.

In Proceedings of the 28th International Conference on Machine Learning, 2011.

[Liu et al., 2011b] Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. Recognizing named

entities in tweets. In The Semanic Web: Research and Applications, 2011.

[Liu et al., 2012a] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Super-

vised hashing with kernels. In Proceedings of IEEE Conference on Computer Vision and Pattern

Recognition, 2012.

[Liu et al., 2012b] Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, and Shih-Fu Chang. Compact

hyperplane hashing with bilinear functions. In Proceedings of the 29th International Conference

on Machine Learning, 2012.

BIBLIOGRAPHY 126

[McCarthy et al., 2004] Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding

predominant word senses in untagged text. In Proceedings of the 42nd Meeting of the Association

for Computational Linguistics, 2004.

[McCarthy et al., 2007] Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Unsuper-

vised acquisition of predominant word senses. Computational Linguistics, 2007.

[Mihalcea et al., 2006] Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based

and knowledge-based measures of text semantic similarity. In Proceedings of the 21st National

Conference on Articial Intelligence, 2006.

[Mihalcea, 2005] Rada Mihalcea. Unsupervised large-vocabulary word sense disambiguation with

graph-based algorithms for sequence data labeling. In Proceedings of the Joint Conference on

Human Language Technology and Empirical Methods in Natural Language Processing, pages

411–418, 2005.

[Mikolov et al., 2010] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev

Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th

Annual Conference of the International Speech Communication Association, 2010.

[Miller et al., 1993] George Miller, Claudia Leacock, Randee Tengi, and Ross Bunker. A semantic

concordance. In 3rd DARPA Workshop on Human Language Technology, 1993.

[Mitchell and Lapata, 2008] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic

composition. In ACL, 2008.

[Mnih and Hinton, 2007] Andriy Mnih and Geoffrey Hinton. Three new graphical models for sta-

tistical language modelling. In Proceedings of the 24th international conference on Machine

learning, 2007.

[Navigli, 2009] Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys

(CSUR), 2009.

[Nenkova and Passonneau, 2004] Ani Nenkova and Rebecca Passonneau. Evaluating content selec-

tion in summarization: The pyramid method. In Proceedings of the Human Language Technol-

BIBLIOGRAPHY 127

ogy Conference of the North American Chapter of the Association for Computational Linguistics,

2004.

[Norouzi and Fleet, 2011] Mohammad Norouzi and David J. Fleet. Minimal loss hashing for com-

pact binary codes. In Proceedings of the 28th International Conference on Machine Learning,

2011.

[O’Shea et al., 2008] James O’Shea, Zuhair Bandar, Keeley Crockett, and David McLean. A com-

parative study of two short text semantic similarity measures. In Proceedings of the Agent

and Multi-Agent Systems: Technologies and Applications, Second KES International Symposium

(KES-AMSTA), 2008.

[Palmer et al., 2001] Martha Palmer, Christiane Fellbaum, Scott Cotton, Lauren Delfs, and

Hoa Trang Dang. English tasks: All-words and verb lexical sample. In Proceedings of

SENSEVAL-2: Second International Workshop on Evaluating Word Sense Disambiguation Sys-

tems, 2001.

[Pang and Lee, 2004] Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using

subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the

Association for Computational Linguistics, 2004.

[Pang and Lee, 2005] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for

sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting

of the Association for Computational Linguistics, 2005.

[Pantel et al., 2007] Patrick Pantel, Rahul Bhagat, Bonaventura Coppola, Timothy Chklovski, and

Eduard H Hovy. Isp: Learning inferential selectional preferences. In HLT-NAACL, 2007.

[Papineni et al., 2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a

method for automatic evaluation of machine translation. In Proceedings of the 40th annual

meeting on association for computational linguistics, 2002.

[Passonneau et al., 2013] Rebecca J. Passonneau, Emily Chen, Weiwei Guo, and Dolores Perin.

Automated pyramid scoring of summaries using distributional semantics. In Proceedings of the

51th Annual Meeting of the Association for Computational Linguistics, 2013.

BIBLIOGRAPHY 128

[Patwardhan and Pedersen, 2006] Siddharth Patwardhan and Ted Pedersen. Using wordnet-based

context vectors to estimate the semantic relatedness of concepts. In Proceedings of the EACL

2006 Workshop Making Sense of Sense - Bringing Computational Linguistics and Psycholin-

guistics Together, 2006.

[Patwardhan et al., 2005] Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen. Sensere-

late::targetword - a generalized framework for word sense disambiguation. In Proceedings of the

Demonstration and Interactive Poster Session of the 43rd Annual Meeting of the Association for


[Pedersen et al., 2004] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. Word-

net::similarity - measuring the relatedness of concepts. In Proceedings of Fifth Annual Meeting

of the North American Chapter of the Association for Computational Linguistics, 2004.

[Perin et al., 2013] Dolores Perin, Rachel Hare Bork, Stephen T. Peverly, and Linda H. Mason. A

contextualized curricular supplement for developmental reading and writing. Journal of College

Reading and Learning, 2013.

[Petrovic et al., 2010] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Streaming first story

detection with application to twitter. In Human Language Technologies: The 2010 Annual Con-

ference of the North American Chapter of the Association for Computational Linguistics, 2010.

[Porter, 2001] Martin Porter. Snowball: A language for stemming algorithms, 2001.

[Pradhan et al., 2007] Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer.

Semeval-2007 task 17: English lexical sample, srl and all words. In Proceedings of the 4th

International Workshop on Semantic Evaluations (SemEval-2007), 2007.

[Ramage et al., 2009] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Man-

ning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In

Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing,

2009.

BIBLIOGRAPHY 129

[Ramage et al., 2010] Daniel Ramage, Susan Dumais, and Dan Liebling. Characterizing mi-

croblogs with topic models. In Proceedings of the Fourth International AAAI Conference on

Weblogs and Social Media, 2010.

[Rashtchian et al., 2010] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier.

Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL

HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk,

2010.

[Ratcliff and Metzener, 1988] John W Ratcliff and David E Metzener. Pattern matching: the gestalt

approach. Dr Dobbs Journal, 1988.

[Resnik, 1995] Philip Resnik. Using information content to evaluate semantic similarity in a tax-

onomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence,

1995.

[Resnik, 1996] Philip Resnik. Selectional constraints: An information-theoretic model and its com-

putational realization. Cognition, 1996.

[Resnik, 1997] Philip Resnik. Selectional preference and sense disambiguation. In Proceedings

of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?,

1997.

[Riloff et al., 2013] Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert,

and Ruihong Huang. Sarcasm as contrast between a positive sentiment and negative situation. In

EMNLP, 2013.

[Ritter et al., 2010] Alan Ritter, Oren Etzioni, et al. A latent dirichlet allocation method for selec-

tional preferences. In Proceedings of the 48th Annual Meeting of the Association for Computa-

tional Linguistics, 2010.

[Ritter et al., 2012] Alan Ritter, Oren Etzioni, Sam Clark, et al. Open domain event extraction

from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge

discovery and data mining, 2012.

BIBLIOGRAPHY 130

[Rooth et al., 1999] Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil.

Inducing a semantically annotated lexicon via em-based clustering. In Proceedings of the 37th

annual meeting of the Association for Computational Linguistics on Computational Linguistics,

1999.

[Sankaranarayanan et al., 2009] Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler,

Michael D. Lieberman, and Jon Sperling. Twitterstand: news in tweets. In Proceedings of

the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information

Systems, 2009.

[Severyn et al., 2013] Aliaksei Severyn, Massimo Nicosia, and Alessandro Moschitti. Learning

semantic textual similarity with structural representations. In Proceedings of the 51th Annual


[Shen et al., 2013] Chao Shen, Fei Liu, Fuliang Weng, and Tao Li. A participant-based approach

for event summarization using twitter streams. In Proceedings of NAACL-HLT 2013, 2013.

[Sinclair, 2001] John McHardy Sinclair. Collins COBUILD English dictionary for advanced learn-

ers. HarperCollins, 2001.

[Sinha and Mihalcea, 2007] Ravi Sinha and Rada Mihalcea. Unsupervised graph-based word sense

disambiguation using measures of word semantic similarity. In Proceedings of the IEEE Inter-

national Conference on Semantic Computing, pages 363–369, 2007.

[Snyder and Palmer, 2004] Benjamin Snyder and Martha Palmer. The english all-words task. In

Proceeding of the 3rd International Workshop on the Evaluation of Systems for the Semantic

Analysis of Text (Senseval-3), pages 41–43, 2004.

[Socher et al., 2011a] Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and

Christopher D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase

detection. In Proceedings of Advances in Neural Information Processing Systems, 2011.

[Socher et al., 2011b] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and

Christopher D Manning. Semi-supervised recursive autoencoders for predicting sentiment dis-

BIBLIOGRAPHY 131

tributions. In Proceedings of the Conference on Empirical Methods in Natural Language Pro-

cessing, 2011.

[Speriosui et al., 2011] Michael Speriosui, Nikita Sudan, Sid Upadhyay, and Jason Baldridge.

Twitter polarity classification with label propagation over lexical links and the follower graph.

In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing,

2011.

[Srebro and Jaakkola, 2003] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approxima-

tions. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.

[Steck, 2010] Harald Steck. Training and testing of recommender systems on data missing not

at random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, 2010.

[Subercaze et al., 2013] Julien Subercaze, Christophe Gravier, and Frederique Laforest. Towards

an expressive and scalable twitter’s users profiles. In IEEE/WIC/ACM International Joint Con-

ferences on Web Intelligence and Intelligent Agent Technologies, 2013.

[Toutanova et al., 2003] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer.

Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the

2003 Conference of the North American Chapter of the Association for Computational Linguis-

tics on Human Language Technology, 2003.

[Tsatsaronis et al., 2010] George Tsatsaronis, Iraklis Varlamis, and Michalis Vazirgiannis. Text

relatedness based on a word thesaurus. Journal of Articial Intelligence Research, 37, 2010.

[Tumasjan et al., 2010] Andranik Tumasjan, Timm Oliver Sprenger, Philipp G. Sandner, and Is-

abell M. Welpe. Predicting elections with twitter: What 140 characters reveal about political

sentiment. In Fourth International AAAI Conference on Weblogs and Social Media, 2010.

[Wang and Manning, 2012] Sida Wang and Christopher D Manning. Baselines and bigrams: Sim-

ple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the

Association for Computational Linguistics: Short Papers-Volume 2, 2012.

BIBLIOGRAPHY 132

[Wang and McCallum, 2006] Xuerui Wang and Andrew McCallum. Topics over time: a non-

markov continuous-time model of topical trends. In In Proceedings of the 12th ACM SIGKDD

international conference on Knowledge discovery and data mining, 2006.

[Weiss et al., 2008] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances

in Neural Information Processing Systems, 2008.

[Wu and Palmer, 1994] Zhibiao Wu and Martha Palmer. Verb semantics and lexical selection. In

Proceedings of Annual Meeting of the Association for Computational Linguistics, 1994.

[Xu et al., 2014] Wei Xu, Alan Ritter, Chris Callison-Burch, William B Dolan, and Yangfeng Ji.

Extracting lexically divergent paraphrases from twitter. Transactions of the Association for Com-

putational Linguistics, 2014.

[Yan et al., 2012] Rui Yan, Mirella Lapata, and Xiaoming Li. Tweet recommendation with graph

co-ranking. In Proceedings of the 24th International Conference on Computational Linguistics,

2012.

[Yih et al., 2012] Wentau Yih, Geoffrey Zweig, and John C. Platt. Polarity inducing latent seman-

tic analysis. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural

Language Processing and Computational Natural Language Learning, 2012.

[Zhou et al., 2006] Liang Zhou, Chin-Yew Lin, Dragos Stefan Munteanu, and Eduard Hovy. Parae-

val: Using paraphrases to evaluate summaries automatically. In Proceedings of Human Language

Technology Conference of the North American Chapter of the ACL,, 2006.

Dimension Reduction for Short Text Similarity and its ...

Documents