Dimension Reduction for Short Text Similarity and its Applications Weiwei Guo Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2015
Dimension Reduction for Short Text Similarity and itsApplications
Weiwei Guo
Submitted in partial fulfillment of the
requirements for the degree
of Doctor of Philosophy
in the Graduate School of Arts and Sciences
COLUMBIA UNIVERSITY
2015
c©2015
Weiwei Guo
All Rights Reserved
ABSTRACT
Dimension Reduction for Short Text Similarity and itsApplications
Weiwei Guo
Recently, due to the burst of online text data, much of the focus of natural language processing
(NLP) research has shifted from long documents to shorter ones such as sentences and utterances.
However, short texts posit significant challenges from an NLP perspective especially if the goal is
to get at sentence level semantics in the absence of larger contexts. Motivated by this challenge,
this thesis focuses on the problem of predicting the similarity between two short text samples by
extracting the latent representation of the text data, and we apply the resulting models in various
NLP tasks that involves short text similarity computation.
The major challenge of computing short text similarity is insufficient information in the text
snippets. In a sentence similarity benchmark [Agirre et al., 2012], on average a sentence has 10.8
words. Hence, there are very few overlapping words in a text pair even when they are semantically
related, and the widely used bag-of-words representation fails to captures the semantics relatedness.
To this end, we propose several weighted matrix factorization models for learning latent represen-
tation of texts , which induces meaningful similar scores:
1. Modeling Missing Words: To address the word sparsity issue, we propose to model the
missing words (words that are not in the short text), a feature that is typically overlooked in the liter-
ature. We define the missing words of a text as the whole vocabulary in a corpus minus the observed
words in the text. The model carefully handles the missing words that by assigning them a small
weight in the matrix factorization framework. In the experiments, the new model weighted matrix
factorization (WMF) achieves superior performance to Latent Dirichlet Allocation (LDA) [Blei et
al., 2003], which does not use missing words, and latent semantic analysis (LSA) [Deerwester et
al., 1990], which uses missing words but does not distinguish missing words from observed words.
2. Modeling Lexical Semantics: We improve the previous (WMF) model in terms of lexical
semantics. For short text similarity, it is crucial to robustly model each word in the text to capture the
complete semantic picture of the text, since there is very few repetitive information given the short
context. To this end, we incorporate both corpus based (bigrams) and knowledge-based (similar
words extracted from a dictionary) lexical semantics into the WMF model. The experiments show
both additional information are helpful and complementary to each other.
3. Similarity Computing for Large-scale data sets: We tackle the short text similarity
problem in large scale setting, i.e., given a query tweet, compute the similarity/distance with all
other data point in a database, and rank them based on similarity/distance score. To reduce the
computation time, we exploit binary coding to transform each data sample into a compact binary
code, hence enables highly efficient similarity computations via Hamming distances between the
generated codes. In order to preserve as much original data as possible in the binary bits, we restrict
the projection directions to be nearly orthogonal hence reduce redundant information. The resulting
model demonstrate better performance in both short text similarity task and a tweet retrieval task.
We not only are interested in the short text similarity task itself, but also are concerned with
how much the model could contribute to other NLP tasks. Accordingly, we adapt the short text
similarity for several NLP tasks closely associated to semantics, which involve intensive similarity
computation:
4. Text Summarization Evaluation: The pyramid method is one of the most popular methods
for evaluating content selection in summarization, which requires manual inspection during eval-
uation. Recently some efforts have been made to automate the evaluation process: Harnly et al.
[2005] searched for key facts/concepts covered in the summaries based on surface word matching.
We apply WMF model to this task to enable more accurate key facts identification in summaries.
The resulting automated pyramid scores correlate very well with manual pyramid scores.
5. Word Sense Disambiguation: The unsupervised Word Sense Disambiguation (WSD) sys-
tems highly rely on a sense similarity module that returns a similarity score given two senses. Cur-
rently the most popular sense similarity measure is Extended Lesk [Banerjee and Pedersen, 2003],
which calculates the similarity score based on the number of overlapping words and phrases between
two extended dictionary definitions. We propose a new sense similarity measure wmfvec by running
WMF on the sense definition data and integrating WordNet [Fellbaum, 1998] features. The WSD
system using wmfvec significantly outperforms traditional surface form based WSD algorithms as
well as LDA based systems.
6. Linking Tweets to News: In this task we target at social media data and news data. We
propose a new task of linking a tweet to a news article that is most relevant to the tweet. The
motivation of the task is to argument the context of a tweet by a news article. We extend the WMF
model and incorporate multiple Twitter/news specific features, i.e., hashtag, named entities and
timestamps, in the new model. Our experiments show significant improvement of the new model
over baselines in various evaluation metrics.
Table of Contents
List of Figures v
List of Tables viii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Related Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 STS Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
I Dimension Reduction for Short Text Similarity 15
2 Enrich Short Text by Modeling Missing Words 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Limitations of LDA and LSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Weighted Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Modeling Missing Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Experiment setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
i
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Enrich Lexical Features by Modeling Bigrams and Similar Words 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Incorporating Bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Incorporating Bigrams from Dependency Tree . . . . . . . . . . . . . . . 37
3.4 Incorporating Similar Word Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Binary Coding for Large Scale Similarity Computing 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Binary Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Applications in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Binarized version of WMF . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Removing Redundant Information . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Implementation of Orthogonal Projections . . . . . . . . . . . . . . . . . . 53
4.4 Experiments on Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Experiments on STS Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
ii
4.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
II Applications 64
5 Automated Pyramid Method for Summaries 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 A Scoring Approach based on Distributional Similarity . . . . . . . . . . . . . . . 70
5.3.1 A Student Summary Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.2 Criteria for Automated Scoring of Student Summaries . . . . . . . . . . . 70
5.3.3 A Dynamic Programming Approach . . . . . . . . . . . . . . . . . . . . . 71
5.4 Experiments on Student Summaries . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Experiments on TAC 2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Unsupervised Word Sense Disambiguation 78
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 A New Sense Similarity Measure – wmfvec . . . . . . . . . . . . . . . . . . . . . 82
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7 Linking Tweets to News 91
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3 Searching Complementary texts via Twitter/News Features . . . . . . . . . . . . . 95
iii
7.3.1 Hashtags and Named Entities . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3.2 Temporal Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.3 Authorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3.4 Creating Relations on News . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.4 WMF on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.5.1 Experiment Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
III Conclusions 108
8 Conclusions 109
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.2 Limitations and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
IV Bibliography 115
Bibliography 116
iv
List of Figures
2.1 An example to illustrate why missing words should be helpful: the red dots are
observed words in the text; the green dots represent missing words; the black dot
denotes the hypothesis of the latent vector of the text data. After taking into consid-
eration the missing words, we will have a better estimation for the text, i.e., where
the black dot should be. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Matrix Factorization: the M ×N matrix X is factorized into two matrices, K ×M
matrix P and K ×N matrix Q; K denotes the number of latent dimensions. . . . . 23
2.3 Pearson’s Correlation percentage scores of WMF on each data set: the missing word
weight wm varies from 0.001 to 0.1; the dimensionK is fixed to 100; regularization
factor λ is fixed to 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Pearson’s Correlation percentage scores of WMF and LDA on each data set: the
dimension K varies from 50 to 200; missing word weight wm is fixed to 0.01;
regularization factor λ is fixed to 20. . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 In current dimension reduction models (WMF/LSA and LDA), the features to rep-
resent a word are simply document IDs, which are denoted by the red circles. . . . 34
3.2 Each bigram is integrated in the original corpus matrix X as an additional column.
From the model’s perspective, a bigram is treated as a pseudo-text; accordingly,
only two cells in a bigram column have non-0 values. . . . . . . . . . . . . . . . . 37
3.3 WMF+BK model (WMF + corpus-based [B]igram semantics + [K]nowledge-based
similar word pairs semantics): a w/d/b node represents a word/document/bigram,
respectively; the extra node in Figure 3.3c denotes w2 and w3 constitute a similar
word pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v
3.4 Pearson’s Correlation percentage scores of WMF-B (with corpus-based [B]igram
semantics alone) on each data set: corpus-based semantics weight γ is chosen from
{0, 1, 2}; the dimension K is 100; missing word weight wm is fixed as 0.01; regu-
larization factor λ is fixed as 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Pearson’s Correlation percentage scores of WMF-K (with [K]nowledge-based simi-
lar word pairs semantics alone) on each data set: knowledge-based semantics weight
δ is chosen from {0, 10, 30, 50}; the dimension K is 100; missing word weight wm
is fixed as 0.01; regularization factor λ is fixed as 20. . . . . . . . . . . . . . . . . 44
4.1 Two views of the P matrix: K is the number of dimensions, and M is the number
of distinct words. The first view, columns of P matrix, is frequently observed in the
WMF model (algorithm 1). Now we are going to apply the second view, rows of P
matrix which are projections, to improve the WMF model. . . . . . . . . . . . . . 51
4.2 Three examples to illustrate the noisiness in the P matrix. In general, we would like
to remove as much noise as possible. . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Hamming ranking on tweet retrieval data set: precision curve under top 1000 re-
turned list of all 6 binary coding models, with dimension K = {64, 96, 128}. . . . 56
4.4 Hamming ranking on tweet retrieval data set: recall curve under top 100,000 re-
turned list of all 6 binary coding models, with dimension K = {64, 96, 128}. . . . 57
4.5 Impact of the missing word weight wm on the MP@1000 performance for OrMF
and WMF models: wm is chosen from 0.05 to 0.2; regularization factor λ is fixed
as 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 Pearson’s Correlation percentage scores of OrMF, WMF and LDA on each data set:
the dimension K varies from 50 to 200; missing word weight wm is fixed as 0.01;
regularization factor λ is fixed as 20. . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 The pipeline for pyramid method to evaluate student summaries: the first annotation
of pyramid method is to create pyramids from model summaries; the second anno-
tation is to find the SCUs in target summaries. After the procedure, we can score a
target summary based on how many SCUs it has. . . . . . . . . . . . . . . . . . . 67
vi
5.2 Notation used for the 45 variants of automated pyramid methods. The 5 thresholds
correspond to inverse cumulative density function. . . . . . . . . . . . . . . . . . . 73
6.1 Unsupervised Graph-based Word Sense Disambiguation System: several sense nodes
are created for each word; the weights on edges are similarity scores between the
two senses; for simplicity, the edges between walk senses and friend sense are not
shown. The final decision of disambiguated words are the sense nodes that achieve
maximum indegree values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 The WSD performance, measured by F-measure, of ldavec and wmfvec on each data
set. The latent dimension K varies from 50 to 150. . . . . . . . . . . . . . . . . . 87
7.1 The general framework for linking a tweet to its most relevant new article, by firstly
transforming the textual data into latent representation, and then choosing the one
with maximum cosine similarity score. . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 The tweet nodes t and news nodes n are connected by hashtags, named entities or
temporal edges. For simplicity, the missing tokens are not shown in the figure. All
the grey nodes are observed information, such as TF-IDF values, while white nodes
are latent vectors to be inferred. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.3 Impact of the weight of links δ of model WMF-G on development set and test set
evaluated by three evaluation metrics: latent dimension K = 100, and neighbor
tweets number is k = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4 Impact of latent dimension K of model WMF-G on test set evaluated by three met-
rics: the neighbor tweet number is fixed k = 4. Dimension K varies from 50 to
150. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
vii
List of Tables
1.1 Genres and data sources for each of the subset data for STS12/STS13/STS14 . . . 9
1.2 This table contains the word pairwise similarity. The content words in the first
sentence are cemetery, place, body, ash; the content words in the second sentence are
graveyard, area, land, sometime, near, church. Each cell stores the word similarity
value; the numbers in red denote the word pair alignment that maximizes the total
sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 List of features used in DKpro [Bar et al., 2013] . . . . . . . . . . . . . . . . . . . 12
2.1 Three possible latent vectors hypotheses for the text data, which is the WordNet
sense definition of bank#n#1: a financial institution that accepts deposits and
channels the money into lending activities. Assume there are only three topics in
the corpus: finance, sport, institution. Ro denotes the relatedness score between
the hypothesis with observed words; Rm denotes the relatedness score between the
hypothesis with missing words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Pearson’s correlation (in percentage) on the four data sets: latent dimension K =
100 for LSA/LDA/WMF. For WMF models, the regularization factor λ is fixed as
20. Model 4-6 are WMF with different missing word weight wm, where the first
two models are analogous to LSA and LDA, respectively. . . . . . . . . . . . . . . 26
2.3 Pearson’s correlation (in percentage) on the four data sets: the models are trained
on long documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
3.1 Pearson’s correlation (in percentage) on the four data sets. Latent dimension K =
100 for LSA/LDA/WMF/WMF-BK. For matrix factorized based models, the regu-
larization factor λ is fixed as 20. Model 5 is WMF with bigram semantics alone;
model 6 is WMF with similar word pairs alone; model 7 is the final model with both
semantics incorporated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Symbols used in binary coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Mean precision among top 1000 returned list (MP@1000) on the tweet retrieval
data set. TF-IDF is the only system that does not use binary encoding, and serves
as the upper bound of the task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Pearson’s correlation (in percentage) on the data sets. Latent dimension K = 100
for LSA/LDA/WMF/OrMF. We use the real-valued vectors produced by OrMF for
short text similarity evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 An example of summary content unit created from 5 model summaries. The concept
has 4 contributors, all expressing the same meaning yet with different wording.
Accordingly this SCU has a weight of 4. . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Five top performing variants out of 45 variants ranked by correlation scores, with
confidence interval and rank (P=Pearson’s, S=Spearman, K=Kendalls tau) . . . . 74
5.3 SCU selection results: averaged recall, precision and F-measure over the 20 student
summaries, for each combination of similarity method and method of comparison
to the SCU (9 categories). The number in bracket is the standard deviation for
precision and recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 SCU selection results: averaged recall, precision and F-measure over the 20 student
summaries, for variants of the top five variants in Table 5.2. . . . . . . . . . . . . . 76
6.1 The statistics of annotated senses in the four WSD data sets, as well as the distribu-
tion per part-of-speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 The WSD performance measured by F-measure of 7 models on each data set, as
well as the performance per part-of-speech. For the models ldavec and wmfvec,
they are trained with the latent dimension K = 100. . . . . . . . . . . . . . . . . . 86
ix
6.3 The similarity values of wmfvec and elesk in two examples. The first example is
the target word mouse in a biology context that contains words gene, cell, etc. The
second example is the word church in a context that involves stop, chat. . . . . . . 89
7.1 Performance for Linking-Tweets-to-News under three evaluation metrics (latent di-
mension K = 100 for LDA/WMF/WMF-G) . . . . . . . . . . . . . . . . . . . . . 101
7.2 Contribution of subgraphs of hashtag/named entity/temporal/author, when K =
100, k = 4, δ = 3, measured by gain over baseline WMF. . . . . . . . . . . . . . . 104
x
Acknowledgments
Writing thesis is a great process to review not only my academic work, but also the journey I
took as a PhD student. Throughout the years I spent at Columbia, I have been fortunate to come
across so many brilliant researchers and genuine friends. It is the people who I met shaped who I
am today. My gratitude goes out to all of them. I would like to send my thanks in particular:
Above all, I would express my supreme gratitude to my supervisor, Professor Mona Diab, for
her great support in the past years. It is Mona who encourages me to the challenging field of natural
language processing. Her passion and positive attitude will always be an inspiration to me. She
provided me great opportunities to participate a lot of interesting research projects and organize
different academic activities. There would not exist this thesis without her patience and guidance.
And to Professor Kathy McKweon, my departmental advisor, who always has confidence on
me and encourages me to pursue a higher standard. To Owen Rambow, who patiently helped me
solve the issues in our projects. Their insightful guidance and sense of responsibility motivates me
towards a professional researcher.
To my coauthors, Professor Heng Ji, Doctor Rebecca Passonneau and Doctor Smaranda Mure-
san. I am very grateful for their generous help in addressing research issues and paper revision. I
also learned a lot from their serious attitude towards academia. To my intern mentor Rakesh Gupta,
who together with Prof. Ji, Prof. Mckweon, my advisor Mona Diab, provided generous help during
my job seeking.
To my colleagues and friends in Columbia University, Apoorv Agarwal Mohamed Altantawy,
Daniel Bauer, Or Biran, Hao Dang, Zhe He, Weiwei Jiang, Ahmed El Kholy, Heba Elfardy, Noura
Farra, Wei-Yun Ma, Vinodkumar Prabhakaran, Mohammad Sadegh Rasooli, Wael Salloum, Xiaorui
Sun, Kapil Thadani. I would never forget the surprising birthday parties, wonderful trips and crazy
deadlines I spent with you. To my dear friends, Jinai A, Ti-wei David Chen, Yiding Cheng, Pradeep
Dasigi, Mevlana Gemici, Jun Hu, Jianzhao Huang, Qiao Hui, Jia Liu, Peng Liu, Shih-hao Liao,
xi
Yuan Ma, Misagh Mb, Ruiyang Wu, Yu Xie, Aya Zerikly, Fan Zhang, Hang Zhao, who shared the
excitement of studying abroad.
Also to Hao Li, Qi Li, Wei Liu, Junming Xu, Feng Song, Xiaoxiao Shi, who happened to pursue
their PhD degrees with me at the same time and left precious memories for me, including my best
“comrades” Boyi Xie and Leon Wu, who helped me dealing with all sorts of things in CCLS.
To CCLS staff members, Daniel Alicea, Hatim Diab, Kathy Hickey, Idrija Ibrahimagic, Derrick
Lim, Axinia Radeva, who makes it a big family to me.
My special appreciation goes to my parents Wenliang Guo, Xiaoying Li, and my girlfriend,
Wenhui Li, with whom I always share my good news and frustration. My parents always support me
to pursue what interests me. Hardly can I achieve any accomplishment without their unconditional
love. My girlfriend helps me in every aspect of my life. And my deepest thanks for my grandpa
Jianzhong Li, who cares for me more than himself.
Finally, I would like to thank all committee members, Prof. Diab Mona, Prof. Kathy Mckweon,
Dr. Owen Rambow, Dr. Smaranda Muresan, and Prof. David Blei, for attending my PhD thesis
defense, and all staffs in the Department of Computer Science at Columbia University.
xii
To my parents
xiii
CHAPTER 1. INTRODUCTION 1
Chapter 1
Introduction
This thesis is dedicated to developing dimension reduction models. We especially focus on ad-
dressing the problem of short text similarity task as well as their application to natural language
processing (NLP) tasks.
1.1 Overview
Recently, online communication makes up a large portion of social media content, especially mi-
croblogs as in Facebook comments and Twitter data which have gained tremendous popularity. The
latter have now become major sources that contain first story breaking news before being reported
in traditional media. Because of this trend, a significant amount NLP research focus has shifted
from large lengthy documents to smaller texts such as sentences and utterances.
Identifying the degree of semantic similarity between two short texts is at the crux of many
NLP applications that address sentence level semantics. In Machine Translation [Kauchak and
Barzilay, 2006] and Text Summarization [Zhou et al., 2006], sentence similarity based metrics
have been applied to evaluate the closeness between yielded translation/summary and reference.
In Text Coherence Detection [Lapata and Barzilay, 2005], different sentences are linked by their
similarity scores. In Word Sense Disambiguation, Lesk [1986] measured the relatedness of two
senses by their definition similarity. Moreover, computing similarity between short text data is an
indispensable step in social media analysis research. Take Twitter data as an example, social media
analysis research includes tweet recommendation [Yan et al., 2012], tweet retrieval [Ramage et al.,
CHAPTER 1. INTRODUCTION 2
2010], tweet paraphrase detection [Xu et al., 2014], event summarization [Shen et al., 2013] and
extraction [Ritter et al., 2012] on tweets, etc. (cf. more details on applications of short text similarity
can be found in section 1.2.4).
Due to the relevance of the problem, this thesis presents a comprehensive study of computing
similarity between short texts. The task of Short Text Similarity (STS) requires systems to calculate
a score that reflects the similarity between a pair of sentences/short snippets of texts, e.g., a score of
0.47, between [0, 1], is given to the following sentence pair from [Li et al., 2006]:
• Cord is a strong, thick string.
• String is a thin rope made of twisted threads, used for tying things together or tying up parcels.
At the first glance, the short text similarity (STS) task closely resembles sentence level para-
phrase recognition. However, paraphrase recognition only provides a binary score for a short text
pair, whereas STS seeks a more nuanced fine-grained continuous score. The continuous score makes
STS more applicable to NLP tasks, since in most cases two sentences are not exactly semantically
equivalent, and knowing the amount of overlapping information could be very helpful, e.g., any two
WordNet senses are not exactly the same, but the degree of similarity between senses is essential
for word sense disambiguation. Also, STS shares some commonality with Textual Entailment. In-
tuitively, if text A entails text B, usually their similarity score should be high. Therefore, the textual
similarity scores could be a useful feature for the Textual Entailment. In Section 1.2.1, we highlight
the difference among these tasks.
Different from typical lengthy document similarity calculation, extracting similarity for short
texts is very difficult, due to the limited features observed in a data unit. Hence, the widely used
TF-IDF weighting on bag-of-words fails to capture semantic relatedness of two sentences unless
they share many overlapping words. In our example above, typical measures of similarity will not
succeed since the two sentences share very few words overall.
Previous research on STS falls in two main thrusts. The first set of approaches work within
the high-dimensional word space exploiting lexical semantics techniques such as word similarity
measures, which are either corpus based [Islam and Inkpen, 2008], or knowledge based [Li et
al., 2006; Mihalcea et al., 2006; Tsatsaronis et al., 2010]. Majority of this work was introduced
within the context of early work on STS [Li et al., 2006; Mihalcea et al., 2006; Islam and Inkpen,
CHAPTER 1. INTRODUCTION 3
2008; Tsatsaronis et al., 2010]. The second set of approaches work within the low-dimensional
space. The low dimensional space is represented by dimension reduction techniques, such as Latent
Semantic Analysis (LSA) [Deerwester et al., 1990], Probabilistic Latent Semantic Analysis (PLSA)
[Hofmann, 1999], and Latent Dirichlet Allocation (LDA) [Blei et al., 2003]. Such techniques can
fully exploit word co-occurrence information by modeling the semantics of words and sentences
simultaneously in the low-dimensional latent space. However, early attempts at addressing STS
using LSA [Mihalcea et al., 2006; O’Shea et al., 2008], or LDA (experiments shown in [Guo and
Diab, 2012b]), performed significantly below high dimensional word similarity based models. (cf.
we present previous work on STS in section 1.2).
In the first part of the thesis, we introduce several of our approaches to improve STS perfor-
mance in a matrix decomposition framework. From lexical semantics based approaches, we ob-
serve that the key to inducing robust sentence similarity is to introduce additional information to
overcome the data sparseness issue (in the [Agirre et al., 2012] data set, on average only 10.8 words
exist in a short text snippet). In Chapter 2, we propose an unsupervised approach, Weighted Matrix
Factorization (WMF) [Guo and Diab, 2012b] that accounts for and accordingly explicitly models
“the missing words” for each short text. We hypothesize that the semantic profile of a sentence is
defined by both what is *in* the text as observed words and what is *not* in the text. Accordingly,
we define the missing words of a short text as all the vocabulary in a training corpus minus the
observed words in the short text. Modeling missing words in practice adds thousands of more fea-
tures for a text, by contrast other low dimensional models such as LDA, for example, only leverages
observed words (around 10 words) to infer a 100 dimension latent vector for a text.
In Chapter 3, we propose an approach for more robust modeling of the lexical items in short
texts [Guo and Diab, 2013], which can further boost the short text semantics. Explicitly modeling
lexical semantics nuances for each word in canonical dimension reduction algorithms does not draw
much attention in the community, as these models are typically used for long documents, which in
turn have abundant word features to induce the document level semantics. However, in the short
textual similarity setting, it is crucial to make good use of each word in the text, in order not to miss
salient topics represented in the short text. Accordingly, we explicitly encode lexical semantics,
derived from both corpus-based and knowledge-based information, in the weighted matrix factor-
ization (WMF) model. The experiments illustrate that these new models gain even better short text
CHAPTER 1. INTRODUCTION 4
similarity scores.
Moreover, given the observation that there is a massive flow of Twitter data online, we note the
need to approximate such large collections of data in efficient ways for several NLP applications
on Twitter such as first story detection. We exploit binary coding to tackle the scalability issue,
which compresses each data sample into a compact binary code and hence enables highly efficient
similarity computations via Hamming distances between the generated codes. One obvious side
effect of using binary bits is a lot of nuanced information is lost. Aiming at alleviating this issue, we
convert the WMF model into a binarized version, and force the projection directions in the model
nearly orthogonal to reduce the redundant information in the resulting binary bits. Our proposed
technique finds the most similar tweets given a query tweet in the large scale Twitter data scenario.
Also, the experiments on STS data sets show its superiority over the previous models. More details
can be found in Chapter 4.
We show the efficacy of our proposed models not only intrinsically on the short text similar-
ity task, but also extrinsically on several applications for various NLP tasks, which is the focus of
the second part of the thesis. From Chapter 5 to 7, we show how our matrix factorization mod-
els are very powerful at extracting quality latent semantics of text at different level of granularity
(phrases/sentences/short texts) that address specific applications.
The first application is automatic pyramid evaluation for text summarization [Passonneau et
al., 2013]. One key component in pyramid evaluation is to identify whether a summary covers
an important concept in the original documents. In traditional pyramid evaluation [Nenkova and
Passonneau, 2004], this is done manually by searching a word sequence in the summary that covers
a key concept, which could be expensive. To propose an automatic method, we extract all the
phrase level ngrams, convert them into low dimensional vectors, and apply a dynamic programming
approach to automatically match concepts. Our approach is shown to correlate better with manual
scores than two string matching baselines on a student summary assessment task. With further
experiments, we find the new approach can extract concepts with higher precision and recall.
We also investigate the impact of the distributional similarity in an unsupervised word sense dis-
ambiguation setting in Chapter 6. Traditional sense similarity measures compute sense similarity
by counting overlapping words and phrases [Lesk, 1986; Banerjee and Pedersen, 2003]. However,
CHAPTER 1. INTRODUCTION 5
surface word matching in the sparse word space does not reveal the true semantic relatedness of two
senses, especially in the context where sense definitions are usually short. To obtain meaningful
sense similarity values, we convert sense definitions to low dimensional dense vectors. We fur-
ther construct a more powerful sense similarity measure wmfvec using WordNet defined relations.
The WSD system using wmfvec outperforms surface word based WSD systems and LDA based
algorithms, significantly.
In Chapter 7, we apply our proposed STS techniques on social network Twitter data [Guo et al.,
2013]. The short nature of tweets poses a big challenge for NLP tools to extract useful information
from the data. To enable the NLP tools to better understand Twitter feeds, we propose the task of
linking a tweet to a relevant news article, in effect augmenting the context of the tweet. We develop
a new model that is able to capture the tweet/news relatedness in the data. Our model utilizes the
tweet specific features (e.g., hashtag) and news specific features (e.g., named entities) as well as
temporal constraints, to find tweets that are on the same topic (and hence complementary) to a
specific tweet. We crawl a tweet-news pairs data set, and the new model significantly outperforms
the baselines in three different evaluation metrics on this data set.
1.2 Related Work
In this section, we summarize the previous work focusing on the task of short text similarity. The
related work of applying short text similarity for other NLP tasks, e.g., word sense disambiguation
and pyramid evaluation for text summarization, will be presented in each application chapter.
We first review two related tasks, Textual Entailment and Paraphrase Recognition, where we
compare the difference and commonality between them and the STS task. Then we briefly introduce
the unsupervised and supervised approaches for this problem, as well as the recent development of
the STS data sets. At last, we briefly summarize the applications of STS.
1.2.1 Related Tasks
Paraphrase Recognition (sentence level) is a task closely related to STS. In [Dolan et al., 2004],
Paraphrase Recognition is defined as identifying two texts “which are more or less semantically
equivalent”, but may differ in the syntactic structure or the amount of shared details. Under this
CHAPTER 1. INTRODUCTION 6
definition, the two tasks closely resemble one another: if two texts are paraphrases, then the semantic
similarity score should be very high. One distinct difference is when one text is the negation of the
other, then they are not paraphrases, yet in STS they still have a relatively high similarity score.
This is partly caused by the design nature of the two tasks. STS aims at reflecting the degree of
information overlapping hence it has a continuous score, whereas Paraphrase Recognition score is
a binary value focusing on exact semantic equivalence; Paraphrase Recognition has a very strict
constraint on the positive label, for example the meaning of two sentences are very close but their
label is negative (not paraphrase):
• Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.
• In the memo, Ballmer reiterated the open-source threat to Microsoft.
Due to this characteristic, supervised models are much more popular in the task of paraphrase
recognition yielding significantly better results, since supervised models can exploit some human
designed features that are highly discriminative for the task. In contrast, in the STS task, unsuper-
vised approaches are able to achieve comparable performance.1 Meanwhile, it is worth noting that
the current best performing systems on the Microsoft Paraphrase corpus [Dolan et al., 2004] are
dimension reduction models plus manually designed features in a supervised setting: Socher et al.
[2011a] applied recursive neural network model for the task, achieving an accuracy score of 76.8
on Microsoft paraphrase corpus; Ji and Eisenstein [2013] developed a discriminative dimension
reduction model with 80.41 accuracy on the same data set.
Textual entailment is defined as the directional relationship between a text T (text) and a second
text H (hypothesis) where T entails H (T ⇒ H) “if the meaning of H can be inferred from the
meaning of T, as would typically be interpreted by people” [Dagan et al., 2006]. Intuitively, if T
entails H, T and H are often highly similar. However, sometimes H can be logically inferred from
T, and the similarity value between T and H might not be very high. Textual entailment differs from
STS in two respects: (1) Textual Entailment is directional where T entails H, hence the opposite does
not hold, i.e., H does not entails T; (2) similar to Paraphrase Recognition, Textual Entailment outputs
1Results for the *SEM 2013 STS shared task are available on http://ixa2.si.ehu.es/sts/index.php,
the DEFT system, an unsupervised system based on our approaches, was ranked 3rd among 89 runs of 34 STS systems
which includes many supervised methods.
CHAPTER 1. INTRODUCTION 7
a binary decision while STS is defined in a graded continuous space. Based on the observations, we
can conclude STS scores could be a very helpful feature for the Textual Entailment task.
1.2.2 STS Datasets
LI06: The most popular data set before 2012 is LI06 [Li et al., 2006]. The LI06 data set consists of
65 pairs of noun definitions selected from the Collin Cobuild Dictionary [Sinclair, 2001]. A subset
of 30 pairs is further selected by Li et al. to render the similarity scores evenly distributed. Each pair
is associated with a continuous score from 0 to 1, which is the average judgment from 32 human
annotators. For example, a score of 0.65 is assigned to the following pair:
• A gem is a jewel or stone that is used in jewelry.
• A jewel is a precious stone used to decorate valuable things that you wear, such as rings or
necklaces.
Typically in the literature Pearson’s correlation coefficient, or Spearman rank correlation coefficient,
between an STS system output and the groundtruth similarity are used to evaluate the performance
of a STS system. While this is the ideal data set to evaluate STS, the small size makes it impossible
for tuning STS algorithms or deriving significant performance conclusions.
LEE05: A less popular data set is developed by Lee et al. [2005], which comprises 50 short texts as
newspaper articles from the political domain. Every two texts among the 50 texts constitute a pair,
resulting in 1225 pairs in total. Each pair is annotated with a similarity score on a discrete 1-5 scale.
The main reason for this data set drawing less attention in the NLP community might be the
length of text unit being relatively long. An example from this data set is shown:
• Beijing has abruptly withdrawn a new car registration system after drivers demonstrated ”an
unhealthy fixation” with symbols of Western military and industrial strength - such as FBI and
007. Senior officials have been infuriated by a popular demonstration of interest in American
institutions such as the FBI. Particularly galling was one man’s choice of TMD, which stands
for Theatre Missile Defense, a US-designed missile system that is regularly vilified by Chinese
propaganda channels.
CHAPTER 1. INTRODUCTION 8
• The Russian defense minister said residents shouldn’t feel threatened by the growing number
of Chinese workers seeking employment in the country’s sparsely populated Far Eastern and
Siberian regions. There are no exact figures for the number of Chinese working in Russia,
but estimates range from 200,000 to as many as 5 million. Most are in the Russian Far
East, where they arrive with legitimate work visas to do seasonal work on Russia’s low-tech,
labor-intensive farms.
The length of a short text in LEE05 is 45 to 126 words. Many NLP tasks deal with a much smaller
context. For example, in word sense disambiguation, a crucial component is sense similarity cal-
culation between the sense definitions, each with around 15 words. Meanwhile, for NLP tasks that
process large contexts, operating in the surface word space can achieve reasonably good results,
hence obfuscating the need to pay special attention to incorporating lexical semantics explicitly.
MSR04: Another data set widely used to evaluate STS models is the Microsoft Paraphrase Corpus
(MSR04) [Dolan et al., 2004]. The MSR04 data set comprises a larger set of sentence pairs: 4, 076
training and 1, 725 test pairs, taken from web news sources. The data set is originally targeting
the task of Paraphrase Recognition, and accordingly is not accompanied by continuous scores. The
paraphrase ratings are binary labels: similar/not similar. This is not a problem per se, however the
issue is that it is very conservative in its assignment of a positive label (similar), for example the
following sentence pair as cited in [Islam and Inkpen, 2008] is rated as not semantically equivalent:
• Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.
• In the memo, Ballmer reiterated the open-source threat to Microsoft.
Since the labels are binary, apart from accuracy, F-measure is also used to evaluate performance in
terms of precision and recall of positive examples.
STS12, STS13 and STS14: In Semantic Textual Similarity Task, SemEval 2012 Task 6, [Agirre et
al., 2012] (STS12), a large collection of sentence pairs (a training set of 2, 234 pairs and a test set of
3, 150 pairs) is annotated with graded similarity scores in the range of [0, 5]. The scale is inspired
by the annotation schema of LI06.
STS12/STS13/STS14 include sentence pairs of very different data genres. The sentence pairs of
STS12 cover dictionary sense definitions of WordNet and OntoNotes [Hovy et al., 2006], machine
CHAPTER 1. INTRODUCTION 9
data set size (pairs) description
STS12 train 750 msr-par: Microsoft Research Paraphrase Corpus [Dolan et al., 2004]
750 msr-vid: Microsoft Research Video Description Corpus [Chen and Dolan,
2011]
734 smt-eur: shared task of the 2007 ACL Workshops on Statistical Machine
Translation [Callison-Burch et al., 2007]
STS12 test 750 msr-par: Same as STS12 train
750 msr-vid: Same as STS12 train
459 smt-eur: Same as STS12 train
399 smt-news: News conversation sentence pairs from Workshop on Machine
Translation [Callison-Burch et al., 2008]
750 on-wn: Pairs of sentences where the first sentence is an OntoNotes [Hovy et
al., 2006] gloss and the second sentence is a WordNet gloss
STS13 750 headlines: News headlines mined from several news sources by European Me-
dia Monitor [Clive et al., 2005] leveraging RSS feeds
189 fn-wn: Pairs of sentences where the first sentence is a FrameNet [Baker et al.,
1998] gloss and the second sentence is a WordNet gloss
561 on-wn: Same as STS12 test
750 smt: This SMT dataset is derived from the DARPA GALE HTER and HyTER
datasets, where one sentence is an MT output and the other is a reference
translation
STS14 750 headlines: Same as STS13
750 on-wn: Same as STS12 test
450 deft-forum: A subset of discussion forum data in the DARPA DEFT data col-
lection
300 deft-news: A subset of news article data in the DARPA DEFT data collection
750 images: A subset of the Image Descriptions data set from PASCAL VOC-2008
[Rashtchian et al., 2010]
750 tweet-news: A subset of the Linking-Tweets-to-News short text pairs [Guo
et al., 2013], where the first text snippet is from tweets and the second text
snippet is from news headline
Table 1.1: Genres and data sources for each of the subset data for STS12/STS13/STS14
CHAPTER 1. INTRODUCTION 10
translation output and references pairs from the translation shared task of the 2007 and 2008 ACL
Workshops on Statistical Machine Translation [Callison-Burch et al., 2007; Callison-Burch et al.,
2008], some pairs from video paraphrase corpus [Chen and Dolan, 2011], and the existing news
paraphrase data set Microsoft Paraphrase Corpus. Later in the *SEM 2013 Shared Task [Agirre
et al., 2013] (STS13), a similar test set of 2, 250 pairs is developed under the same annotation
guidelines, which contains new genres of news headline pairs gathered by Europe Media Monitor
engine [Clive et al., 2005], FrameNet [Baker et al., 1998] gloss to WordNet gloss pairs. In SemEval
2014 Task 10 [Agirre et al., 2014] (STS14), a test set of 3, 750 English pairs is released. The new
genres in STS14 are DEFT forum data, image description [Rashtchian et al., 2010], tweets/news
pairs [Guo et al., 2013]. For the first time, the organizers also developed a test set of 804 Spanish
pairs. A brief description of the genres of STS12/STS13/STS14 is presented in Table 1.1.
The development of large training data in STS12/STS13/STS14 has a very big impact, which
enables supervised learning on the similarity scores. Because of the large data size and non-binary
similarity scores, these three data sets are highly beneficial for future work. In our evaluation, we
conduct experiments on these three data sets.
1.2.3 Approaches
We can see a clear correlation of supervised/unsupervised methods with the data sets they are eval-
uated on. The early works on STS are mostly unsupervised and evaluated on small data sets such
as LI06 or MSR04. The most recent work benefit from the development of large data sets of
STS12/STS13/STS14, and thus supervised approaches are extensively adopted.
Unsupervised Approaches: STS enjoys a close relationship to lexical semantics, as the STS task is
first introduced by the lexical semantics community. Accordingly, early work on short text similarity
[Li et al., 2006; Mihalcea et al., 2006; Islam and Inkpen, 2008; Tsatsaronis et al., 2010] focuses on
leveraging lexical semantics techniques to discover the similarity between different words within
the two sentences and attempts to discover if they are related.
The general framework of these works is: (1) firstly decompose the short text similarity problem
into word similarity problem; (2) then calculate the overall textual similarity by summing up some
of the word similarity values with normalization.
Specifically, the lexical semantics techniques are sense/word similarity measures, which are
CHAPTER 1. INTRODUCTION 11
graveyard area land sometime near church
cemetery 0.505 0.010 0.195 0.162 0.297 0.449
place 0.018 0.248 0.204 0.083 0.017 0.011
body 0.242 0 0.039 0.071 0.032 0.044
ash 0.416 0.041 0.134 0.133 0.225 0.124
Table 1.2: This table contains the word pairwise similarity. The content words in the first sentence
are cemetery, place, body, ash; the content words in the second sentence are graveyard, area, land,
sometime, near, church. Each cell stores the word similarity value; the numbers in red denote the
word pair alignment that maximizes the total sum
knowledge based [Li et al., 2006; Feng et al., 2008; Ho et al., 2010; Tsatsaronis et al., 2010],
corpus-based [Islam and Inkpen, 2008] or hybrid [Mihalcea et al., 2006]. Most knowledge based
word similarity measures rely on machine readable dictionaries, of which the most widely used is
WordNet [Fellbaum, 1998], where the graph structure of the taxonomy is the main resource to com-
pute word similarity. In the corpus-based approach [Islam and Inkpen, 2008], the word similarity
is computed based on mutual information between words in a corpus. Ho et al. [2010] went be-
yond the word token, and transformed the sentence into sense representation after performing word
sense disambiguation, hence the step is replaced by summing up the sense similarity scores. They
achieved better performance, however not statistically significant, on the LI06 data set.
In terms of the second step, of calculating the overall short text similarity, we present some
representative methods. Mihalcea et al., [2006] calculated the text similarity as the sum of word
similarity normalized by inverse document frequency (IDF) values:
sim(T1, T2) =1
2
(∑w∈{T1}maxSim(w, T2) ∗ idf(w)∑
w∈{T1} idf(w)+
∑w∈{T2}maxSim(w, T1) ∗ idf(w)∑
w∈{T2} idf(w)
)Instead of choosing the maximum similarity value for a word, Islam et al. [2008] searched for an
alignment between words in two texts, and then computed the sum of the similarity of the aligned
word pairs. The aligned word pairs are chosen to maximize the sum. An example of such align-
ment from their paper [Islam and Inkpen, 2008] is illustrated in Table 1.2, accordingly the textual
similarity is the sum of similarity scores of these four aligned word pairs.
The second set of approaches work within the low-dimensional space. Dimension reduction
CHAPTER 1. INTRODUCTION 12
techniques, such as LSA/LDA, can fully exploit word co-occurrence information, and subsequently
map the short texts into low dimensional dense vectors. Textual similarity is computed as the cosine
similarity between two vectors. However, early attempts at addressing STS using LSA [Mihalcea
et al., 2006; O’Shea et al., 2008], or LDA (experiments shown in [Guo and Diab, 2012b]), are
significantly outperformed by lexical semantics based models. Recently, a lot of supervised methods
directly use the similarity scores returned by LSA or LDA as features, however, there is very few
effort on improving the dimension reduction models.
Supervised Approaches: The development of a large scale dataset STS12 makes supervised
systems for short text similarity possible. A supervised system is able to combine NLP features
from different aspects, and train a regression model on these features to better approximate the
groundtruth similarity scores. To create features, a common technique researchers adopted is stack-
ing, which is to train a model to combine the predictions of several other learning algorithms.
This is evident in many competitive supervised systems [Bar et al., 2013; Severyn et al., 2013;
Han et al., 2013]. Table 1.3 shows a list of such features used in the system DKpro [Bar et al.,
2013].
features description
string similarity the number of overlapping of ngram characters
pairwise word similarity Similar to the approach [Mihalcea et al., 2006]
vector space model the similarity value returned by LSA
syntactic similarity overlapping of POS ngram
stylistic Similarity a measure which compares function word frequencies
phonetic Similarity pairwise phonetic comparisons of words
Table 1.3: List of features used in DKpro [Bar et al., 2013]
Apart from that, the supervised labels enable exploiting another interesting category of features,
namely, sentential structural features. Usually a sentence contains a large number of structural
features, not all of which are relevant, typically given supervised labels the useful ones can be more
easily identified.
In the following, we elaborate some of the features and techniques used in supervised ap-
CHAPTER 1. INTRODUCTION 13
proaches by reviewing several successful systems.
DKpro [Bar et al., 2013] is an example of stacking system, which is the best performing system
among the SemEval 2012 Task participants. DKpro achieves a Pearson’s correlation of 0.8239 on
the STS12 test data set. They used simple surface lexical features, such as character, word ngrams,
common letter subsequences, combined with complex features such as LSA latent vectors and word
similarity scores. Also to alleviate the word sparsity problem they employed lexical substitution
and machine translation to obtain more lexical features. All these features are subsequently fed into
a log-linear regression model.
UMBC EBIQUITY [Han et al., 2013] is the best performing participant system on STS13 (The
weighted Pearson’s correlation is 0.6181). Han et al. trained a support vector regression model
using features such as lexical semantics with WordNet, word ngrams, word alignment between the
two sentences, and stacking with tree kernel similarity.
Severyn et al. [2013] are the first to incorporate structural features in this task. They converted
the input texts to syntactic trees, and relied on tree kernels to learn relevant features. Several syn-
tactic tree representations are combined in a tree kernel, meanwhile other features (such as string
similarity, word pair similarity) are incorporated in a stacking fashion. Together with domain adap-
tation, their model achieved state-of-the-art Pearson’s correlation of 0.8810 on STS12.
Another interesting structural feature is probabilistic soft logic [Kimmig et al., 2012; Bach et al.,
2013], which is applied to STS by Beltagy et al. [2014]. The benefits of using probabilistic soft logic
is (1) it allows fast inference; (2) it is designed for computing similarity between complex structured
objects; (3) compared to tree kernels, the logic representation captures more direct semantics. Their
model is evaluated on msr-vid and msr-par data sets in STS12, however receiving a lower Pearson’s
score than DKpro [Bar et al., 2013]: 0.83 on msr-vid and 0.49 on msr-par, compared to 0.87 and
0.68 of DKpro.
1.2.4 Applications
STS is a core component in many sentential semantics based NLP tasks, hence it is applied in a
wide range of tasks. In Text Coherence Detection [Lapata and Barzilay, 2005], similarity between
adjacent sentences is calculated to measure the local coherence of machine-generated texts. In
unsupervised Word Sense Disambiguation, sense relatedness plays a crucial role in disambiguating
CHAPTER 1. INTRODUCTION 14
senses. Lesk [1986] measured the relatedness of senses by similarity of two sense definitions,
counting number of overlapping words/phrases between the two sense definition sentences. We
acquire a more accurate sense similarity by projecting a definition sentence into a latent vector,
where the sense similarity is the cosine similarity of the two latent vectors [Guo and Diab, 2012a].
In automated pyramid evaluation for text Summarization [Passonneau et al., 2013], phrase similarity
is employed to identify the same concepts appearing in model summary and submitted summaries.
Moreover, computing similarity between tweets is a common step in Twitter related research.
In tweets clustering [Jin et al., 2011], extensive pairwise tweet similarity is computed during clus-
tering. To overcome the word sparsity problem, the presence of url in the tweets is used to augment
the tweet data impacting performance significantly, boosting the clustering purity score from 0.280
to 0.392. In tweet recommendation [Yan et al., 2012] and tweet retrieval [Huang et al., 2012], the
most relevant tweets towards a given tweet or keywords are identified based on similarity scores.
In tweet paraphrase detection [Xu et al., 2014], tweet pairwise similarity is a strong unsupervised
baseline. In event summarization [Shen et al., 2013], a hybrid TF-IDF approach is used to extract
representative tweets.
15
Part I
Dimension Reduction for Short Text
Similarity
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 16
Chapter 2
Enrich Short Text by Modeling Missing
Words
To date most of the NLP community focus on document level similarity, where abundant words
exist in a document and thus accurate similarity scores can be obtained simply by cosine similarity
in the original word space. With the pervasive presence of social media such as Twitter feeds and
SMS, the notion of document has changed from being hundreds of words to simple utterances or
sentences rendering the need for computing meaningful similarity scores for short text snippets.
However, due to the small context of these short texts, the cosine similarity approach in the original
word space fails to identity a lot of semantically relevant: most text pairs have a cosine similarity
of 0, because of few common words between them (even though sometimes they are semantically
relevant). In this chapter, we present our first attempt to solve this problem.
We believe that the bottleneck lies in the explicitly available number of features (the observed
high dimensional words in the short text) that represent such short text data are way too few.
Thereby, we focus our efforts on augmenting these explicit features with other features, namely,
modeling the missing words for the short text. Missing words for a text is defined as the total vo-
cabulary in the collection while excluding those words that are present in the text. Our intuition
behind explicitly modeling for missing words is that the missing words serve as negative examples
telling us what the text is not about. Together with the observed words in the text, the missing words
complete the full semantic map for the utterance modeled. Explicitly modeling for missing words
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 17
empirically, in practice, adds thousands more features for each text, which leads to robust modeling
of the short text data.
2.1 Introduction
The challenge of the short text similarity (STS) problem lies in the sparsity of features present in
the text data. In the data set released by Agirre et al. [2012], on average there are only 10.8 words in
each text snippet. Such a small number of words typically results in very few overlapping words in
a short text pair, resulting in a 0 cosine similarity score among most short text pairs, ignoring many
cases where the two text data are indeed highly semantically related.
One natural solution is to leverage dimension reduction models, such as Latent Semantic Anal-
ysis (LSA) [Deerwester et al., 1990], Probabilistic Latent Semantic Analysis (PLSA) [Hofmann,
1999] or Latent Dirichlet Allocation (LDA) [Blei et al., 2003], to extract a low dimensional rep-
resentation for each short text, on which meaningful cosine similarity scores can be calculated.
However, previous attempts at addressing the short text similarity task using LSA performed signif-
icantly below high dimensional word similarity based models [Mihalcea et al., 2006; O’Shea et al.,
2008]. When topic models are applied on short text data, we observe that only one dominant topic
can be extracted. The reason is again there are very few observed words in a text. It is very hard for
the topic models to learn a K-dimensional vector based only on 10 words.
We believe that the dimension reduction approaches applied to date had not yielded positive
results due to the deficient modeling of the sparsity in the semantic space. In this thesis, we propose
to model the missing words (words that are not observed in the text data), a feature that is typically
overlooked in the text modeling literature, to address the sparseness issue for the short text similarity
task. We define the missing words of a text as the whole vocabulary in a corpus minus the observed
words in the text. Our intuition is since observed words in a short text are too few to tell us what
the text is about, missing words can be used to tell us what the text is not about. We want to use
the missing words as negative examples to guide us in finding the optimal semantic hypothesis for
a text data. Our idea is illustrated in Figure 2.1.
After analyzing the way traditional dimension reduction models (LSA/PLSA/LDA) handle miss-
ing words, we decide to model data using a weighted matrix factorization approach [Srebro and
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 18
(a) When only observed words are explicitly taken into account, the
words in the text are close to all the missing words, which is the center
of the observed words
(b) After missing words are explicitly taken into account, the text node’s
position would be adjusted so that it is also away from the missing
words
Figure 2.1: An example to illustrate why missing words should be helpful: the red dots are observed
words in the text; the green dots represent missing words; the black dot denotes the hypothesis of
the latent vector of the text data. After taking into consideration the missing words, we will have a
better estimation for the text, i.e., where the black dot should be.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 19
Jaakkola, 2003], which allows us to treat observed words and missing words differently. We handle
missing words using a weighting scheme that distinguishes missing words from observed words
yielding robust latent vectors for short texts.
The properties of our model are: (1) it is an unsupervised approach, without requiring labels
annotated; (2) it is a simple model that only exploits bag-of-word features for short texts (exactly
the same information LSA/LDA uses); (3) since we use the missing word feature, which is already
implied by the text itself, our approach is very general (similar to LSA/LDA) in that it can be applied
to any format of short texts. In contrast, existing work on modeling short texts focuses on exploiting
additional data, e.g., Ramage et al. [2010] modeled tweets using their metadata (author, hashtag,
etc.).
2.2 Limitations of LDA and LSA
Usually dimension reduction models aim to find a latent semantic profile for a text that is most
relevant to the observed words. By explicitly modeling missing words, we set another criterion
to the latent semantic profile: it should not be associated with the missing words from the text.
Intuitively, missing words are not as informative as observed words, but they bear on the overall
semantic picture for textual data, as they inform us what the text is not about. Therefore there is a
need for a model that does a good job of representing this information, that the missing words are
relevant but with crucially modeling with them with right level of emphasis/impact.
LSA and PLSA/LDA work on a word-document co-occurrence matrix (in our context, each
short text is considered a document). Given a corpus, the row entries of the matrix are the uniqueM
words in the corpus, and the N columns are the document ids. The yielded M ×N co-occurrence
matrix X comprises the TF-IDF values in each Xij cell, namely that TF-IDF value of word wi in
document dj . All zero cells Xij = 0 are missing words.
Topic models (PLSA/LDA) do not explicitly model missing words. PLSA assumes each docu-
ment has a distribution over K topics P (zk|dj), k = 1, 2...K, j = 1, 2, ...N , and each topic has a
distribution over all the vocabularies in the corpus P (wi|zk), i = i, 2, ...M . Therefore, PLSA finds
a topic distribution for each document that maximizes the log likelihood of the corpus X (LDA has
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 20
a similar form): ∑i
∑j
Xij log∑k
P (zk|dj)P (wi|zk) (2.1)
In this formulation, missing words do not contribute to the estimation of document semantics, i.e.,
excluding missing words (Xij = 0) in equation 2.1 does not make a difference.
However, empirical results show that given a small number of observed words in a document,
usually topic models can only find one dominant topic (the most evident topic) for a document,
e.g., the concept definitions of bank#n#1 and stock#n#1 are assigned the financial topic alone
without any further discernability. This results in many documents are assigned exactly the same
semantics profile as long as they are pertaining/mentioned within the same domain/topic. There-
fore, any two documents in the same topic will have a cosine similarity of 1, otherwise the cosine
similarity is 0. This is not a desirable feature since we need differentiable similarity scores for ap-
plications. The reason for extracting only the dominant topic is that these topic models try to learn
a 100-dimension latent vector (assume dimension K = 100) from very few features (10 observed
words on average). It would be desirable if topic models can exploit missing words (a lot more data
than observed words) to render more nuanced latent semantics, so that pairs of documents in the
same domain can be differentiable.
On the other hand, LSA explicitly models missing words but not at the right level of emphasis.
LSA finds another matrix X with rank K to approximate X using Singular Vector Decomposition
(X ≈ X = UKΣKV>K ), such that the Frobenius norm of difference between the two matrices is
minimized: √∑i
∑j
(Xij −Xij
)2(2.2)
In effect, LSA allows missing and observed words to equally impact the objective function.
Given the inherent short length of the texts, LSA (equation 2.2) allows much more potential influ-
ence from missing words rather than observed words (99.9% cells are 0 in X). Hence the con-
tribution of the observed words is significantly diminished. Moreover, the true semantics of the
document is actually related to some missing words, but such true semantics will not be favored
by the objective function, since equation 2.2 allows for too strong an impact by forcing Xij = 0
for any missing word. Therefore the LSA model, in the context of short texts, is allowing missing
words to have a significant “uncontrolled” impact on the model.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 21
model financial sport institution Ro Rm Ro −Rm Ro − 0.01Rm
topic models: v1 1 0 0 20 600 -580 14
LSA: v2 0.2 0.3 0.2 5 100 -95 4
ideal: v3 0.6 0 0.1 18 300 -282 15
Table 2.1: Three possible latent vectors hypotheses for the text data, which is the WordNet sense
definition of bank#n#1: a financial institution that accepts deposits and channels the money into
lending activities. Assume there are only three topics in the corpus: finance, sport, institution.
Ro denotes the relatedness score between the hypothesis with observed words; Rm denotes the
relatedness score between the hypothesis with missing words.
2.2.1 An Example
We list three latent semantics profiles expressing the short text corresponding to the concept defini-
tion of bank#n#1 in Table 2.1, which illustrates our analysis for topic models and LSA. Assume
there are three dimensions: financial, sports, institution. We use Rvo to denote the sum of seman-
tic relatedness scores between latent vector v and all observed words; similarly, Rvm is the sum of
relatedness between the vector v and all missing words. The first vector profile v1 is chosen by
maximizing Ro = 600, hence generated by topic models. It suggests bank#n#1 is only related
to the financial dimension. The second latent vector (found by LSA) has the maximum value of
Ro−Rm = −95, but obviously the latent vector is not related to bank#n#1 at all. This is because
LSA treats observed words and missing words equally the same, and due to the large number of
missing words, the information of observed words is lost: Ro−Rm ≈ −Rm. The third vector is the
ideal semantic profile, since it is also related to the institution dimension. It has a slightly smaller
Ro in comparison to the first vector, yet it has a substantially smaller Rm.
In order to favor the ideal vector over other hypotheses, we simply need to adjust the objective
function by assigning a smaller weight to Rm, such as: Ro − 0.01 × Rm in the 8th column of
Table 2.1. Accordingly, we use weighted matrix factorization [Srebro and Jaakkola, 2003] to model
missing words.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 22
2.3 The Proposed Approach
2.3.1 Weighted Matrix Factorization
The weighted matrix factorization approach is very similar to SVD, except that it allows for direct
control on each matrix cell Xij . The model factorizes the original matrix X into two matrices P
and Q such that X ≈ P>Q, where X is a M ×N matrix, P is a K ×M matrix, and Q is a K ×N
matrix (Figure 2.2).
The model parameters (vectors in P andQ) are optimized by minimizing the objective function:∑i
∑j
Wij
(P·,i>Q·,j −Xij
)2+ λ||P ||22 + λ||Q||22 (2.3)
where λ is a free regularization factor, and the weight matrix W defines a weight for each cell in X .
Accordingly, P·,i is a K-dimension latent semantic vector profile for word wi; similarly, Q·,j is
a K-dimension vector profile that represents the text dj . Operations on theseK-dimensional vectors
have very intuitive semantic meanings:
(1) the inner product of P·,i and Q·,j is used to approximate semantic relatedness of word wi and
document dj : P·,i · Q·,j ≈ Xij , as the shaded parts in Figure 2.2; a large value of Xij means P·,i
and Q·,j should be more similar; in other words, they should share more common topics;
(2) equation 2.3 explicitly requires a document should not be related to its missing words by forcing
P·,i ·Q·,j = 0 for missing words Xij = 0.
(3) we can compute the similarity of two documents dj and dj′ using the cosine similarity between
vectors Q·,j and Q·,j′ .
Alternating least square [Srebro and Jaakkola, 2003] can be used for computing the latent vec-
tors in P and Q: P and Q are first randomly initialized, then computed iteratively by the following
equations (derivation can be found in [Srebro and Jaakkola, 2003]):
P·,i =(QW (i)Q> + λI
)−1QW (i)X>i,·
Q·,j =(PW (j)P> + λI
)−1PW (j)X·,j
(2.4)
where W (i) = diag(W·,i) is an N × N diagonal matrix containing ith row of weight matrix W .
Similarly, W (j) = diag(W·,j) is an M ×M diagonal matrix containing jth column of W .
It is worth noting that P and Q are computed iteratively, i.e., in a iteration each P·,i(i =
1, · · · ,M) is calculated based on Q, then each Q·,j(j = 1, · · · , N) is calculated based on P .
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 23
X PT Q
≈ ×
Figure 2.2: Matrix Factorization: the M × N matrix X is factorized into two matrices, K ×M
matrix P and K ×N matrix Q; K denotes the number of latent dimensions.
This can be computed efficiently since: (1) all P·,i share the same QQ>; similarly all Q·,j share
the same PP>; (2) X is very sparse. More details on accelerating the computation can be found in
[Steck, 2010].
2.3.2 Modeling Missing Words
It is straightforward to implement the idea in section 2.2.1 (choosing a latent vector that maximizes
Ro − 0.01× Rm) in the weighted matrix factorization framework, by assigning a small weight for
all the missing words in equation 2.3:
Wi,j =
1, if Xij 6= 0
wm, if Xij = 0(2.5)
We refer to resulting model as Weighted Matrix factorization [WMF]. The algorithm that uses al-
ternating least square is presented in algorithm 1.
This solution is elegant: 1. it explicitly tells the model that in general all missing words should
not be related to the short text; 2. meanwhile latent semantics are mainly generalized based on ob-
served words, and the model is not penalized too much (wm is very small) when it is very confident
that the text is highly related to a small subset of missing words based on their latent semantic
profiles (e.g., bank#n#1 definition text is strongly related to its missing words check loan).
In fact, the weight value reflects the confidence we have on the X cells. If Xij = 0 a missing
word, most likely word i is irrelevant to document j. However, there is still a small chance that
word i is a related word such as check loan to bank#n#1 sense definition. Therefore we are less
confident at the 0 values, and assign a small weight to it.
We adopt the same approach (assigning a small weight for some cells in WMF, some feature
values) proposed for recommender systems [Steck, 2010]. In recommender systems, an incomplete
rating matrix R is formed, where rows are users and columns are items. Typically, a user rates only
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 24
Algorithm 1: WMF
1 Procedure P = WMF(X,W, λ, n itr, α)
2 n words, n docs← size(X);
3 randomly initialize P,Q;
4 itr ← 1;
5 while itr < n itr do
6 for j ← 1 to n docs do
7 Qj,· =(P>W (j)P + λI
)−1P>W (j)X·,j
8 for i← 1 to n words do
9 Pi,· =(Q>W (i)Q+ λI
)−1Q>W (i)X>i,·
10 itr ← itr + 1;
a small portion of the items, hence, the recommender system needs to predict the missing ratings.
Steck [2010] imputed a value for all the missing cells, and set a small weight for those cells.
Compared to [Steck, 2010], we are facing a different problem and targeting a different goal. We
have a full matrixX where missing words have a 0 value, while the missing ratings in recommender
systems are unavailable – the values are unknown, hence the rating matrix R is not complete. In
the recommender system setting, they are interested in predicting individual ratings, while we are
interested in the text semantics. More importantly, they do not have the sparsity issue (on average
each movie has been rated over 250 times in the movie lens data1) and robust predictions can be
made based on the observed ratings alone.
2.4 Experiments
2.4.1 Experiment setting
Task and data sets: the WMF model is evaluated within the context of the short text similarity
task, where a system needs to predict a similarity score given two text data. The evaluation metric
is the Pearson correlation coefficient between the gold similarity scores and a system’s predicted
1http://www.grouplens.org/node/73, with 1M data set being the most widely used.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 25
scores.
The evaluation data sets are the data developed from SemEval-2012 Semantic Textual Semantics
task (STS12) [Agirre et al., 2012], *SEM 2013 shared STS task (STS13) [Agirre et al., 2013],
and SemEval-2014 STS task (STS14) [Agirre et al., 2014].2 For STS12, the training data (2234
pairs) is used as the tuning set for setting parameters for our models. This data comprises msr-par
news sentence paraphrases from Microsoft Paraphrase Corpus [Dolan et al., 2004], msr-vid video
description paraphrases [Chen and Dolan, 2011], smt-eur translation data [Callison-Burch et al.,
2007]. Once the models are tuned, we evaluate them on the STS12 test data set, STS13 data set
and STS14 data set. It is worth noting that the tuning data and test data are not from the same
sources: STS12 test set comprises out of domain sentence pairs such as OntoNotes [Hovy et al.,
2006] dictionary glosses; STS13 has FrameNet [Baker et al., 1998] glosses, OntoNotes glosses,
news headlines; The new genres in STS14 are tweets-news data [Guo et al., 2013], OntoNotes
glosses, news headlines, image descriptions [Rashtchian et al., 2010],and Deft-forum forum data.
Baselines: The performance of WMF is compared against (a) TF-IDF: a surface word based TF-
IDF weighting schema in the original high dimensional space, (b) LSA, and (c) LDA that uses
Collapsed Gibbs Sampling for inference [Griffiths and Steyvers, 2004]. The similarity of two short
texts is computed by cosine similarity either in the original word space (TF-IDF) or latent space
(LSA, LDA, WMF).
To eliminate randomness in WMF and LDA, all the reported results are averaged over 10 runs.
We run 20 iterations for WMF, and run 5000 iterations for LDA; Each LDA model is averaged over
the last 10 Gibbs Sampling iterations to obtain more robust predictions.
The latent vector of a text is computed by: (1) equation 7.2 in WMF, or (2) summing up the
latent vectors of all the constituent words weighted by Xij in LSA and LDA, following the work
reported in [Mihalcea et al., 2006]. For LDA, the latent vector of a word is computed by P (z|w).
It is worth noting that we could directly use the estimated topic distribution of a text θ to represent
a sentence, however, the topic distribution has only non-zero values in one or two topics, hence it
loses a lot of nuanced information, leading to a much worse performance.
Corpora: The corpora used by the dimension reduction models (LSA/LDA/WMF) comprise def-
2A description of the data sets is summarized in Table 1.1
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 26
Models Parameters STS12 tune STS12 test STS13 STS14
1. TF-IDF - 72.8 66.2 58.4 70.2
2. LSA - 16.1 23.0 24.9 27.5
3. LDA α = 0.05, β = 0.05 73.5 67.1 72.5 63.6
4. WMF wm = 1, λ = 20 15.7 23.9 23.7 27.4
5. WMF wm = 0, λ = 20 58.7 55.7 48.7 46.2
6. WMF wm = 0.01, λ = 20 74.3 71.7 71.8 71.7
Table 2.2: Pearson’s correlation (in percentage) on the four data sets: latent dimensionK = 100 for
LSA/LDA/WMF. For WMF models, the regularization factor λ is fixed as 20. Model 4-6 are WMF
with different missing word weight wm, where the first two models are analogous to LSA and LDA,
respectively.
initions from two dictionaries WordNet, Wiktionary3 and the Brown corpus. All definitions are
simply treated as individual documents. For the Brown corpus, each sentence is treated as a docu-
ment in order to mimic documents that are short text thereby creating more coherent co-occurrence
data. All data is tokenized and stemmed using the Porter Stemmer [Porter, 2001]. The importance
of words in a text is measured by the TF-IDF schema. All the dimension reduction models (LSA,
LDA, WMF) are built on the same set of corpora: WordNet+Wiktionary+Brown (393, 667 short
texts, 5, 252, 143 tokens, and 81, 848 distinct words).
2.4.2 Results
Table 2.2 summarizes the Pearson correlation coefficient values on the tuning and test sets. All
parameters are tuned based on the tuning set. In LDA, we chose an optimal combination of α and β
from {0.01, 0.05, 0.1, 0.5}.4 In WMF, we choose the best parameters for the weight wm for missing
words and λ for regularization. We fix the dimension K = 100. Later in section 2.4.3, we will see
3http://en.wiktionary.org/wiki/Wiktionary:Main Page
4Here α (Dirichlet prior for topic distribution of a document θj) and β (Dirichlet prior for word distribution given a
topic φk) serve as the regularization terms in LDA just like λ in WMF, since a larger α or β makes P (z|θj) and P (z|wi)
more evenly distributed.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 27
that a larger value of K can further improve performance.
There are several interesting observation in Table 2.2. Firstly, LSA is the most ineffective model.
This is caused by the fact that most cells in the corpora matrix X are missing words with 0 values;
LSA focuses is overwhelmed with the missing words, accordingly they induce noisy latent vectors
for the texts. Also, we note that compared to the TF-IDF model that works in high dimensional word
space, LDA is not consistently better: LDA achieved better Pearson’s correlation than TF-IDF on
STS12 and STS13 data sets, but not on STS14. One major reason is that STS14 is a relatively easier
corpus for the TF-IDF model in the sense that it contains more common words between text pairs:
in STS14 on average each pair has 54% words that appear in both sentences, while the percentage
is 51.8% for STS13 excluding the translation pairs.5
WMF that models missing words using a small weight (model 6 with wm = 0.01) outperforms
the second best model LDA by a large margin on all data sets except STS13 (+4.6%, −0.7%,
+8.1% on STS12 test, STS13, STS14, respectively). This is because LDA only uses 10 observed
words to infer a 100 dimension vector for a text, while WMF takes advantage of overwhelmingly
more missing words to learn more robust latent profiles.
We also present model 4 and 5 (both are WMF), to show the impact of: (1) modeling missing
words with equal weights as observed words (wm = 1) (mimicking LSA modeling), and (2) not
modeling missing words at all (wm = 0) (mimicking LDA modeling) in the context of WMF model.
As expected, both model 4 and model 5 generate much worse results.
Both LDA and model 5 ignore missing words, with better correlation scores achieved by LDA.
This may be due to the different inference algorithms: Gibbs sampling is a better inference algo-
rithm than alternating least squares which finds a local optimum solution. Model 4 and LSA are
comparable, where missing words are used with a large weight wm = 1. Both of them yield low
results. This confirms our assumption that allowing for equal impact of both observed and missing
words is not the appropriate manner for modeling the semantic space.
5The percentage is around 60% for STS12 tune and STS12 test, however, the ground truth similarity score has a lower
correlation with TF-IDF model.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 28
2.4.3 Analysis
In WMF and LDA models, there are several essential parameters: weight of missing words wm,
and dimension K. Figure 2.3 and 2.4 illustrate the impact of these parameters on predicting text
similarity scores.
Figure 2.3 shows the influence of wm on performance. When wm ≤ 0.01, the correlation scores
are very stable (all better than LDA except STS13), with wm = 0.005 being at the peak (even better
than our tuned parameterwm = 0.01). The scores drop significantly after wm becomes larger than
0.01. Accordingly, we can conclude the short text similarity task prefer a small missing word weight
wm ≤ 0.01.
We also illustrate the influence of the dimension K = {50, 75, 100, 150, 200} on LDA and
WMF in Figure 2.4, where parameters for WMF are fixed as wm = 0.01, λ = 20, and for LDA are
α = 0.05, β = 0.05. We observed two trends: (1) in most cases, a larger dimension K produces
higher Pearson’s correlation for both models, as more dimensions allow the encoding of more se-
mantics about the original data; (2) WMF outperforms LDA in all dimensions, except the STS13
data set. In Figure 2.4c on STS13, it seems a smaller number of dimension K = 75 yields the best
score for WMF, whereas LDA continue benefiting from a larger dimension K = 200.
Based on the two figures, we can conclude that WMF outperforms LDA in most cases; the result
is robust with different values of the K dimension and the missing word weight wm.
Another interesting factor is the training data. In all the experiments the models are trained on
short text data; as we observe, LSA performs particularly poorly because of the length of the data.
To investigate the impact of the length of the training data, we train the models on long documents:
400, 000 wikipedia documents, the same size as the short text corpora, with 156 tokens on average
in a document (versus 13.3 in short text corpora). The results are shown in Table 2.3.
With more words in a document, LSA is able to improve the results by a large margin, however
still worse than LDA and WMF. The performance of LDA and WMF degrades compared to trained
on the short texts. The reason may be that using the dictionary definitions is able to cover more
topics.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 29
wm
0.001 0.005 0.01 0.05 0.1
Pe
ars
on
's c
orr
ela
tio
n %
60
65
70
75
WMF
(a) STS12 train
wm
0.001 0.005 0.01 0.05 0.1
Pe
ars
on
's c
orr
ela
tio
n %
60
65
70
75
WMF
(b) STS12 test
wm
0.001 0.005 0.01 0.05 0.1
Pe
ars
on
's c
orr
ela
tio
n %
60
65
70
75
WMF
(c) STS13
wm
0.001 0.005 0.01 0.05 0.1
Pe
ars
on
's c
orr
ela
tio
n %
60
65
70
75
WMF
(d) STS14
Figure 2.3: Pearson’s Correlation percentage scores of WMF on each data set: the missing word
weight wm varies from 0.001 to 0.1; the dimensionK is fixed to 100; regularization factor λ is fixed
to 20.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 30
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
WMFLDA
(a) STS12 train
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
WMFLDA
(b) STS12 test
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
WMFLDA
(c) STS13
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
WMFLDA
(d) STS14
Figure 2.4: Pearson’s Correlation percentage scores of WMF and LDA on each data set: the dimen-
sion K varies from 50 to 200; missing word weight wm is fixed to 0.01; regularization factor λ is
fixed to 20.
CHAPTER 2. ENRICH SHORT TEXT BY MODELING MISSING WORDS 31
Models Parameters STS12 tune STS12 test STS13 STS14
TF-IDF - 72.8 66.2 58.4 70.2
LSA - 48.80 43.22 43.28 43.13
LDA α = 0.05, β = 0.05 73.46 66.83 59.68 54.77
WMF wm = 0.01, λ = 20 76.93 71.55 64.65 68.84
Table 2.3: Pearson’s correlation (in percentage) on the four data sets: the models are trained on long
documents.
2.5 Summary and Discussion
In this chapter, we analyzed how traditional models (LSA and topic models) handle missing words.
Accordingly, we explicitly take special treatment to missing words to alleviate the sparsity problem
in modeling short texts. Experiment results on three data sets confirm our hypothesis, and show that
our model WMF significantly outperforms existing methods.
One limitation of the bag-of-words based models is they neglect a lot of nuanced semantics
expressed by phrases such as word order. For example, the two sentences the dog bit him and he bit
the dog have the same feature vectors. This is a factor that may not hurt similarity prediction too
much but likely prevents short text similarity from being used for other NLP tasks. Therefore, in the
future work, we would like to integrate phrases as additional features for the short texts. One major
challenge lies in that most phrases are very infrequent and hence too sparse to model. Therefore,
we intend to filter a set of meaningful phrases and only concentrate on these sets.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 32
Chapter 3
Enrich Lexical Features by Modeling
Bigrams and Similar Words
In last chapter, we introduce the bottleneck of the short text modeling, i.e., simply employing the
observed words provides too few features for a short text, which causes inaccurate low dimensional
text representation. To this end, we integrate the missing words as additional features for a text
in the matrix factorization framework, and achieve significantly more robust representation for the
short text data.
In this chapter, we further investigate the representation for short texts in another perspective.
We argue that the current dimension reduction models, including our WMF model, do not pay
enough attention to lexical semantics. In LSA/LDA/WMF, the features to represent a word are sim-
ply the document IDs (see Figure 3.1), hence not very expressive. Under this simple assumption, a
lot of nuanced lexical information such as selectional preference is lost. Therefore, in this chapter
we focus on extracting and incorporating more features for words under the matrix factorization
framework, in order to infer robust lexical semantics. We believe modeling robust lexical items
is very important in the short text modeling context, since a short text contains very few observed
words, and the text representation will benefit significantly from quality word representation. The
experimental results support our hypothesis, where the new model [Guo and Diab, 2013] signifi-
cantly outperforms the WMF model in short text similarity data sets.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 33
3.1 Introduction
Our proposed Weighted Matrix Factorization (WMF) [Guo and Diab, 2012b] has outperformed
LSA [Deerwester et al., 1990] and LDA [Blei et al., 2003] by a large margin in the short text
similarity task, yielding previous state-of-the-art performance among unsupervised systems on the
STS12 [Agirre et al., 2012] data sets. However, all of these three models make over simiplified
assumptions on how a token is generated: (1) in LSA/WMF (Figure 3.1a), a token is generated by
the inner product of the word latent vector and the corresponding document latent vector; (2) in
LDA (Figure 3.1b), all the tokens in a document are sampled from the same document level topic
distribution. Under this assumption, all these models ignore rich lexical linguistic phenomena such
as inter-word dependency, semantic scope of words, and so on; accordingly all the models simply
assume each word is related to all other words in the document. This is a result of merely using
document IDs as features to represent a word (As shown in Figure 2.2, in the data matrix X , each
row represents a word, hence the columns, which are document IDs, can be seen as features for the
word).
It is worth noting that in text modeling community, using the document IDs alone to represent
a word is prevalent, since these dimension reduction techniques are usually applied for documents,
where abundant words exist for extracting the document level semantics. Nonetheless, we believe
this simple assumption is harmful in the short text setting. Given the limited number of words in
a text, it is crucial to make good use of each word; if one word is not modeled accurately, the
corresponding topics might not appear in the latent vector of the short text.
In this chapter, we focus on creating more features to induce quality latent semantic vectors
for words. This is motivated by the belief that a reasonable word generation story will encourage
robust lexical semantics, which can further boost the short text semantics. The features that we
are interested in are belonging to two very different categories: the first one is bigrams, which is
purely corpus-based lexical semantic evidence. The second is similar word pairs, extracted from
human constructed knowledge base. These two kinds of lexical semantics are naturally different,
and hence complementary for each other. We integrate both of them in the WMF model, resulting
in even better performance in the short text similarity task.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 34
d1, d2 … dN
X
w1, w2 … wM
(a) matrix facotization
θ z w N
M α
K ϕ β
(b) topic models
Figure 3.1: In current dimension reduction models (WMF/LSA and LDA), the features to represent
a word are simply document IDs, which are denoted by the red circles.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 35
3.2 Related Work
The modeling of bigrams is closely related to selectional preference. Selectional Preference de-
notes a word’s likelihood to co-occur with certain lexical sets, by “encoding the set of admissible
argument values for a relation” [Ritter et al., 2010]. For example, the word drink prefers drink-
able objects after it; names of people are more likely to appear in the argument of the verb meet.
Selectional preference proves to be helpful for a number of NLP applications, such as syntactic dis-
ambiguation [Hindle and Rooth, 1993], semantic role labeling [Gildea and Jurafsky, 2002], textual
inference [Pantel et al., 2007] and word sense disambiguation [Resnik, 1997], and many more.
Many previous work have proposed models for address selectional preference. Resink [1996]
made use of the WordNet predefined noun word classes, and calculated the selectional preference
strength between the noun classes and observed verbs. Erk [2007] demonstrated that an approach
of computing similarity between arguments is able to provide better lexical coverage. Rooth et al.
[1999] studied the relations and arguments in a generative probabilistic model, which is extended
by Ritter et al. [2010] in the LDA framework.
In this chapter, we are targeting on modeling the selectional preference to achieve a better latent
presentation for lexical items. As shown in the next section, we relax the traditional notion of
selectional preference (modeling the association between nouns and verbs in [Resnik, 1996]), and
model the association between two words in bigrams, which is purely data driven without human
resources. Also, our approach has the benefit of learning co-occurrence tendency for all words,
compared to other work which target on a specific lexical type.
The other type of resource we are exploiting is knowledge based information, similar word
pairs extracted from a dictionary WordNet [Fellbaum, 1998]. The human constructed knowledge is
a great complement for the corpus-based data. Because of its robustness, researchers have found the
knowledge based semantics extremely valuable in various NLP tasks such as paraphrasing [Barzilay
and Lee, 2003], lexical semantics [Yih et al., 2012], etc. In this chapter, we extract similar word
pairs from WordNet, and test its influence for short text similarity.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 36
3.3 Incorporating Bigrams
The additional corpus-based information we exploit, other than word-document co-occurrence, is
the bigrams, a feature already existing in the data yet ignored by most distributional similarity
models. The bigrams encodes the admissible arguments for a word, thus capturing more nuanced
semantics than the document IDs. Consider the following example (in our data set, a short text as a
document):
Many analysts say the global Brent crude oil benchmark price, currently around $111 a barrel
By the nature of WMF/LSA/LDA, a word will receive semantics from all the other words in a doc-
ument, therefore, the word oil, in the above example, will be assigned the incorrect finance topic
that is the dominant topic in the text level semantics. Moreover, the problem worsens for adjectives,
adverbs and verbs, which have a much narrower semantic scope than the whole sentence/short
text/document. For example, the verb say should only be associated with analyst (only receiving
semantics from analyst), as its semantics should not be related to any other word in the sentence.
In contrast, the word oil, according to its selectional preference, should only be associated with its
modifier crude which indicates the correct resource topic. We believe modeling bigrams captur-
ing local evidence completes the semantic picture for words, hence subsequently rendering better
short text semantics. To our best knowledge, this is the first work to model bigrams for short text
semantics.
If two words form a bigram, then the two words should share similar latent topics.1 In the
previous example, crude and oil form a bigram, and they share the resource topic. In our framework,
this is implemented by adding extra columns in X , so that each additional column corresponds to
a bigram, treating each bigram as a pseudo-document that only contains these two words, as shown
in Figure 3.2. The corresponding graphical model is illustrated in Figure 3.3b, where the extra b
nodes stand for the bigrams. Therefore, oil will receive more resource topic from crude through the
bigram crude oil, instead of only finance topic from the sentence as a whole.
Each non-zero cell in the new columns of X , i.e. an observed token in a bigram (pseudo-
1Note this distinguishes our work from previous efforts that mainly work on noun-verb relations, e.g., admissible
nouns for a verb. Since we are targeting on enhancing the latent representation for all words, our approach is very general
that can be applied on any word.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 37
X
analyst says
.
.
.
crude oil
d1 d2 … dN
0 1 0 1 0 0 . . . . 1 0 1 0
b1 b2..
Figure 3.2: Each bigram is integrated in the original corpus matrixX as an additional column. From
the model’s perspective, a bigram is treated as a pseudo-text; accordingly, only two cells in a bigram
column have non-0 values.
document), is given a different weight:
Wi,j =
1, if Xij 6= 0 and j is a document index
γ · freq(j), if Xij 6= 0 and j is a bigram index
wm, if Xij = 0
(3.1)
freq(j) denotes the frequency of bigram j appearing in the corpus, hence the strength of associ-
ation is differentiated such that higher weights are assigned on the more frequented bigrams. The
coefficient γ, whose value is manually set, is a hyperparamter that controls the importance of the
bigram evidence. Assigning a large γ value indicates that the bigram is more trustable than the
global textual semantics.
3.3.1 Incorporating Bigrams from Dependency Tree
An alternative way to incorporate the relations between two words is to extract bigrams from the
syntactic dependency tree. The benefit of doing so is that we can extract long range word relations.
After parsing each sentence, we derive the bigrams as the tuples of modifier and head. We use
Stanford parser [Klein and Manning, 2003] to parse the sentences. However, we are not able to
obtain better performance in the short text similarity task. The results is analyzed in the experiment
section.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 38
3.4 Incorporating Similar Word Pairs
We also integrate knowledge-based semantics in the WMF framework. Knowledge-based seman-
tics, as a type of human-annotated clean resource, is an important complement to corpus-based
noisy co-occurrence information. In this section, the knowledge-based semantics we exploit is sim-
ilar word pairs extracted from Wordnet [Fellbaum, 1998].
These similar word pairs is very valuable for improving the quality of infrequent word latent
profiles, because the model does not have enough contexts to understand an infrequent word. Lever-
aging the pairs, an infrequent word such as purchase can “borrow” relatively robust latent vectors
from its synonyms such as buy, as buy appears much more frequently and the model is able to
capture its semantics more accurately.
Similar words pairs can be seamlessly modeled in WMF, since in the matrix factorization frame-
work a latent vector profile is explicitly created for each word, therefore we can directly operate on
these word vectors. By contrast, in LDA all the data structures are designed for documents rather
than words. To integrate the knowledge, we construct a graph to connect words according to the
extracted similar word pairs, to encourage similar words to enjoy similar latent vector profiles, as
the nodes w2 and w4 shown in Figure 3.3c.
We first extract synonym pairs from WordNet, which are words associated with the same sense,
aka synset. We further expand the set by exploiting the relations defined in WordNet. For the
extracted words, we consider the first sense of each word, and if it is connected to other senses by
any of the WordNet defined relations (such as hypernym, meronym, etc.), then we treat the words
associated with the other senses as similar words. In total, we are able to discover more than 80, 000
pairs of similar words for the 46, 000 distinct words in our corpus.
Given a pair of similar words wi1 and wi2 , we want the two corresponding latent vectors P·,i1
and P·,i2 to be as close as possible, namely the cosine similarity to be close to 1. Accordingly, a
term is added in equation 2.3 for each similar word pair wi1 , wi2 :
δ · (P>·,i1P·,i2|P·,i1 ||P·,i2 |
− 1)2 (3.2)
|P·,i| denotes the Euclidean length of the vector P·,i. The coefficient δ, analogous to γ, denotes the
importance of the knowledge-based evidence. Figure 3.3c shows the final WMF+BK model (WMF
+ corpus-based Selectional [P]references semantics + [K]nowledge-based semantics), where the
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 39
extra link connecting w2 and w4 denotes the term in equation 3.2 that forces the two corresponding
word profile vectors to be similar.
3.5 Experiments
3.5.1 Experiment Setting
The experiment setting is almost the same as the setting in last chapter (more details to be found in
section 2.4).
Task and data sets: All the models are evaluated against the short text similarity task with dat sets
STS12, STS13 and STS14. STS12 training data set is used as tuning set.
Baselines: (a) TF-IDF: a surface word based TF-IDF weighting schema in the original high di-
mensional space, (b) LSA, (c) LDA that uses Collapsed Gibbs Sampling for inference [Griffiths and
Steyvers, 2004], and (d) WMF.
Corpora: The co-occurrance corpora is: definitions from two dictionaries WordNet, Wiktionary,
and Brown corpus.
3.5.2 Results
Table 3.1 lists the results at dimension K = 100 (the dimension of latent topics). To remove
randomness, each reported number is the averaged results of 10 runs. Based on the STS12 tuning
set, we experiment with different values for the bigram weight γ = {0, 1, 2}, and likewise for the
similar word pairs weight varying the weight δ as follows δ = {0, 10, 30, 50}. The performance on
STS12 tuning and STS12 test, STS13, STS14 is illustrated in Figure 3.4 and 3.5. The parameters of
model 7 in Table 3.1 (γ = 2, δ = 50) are the chosen values based on tuning set performance.
Table 3.1 shows WMF is already a very strong baseline: it outperforms TF-IDF, LSA and LDA
by a large margin. Using corpus-based bigram semantics alone (model 5 WMF+B in Table 3.1)
boosts the performance of WMF from +0.4% to +0.7% on the test sets, while using knowledge-
based semantics alone (model 6 WMF+K) improves the over the WMF results by an absolute value
of at most +1.1% (on STS13). That the performance gain of similar word pairs is larger than bigram
is expected, since the former is a cleaner source of semantics created by human annotators.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 40
w1 w2 w3
d1
w4 w5
d2
(a) WMF
w1 w2 w3
d1
b1 b2
w4 w5
d2
b3 b4
(b) WMF with bigrams
w1 w2 w3
d1
b1 b2
w4 w5
d2
b3 b4
(c) WMF with bigrams and similar words (full model)
Figure 3.3: WMF+BK model (WMF + corpus-based [B]igram semantics + [K]nowledge-based
similar word pairs semantics): a w/d/b node represents a word/document/bigram, respectively; the
extra node in Figure 3.3c denotes w2 and w3 constitute a similar word pair.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 41
Models Parameters STS12 tune STS12 test STS13 STS14
1. TF-IDF - 72.8 66.2 58.4 70.2
2. LSA - 16.1 23.0 24.9 27.5
3. LDA α = 0.05, β = 0.05 73.5 67.1 72.5 63.6
4. WMF 74.3 71.7 71.8 71.7
5. WMF+B γ = 2, δ = 0 74.5 72.2 72.6 72.5
6. WMF+K γ = 0, δ = 50 74.6 72.7 72.9 72.3
7. WMF+BK γ = 2, δ = 50 74.8 73.1 73.0 72.8
5. WMF+syn γ = 2, δ = 0 73.2 71.6 71.3 70.5
Table 3.1: Pearson’s correlation (in percentage) on the four data sets. Latent dimension K = 100
for LSA/LDA/WMF/WMF-BK. For matrix factorized based models, the regularization factor λ is
fixed as 20. Model 5 is WMF with bigram semantics alone; model 6 is WMF with similar word
pairs alone; model 7 is the final model with both semantics incorporated.
Combining them (model 7 WMF+BK) yields the best results, with an absolute increase of
+0.7% to +1.4%, which suggests that the two sources of semantic evidence are useful, but more
importantly, they are complementary for each other.
Observing the performance using different values of weights in Figure 3.4 (corpus-based se-
mantics weight γ) and 3.5 (knowledge-based semantics δ), we can conclude that the bigram and
similar word pairs yield very promising results. The trends hold in different parameter conditions
with a consistent improvement.
At last, we present the WMF+syn setting, where the bigrams are extracted from the dependency
tree of the sentences. We can see the performance is worse than the baseline WMF in all data sets.
The reason might be that the dependency parser is not mature enough to be applied on our data set –
we use the sense definitions which do not have the same structures of the natural langue in the news
genre data sets.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 42
3.6 Summary and Discussion
Motivated by the importance of recognizing the correct topics of words in the short text context,
we incorporate corpus-based (bigrams) and knowledge-based (similar word pairs) lexical semantics
into our matrix factorization model. Our system yields significant unsupervised performance gains
on short text similarity data sets over an existing strong baseline WMF model.
This method bridges the gap between lexical semantics and short text similarity, by applying
lexical semantics techniques to the P matrix of word latent vectors. Yet there is still room to benefit
from knowledge based semantics in the current framework. One direction is similar to the idea
introduced in [Yih et al., 2012; Chang et al., 2013], where the proposed model is aware of the
relations between sense pairs (synonym, antonym, meronym...). Intuitively, different word relation
should have different impact on the word profile vectors, e.g., in [Yih et al., 2012] two words that are
antonyms should have a −1 cosine similarity over their latent vectors. Nonetheless, in the current
model, all relations are simply abstracted as word neighbors without further differentiation (all word
pairs should have a cosine similarity value close to 1). We hope that explicitly modeling the sense
relations would yield better word latent profiles that encode more linguistic intuitions.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 43
γ
0 0.5 1 1.5 2
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-B
(a) STS12 train
γ
0 0.5 1 1.5 2
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-B
(b) STS12 test
γ
0 0.5 1 1.5 2
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-B
(c) STS13
γ
0 0.5 1 1.5 2
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-B
(d) STS14
Figure 3.4: Pearson’s Correlation percentage scores of WMF-B (with corpus-based [B]igram se-
mantics alone) on each data set: corpus-based semantics weight γ is chosen from {0, 1, 2}; the
dimension K is 100; missing word weight wm is fixed as 0.01; regularization factor λ is fixed as
20.
CHAPTER 3. ENRICH LEXICAL FEATURES BY MODELING BIGRAMS AND SIMILARWORDS 44
δ0 10 30 50
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-K
(a) STS12 train
δ0 10 30 50
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-K
(b) STS12 test
δ0 10 30 50
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-K
(c) STS13
δ0 10 30 50
Pe
ars
on
's c
orr
ela
tio
n %
70
71
72
73
74
75
WMF-K
(d) STS14
Figure 3.5: Pearson’s Correlation percentage scores of WMF-K (with [K]nowledge-based similar
word pairs semantics alone) on each data set: knowledge-based semantics weight δ is chosen from
{0, 10, 30, 50}; the dimension K is 100; missing word weight wm is fixed as 0.01; regularization
factor λ is fixed as 20.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 45
Chapter 4
Binary Coding for Large Scale
Similarity Computing
In previous two chapters, we presented matrix factorization models to convert text data into low-
dimensional real-valued vectors. We demonstrate that the models are very effective at predicting
semantic similarity for short text pairs.
We now turn our attention to computing similarity scores in a massive data set, specifically in
this chapter Twitter data, where millions of tweets are posted each day. One obvious issue is that
the massive data lead to time-consuming cosine similarity computation. To overcome the problem,
we focus on exploiting binary bit representation, rather than the real-valued representation in the
previous two chapters, for textual semantic similarity computation. We introduce a new model that
potentially removes redundant information in the model, and produces better performance in tweet
retrieval task and the short text similarity task.
4.1 Introduction
Twitter is rapidly gaining worldwide popularity, with 500 million active users generating more than
340 million tweets daily1. Massive-scale tweet data is freely available on the Web and contains
rich linguistic phenomena and valuable information, therefore making it one of most favorite data
1http://en.wikipedia.org/wiki/Twitter
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 46
sources used by a variety of Natural Language Processing (NLP) applications. Successful examples
include first story detection [Petrovic et al., 2010], local event detection [Agarwal et al., 2012],
Twitter event discovery [Benson et al., 2011], extraction [Ritter et al., 2012] and summarization
[Chakrabarti and Punera, 2011], etc.
In these NLP applications, one of core technical components is tweet similarity computing
to search for the desired tweets with respect to some sample tweets. For example, in first story
detection [Petrovic et al., 2010], the purpose is to find an incoming tweet that is expected to report
a novel event not revealed by the previous tweets. This is done by measuring cosine similarity
between the incoming tweet and each previous tweet.
One obvious issue is that cosine similarity computations among Twitter data will become very
slow once the scale of Twitter data grows drastically. In this chapter, we investigate the problem of
computing similarity score in large scale data set. We evaluate the similarity scores by the task of
tweet retrieval, where a system searches for the most similar tweets given a query tweet.2 Specifi-
cally, we propose a binary coding approach to render computationally efficient tweet comparisons
that should benefit practical NLP applications, especially in the massive data scenarios. Using the
proposed approach, each tweet is compressed into short-length binary bits (i.e., a compact binary
code), so that tweet comparisons can be performed substantially faster through measuring Ham-
ming distances between the generated compact codes. Crucially, Hamming distance computation
only involves very cheap NOR and popcount operations instead of floating-point operations needed
by cosine similarity computation.
Since Twitter messages contains very few words, naturally we can apply the WMF model on
Twitter data to get quality latent vectors. And then we convert the real-valued vectors to a binarized
version. Intuitively, the binary bits loses a lot of information, compared to the real-valued vectors.
Therefore, we focus on improving the WMF model to preserve as much information as possible in
the binary strings for tweets, and reduce any redundant information in the model.
Looking at the objective function, we find the WMF model solely focuses on exhaustively en-
coding the local context, i.e., whether a word appears in a short text. One issue caused by the local
approach is that it introduces some overlapping information, which is reflected in its associated pro-
jections (the P matrix in Figure 2.2). In order to remove the redundant information and meanwhile
2We also evaluate the model by short text similarity task, on the real-valued latent vector rather than binary bits.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 47
Symbol Definition
N Number of tweets in the corpus.
M Dimension of a tweet vector, i.e., the vocabulary size.
xi The sparse TF-IDF weighted vector corresponding to the i-th tweet in the corpus.
xi The vector subtracted by the mean µ of the tweet corpus: xi = xi − µ.
X, X The tweet corpus in a matrix format, and the zero-centered tweet data.
K The number of binary coding functions, i.e., the number of latent topics.
fk The k-th binary coding function.
Table 4.1: Symbols used in binary coding.
discover more distinct topics, we employ a gradient descent method to make the projection direc-
tions nearly orthogonal. We name the improved model Orthogonal Matrix Factorization (OrMF)
[Guo et al., 2014].
In our experiments, we evaluate the quality of similarity/disimilarity scores by searching for
most similar tweets given a query tweet. We use Twitter hashtags to create the gold (i.e., groundtruth)
labels, where tweets with the same hashtag are considered semantically related, hence relevant. We
collect a tweet data set which consists of 1.35 million tweets over 3 months where each tweet has
exactly one hashtag. The experimental results show that our proposed model OrMF significantly
outperforms competing binary coding methods.
4.2 Background and Related Work
4.2.1 Preliminaries
We first introduce some notations used in this chapter to formulate our problem. Suppose that we
are given a data set ofN tweets and the size of the vocabulary isM . A tweet is represented by all the
words it contains. We use notation x ∈ RM to denote a sparse M -dimensional TF-IDF weighted
vector corresponding to a tweet, where each word stands for a dimension. For ease of notation,
we represent all N tweets in a matrix X = [x1,x2, · · · ,xn] ∈ RM×N . For binary coding, we
seek K binarization functions{fk : Rd → {1,−1}
}Kk=1
so that a tweet xi is encoded into an K-bit
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 48
binary code (i.e., a string of K binary bits). Table 4.1 illustrates the symbols used in this chapter for
notation.
Hamming Ranking: In the chapter we evaluate the quality of binary codes in terms of Hamming
ranking. Given a query tweet, all data items are ranked in an ascending order according to the Ham-
ming distances between their binary codes and the query’s binary code, where a Hamming distance
is the number of bit positions in which bits of two codes differ. Compared with cosine similarity,
computing Hamming distance can be substantially efficient. This is because fixed-length binary bits
enable very cheap logic operations for Hamming distance computation, whereas real-valued vectors
require floating-point operations for cosine similarity computation. Since logic operations are much
faster than floating-point operations, Hamming distance computation is typically significantly faster
than cosine similarity computation.
4.2.2 Binary Coding
Early explorations of binary coding focused on using random permutations or random projections to
obtain binary coding functions (aka, hash functions), such as Min-wise Hashing (MinHash) [Broder
et al., 1998] and Locality-Sensitive Hashing (LSH) [Indyk and Motwani, 1998]. MinHash and LSH
are generally considered data-independent approaches, as their coding functions are generated in
a randomized fashion. In the context of Twitter, the simple LSH scheme proposed in [Charikar,
2002] is of particular interest. Charikar proved that the probability of two data points colliding
is proportional to the angle between them, and then employed a random projection w ∈ RM to
construct a binary coding function:
f(x) = sgn(w>x
)=
1, if w>x > 0,
−1, otherwise.(4.1)
The current held view is that data-dependent binary coding can lead to better performance. A
data-dependent coding scheme typically includes two steps: 1) learning a series of binary coding
functions with a small amount of training data; 2) applying the learned functions to larger scale data
to produce binary codes.
In the context of tweet data, Latent Semantic Analysis (LSA) [Deerwester et al., 1990] can di-
rectly be used for data-dependent binary coding. LSA reduces the dimensionality of the data inX by
performing singular value decomposition (SVD) over X: X = UΣV >. Let X be the zero-centered
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 49
data matrix, where each tweet vector xi is subtracted by the mean vectorµ, resulting in xi = xi−µ.
The K coding functions are then constructed by using the K eigenvectors u1,u2, · · · ,uK as-
sociated with the K largest eigenvalues, that is, fk(x) = sgn(U·,k
>x)
= sgn(U·,k
>(x − µ))
(k = 1, · · · ,K).
Iterative Quantization (ITQ) [Gong and Lazebnik, 2011] is another popular unsupervised binary
coding approach. ITQ attempts to find an orthogonal rotation matrix R ∈ RK×K to minimize the
squared quantization error: ‖B − RV ‖2F, where B ∈ {1,−1}K×N contains the binary codes of
all data, V ∈ RK×N contains the LSA-projected and zero-centered vectors, and ‖ · ‖F denotes
Frobenius norm. After R is optimized, the binary codes are simply obtained by B = sgn(RV ).
Much recent work learns nonlinear binary coding functions, including Spectral Hashing [Weiss
et al., 2008], Anchor Graph Hashing [Liu et al., 2011a], Bilinear Hashing [Liu et al., 2012b],
Kernelized LSH [Kulis and Grauman, 2012], etc. Concurrently, supervised information defined
among training data samples was incorporated into coding function learning such as Minimal Loss
Hashing [Norouzi and Fleet, 2011] and Kernel-Based Supervised Hashing [Liu et al., 2012a]. Our
proposed method falls into the category of unsupervised, linear, data-dependent binary coding.
4.2.3 Applications in NLP
The NLP community has successfully applied LSH in several tasks such as first story detection
[Petrovic et al., 2010], and paraphrase retrieval for relation extraction [Bhagat and Ravichandran,
2008], etc. This chapter shows that our proposed data-dependent binary coding approach is superior
to data-independent LSH in terms of the quality of generated binary codes.
Subercaze et al. [2013] proposed a binary coding approach to encode user profiles for recom-
mendations. Compared to [Subercaze et al., 2013] in which a data unit is a whole user profile
consisting of all his/her Twitter posts, we are tackling a more challenging problem, since our data
units are extremely short – namely, a single tweet.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 50
4.3 The Proposed Approach
4.3.1 Binarized version of WMF
Our approach is based on WMF model. Adapting WMF to binary coding is very straightforward.
Following LSA (section 4.2.2), we use the matrix P to linearly project tweets into low-dimensional
vectors, and then apply the sign function. The k-th binarization function uses the k-th row of the P
matrix (Pk,·) as follows
fk(x) = sgn (Pk,·x) =
1, if Pk,·x > 0,
−1, otherwise.(4.2)
Note that we use the zero-centered version x, which is the original data vector x subtracted by
the mean of the all tweets µ: x = x − µ. The goal of using zero-centered data X is to have a
balanced number of 1 bits and −1 bits in the data set.
4.3.2 Removing Redundant Information
Transforming a real-valued vector to binary bits loses a lot of information. Therefore, in this section
we aim to preserve as much original information as possible, and reduce redundant information in
the model.
We elaborate how to remove redundant information from the word semantics matrix P . Firstly,
it is worth noting that there are two explanations of the K ×M matrix P , as in Figure 4.1. The
columns of P (Figure 4.1a), denoted by P·,i, may be viewed as the collection of K-dimensional
latent profiles of words, which we observe frequently in the WMF model. On the other hand,
the rows of P (Figure 4.1b) are seen as projection vectors, denoted by Pk,·, which are similar to
eigenvectors U obtained by LSA. The projection vector Pk,· is employed to multiply to a zero
centered data vector x to generate a binary string: sgn(P·,k>x) for the text. In this section, we
focus on the property of the P matrix rows.
To compute the optimal P and Q, each columns in matrices P and Q is iteratively optimized to
approximate the data: Pi,·>Qj,· ≈ Xij , as shown in the line 6-9 in Algorithm 2 (which is basically
equation 2.4). While it does a good job at preserving the existence/relevance of each word in a
short text, it might encode repetitive information by means of the dimensionality reduction or the
projection vectors P·,k (the columns of P ).
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 51
P P P P P P P P P P P P P P⋅,iK
M
(a) Each column P·,i represents a word profile
Pk,⋅K
M
(b) Each row Pk,· is a projection vector
Figure 4.1: Two views of the P matrix: K is the number of dimensions, and M is the number
of distinct words. The first view, columns of P matrix, is frequently observed in the WMF model
(algorithm 1). Now we are going to apply the second view, rows of P matrix which are projections,
to improve the WMF model.
Figure 4.2 illustrates the redundant information (noisiness) in the P matrix. With the local
approach adopted in WMF, it is very likely to produce the topics in Figure 4.2a (which contains
some redundant information): the first projection vector P1,· may be 90% about the politics topic
and 10% about the war topic, and the second projection vector P2,· is 95% on war and 5% on food
topics, respectively. An extreme case of such redundancy is presented in Figure 4.2b, where the
second topic is exactly the same as the first topic.3
Ideally we would like the dimensions to be uncorrelated, so that more distinct topics of data
could be captured, such as in Figure 4.2c where first dimension is only about politics topic and
second dimension is only about war topic. We believe such a model is able to encode enrich infor-
mation by removing the repetitive information.
Inspired by LSA, one way to ensure the uncorrelatedness is to force P to be orthogonal, i.e.,
PP> = I . It implies Pj,·Pk,·> = 0 if k 6= j.
3It won’t happen in a real-world setting; we just use this example to illustrate the noisiness.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 52
topic 1: poli%cs topic 2: war obama war congress soldier budget food
government water war iraq army weapon … …
(a) A noisy case: the politics topic con-
tains some war words; the war topic con-
tains some food words
topic 1: poli%cs topic 2: poli%cs obama obama congress congress budget budget
government government war war army army … …
(b) The extreme case: the two topics are ex-
actly the same
topic 1: poli%cs topic 2: war obama war congress soldier budget injure
government peace elec5on iraq policy weapon … …
(c) The perfect case: each topic only con-
tains relevant words
Figure 4.2: Three examples to illustrate the noisiness in the P matrix. In general, we would like to
remove as much noise as possible.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 53
Algorithm 2: OrMF
1 Procedure P = OrMF(X,W, λ, n itr, α)
2 n words, n docs← size(X);
3 randomly initialize P,Q;
4 itr ← 1;
5 while itr < n itr do
6 for j ← 1 to n docs do
7 Q·,j =(PW (j)P> + λI
)−1PW (j)X·,j
8 for i← 1 to n words do
9 P·,i =(QW (i)Q> + λI
)−1QW (i)X>i,·
10 c = mean(diag(PP>));
11 P ← P − α(PP> − cI)P ;
12 itr ← itr + 1;
4.3.3 Implementation of Orthogonal Projections
To produce nearly orthogonal projections in the current framework, we could add a regularizer
β(PP> − I)2 with the weight β in the objective function of the WMF model (equation 2.4). How-
ever, in practice this method does not lead to the convergence of P . This is mainly caused by
the situation that any word profile P·,i would become dependent of all other word profiles after an
iteration.
Therefore, we adopt a simpler method, gradient descent, in which P is updated by taking a small
step in the direction of the negative gradient of (PP> − I)2. It should be noted that (PP> − I)2
requires each projection Pk,· to be a unit vector because of Pk,·Pk,·> = 1, which is infeasible when
the nonzero values in X are large. Therefore, we multiply the matrix I by a coefficient c, which is
calculated from the mean of the diagonal of P>P in the current iteration. The following two lines
are added at the end of an iteration:
c← mean(diag(PP>)),
P ←P − α(PP> − cI)P.(4.3)
Using the coefficient c, the magnitude of P is not affected. The step size α is fixed to 0.0001.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 54
This procedure is presented in Algorithm 2. We refer to this new model as Orthogonal Matrix
Factorization (OrMF).
4.4 Experiments on Twitter Data
4.4.1 Experiment Setting
Twitter data: We crawled English tweets spanning three months from October 5th 2013 to January
5th 2014 using the Twitter API.4 We cleaned the data such that each hashtag appears at least 100
times in the corpus, and each word appears at least 10 times. This data collection consists of
1,350,159 tweets, 15 million word tokens, 30,608 unique words, and 3,214 unique hashtags.
One of main reasons to use hashtags is to enhance accessing topically similar tweets [Efron,
2010]. In a large-scale data setting, it is impossible to manually identify relevant tweets for a
query tweet. Therefore, we use Twitter hashtags to create groundtruth labels, which means that
tweets marked by the same hashtag as the query tweet are considered relevant. Accordingly, in our
experiments all hashtags are removed from the original data corpus. We choose a subset of hashtags
from the most frequent hashtags to create groundtruth labels: we manually remove some tags from
the subset that are not topic-related (e.g., #truth, #lol) or are ambiguous; we also remove all the
tags that are referring to TV series (the relevant tweets can be trivially obtained by named entity
matching). The resulting subset contains 18 hashtags.5
100 tweets are randomly selected as queries (test data) for each of the 18 hashtags. The median
number of relevant tweets per query is 5,621. The small size of relevant tweets makes the task
relatively challenging. We need to identify 5,621 (0.42% of the whole data set) tweets out of 1.35
million tweets.
200,000 tweets are randomly selected (not including the 1,800 queries) as training data for
the data dependent models (LSAH, ITQ, SH, WMF, OrMF) to learn binarization functions.6 The
functions are subsequently applied on all the 1.35 million tweets, including the 1,800 query tweets.
4https://dev.twitter.com
5The tweet data set and their associated list of hashtags will be available upon request.
6Although we use the word “training”, the hashtags are never seen by the models. The training data is used for the
models to learn the word co-occurrence, and construct binary coding functions.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 55
Evaluation metric: We evaluate a model by the search quality: given a tweet as query, we would
like to rank the relevant tweets (the tweets sharing the same hashtag as the query tweet) as high as
possible. Following previous work [Weiss et al., 2008; Liu et al., 2011a], we use mean precision
among top 1000 returned list (MP@1000) to measure the ranking quality. Let pre@k be the preci-
sion among top k return data, then MP@1000 is the average value of pre@1, [email protected]@1000.
Obviously MP gives more reward on the systems that can rank relevant data in the top places, e.g., if
the highest ranked tweet is a relevant tweet, then all the precision values (pre@2, pre@3, pre@4...)
are increased. We also calculate the precision and recall curve at varying values of top k returned
list.
Baselines: We evaluate the proposed unsupervised binary coding models OrMF, whose perfor-
mance is compared against 5 other unsupervised methods, LSH, SH, LSA, ITQ, and WMF. All the
binary coding functions except LSH are learned on the 200,000 tweet set. All the methods have the
same form of binary coding functions: sgn(P·,k>x), where they differ only in the projection vec-
tor P·,k. The retrieved tweets are ranked according to their Hamming distance to the query, where
Hamming distance is the number of different bit positions between the binary codes of a tweet and
the query.
For ITQ and SH, we use the code provided by the authors. Note that the dense matrix XX> is
impossible to compute due the large vocabulary, therefore we replace it by sparse matrixXX>. For
the two matrix factorization based methods (WMF, OrMF) we run 10 iterations. The regularizer
λ in equation 2.3 is fixed at 20 as in our previous experiments [Guo and Diab, 2012b]. A small
set of 500 tweets is selected from the training set as tuning set to choose the missing word weight
wm in the baseline WMF, and then its value is fixed for OrMF. In fact WMF/OrMF are very stable,
consistently outperforming the baselines regardless of different values of wm, as later shown in
Figure 4.5.
We also present the results of cosine similarity on the original word space (TF-IDF) as an upper
bound of the binary coding methods. We implemented an efficient algorithm for TF-IDF, which is
the algorithm 1 in [Petrovic et al., 2010]. It firstly normalizes each data to a unit vector, then cosine
similarity is calculated by traversing only once the tweets via inverted word index.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 56
# of samples0 200 400 600 800 1000
Pre
cis
ion
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
OrMFWMFITQLSASHLSH
(a) K = 64
# of samples0 200 400 600 800 1000
Pre
cis
ion
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
OrMFWMFITQLSASHLSH
(b) K = 96
# of samples0 200 400 600 800 1000
Pre
cis
ion
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
OrMFWMFITQLSASHLSH
(c) K = 128
Figure 4.3: Hamming ranking on tweet retrieval data set: precision curve under top 1000 returned
list of all 6 binary coding models, with dimension K = {64, 96, 128}.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 57
# of samples×10
40 2 4 6 8 10
Re
ca
ll
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35 OrMFWMFITQLSASHLSH
(a) K = 64
# of samples×10
40 2 4 6 8 10
Re
ca
ll
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35 OrMFWMFITQLSASHLSH
(b) K = 96
# of samples×10
40 2 4 6 8 10
Re
ca
ll
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35 OrMFWMFITQLSASHLSH
(c) K = 128
Figure 4.4: Hamming ranking on tweet retrieval data set: recall curve under top 100,000 returned
list of all 6 binary coding models, with dimension K = {64, 96, 128}.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 58
wm
0.05 0.08 0.1 0.15 0.2
MP
@1000
26
27
28
29
30
31
32OrMFWMF
(a) K = 64
wm
0.05 0.08 0.1 0.15 0.2
MP
@1000
26
27
28
29
30
31
32OrMFWMF
(b) K = 96
wm
0.05 0.08 0.1 0.15 0.2
MP
@1000
26
27
28
29
30
31
32
OrMFWMF
(c) K = 128
Figure 4.5: Impact of the missing word weight wm on the MP@1000 performance for OrMF and
WMF models: wm is chosen from 0.05 to 0.2; regularization factor λ is fixed as 20.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 59
Models Parameters K = 64 K = 96 K = 128
LSH – 19.21% 21.84% 23.75%
SH – 18.29% 19.32% 19.95%
LSA – 21.04% 22.07% 22.67%
ITQ – 20.8% 22.06% 22.86%
WMF wm = 0.1 26.64% 29.39% 30.38%
OrMF wm = 0.1 27.7% 30.48% 31.26%
TF-IDF – 33.68%
Table 4.2: Mean precision among top 1000 returned list (MP@1000) on the tweet retrieval data set.
TF-IDF is the only system that does not use binary encoding, and serves as the upper bound of the
task.
4.4.2 Results
Table 4.2 presents the ranking performance measured by MP@1000 (the mean precision at top
1000 returned list). Figure 4.3 and 4.4 illustrate the corresponding precision and recall curve for the
Hamming distance ranking. The number of K binary coding functions corresponds to the number
of dimensions in the 5 data-dependent models LSA, SH, ITQ, WMF, OrMF. The missing words
weight wm is fixed as 0.1 based on the tuning set in the two weighted matrix factorization based
models WMF, OrMF. Later in Figure 4.5 we experiment with different values of wm.
As the number of bits increases, all binary coding models yield better results. This is under-
standable since the binary bits really record very tiny bits of information from each tweet, and more
bits, the more they are able to capture more semantic information.
SH has the worst MP@1000 performance. The reason might be it is designed for vision
data where the data vector is relatively dense. ITQ yields comparable results to LSA in terms
of MP@1000, yet the recall curves in Figure 4.4b (K = 96) and 4.4c (K = 128) clearly show the
superiority of ITQ over LSA.
WMF outperforms LSA by a large margin (around 5% to 7%) through properly modeling miss-
ing words, which is also observed in Chapter 2 and 3. Although WMF already reaches a very high
MP@1000 performance level, OrMF can still achieve around 1% improvement over WMF, which
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 60
can be attributed to orthogonal projections that captures more distinct topics. The trend holds con-
sistently across all conditions. The precision and recall curves in Figure 4.3 and 4.4 confirm the
trend showed in Table 4.2 as well.
All the binary coding models yield worse performance than TF-IDF baseline. This is expected,
as the binary bits are employed to gain efficiency at the cost of accuracy: the 128 bits significantly
compress the data losing a lot of nuanced information, whereas in the high dimensional word space
128 bits can be only used to record two words (32 bits for two word indices and 32 bits for two
TF-IDF values). We manually examine the ranking list. We find in the binary coding models, there
exist a lot of ties (128 bits only result in 128 possible Hamming distance values), whereas the TF-
IDF baseline can correctly rank them by detecting the subtle difference signaled by the real-valued
TF-IDF values.
4.4.3 Analysis
We are interested in whether other values of missing word weight wm can generate good results –
in other words, whether the performance is robust to the parameter value. Accordingly, we present
the influence of wm on MP@1000 in Figure 4.5, where the missing word weight wm is chosen
from {0.05, 0.08, 0.1, 0.15, 0.2}. The figure indicates we can achieve even better MP@1000 around
33.2% when selecting the optimal wm = 0.05. In general, the curves for all the code length are
very smooth; the chosen value of wm does not have a negative impact, i.e., the gain from OrMF
over WMF is always positive.
4.5 Experiments on STS Data
We repeat the experiments on STS data for the OrMF model. As in previous 2 chapters, OrMF is
evaluated against the short text similarity task on dat sets STS12, STS13 and STS14. Since there is
no parameter in OrMF to tune, STS12 training set can be seen as a test set. OrMF is trained on the
corpora which is constituted by definitions from two dictionaries WordNet, Wiktionary, and Brown
corpus.
The baselines are: (a) TF-IDF: a surface word based TF-IDF weighting schema in the origi-
nal high dimensional space, (b) LSA, (c) LDA that uses Collapsed Gibbs Sampling for inference
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 61
Models Parameters STS12 tune STS12 test STS13 STS14
1. TF-IDF - 72.8 66.2 58.4 70.2
2. LSA - 16.1 23.0 24.9 27.5
3. LDA α = 0.05, β = 0.05 73.5 67.1 72.5 63.6
4. WMF wm = 0.01, λ = 20 74.3 71.7 71.8 71.7
5. WMF+BK γ = 2, δ = 50 74.8 73.1 73.0 72.8
5. OrMF wm = 0.01, λ = 20 76.7 72.6 74.1 71.9
Table 4.3: Pearson’s correlation (in percentage) on the data sets. Latent dimension K = 100 for
LSA/LDA/WMF/OrMF. We use the real-valued vectors produced by OrMF for short text similarity
evaluation.
[Griffiths and Steyvers, 2004], (d) WMF, and (e) WMF-BK.
In this experiments, we are not evaluating the binary coding performance for OrMF. Therefore,
the output of OrMF is the real-valued low dimensional vector for a short text. The missing word
weight wm and regularization factor λ are set as the optimal values for WMF: wm = 0.01, λ = 20.
4.5.1 Results
Table 4.3 summarizes the Pearson’s correlation values for the 5 models. It is very clear that OrMF
is the best performing model, consistently yielding better scores than the second best model WMF
in all 4 data sets.
It is interesting to observe that the improvement of OrMF over WMF on STS14 is the smallest
(+0.2%), compared to an averaged improvement of +1.7% on the other 3 data sets. Recall that
STS14 is the easiest data set in the sense that it has the many common surface words between pairs,
we can imagine that OrMF can perform even better when the task is more challenging.
As usual, we also present the Pearson’s correlation scores by varying number of dimensions K
for OrMF/WMF/LDA on the four data sets in Figure 4.6 where K = {50, 100, 150, 200}. OrMF
notably defeats WMF consistently on the first three data sets by 1% − 2%, with almost the same
performance on STS14. This achievement is very difficult, given that OrMF does not use any
additional features compared to WMF.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 62
4.6 Summary and Discussion
In this paper, we propose a novel unsupervised binary coding model which provides efficient simi-
larity search in massive tweet data. The resulting model, Orthogonal Matrix Factorization (OrMF),
improves an existing matrix factorization model through learning nearly orthogonal projection di-
rections. We collect a data set whose groundtruth labels are created from Twitter hashtags. Our
first experiment conducted on this data set show significant performance gains of OrMF over the
competing methods. We also evaluate on short text similarity tasks, the results of which show OrMF
consistently outperform WMF on all 4 short text similarity data sets.
To further enhance the accuracy of the tweet retrieval task, we can introduce supervised labels
to make hashtags visible to the models. Previous work on supervised hashing [Liu et al., 2012a]
already demonstrated significant improvement. In our task, we want to learn binary bits that are
similar among those tweets tagged by the same hashtag and hence triggered by the same event.
Another promising direction is to model time stamp as a feature for the tweet, motivated by the
observation that many tweets are posted to describe the same event in a short period of time. We
believe that two tweets with close timestamps should be more likely to be similar. Our preliminary
approach is to build a model for each time span, following the idea in [Blei and Lafferty, 2006], so
that tweets within the same timestamp are generated from the same time specific model.
CHAPTER 4. BINARY CODING FOR LARGE SCALE SIMILARITY COMPUTING 63
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
78
OrMFWMFLDA
(a) STS12 train
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
78
OrMFWMFLDA
(b) STS12 test
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
78
OrMFWMFLDA
(c) STS13
K
50 75 100 150 200
Pe
ars
on
's c
orr
ela
tio
n %
60
62
64
66
68
70
72
74
76
78 OrMFWMFLDA
(d) STS14
Figure 4.6: Pearson’s Correlation percentage scores of OrMF, WMF and LDA on each data set: the
dimension K varies from 50 to 200; missing word weight wm is fixed as 0.01; regularization factor
λ is fixed as 20.
64
Part II
Applications
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 65
Chapter 5
Automated Pyramid Method for
Summaries
short text similarity has a wide range of applications in NLP tasks. The first task studied in the
thesis is automated pyramid evaluation methods for text summarization. Text summarization is the
process of compressing a text document and create a short summary that retains the most important
points of the original document.
The pyramid method is an evaluation method to assess the qualities of summaries. Essentially,
it will assess a summary by assigning a score, whose value is high if the summary contains as
many facts as the original documents. The score is computed mainly by manually identifying text
snippets in summaries that cover the key concepts in the source documents. Therefore, the pyramid
method evaluation requires human manual annotation. As a previous work, Harnly et al. [2005]
proposed a dynamic programming approach to automatically compute the pyramid scores relying
on bag-of-words matching.
In this chapter, we propose to use the Weighted Matrix Factorization (WMF) model to determine
whether the important facts are included in the summary. We believe the current surface word
matching based approach can be improved by the latent semantic approach, as there are many
different ways to express the same facts in natural language in summaries. Our experiments show
that our approach is able to identify facts covered in the summaries with greater precision and recall,
which leads to better correlation with human judgement of the quality of summaries.
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 66
Index 105
label matter is what makes up all objects or substances
contributor 1 matter is what makes up all objects or substances
contributor 2 matter as the stuff that all objects and substances in the universe are made of
contributor 3 matter is identified as being present everywhere and in all substances
contributor 4 matter is all the objects and substances around us
weight 4
Table 5.1: An example of summary content unit created from 5 model summaries. The concept has
4 contributors, all expressing the same meaning yet with different wording. Accordingly this SCU
has a weight of 4.
5.1 Introduction
The pyramid method [Nenkova and Passonneau, 2004] is an annotation and scoring procedure to
measure how much content is covered by a summary. It is designed in an attempt to address a key
problem in summarization – namely the fact that different humans choose different content when
writing summaries. It has been shown to yield reliable rankings of text summarization systems on
multiple summarization tasks.
The pyramid method consists of two phases of manual annotation: (1) identifying content units
in model summaries; the model summaries are written by humans and serve as gold standard
summary; (2) identifying which content units are included in a system summary, and accordingly
assign a score for the system summary. The procedure is illustrated in Figure 5.1.
The first annotation phase yields Summary Content Units (SCUs), sets of text segments that
express a basic content in the model summaries. Note that each SCU is weighted by the number of
model summaries it occurs in, accordingly more frequent SCUs have larger weights. By intuition,
an SCU that appears in all model summaries is a more important fact, hence the higher weight.
After manual annotation, the resulting SCUs extracted from the same set of model summaries is
referred to as a pyramid.
Table 5.1 demonstrates an example of SCU extracted from five model summaries. The elements
of a SCU are its index, a label, contributors and its weight. In this example, (1) the index is 105. (2)
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 67
Original documents
Model Summaries
Student Summaries
pyramids
1st annota6on of pyramid method
Iden6fied SCUs of student summaries
2nd annota6on of pyramid method
Figure 5.1: The pipeline for pyramid method to evaluate student summaries: the first annotation of
pyramid method is to create pyramids from model summaries; the second annotation is to find the
SCUs in target summaries. After the procedure, we can score a target summary based on how many
SCUs it has.
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 68
The label is a sentence describing the content unit, which is summarized by annotators. (3) Each
contributor is a text snippet in one distinct model summary that refers to the SCU. In this example,
four out of five model summaries (hence 4 contributors) express the SCU Matter is what makes up
all objects or substances, therefore, (4) this SCU has a weight of 4. Obviously, the weight of a SCU
should range from 1 to M , where M is the number of model summaries. Therefore, the first phase
of manual annotation in pyramid method includes identifying the contributors and summarizing the
corresponding label.
The procedure of scoring a target summary is basically identifying those SCUs that have been
expressed by the summary. Because each summary use paraphrases with different words referring
to the same concepts, it also requires human to identify SCUs, which is the effort of second phase of
manual annotation. As in Table 5.1, the contributors have lexical items in common (matter, objects,
substances), but a lot of differences (stuff, present, around).
In this chapter, we aim to complete the second phrase without human annotation, employing
latent representation of the labels/contributors of SCUs to automatically identify SCUs in a system
summary and accordingly score the system summary. This is a perfect application of the WMF
model, which is good at extracting semantic similarity between two text snippets that are expressed
by different words. The general procedure is to run the WMF model on the SCU labels and con-
tributors, as well as the ngram in the system summaries, and then score the system summaries.
Generally, if the similarity between a label/contributor and an ngram exceeds a threshold, then the
summary is potentially considered matching the SCU.
We evaluate our automated pyramid method on an assessment task of student reading compre-
hension. Previous work on automated pyramid method performed well at ranking systems on many
documents set, but is not precise enough on a document (a student summary for reading materials).
For evaluation, we have produced the manual pyramid scores for 20 student summaries, which serve
as gold standard scores. We have tested three automated pyramid scoring procedures, and the one
based on WMF correlates best with manual pyramid scores. It also has the best precision and recall
for matching SCUs in the student summaries.
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 69
5.2 Related Work
ROUGE [Lin and Hovy, 2003; Lin, 2004] is the most popular automated evaluation method for
text summarization, which originates from the BLEU score in machine translation [Papineni et
al., 2002]. ROUGE contains a set of metrics that compare the summary against reference (model
summaries) based on ngram matching. Because it relies on the string matching, it performs better
with large sets of model summaries. Compared to ROUGE, pyramid method is more robust as it
requires as few as four model summaries.
Nenkova and Passonneau [2004] proposed the the pyramid method. It is based on the idea that
no single model summary is perfect, and hence assign differential weights to content unites based
on the their frequency in all the model summaries. Essentially a pyramid is a weighted inventory
of SCUs, created for each document (set) to be summarized. In addition, each SCU has a weight
to differentiate the importance of the content unit, which yields more reliable and stable scores.
Pyramid method has been shown to perform well on ranking summarization systems.
Harnly et al. [2005] proposed the first automated summary evaluation to replace the second
manual annotation phase, which makes use of the labels and contributors of SCUs. They reduced
the problem to similarity computation between summary ngram and SCU labels/contributors based
on unigram overlap. The automated method yielded higher correlation to manual pyramid method
than the ngram overlap based ROUGE systems. Our method is an extension of this framework, and
the experiment results show the superiority of our method in terms of ranking summaries, as well
as identifying the correct SCUs.
On the other hand, distributional similarity models have been applied in reading comprehen-
sion. Foltz et al. [2000] found using LSA [Deerwester et al., 1990] correlates well with read-
ing comprehension. More recently, LSA has been combined with word matching to assess stu-
dents’ reading comprehension skills [Boonthum-Denecke et al., 2011]. The resulting tool, and
similar assessment tools such as Coh-Metrix, assess aspects of readability of texts, such as co-
herence, but do not assess students comprehension through their writing [Graesser et al., 2004;
Graesser et al., 2011].
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 70
5.3 A Scoring Approach based on Distributional Similarity
In this section, we first introduce how we create the student summary corpus, and what are the
criteria for a good automated scoring schema. Then we explain the details of our approach based
on dynamic programming that is designed to match the criteria.
5.3.1 A Student Summary Corpus
Pyramid scores of students summaries correlate well with a manual main ideas score developed for
an intervention study with community college freshmen who attended remedial classes [Perin et al.,
2013]. Twenty student (target) summaries by students who attended the same college and took the
same remedial course were selected from a larger set of 322 that summarized an elementary physics
text. All were native speakers of English, and scored within 5 points of the mean reading score
for the larger sample. For the intervention study, student summaries had been assigned a score to
represent how many main ideas from the source text were covered [Perin et al., 2013]. Interrater
reliability of the main ideas score, as given by the Pearson correlation coefficient, was 0.92.
We first collected model summaries that are written by proficient Masters of Education students.
Then, Perin created a model pyramid from the model summaries, annotated the 20 target (student)
summaries against this pyramid, and scored the results. There are many ways to score a target sum-
mary: (1) the raw score of a target summary is simply the sum of its identified SCU weights; (2)
the pyramid scores are the raw scores normalized by the number of SCUs in the target summary
(analogous to precision), (3) or normalized by the average number of SCUs in model summaries
(analogous to recall); (4) in this chapter, we normalize raw scores as the average of the two previous
normalizations (analogous to F-measure). The resulting pyramid scores have a high Pearson’s cor-
relation of 0.85 with the main idea score [Perin et al., 2013] that was manually and directly assigned
to each student summary.
5.3.2 Criteria for Automated Scoring of Student Summaries
To be pedagogically useful, an automated method to assign pyramid scores to students summaries
should meet the following two criteria: 1) reliably rank students summaries of a source text, i.e.,
preserving the ranking generated by the manual pyramid scores, and 2) identify the correct SCUs
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 71
in each student summary. Ideally a method should do well on criterion 1, as we want to find out the
best summaries. Also, since each weight partition will have more than one SCU and a score is a
sum of the SCU weights, it is possible to produce the correct numeric final score sum by matching
incorrect SCUs that have the correct weights. Compared to the previous methods, our method meets
the first criterion, and has superior performance on the second criterion.
5.3.3 A Dynamic Programming Approach
Harnly et al. [2005] have observed that assignment of SCUs to a target summary can be cast as a
dynamic programming problem. The method presented there relied on unigram overlap to score
the closeness of the match of each eligible substring in a target summary against each SCU in
the pyramid. It returned the set of matches that yielded the highest score for the summary. It
produced good rankings across summarization tasks, but assigned scores much lower than those
assigned by humans. This is because the surface word matching is too strict that many SCUs using
different words than the summary string are not discovered by the algorithm. Therefore, in this
section we extend the dynamic programming approach in two ways. We test two new semantic text
similarities, a string comparison method and a distributional semantic method, and we present a
general mechanism to set a threshold value for an arbitrary computation of text similarity, below
which the match between a summary substring and an SCU is not considered.
Unigram overlap ignores word order, and cannot consider the latent semantic content of a string.
To take word order into account, we use Ratcliff/Obershelp (R/O), which measures overlap of com-
mon subsequences [Ratcliff and Metzener, 1988]. To take the underlying semantics into account,
we use cosine similarity of 100-dimensional latent vectors of the candidate strings (ngrams from
target summary) and of the textual components of the SCU (label and contributors). Because the
algorithm optimizes for the total sum of all SCUs, when there is no threshold similarity to count as
a match, a lot of false matches will occur. Therefore, we add a threshold to the algorithm, below
which matches are not considered. Because each similarity metric has different properties and dis-
tributions, a single absolute value threshold is not comparable across metrics. We present a method
to set comparable thresholds across metrics.
Latent Representation: To represent the latent semantics of SCUs and candidate substrings of
target summaries, we apply the weighted matrix factorization model (WMF) [Guo and Diab, 2012b].
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 72
Comparing summary substrings with SCUs is a very ideal setting for WMF, since both text data are
at phrase level, and WMF is able to learn a robust latent representation for short texts using missing
words, as introduced in previous chapters.
A 100-dimension latent vector representation is learned for every span of contiguous words
within sentence bounds in a target summary, for the 20 summaries. The training data is selected to
be domain independent, so that our model could be used for summaries across domains. Thus we
prepare a corpus that is balanced across topics and genres. It is drawn from from WordNet sense
definitions, Wiktionary sense definitions, and the Brown corpus. It yields a co-occurrence matrix
X of unique words by sentences of size 46,619 393,666. Xij holds the TF-IDF value of word wi
in sentence sj . Similarly, the contributors to and the label for an SCU are given a 100- dimensional
latent vector representation. These representations are then used to compare candidates from a
summary to SCUs in the pyramid.
Three Comparison Methods: An SCU consists of at least two text strings: the SCU label and mul-
tiple contributors. As in Harnly et al. [2005], we use three similarity comparisons scusim(ngram,
SCU), where ngram is the target summary string. When the comparison parameter is set to min
(max, or mean), the similarity of ngram to each SCU contributor and the label is computed in turn,
and the minimum (maximum, or mean) similarity value is returned.
Similarity Thresholds: We define a threshold parameter for a target SCU to match a pyramid SCU,
based on the distributions of scores each similarity method gives to the target SCUs identified by the
human annotator. Annotation of the target summaries yields 204 SCUs in total. The similarity score
being a continuous random variable, the empirical sample of 204 scores is very sparse. Hence, we
use a Gaussian kernel density estimator to provide a non-parametric estimation of the probability
densities of scores assigned by each of the similarity methods to the manually identified SCUs. We
then select five threshold values corresponding to those for which the inverse cumulative density
function (icdf) is equal to 0.05, 0.10, 0.15, 0.20 and 0.25. Each threshold represents the probability
that a manually identified SCU will be missed.
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 73
5.4 Experiments on Student Summaries
5.4.1 Experiment Setting
The three similarity computations (Uni, R/O, WMF), three methods to compare against SCUs
(max, min, mean), and five icdf thresholds yield 45 variants, as shown in Figure 5.2. Each
variant was evaluated by comparing the unnormalized automated variant, e.g., WMF, max, 0.64 (its
0.15 icdf), to the human goldstandard scores, using each of the evaluation metrics described in the
next subsection. To compute confidence intervals for the evaluation metrics for each variant, we use
bootstrapping with 1000 samples [Efron and Tibshirani, 1986].
(3 similarities) × (3 comparisons) × (5 thresholds) = 45
(Uni, R/O, WMF) × (max, min, mean) × (0.05, . . . , 0.25)
Figure 5.2: Notation used for the 45 variants of automated pyramid methods. The 5 thresholds
correspond to inverse cumulative density function.
To assess the 45 variants, we compare their automated pyramid scores to the manual scores. By
our criterion 1, an automated score that correlates well with manual scores for summaries of a given
text could be used to indicate how well students rank against other students. We report several types
of correlation tests. Pearson’s coefficient tests the strength of a linear correlation between the two
sets of scores; it will be high if the same order is produced, with the same distance between pairs of
scores. The Spearman rank correlation is said to be preferable for ordinal comparisons, meaning the
absolute unit interval is less relevant. Kendall’s tau, an alternative rank correlation, is less sensitive
to outliers and more intuitive. It is the proportion of concordant pairs (pairs in the same order) minus
the proportion of discordant pairs. Since correlations can be high when differences are uniform, we
use Student’s T to test whether differences score means statistically significant.
We also evaluate at the SCU level, as another set of experiments to assess the 45 variants. In
Perin’s annotation, the correct SCUs mentioned in each student summary are manually identified.
According to criterion 2, the best variant would be able to retrieve the correct SCUs. Therefore, we
use precision, recall and F-score to measure the performance.
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 74
Variant (with icdf) P (95% conf.), rank S (95% conf.), rank K (95% conf.), rank
WMF, max, 0.64(0.15) 0.93(0.94, 0.92), 1 0.94(0.93, 0.97), 1 0.88(0.85, 0.91), 1
R/O, mean, 0.23(0.15) 0.92(0.91, 0.93), 3 0.93(0.91, 0.95), 2 0.83(0.80, 0.86), 3
R/O, mean, 0.26(0.20) 0.92(0.90, 0.93), 4 0.92(0.90, 0.94)4 0.80(0.78, 0.83), 5
WMF, max, 0.59(0.10) 0.91(0.89, 0.92), 8 0.93(0.91, 0.95)3 0.83(0.80, 0.87), 2
WMF, min, 0.40(0.20) 0.92(0.90, 0.93), 2 0.87(0.84, 0.91)11 0.74(0.69, 0.79), 11
Table 5.2: Five top performing variants out of 45 variants ranked by correlation scores, with confi-
dence interval and rank (P=Pearson’s, S=Spearman, K=Kendalls tau)
5.4.2 Results
The correlation tests indicate that several variants achieve sufficiently high correlations for students’
summaries to manual gold scores (criterion 1). On all correlation tests, the highest ranking auto-
mated method is WMF, max, 0.64; this similarity threshold corresponds to the 0.15 icdf. As shown
in Table 5.2, it has the best Pearson correlation (0.93), Spearman’s correlation (0.94) and Kendall’s
tau (0.88). It should be attributed that WMF is able go beyond the surface words and extract more
accurate SCU matches. We also observe tthat R/O achieves better results than Uni, thanks to the
capture of word order in R/O.
The differences in the unnormalized score computed by the automated systems from the score
assigned by human annotation are consistently positive. Inspection of the SCUs retrieved by each
automated variant reveals that the automated systems lean toward the tendency to identify false
positives (to match more SCUs even if the summary does not cover the SCU). This may result
from the dynamic programming implementation decision to maximize the score. To get a measure
of the degree of overlap between the SCUs that were selected automatically versus manually, we
computed recall and precision for the various methods.
Table 5.3 shows the mean recall, precision (with standard deviations) and F measure scores
across all five thresholds for each combination of similarity method and method of comparison to
the SCU. The low standard deviations show that the recall and precision are relatively similar across
thresholds for each variant. The WMF methods outperform R/O and unigram overlap methods, indi-
cating the use of distributional semantics is a superior approach for pyramid summary scoring than
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 75
Variant recall (std) precision (std) F-measure
Uni, min 0.69(0.08) 0.35(0.02) 0.52
Uni, max 0.70(0.03) 0.35(0.04) 0.53
Uni, mean 0.69(0.02) 0.39(0.04) 0.54
R/O, min 0.69(0.08) 0.34(0.01) 0.51
R/O, max 0.72(0.03) 0.33(0.04) 0.52
R/O, mean 0.71(0.06) 0.38(0.02) 0.54
LCv, min 0.61(0.03) 0.38(0.04) 0.49
LCv, max 0.74(0.06) 0.48(0.01) 0.61
LCv, mean 0.75(0.06) 0.50(0.02) 0.62
Table 5.3: SCU selection results: averaged recall, precision and F-measure over the 20 student
summaries, for each combination of similarity method and method of comparison to the SCU (9
categories). The number in bracket is the standard deviation for precision and recall.
methods based on string matching. It is worth noting that the high F-measure scores WMF achieved
mainly come from precision, which confirms our hypothesis that unigram and R/O methods produce
too many false positive SCU matches.
In Table 5.4, We also collect the performance on SCU selection of the top five variants in Ta-
ble 5.2. Generally the systems achieving better correlations scores for the summaries would also
perform well on selecting the SCUs. The table also reveals an interesting observation about the
best models of WMF and R/O methods: WMF beats R/O because it is able to find more SCUs, by
increasing the recall and maintaining the precision.
5.5 Experiments on TAC 2011
We are also interested in the performance our evaluation method on the machine generated sum-
maries. Therefore, we apply it on the data set of traditional summarization task in Text Analysis
Conference (TAC) 2011. TAC 2011 contains 44 topics. Each topic falls into one of 5 predefined
event categories and contains 10 related news documents. TAC had four writers to write model
summaries for each topic.
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 76
Variant recall precision F-measure
WMF, max, 0.64(0.15) 0.78 0.51 0.61
R/O, mean, 0.23(0.15) 0.71 0.48 0.56
R/O, mean, 0.26(0.20) 0.70 0.50 0.57
WMF, max, 0.59(0.10) 0.82 0.51 0.62
WMF, min, 0.40(0.20) 0.54 0.49 0.51
Table 5.4: SCU selection results: averaged recall, precision and F-measure over the 20 student
summaries, for variants of the top five variants in Table 5.2.
There are 50 team submission in TAC 2011, and TAC had manually calculated the Pyramid
scores. To evaluate the performance of our automated pyramid method, we just need to compare
the correlation between our pyramid scores and the gold standard manual scores.
We test the WMF, max variant with similarity threshold values of 0.59 and 0.64, which are the
best two variants shown in table 5.2. The Pearson’s correlation is 0.93 and 0.92 for the similarity
value of 0.59 and 0.64. It demonstrates that the automated pyramid evaluation is also reliable to
differentiate the performance of different methods for machine generated summaries.
5.6 Summary and Discussion
We extend a dynamic programming approach [Harnly et al., 2005] to automate pyramid scores more
accurately by applying our WMF model for phrase level data. Our contribution mainly results from
principled thresholds for similarity scores, and from extracting latent vector representation for the
short spans of text. We propose two criteria for a good automated pyramid method, and accordingly
design two experiments: evaluation in the summary level (the correlation with final gold manual
pyramid scores) and SCU level (identifying the correct SCUs). we find the latent semantics based
methods perform best at the two criteria for a pedagogically useful automatic metric, evaluated by
both the correlation with gold manual scores and the gold SCUs.
For future work, we are interested in applying our approach for text summarization systems
as an attempt for improving summarization quality. Since our approach is able to identify with
higher precision and recall whether a text snippet contains the key concepts, it could also be helpful
CHAPTER 5. AUTOMATED PYRAMID METHOD FOR SUMMARIES 77
for choosing which ngrams to be included in the summary. We hope by incorporating our model,
the yielded summary in a fixed length is able to convey maximum information from the source
documents.
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 78
Chapter 6
Unsupervised Word Sense
Disambiguation
In this chapter, we study the impact of short text similarity for a lexical semantics task – word sense
disambiguation (WSD). WSD is the task to identify which sense of a word is used in a context.
Usually the sense inventory is obtained from a lexicon such as WordNet [Fellbaum, 1998].
In many unsupervised WSD systems, a most important component is a sense similarity measure
that returns a similarity score given two sense IDs. Previous work adopted very simple approaches
to compute the similarity score: most similarity measure use the taxonomy structure of WordNet
such as jcn [Jiang and Conrath, 1997], while Extended Lesk (elesk) [Banerjee and Pedersen, 2003]
computes the number of overlapping words/phrases between the two sense definitions. The latter
one gains much wider popularity, since the many other similarity measures rely on taxonomies,
hence they can only compute similarity between noun/verb pairs, while adjectives and adverbs do
not have a taxonomic representation structure in WordNet.
Because of the short nature of the sense definitions, we believe that exploiting our WMF model
can yield more meaningful sense similarity scores given two sense definitions. We first apply WMF
model on the sense definition data sets to get a low dimensional representation of the data, based on
which we construct a new sense similarity measure wmfvec [Guo and Diab, 2012a]. We make some
crucial adjustment on the procedure of sense similarity computation, inspired by the noticeable traits
of sense similarity measure Extend Lesk (elesk). To our best knowledge, wmfvec is the first sense
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 79
s4 s2
I am walking on the bank with my friend
s1 s3 walk by foot
escort
riverbank
financial bank
s5 friend 1.2 0.5
0.7
Figure 6.1: Unsupervised Graph-based Word Sense Disambiguation System: several sense nodes
are created for each word; the weights on edges are similarity scores between the two senses; for
simplicity, the edges between walk senses and friend sense are not shown. The final decision of
disambiguated words are the sense nodes that achieve maximum indegree values.
similarity measure calculated on low dimensional representation of sense definitions. Extensive
WSD experiments performed on four standard benchmarks demonstrate that our proposed sense
similarity measure outperforms the baselines by a large margin.
6.1 Introduction
To date, many unsupervised WSD systems rely heavily on a sense similarity module that returns
a similarity score given two senses. For example, the graph-based WSD systems [Mihalcea et
al., 2006; Guo and Diab, 2010] build a graph where nodes are senses of content words, and the
weight on an edge denotes the sense similarity score between the two sense (Figure 6.1). The sense
disambiguation is performed to choose the sense node with the maximum indegree value (the sum
of weights of edges associated to the node), since such nodes are perceived to have the maximum
relatedness with the context words.
Because sense similarity measure is the most crucial component in many unsupervised WSD
systems, in the lexical semantics community much effort has been devoted to developing useful
sense similarity measures based on some knowledge base lexicons such as WordNet [Fellbaum,
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 80
1998]. For example, many similarity measures take advantage of the taxonomy structure of Word-
Net, which is constructed by “is-a” relations . The sense similarity value is computed based on the
positions of the two senses and their least common subsumer in the noun/verb hierarchy. However,
it only allows noun-noun and verb-verb pair similarity computation, as the other part-of-speech
(adjectives and adverbs) do not have a taxonomic representation structure.
The most popular sense similarity measure is the Extended Lesk (elesk) measure [Banerjee
and Pedersen, 2003]. In elesk, a similarity score is computed based on the length of overlapping
words/phrases between two extended dictionary definitions (hence it works for all part-of-speech
types). The definitions are extended by definitions of neighbor senses to discover more overlapping
words. However, exact word matching is lossy. Below are two definitions from WordNet:
• bank#n#1: a financial institution that accepts deposits and channels the money into lending
activities
• stock#n#1: the capital raised by a corporation through the issue of shares entitling holders
to an ownership interest (equity)
Despite the high semantic relatedness of the two senses, the overlapping words in the two definitions
are only a, the, yielding a very low sense similarity score.
Accordingly we are interested in extracting latent semantics from sense definitions to improve
elesk. However, the challenge lies in that sense definitions are typically too short/sparse for la-
tent variable models to learn accurate semantics, since these models are designed for long docu-
ments. For example, topic models such as LDA [Blei et al., 2003], can only find the dominant topic
(finance topic in bank#n#1 and stock#n#1) without further discernibility. In this case, many
senses will share the same latent semantics profile, as long as they are in the same topic/domain,
which results in a maximum cosine similarity of 1, or 0 cosine similarity if the dominant topic is
different for them.
To obtain quality latent vector representations for senses and enable meaningful textual sim-
ilarity, we apply the WMF model on the WordNet sense definitions. We then show how to use
WordNet neighbor sense definitions to construct a more nuanced sense similarity wmfvec, relying
on the inferred latent semantic vectors of senses. The WordNet neighbor senses are induced by
the sense relations defined in WordNet. We show that wmfvec is superior to elesk and LDA based
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 81
approaches in four All-words WSD data sets. To our best knowledge, wmfvec is the first sense
similarity measure based on latent semantics of sense definitions.
6.2 Related Work
Many systems over the years have been proposed for the WSD task. A thorough review of state-
of-the-art through the late 1990s is presented in [Ide and Veronis, 1998] and more recently in
[Navigli, 2009]. Several techniques have been used to address the problem ranging from rule
based/knowledge based approaches to unsupervised and supervised machine learning techniques.
In this chapter, we focus on the unsupervised All-words task, where systems are required to disam-
biguate all the content words (noun, adjective, adverb and verb) in documents.
Sense similarity measures have been the core components in many unsupervised WSD systems
and lexical semantics research/applications. Among these sense similarity measure, elesk is the most
widely used one. Sometimes people use jcn to obtain similarity of noun-noun pairs. McCarthy
et al. [2004] tested elesk and jcn to find predominant word sense, where elesk produced better
performance. Patwardhan et al. [2005] builded a WSD system that integrated elesk similarity values
between target words and its neighbor words. In [Mihalcea, 2005], they constructed a graph where
nodes are senses of context words and edges are sense similarity values returned by elesk. Following
the graph framework of [Mihalcea, 2005], researchers [Sinha and Mihalcea, 2007; Guo and Diab,
2010] replaced elesk by jcn for noun-noun and verb-verb pairs and obtained better WSD results.
Sense similarity measures can be mainly broken into three categories on the basis of the re-
sources they depend on: (1) WordNet relations, (2) WordNet noun/verb taxonomy + Information
Content, (3) WordNet relations + sense definitions.
lch [Leacock and Chodorow, 1998], wup [Wu and Palmer, 1994] and path [Pedersen et al., 2004]
are instances of the first group, e.g., path returns the inverse of the shortest path length between two
senses in the WordNet sense graph where senses are connected by relations. However, all of them
simply used WordNet relations to create a graph connecting the senses, without further exploiting
any information on senses.
The second category, first proposed by Resnik [1995] in the similarity measure res, then fol-
lowed by lin [Lin, 1998], jcn, combined information content and noun/verb taxonomy (which is
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 82
the hypernym/hyponym relations defined in WordNet). Information content is the frequency in-
formation of a sense. The ideal way to record information content is to use a corpus that is sense
annotated, but such corpus is expensive to build. Hence people estimated the frequency information
in alternative ways. To illustrate how information content (IC) is incorporated into sense similarity
computation, we present the formula of jcn to calculate sense similarity:
sim(s1, s2) = IC(s1) + IC(s2)− 2× IC(LCS(s1, s2)) (6.1)
where LCS(s1, s2) is the least common subsumer of sense s1 and s2 in the taxonomy. It should be
noted that all these three measures (res, lin and jcn) use the IC of least common subsumer of sense
s1 and s2. The disadvantage is clear: it requires a taxonomy, thereby only noun-noun and verb-verb
pair similarity can be computed.
At last, elesk and Gloss Vector [glsvec] [Patwardhan and Pedersen, 2006] are sense-definition
based similarity measures. If WordNet relations are not available, elesk and glsvec can still return
similarity values, while sense similarity measures in the previous two categories fail to do so. This
feature enables them to be applied to other dictionaries, as long as sense definitions are available.
glsvec is similar to elesk except it converts the definitions into high-dimensional word space vector
representation and returns the cosine similarity of the two sense vectors. Therefore glsvec also does
not extract the latent semantics of definitions.
Our similarity measure wmfvec exploits the same information (sense definitions and WordNet
relations) elesk use, and outperforms it significantly on four data sets. Therefore, we believe wmfvec
will make a great contribution to lexical semantics community. To our best knowledge, we are the
first to construct a sense similarity by latent semantics of sense definitions.
6.3 A New Sense Similarity Measure – wmfvec
We first run WMF on the WordNet sense definition data sets. Thus, a sense is presented by the
K-dimensional vector induced from the sense definition. A natural way to obtain sense similarity is
to calculate the cosine similarity of the two corresponding K-dimensional vectors. Inspired by the
elesk, we make some crucial changes when constructing the sense similarity measure, as explained
in details in the following.
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 83
After applying WMF on WordNet sense definitions, we can further use the features of WordNet
to construct a better low dimensional representation for senses. The most important feature of
WordNet is that senses are connected to each other through relations such as hypernym, meronym,
homonym, similar attributes, etc. In our experiments, we use all the 28 relations defined in WordNet
3.0. We observe that neighbor senses are semantically similar in most cases, for example air bag is
a meronym of car, hence the semantics of the neighboring senses could be a good indicator for the
latent semantics of the target sense.
We use these WordNet neighbors in a manner similar to elesk. As shown in section 6.1, sense
definitions are fairly short and do not provide sufficient vocabulary to cover the relatedness. To
address the issue, in elesk a definition is augmented by including the definitions of its neighbor
senses, in order to yield more overlapping words/phrases. Therefore, in our method, a sense is
represented by the sum of its original latent vector and its neighbors’ latent vectors. Let N(j) be
the set of neighbor senses of sense j, then new latent vector becomes:
Qnew·,j = Q·,j +
k∈N(j)∑k
Q·,k (6.2)
It is also worth noting that the similarity score of elesk is not normalized by the length of sense
definitions. This is understandable since otherwise it will gives an unfair advantage to the short
definitions. Hence we also adopt a similar idea: inner product (instead of cosine similarity) of the
two resulting sense low dimensional vectors is employed as function to calculate the sense pair
similarity. We refer to our sense similarity measure as wmfvec.
6.4 Experiments
6.4.1 Experiment Setting
Task and data sets: We choose the fine-grained All-Words Sense Disambiguation task for eval-
uation. The data sets we use are all-words tasks in SensEval2 [Palmer et al., 2001], SensEval3
[Snyder and Palmer, 2004], SemEval-2007 [Pradhan et al., 2007], and Semcor [Miller et al., 1993].
Statistics of annotated senses on the four data sets is listed in Table 6.1. We tune the parameters in
of all models based on their performance on SensEval2, and then directly apply the tuned models
on other three data sets.
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 84
data set docs noun adj adv verb
SensEval2 3 1064 465 301 554
SensEval3 3 902 358 - 732
SemEval-2007 3 159 - - 296
Semcor 381 86994 31706 18947 88320
Table 6.1: The statistics of annotated senses in the four WSD data sets, as well as the distribution
per part-of-speech.
Data: The sense inventory is WordNet 3.0 for the four WSD data sets. WMF and LDA are built on
the corpus of sense definitions of two dictionaries: WordNet and Wiktionary.1 We do not link the
senses across dictionaries, hence Wiktionary is only used as augmented data for distributional mod-
els to better learn word latent profiles. All data is tokenized, POS tagged with Stanford POS Tagger
[Toutanova et al., 2003] and lemmatized,2 resulting in 341, 557 sense definitions and 3, 563, 649
words.
WSD Algorithm: To perform WSD we need two components: (1) a sense similarity measure that
returns a similarity score given two senses, which will be the baselines in the next paragraph; (2)
a disambiguation algorithm that determines which senses to choose as final answers based on the
sense pair similarity scores. We choose the Indegree algorithm used in [Sinha and Mihalcea, 2007;
Guo and Diab, 2010] as our disambiguation algorithm.
Baselines: We compare with (1) elesk, the most widely used sense similarity, and (2) glsvec which
is similar to elesk except glsvec converts the definitions into high-dimensional word space vector
representation and return the cosine similarity of vectors. We use the implementation of elesk and
glsvec in [Pedersen et al., 2004].
The third baseline is (3) ldavec, LDA using Gibbs sampling [Griffiths and Steyvers, 2004].
We calculate the latent vector of a sense definition by summing up the P (z|w) of all constituent
words weighted by Xij (more details can be found in section 2). (4) At last, we compare wmfvec
1http://en.wiktionary.org/
2The lemmatization is conducted with WordNet::QueryData package
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 85
with a sense similarity combination for a very mature WSD system, [jcn+elesk], introduced in
[Sinha and Mihalcea, 2007], where the authors evaluated six sense similarities, select the best of
them and combine them into one system. Specifically, in their implementation they use jcn [Jiang
and Conrath, 1997] for noun-noun and verb-verb pairs, and elesk for other pairs. jcn+elesk with
Indegree algorithm [Sinha and Mihalcea, 2007] used to be the state-of-the-art system on SensEval2
and SensEval3.
6.4.2 Results
The disambiguation results (K = 100) are summarized in Table 6.2. We also present in Figure
6.2 results using other values of dimensions K for wmfvec and ldavec. There are very few words
that are not covered due to failure of lemmatization or POS tag mismatches, thereby F-measure is
reported.
Based on SensEval2, wmfvec’s parameters are tuned as λ = 20, wm = 0.01; ldavec’s param-
eters are tuned as α = 0.05, β = 0.05. We run WMF on WordNet+Wiktionary for 30 iterations,
and LDA for 2000 iterations. For LDA, more robust P (w|z) is generated by averaging over the
last 10 sampling iterations. We also set a threshold to elesk similarity values, which yields better
performance. Same as [Sinha and Mihalcea, 2007], values of elesk larger than 240 are set to 1, and
the rest are mapped to [0,1].
GlsVec vs elesk: Both of GlsVec and elesk compute similarity based on surface word matching,
yet GlsVec produces much worse WSD results than elesk. The reason may be GlsVec uses cosine
similarity, hence sometimes even though two definitions have a lot of words in common, their
similarity value is still small as long as one definition is lengthy. Another disadvantage compared to
elesk is that GlsVec cannot capture phrases.
elesk vs wmfvec: wmfvec outperforms elesk consistently in all POS cases (noun, adjective, adverb
and verb) on four datasets by a large margin (2.9% − 4.5% in total case). Observing the results
yielded per POS, we find a large improvement comes from nouns. Same trend has been reported
in other distributional methods based on word co-occurrence [Cai et al., 2007; Li et al., 2010;
Guo and Diab, 2011]. More interestingly, wmfvec also improves verbs accuracy significantly, partly
because verb is a harder POS for disambiguation.
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 86
Data Model Total Noun Adj Adv Verb
SensEval2 random 40.7 43.9 43.6 58.2 21.6
glsvec 49.1 51.8 55.7 66.4 28.5
elesk 56.0 63.5 63.9 62.1 30.8
ldavec 58.6 68.6 60.2 66.1 33.2
wmfvec 60.5 69.7 64.5 67.1 34.9
jcn+elesk 60.1 69.3 63.9 62.8 37.1
jcn+wmfvec 62.1 70.8 64.5 67.1 39.9
SensEval3 random 33.5 39.9 44.1 - 33.5
GlsVec 39.8 45.6 54.0 - 24.7
elesk 52.3 58.5 57.7 - 41.4
ldavec 53.5 58.1 60.8 - 43.7
wmfvec 55.8 61.5 64.4 - 43.9
jcn+elesk 55.4 60.5 57.7 - 47.4
jcn+wmfvec 57.4 61.2 64.4 - 48.8
SemEval-2007 random 25.6 27.4 - - 24.6
GlsVec 31.6 33.3 - - 30.7
elesk 42.2 47.2 - - 39.5
ldavec 43.7 49.7 - - 40.5
wmfvec 45.1 52.2 - - 41.2
jcn+elesk 44.5 52.8 - - 40.0
jcn+wmfvec 45.5 53.5 - - 41.2
Semcor random 35.26 40.13 50.02 58.90 20.08
GlsVec 39.1 42.2 57.2 67.6 23.5
elesk 55.43 61.04 69.30 62.85 43.36
ldavec 58.17 63.15 70.08 67.97 46.91
wmfvec 59.10 64.64 71.44 67.05 47.52
jcn+elesk 61.61 69.61 69.30 62.85 50.72
jcn+wmfvec 63.05 70.64 71.45 67.05 51.72
Table 6.2: The WSD performance measured by F-measure of 7 models on each data set, as well as
the performance per part-of-speech. For the models ldavec and wmfvec, they are trained with the
latent dimension K = 100.
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 87
K
50 75 100 125 150
F-m
ea
su
re %
45
50
55
60
wmfvecldavec
(a) SensEval2
K
50 75 100 125 150
F-m
ea
su
re %
45
50
55
60
wmfvecldavec
(b) SensEval3
K
50 75 100 125 150
F-m
ea
su
re %
45
50
55
60 wmfvecldavec
(c) SemEval-2007
K
50 75 100 125 150
F-m
ea
su
re %
45
50
55
60
wmfvecldavec
(d) Semcor
Figure 6.2: The WSD performance, measured by F-measure, of ldavec and wmfvec on each data
set. The latent dimension K varies from 50 to 150.
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 88
ldavec vs wmfvec: ldavec also performs very well, again proving the superiority of latent semantics
over surface words matching. However, wmfvec again outperforms ldavec in every POS case (at
least +1% in total case) except Semcor adverbs. We can verify that the trend is consistent by results
in Figure 6.2 where different dimensions are used for ldavec and wmfvec. These results confirm our
argument that given the same text data, WMF outperforms LDA on modeling latent semantics of
senses by exploiting missing words. Another interesting observation is that using different number
of dimensions do not have a very big impact on the performance when K ≥ 100.
jcn+elesk vs jcn+wmfvec: jcn+elesk is a very mature sense similarity combination that takes advan-
tage of the great performance of jcn on noun-noun and verb-verb pairs. Although wmfvec does much
better than elesk, using wmfvec solely is sometimes outperformed by jcn+elesk on nouns and verbs.
Therefore to beat jcn+elesk, we replace the elesk in jcn+elesk with wmfvec (hence jcn+wmfvec).
Similar to [Sinha and Mihalcea, 2007], we normalize the similarity values of wmfvec such that val-
ues greater than 400 are set to 1, and the rest values are mapped to [0,1]. We choose the value 400
based on the WSD performance on tuning set SensEval2. As expected, the resulting jcn+wmfvec
can further improve jcn+elesk for all cases. Moreover, jcn+wmfvec produces similar results to state-
of-the-art unsupervised systems on SensEval2, 61.92% F-mearure in [Guo and Diab, 2010], and
SensEval3, 57.4% in [Agirre and Soroa, 2009]. It shows wmfvec is robust that it not only performs
very well individually, but also can be easily incorporated with existing evidence as represented
using jcn.
6.4.3 Analysis
We look closely into WSD results to obtain an intuitive feeling about what is captured by wmfvec
and what is not captured. We mainly compare wmfvec with surface word matching approach elesk.
The different behaviors and hence different performance between wmfvec and elesk are exhibited
by sense similarity scores, which are listed in Table 6.3. The first example involves the target word
mouse in the following context:
• ... in experiments with mice that a gene called p53 could transform normal cells into cancer-
ous ones...
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 89
sense similarity target senses gene#n#1 cell#n#2
wmfvec animal mouse 27.00 16.57
computer mouse 7.14 0.01
elesk animal mouse 68 78
computer mouse 80 68
sense similarity target senses stop#v#1 chat#v#1
wmfvec church place 3.44 4.89
church service 10.53 6.26
elesk church place 48 17
church service 12 6
Table 6.3: The similarity values of wmfvec and elesk in two examples. The first example is the
target word mouse in a biology context that contains words gene, cell, etc. The second example is
the word church in a context that involves stop, chat.
elesk returns the wrong sense computer mouse, due to the lack of overlapping words between sense
definitions of animal mouse and context words. However, wmfvec chooses the correct sense animal
mouse, by recognizing the biology dimension of the animal mouse sense and related context words
gene, cell, cancerous.
We perform some basic data analysis on the items that wmfvec is not capable of capturing. A
negative example shows the deficiency of the distributional similarity models:
• ... stop to chat at the church door...
Here church clearly refers to the meaning of “a place for public (especially Christian) worship”.
wmfvec chooses a similar sense “a service conducted in a house of worship”. In wmfvec it may not
have a specific latent dimension for concept place, hence it cannot differentiate place from service.
In contrast, elesk can distinguish place from service via surface word matching. The exact sense
similarity scores in Table 6.3 support our hypothesis.
CHAPTER 6. UNSUPERVISED WORD SENSE DISAMBIGUATION 90
6.5 Summary and Discussion
We construct a sense similarity measure wmfvec based on the latent semantics of WordNet sense
definitions by explicitly modeling missing words in the weighted matrix factorization framework.
To our best knowledge, we are the first to construct a sense similarity based on the latent semantics
of sense definitions. Experiment results on four fine-grained all-words WSD data sets show wm-
fvec significantly outperforms previous definition-based similarity measures elesk, glsvec and LDA
based vectors. Moreover, jcn+wmfvec produces results comparable to state-of-the-art systems on
SensEval2 and SensEval3 data sets.
Although in this chapter only WSD experiments are conducted, our method has much impact in
many other sense related tasks. For example, our method could be applied for first sense acquisition
[McCarthy et al., 2007], which aims to find the most frequent sense for words. Given the word
embeddings from WMF model and sense embeddings from our method, the first sense could be
the one that enjoys maximum cosine similarity with the target word. The wide application of our
method should be attributed to that we have learned embeddings for senses, which is very unique.
In the future work, we are looking forward to further exploiting the features of WordNet to bring
more semantics into the sense and strengthen the quality of sense vectors. For example, the part-of-
speech and super tags of the senses can enrich the syntax information of senses, whereas the current
model only concerns about semantic relatedness. We believe this will result in a more robust sense
similarity measure, as the current framework has very little information about the sense entry other
than the definition. In addition, modeling another WordNet feature – antonyms – in wmfvec is very
challenging yet quite useful, since such a similarity measure, with sentiment polarity incorporated,
would be beneficial for many sentiment related tasks.
CHAPTER 7. LINKING TWEETS TO NEWS 91
Chapter 7
Linking Tweets to News
In this chapter, we focus on applying our model on the social media data. A common observation
on Twitter data is that the short nature of tweets makes it very hard for NLP tools to understand
the data. One example of such NLP tools is sentiment analysis: a bag-of-words SVM model gains
much better performance on paragraph level product reviews than sentence level product reviews,
as shown in [Wang and Manning, 2012; Li et al., 2012]. Therefore, we propose the Linking Tweets
to News task, which aims to find a most relevant news article for a tweet if the tweet is discussing
a newsworthy event. We believe the news article serves as a much larger context for the tweet,
and thus the NLP tools can better understand Twitter data, e.g., in the sentiment analysis case, a
significant amount of sentiment clues can be supplied from the news article.
A straightforward solution for the linking tweets to news task is: (1) firstly applying our WMF
model on both Twitter data and news data, (2) for each tweet, choosing the news article with the
maximum similarity score with the tweet as the most relevant news article, as shown in Figure 7.1.
However, this simple approach ignores a distinct characteristic of the Twitter data: due to the
length constraint, a tweet does not retain all the information of a news event; in most cases a tweet
only covers one aspect of the event. This leads to potential inaccurate linking in the approach
introduced in last paragraph. To this end, we propose to search the missing information in other
tweets that are on the same topic of the target tweet, as an attempt to complete the full semantic
picture of the target tweet. This is motivated by the observation that many tweets are triggered by
the same event, thereby become dependent to each other.
To find relevant tweets for a target tweet, we mainly exploit three features: hashtags, named
CHAPTER 7. LINKING TWEETS TO NEWS 92
Pray for Mali…
[0.1, -‐0.2, 1.4, 0.67, -‐1.2]
French troops a4ack rebels in Mali [2.1, -‐0.1, -‐0.5, 3.2, 0.2]
With California Rebounding, Governor Pushes Big Projects [0.3, -‐0.1, -‐0.9, 3.2, 1.5]
Pakistani province in mourning aCer blasts kill scores
[0.7, -‐0.1, 0.7, -‐0.2, -‐0.1]
Tweets News
……
Figure 7.1: The general framework for linking a tweet to its most relevant new article, by firstly
transforming the textual data into latent representation, and then choosing the one with maximum
cosine similarity score.
entities and timestamps. We extend the original WMF model and incorporate correlation between
short texts, such as the target tweet and relevant tweets [Guo et al., 2013]. Our experiments analyze
the impact of the three individual features, and demonstrate significant improvement of the new
model over the baselines.
7.1 Introduction
Recently there has been an increasing interest in language understanding of Twitter messages. Re-
searchers [Speriosui et al., 2011; Brody and Diakopoulos, 2011] were interested in sentiment anal-
ysis on Twitter feeds, and object oriented opinion mining such as towards political issues or politi-
cians [Tumasjan et al., 2010; Conover et al., 2011; Jiang et al., 2011]. Others [Ramage et al., 2010;
Jin et al., 2011] summarized tweets using topic models. Although these NLP techniques are ma-
ture, the performance on tweets inevitably degrades, mainly due to the inherent sparsity in short
CHAPTER 7. LINKING TWEETS TO NEWS 93
texts.1 In the case of sentiment analysis, many previous efforts have reported an accuracy drop from
around 87% on a paragraph level movie review dataset released in [Pang and Lee, 2004], to around
75% [Wang and Manning, 2012] on a sentence level movie review dataset released in [Pang and
Lee, 2005]. The problem worsens when some existing NLP systems can hardly produce any results
given the short texts. Considering the following tweet:
Pray for Mali...
As shown in [Benson et al., 2011; Ritter et al., 2012], a typical event extraction/discovery system [Ji
and Grishman, 2008] would likely be unable to discover the war event due to the lack of contextual
clues, and thus fails to shed light on the user’s focus/interests.
To enable the NLP tools to better understand Twitter feeds, we propose the task of linking a
tweet to a news article that is relevant to the tweet, thereby augmenting the context of the tweet. For
example, we want to supplement the implicit context of the above tweet with a news article such as
the following entitled:
State of emergency declared in Mali
To address the Linking-Tweets-To-News task, we find there are mainly two challenges : (1)
Tweets are too short. In our Twitter data set, on average there are only 14 words in a tweet. It is
very hard hard to pinpoint the relevant news article based on very few information. (2) Tweets are
incomplete, in the sense that usually only one aspect of the event is covered in the tweet. In the
Pray for Mali example, the tweet only contains the location Mali while the event is about French
army participated in Mali war. In this scenario, we would like to find the missing dimensions of the
tweet such as French, war from other complementary short texts, to complete the semantic picture
of Pray in Mali tweet.
For the first challenge, we can directly apply our model WMF on the tweets to generate the
low dimensional representations, since WMF can handle the short text context very well by adding
missing words. After that, we perform cosine similarity computing and choose the most relevant
news document according the similarity values.
For the second issue, we extend the WMF model and incorporate the inter short text correlations
1Apart from the short context issue, tweets exhibit other irregularities of social media data, such as slang, disfluency,
ungrammaticality, informality [Eisenstein, 2013].
CHAPTER 7. LINKING TWEETS TO NEWS 94
(relevance between two texts) in the dimension reduction model. We show that using tweet specific
feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are
able to extract relevant texts that might be complementary to the target tweet. We focus on explicitly
integrating the text relevance relations in the matrix factorization framework, and accordingly the
semantic picture of a tweet is completed by receiving semantics from its related tweets.
We created a data set of news and tweets, where the groundtruth (the most relevant news article
given a tweet) is automatically obtained by extracting the URL in the tweet. Our experiments show
significant improvement of our new model over baselines under three different evaluation metrics
in the new task.
7.2 Related Work
We are targeting at a new task, namely linking a tweet to a news article, which is related to some
existing natural language processing tasks. In the remaining of the section, we will have a brief
introduction of the related tasks, and highlight the difference among them.
Modeling Tweets in a Latent Space: Ramage et al. [2010] leveraged hashtags to improve the latent
representation of tweets in a LDA framework, Labeled-LDA [Ramage et al., 2009], treating each
hashtag as a label. Jin et al. [2011] proposed an LDA based model for Twitter data by incorporating
URL referred documents in tweets. The semantics of long documents were transferred to the topic
distribution of tweets. Evaluated on tweet clustering, the new model increased the purity score from
0.28 to 0.39.
News recommendation: A news recommender system [Claypool et al., 1999; Corso et al., 2005;
Lee and Park, 2007] aims to recommend news articles to a user based on the features (e.g., key
words, tags, category) in the documents that the user likes, hence these documents form a training
set. Our paper resembles news recommendation in searching for a related news article. However,
we target on “recommending” news articles only based on a tweet, which is a much smaller context
than the set of favorite documents chosen by a user .
Linking on Tweets: In tweet ranking [Duan et al., 2010], the availability of a URL is an important
feature. However, one possible bottleneck preventing their approach from broad applications is that
CHAPTER 7. LINKING TWEETS TO NEWS 95
the number of tweets with an explicit URL is very limited. Similarly, Huang et al. [2012] proposed
a graph-based framework to propagate tweet ranking scores, where relevant web documents are
found to be helpful to discover informative tweets. Both work can take advantage of our work to
either extract potential URL features or retrieve topically similar web documents.
Sankaranarayanan et al. [2009] aimed at capturing tweets that correspond to late breaking news.
They adopted a simple approach by clustering tweets and choosing a URL referred news in those
tweets as the related news for the whole cluster (the URLs are visible to the systems). Compared to
our work, their approach lacks variety, since the whole cluster of tweets are assigned the same news
URL. The work presented in [Abel et al., 2011] is most related work to our paper, however their
focus is the user profiling task, therefore they did not provide a paired tweet/news data set and have
to conduct manual evaluation.
7.3 Searching Complementary texts via Twitter/News Features
WMF exploits the text-to-word information in a very nuanced way, whereas the dependency be-
tween texts is ignored (Figure 7.2a). However in the social media context, many tweets and news
article are in fact dependent or complementary for each other, as they are triggered by the same
event. In this section, we introduce how to extract similar tweets to find the missing elements for
a given tweet. We exploit three features to find similar tweets: hashtags, named entities and times-
tamps. These features will help induce better latent representation for tweets/news.
7.3.1 Hashtags and Named Entities
Hashtags highlight the topics of a tweet, e.g., The #flu season has started. We believe two tweets
sharing the same hashtag should be related, hence we place a link between two tweet nodes to
explicitly inform the model that these two tweets should be similar (Figure 7.2b).
We find that only 8, 701 tweets out of 34, 888 tweets in our collected data set include hashtags.
In fact, we observe many hashtag words are mentioned in tweets without explicitly being tagged
with #. Here, we adopt a simple but effective approach to overcome the hashtag sparseness issue:
we collect all the hashtags in the dataset, and automatically hashtag any word in a tweet if that word
appears hashtagged in any other tweets. After the automatic hashtag discovery, we start extracting
CHAPTER 7. LINKING TWEETS TO NEWS 96
relevant tweets: for each tweet, and for each hashtag it contains, we extract k tweets that contain
this hashtag, assuming they are complementary to the target tweet, and place a link between the k
tweets and the target tweet, as in Figure 7.2b.2
Named entities are some of the most salient features in event based text data. Directly applying
Named Entity Recognition (NER) tools on news titles or tweets results in many errors [Liu et al.,
2011b], due to the noisy nature of the data such as slang in tweets and capitalization in news titles.
Accordingly, we first apply the NER tool on news summaries, then label named entities in the tweets
in the same manner as labeling the hashtags: if there is a string in the tweet that matches a named
entity from the summaries, then the string is labeled as a named entity in the tweet.3 To create the
similar tweet set, we find k tweets that also contain the named entity.
7.3.2 Temporal Relations
Intuitively, tweets published in the same time interval have a larger chance of being of the same
topic than those are not chronologically close [Wang and McCallum, 2006]. However, we cannot
simply assume any two tweets are similar only based on the timestamp. Therefore, for a tweet we
link it to the k most similar tweets whose published time is within 24 hours of the target tweet’s
timestamp. To find out the most similar ones, we use the latent representation returned by WMF
model to measure the similarity of two tweets.
7.3.3 Authorship
We also experiment with other features such as authorship. We note that it does not have a posi-
tive contribution for this problem. While authorship information helps in the task of news/tweets
recommendation for a user [Corso et al., 2005; Yan et al., 2012], the authorship information is too
general for this task where we target on “recommending” a news article for a tweet. The results of
using author subgraph can be found in Table 7.2.
2If there are more than k tweets found, we choose the top k ones whose publishing timestamps are most chronologi-
cally close to that of the target tweet.
3Note that there are some false positive named entities detected such as apple. We plan to address removing noisy
named entities and hashtags in our future work.
CHAPTER 7. LINKING TWEETS TO NEWS 97
7.3.4 Creating Relations on News
We can also extract the three subgraphs (based on hashtags, named entities and temporal) for news
articles. However, automatically tagging hashtags or named entities leads to much worse perfor-
mance (around 93% ATOP values, a 3% decrease from baseline WMF). This is because a news
article is long enough to contain a lot of hashtag words and named entity, some of which are not
very relevant to the theme of the event, and thus result in noisy matching. Therefore we only extract
temporal relations for news articles.
7.4 WMF on Graphs
Now we are focusing on incorporating the links, generated as described in the previous section, into
the WMF model.
If two texts are connected by a link, it means they should be semantically similar or sharing a
similar latent profile. In the matrix factorization framework, we would like the latent vectors of two
text nodes Q·,j1 , Q·,j2 to be as similar as possible, namely that their cosine similarity to be close to
1. To implement this, we add a regularization term in the objective function of WMF (equation 2.3)
for each linked pairs Q·,j1 , Q·,j2 in Figure 7.2b:
δ · ( Q·,j1 ·Q·,j2|Q·,j1 ||Q·,j2 |
− 1)2 (7.1)
where |Q·,j | denotes the length of vector Q·,j . The coefficient δ denotes the importance of the
text-to-text links. A larger δ means we put more weights on the text-to-text links and less on the
text-to-word links. We refer to this model as WMF-G (WMF on graphs). The graphical model is
illustrated in Figure 7.2b.
Alternating Least Square [Srebro and Jaakkola, 2003] is used for inference in weighted matrix
factorization. However, alternating least square is no longer applicable here with the new regular-
ization term (equation 7.1) involving the length of text vectors |Q·,j |, which is not in quadratic form.
Therefore we approximate the objective function by treating the vector length |Q·,j | as fixed values
CHAPTER 7. LINKING TWEETS TO NEWS 98
w1 w2 w3
t1
w4 w5
t2
w6 w7 w8
t3
n1 n2
(a) Applying WMF on tweets and news data sets
w1 w2 w3
t1
w4 w5
t2
w6 w7 w8
t3
n1 n2
temporal
#healthcare Obama
(b) WMF-G model
Figure 7.2: The tweet nodes t and news nodes n are connected by hashtags, named entities or
temporal edges. For simplicity, the missing tokens are not shown in the figure. All the grey nodes
are observed information, such as TF-IDF values, while white nodes are latent vectors to be inferred.
CHAPTER 7. LINKING TWEETS TO NEWS 99
during the alternating least square iterations:
P·,i =(QW (i)Q> + λI
)−1QW (i)X·,i
Q·,j =(PW (j)P> + λI + δL2
(j)Q·,s(j)diag(L2(s(j)))Q
>·,s(j)
)−1(PW (j)X>j,· + δL(j)Q·,s(j)Ln(j)
) (7.2)
We define n(j) as the linked neighbors of short text j, and Q·,n(j) as the set of latent vectors of
j’s neighbors. The reciprocal of length of these vectors in the current iteration are stored in Ls(j).
Similarly, the reciprocal of the length of the short text vector Q·,j is Lj . W (i) = diag(W·,i) is an
M ×M diagonal matrix containing the ith row of weight matrix W .
7.5 Experiments
7.5.1 Experiment Setting
Task and Data: The task is given the text in a tweet, a system aims to find the most relevant news
article. For gold standard annotation, we harvest all the tweets that have a single URL link to a CNN
or NYTIMES news article, dated from the 11th of Jan to the 27th of Jan, 2013. In evaluation, we
consider this URL referred news article as the gold standard – the most relevant document for the
tweet. We remove the URL from the text of the tweet, so that URLs are invisible to the algorithms.
We also collect all the news articles from both CNN and NYTIMES from RSS feeds during the same
timeframe. Each tweet entry has the published time, author, text, URL; each news entry contains
published time, title, news summary, URL. The tweet/news pairs are extracted by matching URLs.
We manually filter “trivial” tweets where the tweet content is simply the news title or the news
summary. The final data set has 34,888 tweets and 12,704 news articles.
For our task evaluation, ideally, we would like the system to be able to identify the news article
specifically referred to by the URL within each tweet in the gold standard. However, this is very
difficult given the large number of potential news article candidates, especially those news docu-
ments with slight variations. Therefore, the systems are measured by ranking performance of the
URL referred news document.
We use three metrics for evaluating the ranking of the correct news article:
• Area under the top-k recall curve (ATOP), same in the concept definition retrieval task in
CHAPTER 7. LINKING TWEETS TO NEWS 100
[Guo and Diab, 2012b]. Basically, it is the normalized ranking ∈ [0, 1] of the correct news
article among all candidate articles: ATOP = 1 means the URL referred news article has the
highest similarity value with the tweet among all news candidates; ATOP= 0.95 means the
similarity value with correct news article is larger than 95% of the candidates, i.e. within the
top 5% of the candidates. ATOP is calculated as follows:
ATOP =
∫ 1
0TOPK(k)dk (7.3)
where TOPKt(k) = 1 if the URL referred news article is in the “top k” list, otherwise
TOPKt(k) = 0. Here k ∈ [0, 1] is the relative position (when k = 1, it means the cor-
rect article is above all the candidates).
• Reciprocal Rank (RR), which is the reciprocal of the rank of the correct news article, e.g.,
RR = 1/3 if the correct news article is ranked at the 3rd highest place in the returned list.
• Top 10 recall rate (TOP10), e.g., TOP10 = 1 if the correct news article is among the top 10
returned list, otherwise TOP10 = 0.
Similar to [Guo and Diab, 2012b], for each tweet, we collected the 1,000 news articles published
prior to the tweet whose dates of publication are closest to that of the tweet, as the candidate set. The
cosine similarity score between the URL referred news article and the tweet is compared against the
scores of these 1,000 news articles to calculate the three metric scores. 10% of the gold standard
tweet/news pairs are used as development set, based on which all the parameters of models are
tuned.
Corpora: We use the same corpora as in [Guo and Diab, 2012b]: Brown corpus (each sentence is
treated as a document), sense definitions of Wiktionary and Wordnet [Fellbaum, 1998]. The tweets
and news articles are included in the corpus as well, yielding a total of 441,258 short texts and
5,149,122 tokens.
Baseline: We present 4 baselines: 1. Information Retrieval model [IR], which simply treats a tweet
as a document, and performs traditional surface word matching. 2. LDA-θ with Gibbs Sampling
as inference method. We use the inferred topic distribution θ as a latent vector to represent the
tweet/news. 3. LDA-wvec. where the latent vector is the average of the word latent vectors P (z|w)
CHAPTER 7. LINKING TWEETS TO NEWS 101
Models ParametersATOP TOP10 RR
dev test dev test dev test
IR - 90.795% 90.743% 73.478% 74.103% 46.024% 46.281%
LDA-θ α = 0.05, β = 0.05 81.368% 81.251% 32.328% 31.207% 13.134% 12.469%
LDA-wvec α = 0.05, β = 0.05 94.148% 94.196% 53.500% 53.952% 28.743% 27.904%
WMF - 95.964% 96.092% 75.327% 76.411% 45.310% 46.270%
WMF-G k = 3, δ = 3 96.450% 96.543% 76.485% 77.479% 47.516% 48.665%
WMF-G k = 5, δ = 3 96.613% 96.701% 76.029% 77.176% 47.197% 48.189%
WMF-G k = 4, δ = 3 96.510% 96.610% 77.782% 77.782% 47.917% 48.997%
Table 7.1: Performance for Linking-Tweets-to-News under three evaluation metrics (latent dimen-
sion K = 100 for LDA/WMF/WMF-G)
weighted by TF-IDF. 4. WMF. In these baselines, hashtags and named entities are simply treated as
words.
To curtail variation in results due to randomness, each reported number is the average of 10
runs. For WMF and WMF-G, we assign the same initial random values and run 20 iterations.
In both systems we fix the missing words weight as wm = 0.01 and regularization coefficient at
λ = 20, which is the best condition of WMF found in Chapter 2. For LDA-θ and LDA-wvec, we
run Gibbs Sampling based LDA for 2000 iterations and average the estimated variables over the last
10 iterations.
7.5.2 Results
Table 7.1 summarizes the performance of the baselines and WMF-G at latent dimension K = 100.
All the parameters are chosen based on the development set. For WMF-G, we try different values
of k (the number of neighbors linked to a tweet/news for a hashtag/NE/time constraint) and δ (the
weight of link information). We decided to integrate the links in four subgraphs: (a) hashtags in
tweets; (b) named entities in tweets; (c) timestamp in tweets; (d) timestamp in news articles. For
LDA we tune the hyperparameter α (Dirichlet prior for topic distribution of a document) and β
(Dirichlet prior for word distribution given a topic). It is worth noting that ATOP measures the
overall ranking in 1000 samples whereas TOP10/RR focus more on whether the groundtruth news
article is in the first few returned results.
CHAPTER 7. LINKING TWEETS TO NEWS 102
0 1 2 3 4
96
96.2
96.4
96.6
96.8
AT
OP
δ
dev
test
(a) ATOP
0 1 2 3 475
75.5
76
76.5
77
77.5
78
TO
P10
δ
dev
test
(b) TOP10
0 1 2 3 445
46
47
48
49
RR
δ
dev
test
(c) RR
Figure 7.3: Impact of the weight of links δ of model WMF-G on development set and test set
evaluated by three evaluation metrics: latent dimension K = 100, and neighbor tweets number is
k = 4.
CHAPTER 7. LINKING TWEETS TO NEWS 103
50 75 100 125 15095
95.5
96
96.5
97
AT
OP
D
WTMF
WTMF−G
(a) ATOP
50 75 100 125 15070
72
74
76
78
80
TO
P10
D
WTMF
WTMF−G
(b) TOP10
50 75 100 125 15040
42
44
46
48
50
RR
D
WTMF
WTMF−G
(c) RR
Figure 7.4: Impact of latent dimension K of model WMF-G on test set evaluated by three metrics:
the neighbor tweet number is fixed k = 4. Dimension K varies from 50 to 150.
CHAPTER 7. LINKING TWEETS TO NEWS 104
Conditions LinksATOP TOP10 RR
dev test dev test dev test
hashtags tweets 375,371 +0.397% +0.379% +1.015% +1.021% +0.504% +0.641%
NE tweets 164,412 +0.141% +0.130% +0.598% +0.479% +0.278% +0.294%
time tweet 139,488 +0.126% +0.136% +0.512% +0.503% +0.241% +0.327%
time news 50,008 +0.036% +0.026% +0.156% +0.256% +1.890% +1.924%
full model (all 4 subgraphs) 573,999 +0.546% +0.518% +1.556% +1.371% +2.607% +2.727%
full model minus hashtags tweets 336,963 +0.288% +0.276% +1.129% +1.037% +2.488% +2.541%
full model minus NE tweets 536,333 +0.528% +0.503% +1.518% +1.393% +2.580% +2.680%
full model minus time tweet 466,207 +0.457% +0.426% +1.281% +1.145% +2.449% +2.554%
full model minus time news 523,991 +0.508% +0.490% +1.300% +1.190% +0.632% +0.785%
author tweet 21,318 +0.043% +0.042% +0.028% +0.057% −0.003% −0.017%
full model plus author tweet 593,483 +0.575% +0.545% +1.465% +1.336% +2.415% +2.547%
Table 7.2: Contribution of subgraphs of hashtag/named entity/temporal/author, whenK = 100, k =
4, δ = 3, measured by gain over baseline WMF.
Same as reported in [Guo and Diab, 2012b], LDA-θ has the worst results due to directly using
the inferred topic distribution of a text θ. The inferred topic vector has only a few non-zero values,
hence a lot of information is missing. LDA-wvec preserves more information by creating a dense
latent vector from the topic distribution of a word P (z|w), and thus does much better in ATOP.
It is interesting to see that IR model has a very low ATOP (90.795%) and an acceptable RR
(46.281%) score, in contrast to LDA-wvec with a high ATOP (94.148%) and a low RR(27.904%)
score. This is caused by the nature of the two models. LDA-wvec is able to identify global coarse
grained topic information (such as politics vs. economics), hence receiving a high ATOP by exclud-
ing the most irrelevant news articles, however it does not distinguish fine grained difference such as
Hillary vs. Obama. IR model exerts the opposite influence via word matching. It ranks a correct
news article very high if overlapping words exist (leading to a high RR), or the news article is ranked
very low if no overlapping words (hence a low ATOP).
We can conclude WMF is a very strong baseline given that it achieves high scores with three
metrics. As a dimension reduction model, it is able to capture global topics (+1.89% ATOP over
LDA-wvec); moreover, by explicitly modeling missing words, the existence of a word is also en-
coded in the latent vector (+2.31% TOP10 and −0.011% RR over IR model).
With WMF being a very challenging baseline, WMF-G can still significantly improve all 3
CHAPTER 7. LINKING TWEETS TO NEWS 105
metrics. In the case k = 4, δ = 3 compared to WMF, WMF-G receives +1.371% TOP10, +2.727%
RR, and +0.518% ATOP value (this is a significant improvement of ATOP value considering that it
is averaged on 30,000 data points, at an already high level of 96% reducing error rate by 13%). All
the improvement of WMF-G over WMF is statistically signicant at the 99% condence level with a
two-tailed paired t-test.
We also present results using different number of links k in WMF-G in Table 7.2. We experi-
mented with k = {3, 4, 5}. k = 4 is found to be the optimal value (although k = 5 has a better
ATOP). Figure 7.3 demonstrates the influence of δ = {0, 1, 2, 3, 4} on each metric when k = 4.
Note when δ = 0 no link is used, which is the baseline WMF. We can see using links is always
helpful. When δ = 4, we receive a higher ATOP value but lower TOP10 and RR.
Figure 7.4 illustrates the impact of dimensionK = {50, 75, 100, 125, 150} on WMF and WMF-
G (k = 4) over the test set. The trends hold in different K values with a consistent improvement.
Generally a larger K leads to a better performance. In all conditions WMF-G outperforms WMF.
7.5.2.1 Contribution of Subgraphs
We are interested in the contribution of each feature subgraph. Therefore we list the impact of
individual components in Table 7.2. The impact of each subgraph is evaluated in two conditions:
(a) the subgraph-only; (b) the full-model-minus the subgraph. The full model is the combination of
the 4 subgraphs (which is also the best model k = 4 in Table 7.1). In the last two rows of Table
7.2 we also present the results of using authorship only and the full model plus authorship. The 2nd
column lists the number of links in the subgraph. To highlight the difference, we report the gain of
each model over the baseline model WMF.
We have several interesting observations from Table 7.2. It is clear that the hashtag subgraph
on tweets is the most useful subgraph: with hashtag tweet it has the best ATOP and TOP10 values
among subgraph-only condition (ATOP: +0.379% vs. 2nd best +0.136%, TOP10: +1.021% vs.
2nd best +0.503%), while in the full-model-minus condition, minus hashtag has the lowest ATOP
and TOP10. Observing that it also contains the most links, we believe the coverage is another
important reason for the great performance.
It seems the named entity subgraph helps the least. Looking into the extracted named entities
and hashtags, we found many popular named entities are already covered by hashtags. That said,
CHAPTER 7. LINKING TWEETS TO NEWS 106
adding named entity subgraph into final model has a positive contribution.
It is worth noting that the time news subgraph has the most positive influence on RR. This is
because temporal information is very salient in news domain: usually there are several reports to
describe an event within a short period, therefore the news latent vector is strengthened by receiving
semantics from its neighbors.
At last, we analyzed the influence of authorship of tweets. Adding authorship into the full model
greatly hurts the scores of TOP10 and RR, whereas it is helpful to ATOP. This is understandable
since by introducing author links between tweets, to some degree we are averaging the latent vectors
of tweets written by the same person. Therefore, for a tweet whose topic is vague and hard to detect,
it will get some prior knowledge of topics through the author links (hence increase ATOP), whereas
this prior knowledge becomes noise for those tweets that are already handled very well by the
WMF-G model (hence decrease TOP10 and RR).
7.6 Summary and Discussion
Motivated by the difficulty of understanding tweets, we propose a Linking-Tweets-to-News task,
which potentially benefits many NLP applications where off-the-shelf NLP tools can be applied to
the most relevant news. We also collect a gold standard dataset by crawling tweets each with a URL
referring to a news article. We formalize the linking task as a short text modeling problem, and
extract Twitter/news specific features to extract text-to-text relations, which are then incorporated
under the matrix factorization framework. The new model achieve significant improvement over
baselines.
Aiming at increasing the accuracy of the linking, it is worth investigating the supervised setting
for this task, which can be cast as a classic ranking problem. With the groundtruth available, we
can extract a plenty of interesting features that disclose the relatedness between a tweet and a news
article, such as surface word similarity, whether there exists a named entity appearing on both sides,
and many more.
More importantly, since the Linking-Tweets-to-News task is to provide more context to under-
stand tweets, it would be nice if the predictions of our model can actually improve the performance
of other NLP tasks focusing on tweets. In the future, we would like to collaborate with researchers
CHAPTER 7. LINKING TWEETS TO NEWS 107
working on tasks such as tweet summarization, event extraction on tweets, to test the influence.
108
Part III
Conclusions
CHAPTER 8. CONCLUSIONS 109
Chapter 8
Conclusions
Nowadays the internet generates massive short text data, limited progress has been made toward
alternative ways to compute meaningful similarity other than surface word overlap calculations or
applying word level semantic comparison, which are typically ineffective given the short context or
too time-consuming. To this end, we focus on developing dimension reduction models to address
the issue in the first part of the thesis. This thesis has been an exploration of unsupervised methods
for modeling short text representations in the latent space. We use the task of calculating semantic
textual similarity to illustrate the efficacy of our approach.
In the second part of the thesis we further exemplify the impact our models have on several
NLP application tasks. We address adapting short text similarity models within the context of
several semantics based NLP tasks: word sense disambiguation, automated pyramid evaluation,
and linking tweets to news.
In the following we discuss some new challenges to the short text similarity task and some
potential future work. Firstly, focusing on the current matrix factorization framework, we note
that it only exploits bag-of-words features, and overlooks the structural information in the text,
such as word order, syntax. We believe the structural features convey more subtle and nuanced
semantics that cannot be covered by individual words. Secondly, given that our current models learn
a latent representation for a text, it would be very interesting to see the impact of neural networks
on this task, since neural networks are well known for learning semantic embeddings of words
or texts. In addition, neural network based models are flexible enough to be able to integrate and
incorporate syntactic information. Another direction for enhancing short text modeling performance
CHAPTER 8. CONCLUSIONS 110
is by adding supervised information.
8.1 Summary
Our work on modeling short text data is motivated by the nature of the texts, which can be character-
ized by the following two features/challenges: (1) very few words in the text; (2) large scale of data,
especially in the online generated portion. Our first two models concentrate on the first trait by (a)
integrating more features for texts, (b) integrating more features for words; and the third model (c)
targets the second trait by exploiting binary coding. From the perspective of the matrix factorization
model, Chapter 2 is devoted to improving the Q matrix (textual latent profiles) in Figure 2.2, while
Chapter 3 and 4 work on modeling a better P matrix (lexical latent profiles).
In the first half of the thesis, we begin our investigation for the word sparsity characteristic of
short texts in Chapter 2, by modeling missing words for short text data. The bottleneck for short text
similarity is that the number of words in a text is very small. Using missing words adds thousands
of more features, thereby alleviates the data sparsity issue. We analyze two classic models LSA
and LDA, and provide some important insight on how they handle missing words. Accordingly,
we design a mechanism that sits between the two methods, which models the missing words at the
appropriate level of granularity: the weighted matrix factorization uses all the missing words, and
give a small weight for the missing words, so that the impact of the observed words is not diminished
by the sheer overwhelming number of missing words.
We extend our effort in Chapter 3, however we tackle the same challenge in another direction
by robustly modeling lexical semantics, which translates into improving the P matrix in Figure 2.2.
The intuition is that because there are only around 10 observed words in a short text, we need to
make good use of each word very well, in case some important topics are not included in the text
semantics. We integrate corpus-based semantics (bigrams) and knowledge based semantics (similar
word pairs) in the weighted matrix factorization framework. Because they are very different kinds
of lexical semantics, they are complementary to each other. It is worth noting that this approach is
able to improve the WMF model that already performs significantly above LSA/LDA.
We then move to applying our model to massive data set such as Twitter data. In the online
data scenario, the new challenge is the large size of the data – each day 500 million tweets are
CHAPTER 8. CONCLUSIONS 111
generated. To this end, we convert our model into a binarized version, which produces a binary bit
string for each tweet. The binary strings allow Hamming distance computation using computational
hardware more directly which results in much faster computation than cosine similarity on real-
valued vectors. Since the succinct binary bits lose significant nuance semantics, we also propose a
method to reduce redundancy in the P matrix in order to store as much information as possible.
We developed several models to improve Pearson’s correlation coefficient on predicting short
text similarity. Yet this is not adequate. We need to know whether the improvement is substantial
enough to boost other tasks that comes with short text similarity computation components. There-
fore, we take a step forward and evaluate the performance in other tasks as an extrinsic evaluation.
Theoretically, any tasks that involve similarity computation should benefit from our work. In this
thesis, we select several NLP tasks that are strongly associated with semantics.
The first task is automated pyramid method for text summarization. The similarity computation
happens in the process of determining which key concepts from the original documents are men-
tioned in the ngrams of summaries. The primary difference between this task and the short text
similarity task is the text granularity becomes ngram phrases. Evaluated on student summaries, the
dimension reduction based method is able to extract key concepts with higher precision and recall,
and hence achieve a higher correlation with manuals scores, than previous methods.
Another NLP task that involves heavy similarity computation is unsupervised word sense disam-
biguation (WSD). We note that WSD systems highly rely on sense similarity measures. Moreover,
sense definitions are usually very short, thereby rendering this task ideal as a test bed for our models.
By exploiting the sense relations in WordNet, we construct a new sense similarity measure wmfvec,
where each sense is represented by a latent vector learned from its WordNet definition. On WSD
experiments, wmfvec significantly outperform LDA based similarity measure and the surface word
comparison based elesk measure [Banerjee and Pedersen, 2003].
We then apply our model in social media and news data. Here, we show that our model WMF
can be extended easily to adapt to a new task. We identify the key challenge of modeling tweets
for events – a tweet is fragmented usually covering only one aspect of the event. We integrate the
tweet specific feature hashtags, the news specific feature named entities, to find the complementary
tweets that contain the missing aspects for the target tweet. The resulting model achieves even better
performance than WMF model.
CHAPTER 8. CONCLUSIONS 112
8.2 Limitations and Future work
Despite the progress presented in this thesis, there remain some interesting and exciting challenges
for short text similarity task. In the following, we discuss some limitations of the current proposed
models, and several promising topics that we will explore in our future research. In general, we
intend to continue working on this task from four aspects: (1) adding new features; (2) exploiting
new embedding techniques; (3) providing supervised labels; (4) producing new properties for text
embeddings.
1. Adding new features – Incorporating Syntax: In our current models, one major impediment
is we mainly use bag-of-words features. Intuitively, individual words are not capable of expressing
more subtle semantics compared to phrases, and breaking texts into individual words loses a lot of
information such as word order. An example from [Feng et al., 2011] is, the meaning from cancer
to prevent cancer is completely reversed; yet the textual similarity score, computed in either the
lexical semantics based method or the dimension reduction model, is relatively high. Maybe this is
the reason why short text similarity is rarely applied in sentiment analysis.
To overcome the issue, modeling the new feature syntactic structures could be helpful. Previous
work [Severyn et al., 2013] showed that tree kernels proved to boost the similarity scores, where the
similarity of two texts is the sum of common subtrees extracted from constituent and dependency
trees.
Considering that our model is a dimension reduction model, one preliminary idea is to integrate
the vector based compositional semantics [Mitchell and Lapata, 2008] in the factorization model.
Compositional semantics studies how the meaning of individual text units can be combined to pro-
duce the meaning of bigger units such as phrases or sentences [Hodges, 1997]. Mitchell and Lapata
[2008] investigated constructing the vector representation of phrases from words. In our case, we
can generate the phrase vectors following the constituent tree structure of the short text: the vectors
of two nodes will be merged in a new vector following the compositional operations. We hope that
the compositional semantics is able to correctly model phrases such as not bad.
2. Exploiting word embeddings – Neural Networks as a new method to produce textual em-
bedding: Nowadays neural network techniques have been proven to be very powerful models
to learning word and text embeddings, such as recurrent neural networks [Mikolov et al., 2010],
CHAPTER 8. CONCLUSIONS 113
recursive autoencoders [Socher et al., 2011b]. They were found to be successful in a variety of
NLP tasks, including paraphrase detection [Socher et al., 2011a], sentiment analysis [Socher et al.,
2011b], language modeling [Mnih and Hinton, 2007] by employing automatic feature extraction.
Given the great performance on these NLP tasks, it is worth exploration in the context of short text
similarity. Moreover, because of its recursive property, it provides a natural way to model syntax
of sentences, as shown in [Socher et al., 2011b]. The first step could be to apply recurrent neural
networks [Mikolov et al., 2010] to learn a short text embedding, which preserves the word order by
treating the text as a word sequence. Then we can move to apply recursive autoencoder [Socher et
al., 2011a] on the syntactic tree of text data. Note that these methods already model compositional
semantics.
3. Providing Supervised Labels – Supervised Tweet Retrieval: Another direction of enhancing
the short text similarity performance is to add supervised labels. We want to test the performance
in tweet retrieval task, since a lot of noisy labels, hashtags, are already available. By observing the
hashtag labels, the model will learn a more informative binary string for each tweet, where Ham-
ming distances are minimized among tweets with the same labels, and simultaneously maximized
among tweets pertaining to different labels. This direction is worth exploration since the labels,
which are hashtags, can be easily obtained without expensive manual annotation.
In the meanwhile, it raises new challenges in the Twitter context. Because tweets are short and
hence fragmented that a tweet only reports one aspect of the event, two tweets sharing the same
hashtag label are not necessarily talking the same aspect of the event. Consider the two following
tweets,
• my favorite on #Oscar2015 Red Carpet. @ladygaga agreed, saying she looked ”beautiful.”
• JulieAndrews presents Oscar for Best Film Score. #Oscar2015
Both of them are Oscar 2015 related, however, one focuses on the red carpet and the other talks
about the best film. To address this special issue, we may need to relax the strong assumption that
all tweets sharing the same labels should have similar binary bits.
4. Producing new properties for text embeddings – Sentiment aware short text Embeddings:
The previous three topics focus on improving the quality of embedding so that the text embeddings
CHAPTER 8. CONCLUSIONS 114
encode more similarity information. Now we are considering another aspect of embeddings by
augmenting them with new properties such as sentiment. Our idea is inspired by the work in [Yih
et al., 2012], where the induced word embeddings are able to distinguish antonyms. In their model,
ideally the cosine similarity between hot and cold should be close to −1, which implies the two
words are negatively correlated. We would like our model to enjoy a similar effect – two sentences
that express opposite semantics should have a cosine similarity value of −1. Such an embedding
would have significant impact on research in areas such as sarcasm detection [Gonzalez-Ibanez et
al., 2011; Riloff et al., 2013], where the contradiction of two text segments is an underlying attribute.
To accomplish this goal, we will need to annotate new data, and dramatically change the current
similarity annotation schema. A preliminary annotation schema should take into consideration both
topical similarity as well as sentiment polarity.
In terms of applications, it is interesting to build joint models to bridge the gap between short
text similarity task and the application tasks. Examples of such tasks include paraphrase detection
and textual entailment. Here, we briefly discuss the application and challenges of applying our
models for novelty detection for events.
Novelty Detection aims to find an event that has not been covered by the news media. Thus,
the short text similarity is a great baseline for the task – if a text has very low similarity scores
with all the previous news article, it is considered a potential novelty event. However, there is one
major challenge we need to face, which is the named entities. Identifying novel event highly relies
on the named entities involved in the event, but our models do not handle the new named entities
very well – any distributional model needs to see a string multiple times before it can get the correct
semantics of the string. In this case, it makes more sense to combine our model with the surface
word matching method to get a robust similarity score for novelty detection task.
115
Part IV
Bibliography
BIBLIOGRAPHY 116
Bibliography
[Abel et al., 2011] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Semantic enrichment of
twitter posts for user profile construction on the social web. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics, 2011.
[Agarwal et al., 2012] Puneet Agarwal, Rajgopal Vaithiyanathan, Saurabh Sharma, and Gautam
Shroff. Catching the long-tail: Extracting local news events from twitter. In Proceedings of the
Sixth International AAAI Conference on Weblogs and Social Media, 2012.
[Agirre and Soroa, 2009] Eneko Agirre and Aitor Soroa. Proceedings of personalizing pagerank
for word sense disambiguation. In the 12th Conference of the European Chapter of the ACL,
2009.
[Agirre et al., 2012] Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. Semeval-
2012 task 6: A pilot on semantic textual similarity. In First Joint Conference on Lexical and
Computational Semantics (*SEM), 2012.
[Agirre et al., 2013] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei
Guo. *sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical
and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the
Shared Task: Semantic Textual Similarity, 2013.
[Agirre et al., 2014] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor
Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-
2014 task 10: Multilingual semantic textual similarity. In SemEval 2014, 2014.
BIBLIOGRAPHY 117
[Bach et al., 2013] Stephen H. Bach, Bert Huang, Ben London, and Lise Getoor. Hingeloss markov
random fields: Convex inference for structured prediction. In In Uncertainty in Artificial Intelli-
gence, 2013.
[Baker et al., 1998] Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet
project. In Proceedings of the 36th Annual Meeting of the Association for Computational Lin-
guistics and 17th International Conference on Computational Linguistics-Volume 1, 1998.
[Banerjee and Pedersen, 2003] Satanjeev Banerjee and Ted Pedersen. Extended gloss overlaps as
a measure of semantic relatedness. In Proceedings of the 18th International Joint Conference on
Artificial Intelligence, pages 805–810, 2003.
[Bar et al., 2013] Daniel Bar, Torsten Zesch, and Iryna Gurevych. Dkpro similarity: An open
source framework for text similarity. In Proceedings of the 51st Annual Meeting of the Asso-
ciation for Computational Linguistics: System Demonstrations, 2013.
[Barzilay and Lee, 2003] Regina Barzilay and Lillian Lee. Learning to paraphrase: an unsuper-
vised approach using multiple-sequence alignment. In Proceedings of the 2003 Conference of
the North American Chapter of the Association for Computational Linguistics on Human Lan-
guage Technology-Volume 1, 2003.
[Beltagy et al., 2014] Islam Beltagy, Katrin Erk, and Raymond Mooney. Probabilistic soft logic
for semantic textual similarity. Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, 2014.
[Benson et al., 2011] Edward Benson, Aria Haghighi, and Regina Barzilay. Event discovery in
social media feeds. In Proceedings of the 49th Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technologies, 2011.
[Bhagat and Ravichandran, 2008] Rahul Bhagat and Deepak Ravichandran. Large scale acquisition
of paraphrases for learning surface patterns. In Proceedings of ACL-08: HLT, 2008.
[Blei and Lafferty, 2006] David M Blei and John D Lafferty. Dynamic topic models. In Proceed-
ings of the 23rd international conference on Machine learning, 2006.
BIBLIOGRAPHY 118
[Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.
Journal of Machine Learning Research, 3, 2003.
[Boonthum-Denecke et al., 2011] Chutima Boonthum-Denecke, Philip M McCarthy, Travis Alan
Lamkin, G Tanner Jackson, Joseph Magliano, and Danielle S McNamara. Automatic natural lan-
guage processing and the detection of reading skills and reading comprehension. In Proceedings
of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference,
2011.
[Broder et al., 1998] Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher.
Min-wise independent permutations. In Proceedings of the Thirtieth Annual ACM Symposium
on Theory of Computing, 1998.
[Brody and Diakopoulos, 2011] Samuel Brody and Nicholas Diakopoulos. Coooooooooooooooll-
llllllllllll!!!!!!!!!!!!!! using word lengthening to detect sentiment in microblogs. In Proceedings
of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011.
[Cai et al., 2007] Jun Fu Cai, Wee Sun Lee, and Yee Whye Teh. Improving word sense disambigua-
tion using topic features. In Proceedings of the 2007 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning, 2007.
[Callison-Burch et al., 2007] Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof
Monz, and Josh Schroeder. (meta-) evaluation of machine translation. In Proceedings of the
Second Workshop on Statistical Machine Translation, 2007.
[Callison-Burch et al., 2008] Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof
Monz, and Josh Schroeder. Further meta-evaluation of machine translation. In Proceedings
of the Third Workshop on Statistical Machine Translation, 2008.
[Chakrabarti and Punera, 2011] Deepayan Chakrabarti and Kunal Punera. Event summarization
using tweets. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social
Media, 2011.
BIBLIOGRAPHY 119
[Chang et al., 2013] Kai-Wei Chang, Wen-tau Yih, and Christopher Meek. Multi-relational latent
semantic analysis. In Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing, 2013.
[Charikar, 2002] Moses S. Charikar. Similarity estimation techniques from rounding algorithms.
In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, 2002.
[Chen and Dolan, 2011] David L. Chen and William B. Dolan. Collecting highly parallel data
for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics, 2011.
[Claypool et al., 1999] Mark Claypool, Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry
Netes, and Matthew Sartin. Combining content-based and collaborative lters in an online news-
paper. In In Proceedings of the ACM SIGIR Workshop on Recommender Systems, 1999.
[Clive et al., 2005] Best Clive, Erik van der Goot, Ken Blackler, Teofilo Garcia, and David Horby.
Europe media monitorsystem description. EUR Report, 2005.
[Conover et al., 2011] Michael Conover, Jacob Ratkiewicz, Matthew Francisco, Bruno Goncalves,
Filippo Menczer, and Alessandro Flammini. Political polarization on twitter. In Proceedings of
the Fifth International AAAI Conference on Weblogs and Social Media, 2011.
[Corso et al., 2005] Gianna M. Del Corso, Antonio Gulli, and Francesco Romani. Ranking a stream
of news. In Proceedings of the 14th international conference on World Wide Web, 2005.
[Dagan et al., 2006] Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising
textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncer-
tainty, Visual Object Classification, and Recognising Tectual Entailment. Springer, 2006.
[Deerwester et al., 1990] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W.
Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. Journal of the American
Society for Information Science, 1990.
[Dolan et al., 2004] William Dolan, Chris Quirk, and Chris Brockett. Unsupervised construction
of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the
20th International Conference on Computational Linguistics, 2004.
BIBLIOGRAPHY 120
[Duan et al., 2010] Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, and Heung-Yeung Shum. An
empirical study on learning to rank of tweets. In COLING, 2010.
[Efron and Tibshirani, 1986] Bradley Efron and Robert Tibshirani. Bootstrap methods for standard
errors, confidence intervals, and other measures of statistical accuracy. Statistical science, 1986.
[Efron, 2010] Miles Efron. Information search and retrieval in microblogs. In Journal of the Amer-
ican Society for Information Science and Technology, 2010.
[Eisenstein, 2013] Jacob Eisenstein. What to do about bad language on the internet. In Proceedings
of NAACL-HLT 2013, pages 359–369, 2013.
[Erk, 2007] Katrin Erk. A simple, similarity-based model for selectional preferences. In ANNUAL
MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2007.
[Fellbaum, 1998] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press,
1998.
[Feng et al., 2008] Jin Feng, Yi-Ming Zhou, and Trevor Martin. Sentence similarity based on rele-
vance. In Proceedings of IPMU, 2008.
[Feng et al., 2011] Song Feng, Ritwik Bose, and Yejin Choi. Learning general connotation of
words using graph-based algorithms. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, 2011.
[Foltz et al., 2000] Peter W Foltz, Sara Gilliam, and Scott Kendall. Supporting content-based feed-
back in on-line writing evaluation with lsa. Interactive Learning Environments, pages 111–127,
2000.
[Gildea and Jurafsky, 2002] Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic
roles. Computational linguistics, 2002.
[Gong and Lazebnik, 2011] Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A pro-
crustean approach to learning binary codes. In Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, 2011.
BIBLIOGRAPHY 121
[Gonzalez-Ibanez et al., 2011] Roberto Gonzalez-Ibanez, Smaranda Muresan, and Nina Wa-
cholder. Identifying sarcasm in twitter: a closer look. In Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language Technologies, 2011.
[Graesser et al., 2004] Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang
Cai. Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods,
Instruments, & Computers, 2004.
[Graesser et al., 2011] Arthur C Graesser, Danielle S McNamara, and Jonna M Kulikowich. Coh-
metrix providing multilevel analyses of text characteristics. Educational Researcher, 2011.
[Griffiths and Steyvers, 2004] Thomas L. Griffiths and Mark Steyvers. Finding scientific topics.
Proceedings of the National Academy of Sciences, 101, 2004.
[Guo and Diab, 2010] Weiwei Guo and Mona Diab. Combining orthogonal monolingual and mul-
tilingual sources of evidence for all words wsd. In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics, 2010.
[Guo and Diab, 2011] Weiwei Guo and Mona Diab. Semantic topic models: Combining word
distributional statistics and dictionary definitions. In Proceedings of the 2011 Conference on
Empirical Methods in Natural Language Processing, 2011.
[Guo and Diab, 2012a] Weiwei Guo and Mona Diab. Learning the latent semantics of a concept by
its definition. In Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics, 2012.
[Guo and Diab, 2012b] Weiwei Guo and Mona Diab. Modeling sentences in the latent space. In
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012.
[Guo and Diab, 2013] Weiwei Guo and Mona Diab. Improving lexical semantics for sentential
semantics: Modeling selectional preference and similar words in a latent variable model. In
Proceedings of the 2013 Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, 2013.
BIBLIOGRAPHY 122
[Guo et al., 2013] Weiwei Guo, Hao Li, Heng Ji, and Mona Diab. Linking tweets to news: A
framework to enrich online short text data in social media. In Proceedings of the 51th Annual
Meeting of the Association for Computational Linguistics, 2013.
[Guo et al., 2014] Weiwei Guo, Wei Liu, and Mona Diab. Fast tweet retrieval with compact binary
codes. In Proceedings of COLING 2014, the 25th International Conference on Computational
Linguistics, 2014.
[Han et al., 2013] Lushan Han, Abhay Kashyap, Tim Finin, James Mayfield, and Jonathan Weese.
Umbc ebiquity-core: Semantic textual similarity systems. In Second Joint Conference on Lexical
and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the
Shared Task, 2013.
[Harnly et al., 2005] Aaron Harnly, Ani Nenkova, Rebecca Passonneau, and Owen Rambow. Au-
tomation of summary evaluation by the pyramid method. In Recent Advances in Natural Lan-
guage Processing, 2005.
[Hindle and Rooth, 1993] Donald Hindle and Mats Rooth. Structural ambiguity and lexical rela-
tions. Computational linguistics, 1993.
[Ho et al., 2010] Chukfong Ho, Masrah Azrifah Azmi Murad, Rabiah Abdul Kadir, and Shya-
mala C. Doraisamy. Word sense disambiguation-based sentence similarity. In Proceedings of
the 23rd International Conference on Computational Linguistics, 2010.
[Hodges, 1997] Wilfrid Hodges. Compositional semantics for a language of imperfect information.
Logic Journal of IGPL, 1997.
[Hofmann, 1999] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the
22nd annual international ACM SIGIR conference on Research and development in information
retrieval, 1999.
[Hovy et al., 2006] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph
Weischedel. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology
Conference of the North American Chapter of the ACL, 2006.
BIBLIOGRAPHY 123
[Huang et al., 2012] Hongzhao Huang, Arkaitz Zubiaga, Heng Ji, Hongbo Deng, Dong Wang, Hieu
Le, Tarek Abdelzather, Jiawei Han, Alice Leung, John Hancock, and Clare Voss. Tweet rank-
ing based on heterogeneous networks. In Proceedings of the 24th International Conference on
Computational Linguistics, 2012.
[Ide and Veronis, 1998] Nancy Ide and Jean Veronis. Introduction to the special issue on word
sense disambiguation: the state of the art. Computational linguistics, 1998.
[Indyk and Motwani, 1998] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: to-
wards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM sympo-
sium on Theory of computing, 1998.
[Islam and Inkpen, 2008] Aminul Islam and Diana Inkpen. Semantic text similarity using corpus-
based word similarity and string similarity. ACM Transactions on Knowledge Discovery from
Data, 2, 2008.
[Ji and Eisenstein, 2013] Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to dis-
tributional sentence similarity. In Proceedings of the 2013 Conference on Empirical Methods in
Natural Language Processing, 2013.
[Ji and Grishman, 2008] Heng Ji and Ralph Grishman. Refining event extraction through cross-
document inference. In Proceedings of ACL-08: HLT, 2008.
[Jiang and Conrath, 1997] Jay J. Jiang and David W. Conrath. Semantic similarity based on cor-
pus statistics and lexical taxonomy. In Proceedings of International Conference Research on
Computational Linguistics, 1997.
[Jiang et al., 2011] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target-
dependent twitter sentiment classification. In Proceedings of the 49th Annual Meeting of As-
sociation for Computational Linguistics, 2011.
[Jin et al., 2011] Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. Transferring topical
knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM
international conference on Information and knowledge management, 2011.
BIBLIOGRAPHY 124
[Kauchak and Barzilay, 2006] David Kauchak and Regina Barzilay. Paraphrasing for automatic
evaluation. In Proceedings of the Human Language Technology Conference of the North Ameri-
can Chapter of the ACL, 2006.
[Kimmig et al., 2012] Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, and
Lise Getoor. A short introduction to probabilistic soft logic. In Proceedings of the NIPS Work-
shop on Probabilistic Programming: Foundations and Applications, 2012.
[Klein and Manning, 2003] Dan Klein and Christopher D Manning. Accurate unlexicalized pars-
ing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-
Volume 1, pages 423–430. Association for Computational Linguistics, 2003.
[Kulis and Grauman, 2012] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hash-
ing. IEEE Transactions On Pattern Analysis and Machine Intelligence, 34(6):1092–1104, 2012.
[Lapata and Barzilay, 2005] Mirella Lapata and Regina Barzilay. Automatic evaluation of text co-
herence: Models and representations. In Proceedings of the 19th International Joint Conference
on Artificial Intelligence, 2005.
[Leacock and Chodorow, 1998] Claudia Leacock and Martin Chodorow. Combining local context
and wordnet similarity for word sense identification. Fellbaum, C., ed., WordNet: An electronic
lexical database, 1998.
[Lee and Park, 2007] H. J. Lee and Sung Joo Park. Moners: A news recommender for the mobile
web. Expert Systems with Applications, 2007.
[Lee et al., 2005] Michael David Lee, BM Pincombe, and Matthew Brian Welsh. An empirical
evaluation of models of text document similarity. In Proceedings of the 27th Annual Conference
of the Cognitive Science Society, 2005.
[Lesk, 1986] Michael Lesk. Automatic sense disambiguation using machine readable dictionaries:
How to tell a pine cone from an ice cream cone. In Proceedings of the ACM SIGDOC Conference,
pages 24–26, 1986.
BIBLIOGRAPHY 125
[Li et al., 2006] Yuhua Li, David McLean, Zuhair A. Bandar, James D. O’Shea, and Keeley Crock-
ett. Sentence similarity based on semantic nets and corpus statistics. IEEE Transaction on
Knowledge and Data Engineering, 18, 2006.
[Li et al., 2010] Linlin Li, Benjamin Roth, and Caroline Sporleder. Topic models for word sense
disambiguation and token-based idiom detection. In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics, 2010.
[Li et al., 2012] Hao Li, Yu Chen, Heng Ji, Smaranda Muresan, and Dequan Zheng. Combining
content-based and collaborative lters in an online newspaper. In In Proceedings of the 26th Pacific
Asia Conference on Language, Information and Computation, 2012.
[Lin and Hovy, 2003] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using
n-gram co-occurrence statistics. In Proceedings of the 2003 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics on Human Language Technology-
Volume 1, 2003.
[Lin, 1998] Dekang Lin. Verb semantics and lexical selection. In Proceedings of the 15th Interna-
tional Conference on Machine Learning, 1998.
[Lin, 2004] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text
Summarization Branches Out: Proceedings of the ACL-04 Workshop, 2004.
[Liu et al., 2011a] Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Hashing with graphs.
In Proceedings of the 28th International Conference on Machine Learning, 2011.
[Liu et al., 2011b] Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming Zhou. Recognizing named
entities in tweets. In The Semanic Web: Research and Applications, 2011.
[Liu et al., 2012a] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Super-
vised hashing with kernels. In Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, 2012.
[Liu et al., 2012b] Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, and Shih-Fu Chang. Compact
hyperplane hashing with bilinear functions. In Proceedings of the 29th International Conference
on Machine Learning, 2012.
BIBLIOGRAPHY 126
[McCarthy et al., 2004] Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding
predominant word senses in untagged text. In Proceedings of the 42nd Meeting of the Association
for Computational Linguistics, 2004.
[McCarthy et al., 2007] Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Unsuper-
vised acquisition of predominant word senses. Computational Linguistics, 2007.
[Mihalcea et al., 2006] Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based
and knowledge-based measures of text semantic similarity. In Proceedings of the 21st National
Conference on Articial Intelligence, 2006.
[Mihalcea, 2005] Rada Mihalcea. Unsupervised large-vocabulary word sense disambiguation with
graph-based algorithms for sequence data labeling. In Proceedings of the Joint Conference on
Human Language Technology and Empirical Methods in Natural Language Processing, pages
411–418, 2005.
[Mikolov et al., 2010] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, and Sanjeev
Khudanpur. Recurrent neural network based language model. In INTERSPEECH 2010, 11th
Annual Conference of the International Speech Communication Association, 2010.
[Miller et al., 1993] George Miller, Claudia Leacock, Randee Tengi, and Ross Bunker. A semantic
concordance. In 3rd DARPA Workshop on Human Language Technology, 1993.
[Mitchell and Lapata, 2008] Jeff Mitchell and Mirella Lapata. Vector-based models of semantic
composition. In ACL, 2008.
[Mnih and Hinton, 2007] Andriy Mnih and Geoffrey Hinton. Three new graphical models for sta-
tistical language modelling. In Proceedings of the 24th international conference on Machine
learning, 2007.
[Navigli, 2009] Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys
(CSUR), 2009.
[Nenkova and Passonneau, 2004] Ani Nenkova and Rebecca Passonneau. Evaluating content selec-
tion in summarization: The pyramid method. In Proceedings of the Human Language Technol-
BIBLIOGRAPHY 127
ogy Conference of the North American Chapter of the Association for Computational Linguistics,
2004.
[Norouzi and Fleet, 2011] Mohammad Norouzi and David J. Fleet. Minimal loss hashing for com-
pact binary codes. In Proceedings of the 28th International Conference on Machine Learning,
2011.
[O’Shea et al., 2008] James O’Shea, Zuhair Bandar, Keeley Crockett, and David McLean. A com-
parative study of two short text semantic similarity measures. In Proceedings of the Agent
and Multi-Agent Systems: Technologies and Applications, Second KES International Symposium
(KES-AMSTA), 2008.
[Palmer et al., 2001] Martha Palmer, Christiane Fellbaum, Scott Cotton, Lauren Delfs, and
Hoa Trang Dang. English tasks: All-words and verb lexical sample. In Proceedings of
SENSEVAL-2: Second International Workshop on Evaluating Word Sense Disambiguation Sys-
tems, 2001.
[Pang and Lee, 2004] Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the
Association for Computational Linguistics, 2004.
[Pang and Lee, 2005] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for
sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting
of the Association for Computational Linguistics, 2005.
[Pantel et al., 2007] Patrick Pantel, Rahul Bhagat, Bonaventura Coppola, Timothy Chklovski, and
Eduard H Hovy. Isp: Learning inferential selectional preferences. In HLT-NAACL, 2007.
[Papineni et al., 2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a
method for automatic evaluation of machine translation. In Proceedings of the 40th annual
meeting on association for computational linguistics, 2002.
[Passonneau et al., 2013] Rebecca J. Passonneau, Emily Chen, Weiwei Guo, and Dolores Perin.
Automated pyramid scoring of summaries using distributional semantics. In Proceedings of the
51th Annual Meeting of the Association for Computational Linguistics, 2013.
BIBLIOGRAPHY 128
[Patwardhan and Pedersen, 2006] Siddharth Patwardhan and Ted Pedersen. Using wordnet-based
context vectors to estimate the semantic relatedness of concepts. In Proceedings of the EACL
2006 Workshop Making Sense of Sense - Bringing Computational Linguistics and Psycholin-
guistics Together, 2006.
[Patwardhan et al., 2005] Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen. Sensere-
late::targetword - a generalized framework for word sense disambiguation. In Proceedings of the
Demonstration and Interactive Poster Session of the 43rd Annual Meeting of the Association for
Computational Linguistics, 2005.
[Pedersen et al., 2004] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. Word-
net::similarity - measuring the relatedness of concepts. In Proceedings of Fifth Annual Meeting
of the North American Chapter of the Association for Computational Linguistics, 2004.
[Perin et al., 2013] Dolores Perin, Rachel Hare Bork, Stephen T. Peverly, and Linda H. Mason. A
contextualized curricular supplement for developmental reading and writing. Journal of College
Reading and Learning, 2013.
[Petrovic et al., 2010] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Streaming first story
detection with application to twitter. In Human Language Technologies: The 2010 Annual Con-
ference of the North American Chapter of the Association for Computational Linguistics, 2010.
[Porter, 2001] Martin Porter. Snowball: A language for stemming algorithms, 2001.
[Pradhan et al., 2007] Sameer S. Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer.
Semeval-2007 task 17: English lexical sample, srl and all words. In Proceedings of the 4th
International Workshop on Semantic Evaluations (SemEval-2007), 2007.
[Ramage et al., 2009] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Man-
ning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing,
2009.
BIBLIOGRAPHY 129
[Ramage et al., 2010] Daniel Ramage, Susan Dumais, and Dan Liebling. Characterizing mi-
croblogs with topic models. In Proceedings of the Fourth International AAAI Conference on
Weblogs and Social Media, 2010.
[Rashtchian et al., 2010] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier.
Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL
HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk,
2010.
[Ratcliff and Metzener, 1988] John W Ratcliff and David E Metzener. Pattern matching: the gestalt
approach. Dr Dobbs Journal, 1988.
[Resnik, 1995] Philip Resnik. Using information content to evaluate semantic similarity in a tax-
onomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence,
1995.
[Resnik, 1996] Philip Resnik. Selectional constraints: An information-theoretic model and its com-
putational realization. Cognition, 1996.
[Resnik, 1997] Philip Resnik. Selectional preference and sense disambiguation. In Proceedings
of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?,
1997.
[Riloff et al., 2013] Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert,
and Ruihong Huang. Sarcasm as contrast between a positive sentiment and negative situation. In
EMNLP, 2013.
[Ritter et al., 2010] Alan Ritter, Oren Etzioni, et al. A latent dirichlet allocation method for selec-
tional preferences. In Proceedings of the 48th Annual Meeting of the Association for Computa-
tional Linguistics, 2010.
[Ritter et al., 2012] Alan Ritter, Oren Etzioni, Sam Clark, et al. Open domain event extraction
from twitter. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining, 2012.
BIBLIOGRAPHY 130
[Rooth et al., 1999] Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil.
Inducing a semantically annotated lexicon via em-based clustering. In Proceedings of the 37th
annual meeting of the Association for Computational Linguistics on Computational Linguistics,
1999.
[Sankaranarayanan et al., 2009] Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler,
Michael D. Lieberman, and Jon Sperling. Twitterstand: news in tweets. In Proceedings of
the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information
Systems, 2009.
[Severyn et al., 2013] Aliaksei Severyn, Massimo Nicosia, and Alessandro Moschitti. Learning
semantic textual similarity with structural representations. In Proceedings of the 51th Annual
Meeting of the Association for Computational Linguistics, 2013.
[Shen et al., 2013] Chao Shen, Fei Liu, Fuliang Weng, and Tao Li. A participant-based approach
for event summarization using twitter streams. In Proceedings of NAACL-HLT 2013, 2013.
[Sinclair, 2001] John McHardy Sinclair. Collins COBUILD English dictionary for advanced learn-
ers. HarperCollins, 2001.
[Sinha and Mihalcea, 2007] Ravi Sinha and Rada Mihalcea. Unsupervised graph-based word sense
disambiguation using measures of word semantic similarity. In Proceedings of the IEEE Inter-
national Conference on Semantic Computing, pages 363–369, 2007.
[Snyder and Palmer, 2004] Benjamin Snyder and Martha Palmer. The english all-words task. In
Proceeding of the 3rd International Workshop on the Evaluation of Systems for the Semantic
Analysis of Text (Senseval-3), pages 41–43, 2004.
[Socher et al., 2011a] Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and
Christopher D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase
detection. In Proceedings of Advances in Neural Information Processing Systems, 2011.
[Socher et al., 2011b] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and
Christopher D Manning. Semi-supervised recursive autoencoders for predicting sentiment dis-
BIBLIOGRAPHY 131
tributions. In Proceedings of the Conference on Empirical Methods in Natural Language Pro-
cessing, 2011.
[Speriosui et al., 2011] Michael Speriosui, Nikita Sudan, Sid Upadhyay, and Jason Baldridge.
Twitter polarity classification with label propagation over lexical links and the follower graph.
In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing,
2011.
[Srebro and Jaakkola, 2003] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approxima-
tions. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.
[Steck, 2010] Harald Steck. Training and testing of recommender systems on data missing not
at random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2010.
[Subercaze et al., 2013] Julien Subercaze, Christophe Gravier, and Frederique Laforest. Towards
an expressive and scalable twitter’s users profiles. In IEEE/WIC/ACM International Joint Con-
ferences on Web Intelligence and Intelligent Agent Technologies, 2013.
[Toutanova et al., 2003] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer.
Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the
2003 Conference of the North American Chapter of the Association for Computational Linguis-
tics on Human Language Technology, 2003.
[Tsatsaronis et al., 2010] George Tsatsaronis, Iraklis Varlamis, and Michalis Vazirgiannis. Text
relatedness based on a word thesaurus. Journal of Articial Intelligence Research, 37, 2010.
[Tumasjan et al., 2010] Andranik Tumasjan, Timm Oliver Sprenger, Philipp G. Sandner, and Is-
abell M. Welpe. Predicting elections with twitter: What 140 characters reveal about political
sentiment. In Fourth International AAAI Conference on Weblogs and Social Media, 2010.
[Wang and Manning, 2012] Sida Wang and Christopher D Manning. Baselines and bigrams: Sim-
ple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics: Short Papers-Volume 2, 2012.
BIBLIOGRAPHY 132
[Wang and McCallum, 2006] Xuerui Wang and Andrew McCallum. Topics over time: a non-
markov continuous-time model of topical trends. In In Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining, 2006.
[Weiss et al., 2008] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances
in Neural Information Processing Systems, 2008.
[Wu and Palmer, 1994] Zhibiao Wu and Martha Palmer. Verb semantics and lexical selection. In
Proceedings of Annual Meeting of the Association for Computational Linguistics, 1994.
[Xu et al., 2014] Wei Xu, Alan Ritter, Chris Callison-Burch, William B Dolan, and Yangfeng Ji.
Extracting lexically divergent paraphrases from twitter. Transactions of the Association for Com-
putational Linguistics, 2014.
[Yan et al., 2012] Rui Yan, Mirella Lapata, and Xiaoming Li. Tweet recommendation with graph
co-ranking. In Proceedings of the 24th International Conference on Computational Linguistics,
2012.
[Yih et al., 2012] Wentau Yih, Geoffrey Zweig, and John C. Platt. Polarity inducing latent seman-
tic analysis. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, 2012.
[Zhou et al., 2006] Liang Zhou, Chin-Yew Lin, Dragos Stefan Munteanu, and Eduard Hovy. Parae-
val: Using paraphrases to evaluate summaries automatically. In Proceedings of Human Language
Technology Conference of the North American Chapter of the ACL,, 2006.