Distributional approaches to word meanings Chris Potts, Ling 236/Psych 236c: Representations of meaning, Spring 2013 May 9 Overview 1 Foundational assumptions 2 2 Matrix designs 4 3 Distance measures 9 4 Weighting/normalization 11 5 Dimensionality reduction with Latent Semantic Analysis 14 6 Clustering and other induced structure 16 7 (Semi-)supervision 17 8 Tools 19 Great power, a great many design choices: tokenization annotation tagging parsing feature selection . . . cluster texts by date/author/discourse context/... ↓. Matrix type word × document word × word word × search proximity adj. × modified noun word × dependency rel. verb × arguments . . . × Weighting probabilities length normalization TF-IDF PMI Positive PMI PPMI with discounting . . . × Dimensionality reduction LSA PLSA LDA PCA IS DCA . . . × Vector comparison Euclidean Cosine Dice Jaccard KL KL with skew . . . (Nearly the full cross-product to explore; only a handful of the combinations are ruled out mathe- matically, and the literature contains relatively little guidance.)
20
Embed
Distributional approaches to word meanings - …web.stanford.edu/class/linguist236/materials/ling236-handout-05-09... · Distributional approaches to word meanings ... dimensional
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Distributional approaches to word meaningsChris Potts, Ling 236/Psych 236c: Representations of meaning, Spring 2013
May 9
Overview
1 Foundational assumptions 2
2 Matrix designs 4
3 Distance measures 9
4 Weighting/normalization 11
5 Dimensionality reduction with Latent Semantic Analysis 14
6 Clustering and other induced structure 16
7 (Semi-)supervision 17
8 Tools 19
Great power, a great many design choices:
tokenizationannotationtaggingparsingfeature selection... cluster texts by date/author/discourse context/. . .↓ ↙
probabilitieslength normalizationTF-IDFPMIPositive PMIPPMI with discounting
...
×
Dimensionalityreduction
LSAPLSALDAPCAISDCA
...
×
Vectorcomparison
EuclideanCosineDiceJaccardKLKL with skew
...
(Nearly the full cross-product to explore; only a handful of the combinations are ruled out mathe-matically, and the literature contains relatively little guidance.)
Ling 236/Psych 236c, Stanford (Potts)
1 Foundational assumptions
Firth (1935:37) on context dependence (cited by Stubbs 19939):
the complete meaning of a word is always contextual, and no study of meaning apartfrom context can be taken seriously.
Firth (1957:11) (required quotation for VSM lectures):
You shall know a word by the company it keeps . . .
Harris (1954:34):
All elements in a language can be grouped into classes whose relative occurrence canbe stated exactly. However, for the occurrence of a particular member of one classrelative to a particular member of another class, it would be necessary to speak interms of probability, based on the frequency of that occurrence in a sample.
Harris (1954:34):
[I]t is possible to state the occurrence of any element relative to any other element, tothe degree of exactness indicated above, so that distributional statements can cover allof the material of a language without requiring support from other types of informa-tion.
Harris (1954:34) (anticipating deep learning?):
[T]he restrictions on relative occurrence of each element are described most simply bya network of interrelated statements, certain of them being put in terms of the results ofcertain others, rather than by a simple measure of the total restriction on each elementseparately.
Harris (1954:36) on levels of analysis:
Some question has been raised as to the reality of this structure. Does it really exist,or is it just a mathematical creation of the investigator’s? Skirting the philosophicaldifficulties of this problem, we should, in any case, realize that there are two quitedifferent questions here. One: Does the structure really exist in language? The answeris yes, as much as any scientific structure really obtains in the data which it describes— the scientific structure states a network of relations, and these relations really holdin the data investigated.
Two: Does the structure really exist in speakers? Here we are faced with a questionof fact which is not directly or fully investigated in the process of determining thedistributional structure. Clearly, certain behaviors of the speakers indicate perceptionalong the lines of the distributional structure, for example, the fact that while peopleimitate nonlinguistic or foreign-language sounds, they repeat utterances of their ownlanguage.
2
Ling 236/Psych 236c, Stanford (Potts)
Harris (1954:39) on meaning and context-dependence:
All this is not to say that there is not a great interconnection between language andmeaning, in whatever sense it may be possible to use this work. But it is not a one-to-one relation between morphological structure and anything else. There is not evena one-to-one relation between vocabulary and any independent classification of mean-ing; we cannot say that each morpheme or word has a single central meaning or eventhat it has a continuous or coherent range of meanings.
[. . . ]
The correlation between language and meaning is much greater when we considerconnected discourse.
Harris (1954:43), stating a core assumption of VSMs:
The fact that, for example, not every adjectives occurs with every noun can be used asa measure of meaning difference. For it is not merely that different members of theone class have different selections of members of the other class with which they areare actually found. More than that: if we consider words or morphemes A and B tobe more different than A and C , then we will often find that the distributions of A andB are more different than the distributions of A and C . In other words, difference inmeaning correlates with difference in distribution.
Turney & Pantel (2010:153):
Statistical semantics hypothesis: Statistical patterns of human word usage can beused to figure out what people mean (Weaver, 1955; Furnas et al., 1983). – If unitsof text have similar vectors in a text frequency matrix, then they tend to have simi-lar meanings. (We take this to be a general hypothesis that subsumes the four morespecific hypotheses that follow.)
Bag of words hypothesis: The frequencies of words in a document tend to indicatethe relevance of the document to a query (Salton et al., 1975). – If documents andpseudo-documents (queries) have similar column vectors in a term–document matrix,then they tend to have similar meanings.
Distributional hypothesis: Words that occur in similar contexts tend to have similarmeanings (Harris, 1954; Firth, 1957; Deerwester et al., 1990). – If words have similarrow vectors in a word–context matrix, then they tend to have similar meanings.
Extended distributional hypothesis: Patterns that co-occur with similar pairs tend tohave similar meanings (Lin & Pantel, 2001). – If patterns have similar column vectorsin a pair–pattern matrix, then they tend to express similar semantic relations.
Latent relation hypothesis: Pairs of words that co-occur in similar patterns tend tohave similar semantic relations (Turney et al., 2003). – If word pairs have similar rowvectors in a pair–pattern matrix, then they tend to have similar semantic relations.
3
Ling 236/Psych 236c, Stanford (Potts)
2 Matrix designs
2.1 Word × document
Very sparse. Each column gives the bag-of-words representation of a document. This is the stan-dard design from Web search: after suitable reweighting, the basic idea is to rank documents(columns) according to their values for a given query (set of rows).
(1) Upper left corner of a matrix derived from the training portion of http://ai.stanford.edu/~amaas/data/sentiment/:
Dense. No difference between rows and columns. Diagonal gives word counts. I find these aregood for building rich word representations. Derivable from the word × document design asW = D(DT ), where D is the (m× n)-dimensional word × document matrix and W is the (m×m)-dimensional word × word result.
(2) Excerpt of the matrix derived from (1):
against age agent ages ago agree ahead ain.t air aka al
(3) Full word × word matrix visualized with t-SNE (van der Maaten & Geoffrey 2008):
a. Detail of the above:
b. Detail of the above:
5
Ling 236/Psych 236c, Stanford (Potts)
2.3 Modified × adverb
Derived from the advmod() pairs in the dependency-parsed version of the NYT section of theGigaword corpus. Dimensions: (3000× 3000), chosen by frequency.
(5) Adverb details of a column-wise t-SNE representation
(6) Modified details a row-wise t-SNE representation
Could be relevant to the generalizations proposed in Kennedy & McNally 2005; Kennedy 2007;Syrett et al. 2009; Syrett & Lidz 2010.
6
Ling 236/Psych 236c, Stanford (Potts)
2.4 Interjection × dialog-act
Derived from the Switchboard Dialog Act Corpus. Rows are words tagged as interjections (‘UH’).Columns are DAMSL dialog-act tags. Additional details: http://compprag.christopherpotts.net/swda-clustering.html. Dimensions: (50× 39).
• A VSM is really just a multidimensional array of reals. The ‘M(a)trix’ suggests a limitation to2d, but higher-dimensional VSMs have been explored (Van de Cruys 2009; Turney 2007).
• VSMs are invariant under row and column permutations.
• VSMs are insensitive to row/column labeling.
8
Ling 236/Psych 236c, Stanford (Potts)
3 Distance measures
Definition 1 (Euclidean distance). Between vectors x and y of dimension n:s
n∑
i=1
|x i − yi|2
Definition 2 (Vector length). For a vector x of dimension n:
‖x‖=
s
n∑
i=1
x2i
Definition 3 (Length normalization). For a vector x of dimension n, the length normalization ofx , x , is obtained by dividing each element of x by ‖x‖.
Definition 4 (Cosine distance). Between vectors x and y of dimension n:
1−
∑ni=1 x i · yi
‖x‖ · ‖y‖
Definition 5 (KL divergence). Between probability distributions p and q:
D(p ‖ q) =n∑
i=1
�
pi · logpi
qi
�
p is the reference distribution. Before calculation, map all 0s to ε.
Definition 6 (Symmetric KL divergence). Between probability distributions p and q:
D(p ‖ q) + D(q ‖ p)
Others (see van Rijsbergen 1979):
• Manhattan distance
• Matching coefficient
• Jaccard distance
• Dice (Dice 1945)
• Jensen–Shannon
• KL with skew (Lee 1999)
9
Ling 236/Psych 236c, Stanford (Potts)
dx dy
A 2 4B 10 15C 14 10
(a) VSM.
dx dy
A 0.45 0.89B 0.55 0.83C 0.81 0.58
(b) Length normed.
dx dy
A 0.33 0.67B 0.40 0.60C 0.58 0.42
(c) Probabilities.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0123456789101112131415 (10,15)
(2,4)
(14,10)
10 − 142 + 15 − 102 = 6.42 − 102 + 4 − 152 = 13.6
(d) Euclidean distance.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0123456789101112131415 (10,15)
(2,4)
(14,10)
1 −(10 × 14) + (15 × 10)||10, 15|| × ||14, 10||
= 0.065
1 −(2 × 10) + (4 × 15)||2, 4|| × ||10, 15||
= 0.008
(e) Cosine distance.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(0.55,0.83)(0.45,0.89)
(0.81,0.58)0.55 − 0.812 + 0.83 − 0.582 = 0.36
0.45 − 0.552 + 0.89 − 0.832 = 0.12
(f) Euclidean distance, length normed vectors.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
(0.4,0.6)(0.33,0.67)
(0.58,0.42)0.1
0.001
(g) Symmetric KL divergence.
Figure 1: Euclidean distance and cosine distance compared.
10
Ling 236/Psych 236c, Stanford (Potts)
4 Weighting/normalization
4.1 Length-norms and probability distributions
Length-norming vectors (row- or column-wise) is a kind of reweighting scheme, as is turningthem into probability distributions. Both of these methods exaggerate estimates for small counts,because they ignore differences in magnitude. In a perfect world, this might be what we want. Inthe highly imperfect world we live in, small counts are often misleading.
d1 d2 d3
A 1 2 3B 10 20 30C 100 200 300
(a) VSM.
d1 d2 d3
A 0.17 0.33 0.5B 0.17 0.33 0.5C 0.17 0.33 0.5
(b) P(d|w).
d1 d2 d3
A 0.27 0.53 0.8B 0.27 0.53 0.8C 0.27 0.53 0.8
(c) Length norm by row.
Figure 2: Norming as weighting.
4.2 Term Frequency–Inverse Document Frequency (TF-IDF)
Definition 7 (TF-IDF). For a corpus of documents D:
Definition 10 (Positive PMI). Set all PMI values< 0 to 0. (Effectively combined with discounting.)
4.4 Others
• t-test: p(w,d)−p(w)p(d)pp(w)p(d)
• Observed/Expected and associated χ2 or g statistics.
• TF-IDF variants that seek to be sensitive to the empirical distribution of words (Church &Gale 1995; Manning & Schütze 1999:553; Baayen 2001)
13
Ling 236/Psych 236c, Stanford (Potts)
5 Dimensionality reduction with Latent Semantic Analysis
The goal of dimensionality reduction is eliminate rows/columns that are highly correlated whilebringing similar things together and pushing dissimilar things apart. Latent Semantic Analysis(LSA) is a relatively simple method for doing this that generally yields high-quality word and doc-ument representations.1
LSA (Deerwester et al. 1990) is built upon singular value decomposition:
Theorem 1 (Singular value decomposition). For any matrix of real numbers A of dimension (m×n)there exists a factorization into matrices T , S, D such that
Am×n = Tm×nSn×nDTn×m
The matrices T and D are orthonormal: their columns are length-normalized and orthogonal toone another (cosine distance of 1). The singular-value matrix S is a diagonal matrix arranged bysize, with S[1, 1] corresponding to the dimension of greatest variability, S[2,2] to the next great-est, and so forth. The algorithm for finding this factorization uses some tools from matrix algebrathat I think we won’t cover here. Baker 2005 is an example-driven review of the methods.
LSA (truncated SVD) derives a k-dimensional approximation of the original matrix A:
Definition 11 (LSA). Let Tm×n, Sn×n, and Dn×m be an SVD factorization. The row-wise approxima-tion to dimension k is
(TS)[1:m, 1:k]
and the column-wise approximation to dimension k is
(S(DT ))[1:k, 1:n]
Fig. 7 uses a simple linear regression to show how dimensionality reduction can move points thatare far apart in high-dimensional space close together in the lower-dimensional space (Manning &Schütze 1999:§15.4). In the 2d space, points B and D are far apart (dissimilar). When we projectfrom the 2d space onto the 1d line, B and D are close together (similar). LSA is able to do this aswell, which is its greatest strength for VSMs.
Additionally, the regression line captures the direction of greatest variability for the 2d data.Equivalently, the vector of adjustments (residuals), given by the red arrows, form a vector that isorthogonal to the regression vector. For LSA, this same notion of orthogonality ensures the leastinformation loss possible for a given k.
Like least-squares regression, LSA should be used only for normally distributed data. CountVSMs will not be normally distributed, but VSMs reweighted by PMI generally are roughly normallydistributed.
1Other methods include Principal Components Analysis (PCA; very similar to LSA), Latent Dirichlet Allocation(LDA), which can derive a probabilistic word × topic VSM (Blei et al. 2003; Steyvers & Griffiths 2006; Blei 2012),labeled LDA (Ramage et al. 2009, 2010), and t-Distributed Stochastic Neighbor Embedding (t-SNE), a PCA-like methodoriented towards projecting into 2d or 3d space (van der Maaten & Geoffrey 2008). See also Turney & Pantel 2010:160.
14
Ling 236/Psych 236c, Stanford (Potts)
x
y
0 2 10 11 14 16
0
34
12
1516
4.40
8.929.49
11.18
A
B
C
D
Figure 7: Least-squares regression as dimensionality reduction. B and D are far apart in the 2dspace but close when the data are projected onto a 1d space.
Figure 8: In this example, LSA captures the similarity between gnarly and wicked, even though thetwo never occur together in a document.
15
Ling 236/Psych 236c, Stanford (Potts)
6 Clustering and other induced structure
Clustering algorithms partition the data, which can be valuable, but it is inevitably a compro-mise, since it imposes boundaries on a continuous space. See Manning et al. 2009:§16, 17 for anoverview of clustering methods for VSMs.
6.1 Centroid/mean-based methods
K-means clustering works by randomly picking k mean values, assigning every item to the closestof those means, and then recalculating the means for those new clusters. This process repeatsiteratively until the mean values stop changing.
ɒɑ
ɶ
aæ
ʌ
ɔ
o ɤ
ɘ
œ
əә
e
ɞ
ø
ɛ
ɵ
ɯu
ʊ
ɨʉ
y i
ʏɪ
ŋ
ʟ
ɫ
ɴ
ʀ
ɲ
ʎ
ŋŋ
ʟʟ
ɳ
ʙ
ɭ
ɺ
ɻ
ɽr
nm
l
ɾ
ɱ
ʔɣx
kg
kxg ɣ
ħʕ
ʁq
χ
ɢ
ɕ
ɟ
ʝ
c
ç
dʑtç
ɣɣ
xx
k k
g g
ʑ
ʈ ɖ
ɬ
ʐ
ɸ
ʂ ʒ
z
v
t
ʃ
s
p
f
d
b
θ
ɮ
ð
β
dʒdz
dɮ
d ɮ tʃtɬ tstɬ ts
tɬ d zd ɮ ʈʂ
ɖʑ
pf bv
pɸbβ
tθdð
cçɟʝ
kxkxgɣ
g ɣqχɢʁ
ɧ
kpgb
pt
bd
ɰɰ
w ɥ
jɹ
ʋ
ʍ ɦh
Figure 9: t-SNE visualization (same as (8)), with k-means clusters (k = 10) indicated by color.
6.2 Hierarchical clustering
ɯ ɨj
ɰ ɰ i ɪ æ e ɛ ɑ a ʌ ɤ ɘ əә ɒ ɔ o w u ʊ ɥ ʉ y ʏ œ ɞ ɶ ø ɵ ɴ ŋ ɲ ŋ ŋ ɳ n m ɱ ʎ ʟ ʟ ʟɫ ɺ l ɽ ɾ ɹ ɭ ɻ ʋ ʀ ʙ r ɣ ɣ x x kx
kx gɣ
gɣ ɕ tç dʑ ʑɟ c cç ɟʝɧ ʝ ç x kx
qχ ħ χʕ ʁ ɣ gɣ
ɢʁ q ɢ k g kp gb k k g g ʈ ɖ t d pt bdʔ p b pɸ bβ pf
bv ɸf v β ʍ ɦ h dʒ tʃ ts
dz z s dz ts ʈʂ ɖʑ ʐ ʂ ʒʃ θ ð tθ dðɬ ɮ dɮ tɬ
dɮ tɬ tɬ dɮ
Figure 10: Hierarchical clustering of the phonological data, with ten clusters imposed (red boxes).
6.3 Others
• With a distance matrix derived from a VSM, one can use network methods to discover sub-communities and identify nodes with special network properties (centrality, in-degree, out-degree, PageRank, etc.).
• Model-based clustering methods typically do not require a cluster count to be specified,but rather assumes a specific model and then seeks the clustering that best matches thatassumption.
16
Ling 236/Psych 236c, Stanford (Potts)
7 (Semi-)supervision
The VSMs we’ve seen so far are not grounded. However, where information about the world isavailable, we can bring it in through a mix of unsupervised and supervised methods.
7.1 The semantic-orientation method (Turney & Littman 2003)
i. Get your VSM into shape by weighting and/or dimensionality reduction.
ii. Define two seed-sets S1 and S2 of words (they should be opposing in some way that is appro-priate for your matrix).
iii. For a given distance metric dist and word w:
∑
w′∈S1
dist(w, w′)
−
∑
w′∈S2
dist(w, w′)
Turney & Littman’s (2003:343) hypothesis
The ideas in SO-A can likely be extended to many other semantic aspects of words.The General Inquirer lexicon has 182 categories of word tags [Stone et al. 1966] andthis paper has only used two of them, so there is no shortage of future work.
For related ideas, Blair-Goldensohn et al. 2008; Velikovich et al. 2010.
Figure 11: Top negative and positive words using the semantic orientation method. Seed setsfrom Turney & Littman (2003): S1 = {bad, nasty, poor, negative, unfortunate, wrong, inferior} andS2 = {good, nice, excellent, positive, fortunate, correct, superior}. The VSM from sec. 2.2, weightedwith positive PMI with contextual discounting (sec. 4.3).
17
Ling 236/Psych 236c, Stanford (Potts)
7.2 Word vectors as classifier features
In the following simple example, I use word-level pos/neg labels as an approximation of the con-text, combining it with a VSM to make predictions about unlabeled words. This is a glimpse ofwhat’s to come (in our final meeting; Collobert & Weston 2008; Martineau & Finin 2009; Turianet al. 2010; Maas & Ng 2010; Maas et al. 2011; Socher et al. 2011, 2012).
Figure 13: Values derived from a PMI weighted word × word matrix and then used as features ina logistic regression fit on just the training set. The identities of the test examples are, from top tobottom, bad, horrible, great, and best.
18
Ling 236/Psych 236c, Stanford (Potts)
8 Tools
• R has everything you need for matrices that will fit into memory.
• See Turney & Pantel 2010:§5 for lots of open-source projects.
• Python NLTK’s text and cluster: http://www.nltk.org/
• Python’s gensim is excellent for massive VSMs: http://radimrehurek.com/gensim/
• MALLET and FACTORIE: http://people.cs.umass.edu/~mccallum/code.html
ReferencesBaayen, R. Harald. 2001. Word frequency distributions. Dordrecht: Kluwer Academic Publishers.Baker, Kirk. 2005. Singular value decomposition tutorial. Ms., The Ohio State University. http://www.ling.
ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf.Blair-Goldensohn, Sasha, Kerry Hannan, Ryan McDonald, Tyler Neylon, George A. Reis & Jeff Reynar. 2008. Building
a sentiment summarizer for local service reviews. In Www workshop on nlp in the information explosion era (nlpix),Beijing, China.
Blei, David M. 2012. Probabilistic topic models. Communications of the ACM 55(4). 77–84.Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning
Research 3. 993–1022.Church, Kenneth Ward & William Gale. 1995. Inverse dcument frequency (IDF): A measure of deviations from Poisson.
In David Yarowsky & Kenneth Church (eds.), Proceedings of the third acl workshop on very large corpora, 121–130.The Association for Computational Linguistics.
Collobert, Ronan & Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networkswith multitask learning. In Proceedings of the 25th international conference on machine learning ICML ’08, 160–167. New York: ACM. doi:http://doi.acm.org/10.1145/1390156.1390177. http://doi.acm.org/10.1145/1390156.1390177.
Van de Cruys, Tim. 2009. A non-negative tensor factorization model for selectional preference induction. In Proceedingsof the workshop on geometrical models of natural language semantics, 83–90. Athens, Greece: ACL.
Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer & R. Harshman. 1990. Indexing by latent semantic analysis.Journal of the American Society for Information Science 41(6). 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
Dice, Lee R. 1945. Measures of the amount of ecologic association between species. Ecology 26(3). 267–302.Firth, John R. 1935. The technique of semantics. Transactions of the Philological Society 34(1). 36–73.Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in linguistic analysis, 1–32. Oxford:
Blackwell.Furnas, G. W., Thomas K. Landauer, L. M Gomez & S. T. Dumais. 1983. Statistical semantics: Analysis of the potential
performance of keyword information systems. Bell System Technical Journal 62(6). 1753–1806.Harris, Zellig. 1954. Distributional structure. Word 10(23). 146–162.Kennedy, Christopher. 2007. Vagueness and grammar: The semantics of relative and absolute gradable adjective.
Linguistics and Philosophy 30(1). 1–45.Kennedy, Christopher & Louise McNally. 2005. Scale structure and the semantic typology of gradable predicates.
Language 81(2). 345–381.Lee, Lillian. 1999. Measures of distributional similarity. In Proceedings of the 37th annual meeting of the association for
computational linguistics, 25–32. College Park, Maryland, USA: ACL. doi:10.3115/1034678.1034693.Lin, Dekang & Patrick Pantel. 2001. DIRT – discovery of inference rules from text. In Proceedings of ACM SIGKDD
conference on knowledge discovery and data mining, 323–328. New York: ACM.Maas, Andrew, Andrew Ng & Christopher Potts. 2011. Multi-dimensional sentiment analysis with learned representa-
tions. Ms., Stanford University.Maas, Andrew L. & Andrew Y. Ng. 2010. A probabilistic model for semantic word vectors. In Nips 2010 workshop on
deep learning and unsupervised feature learning, .
van der Maaten, Laurens & Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research9. 2579–2605.
Manning, Christopher D., Prabhakar Raghavan & Hinrich Schütze. 2009. An introduction to information retrieval.Cambridge University Press.
Manning, Christopher D. & Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge,MA: MIT Press.
Martineau, Justin & Tim Finin. 2009. Delta TFIDF: An improved feature space for sentiment analysis. In Thirdinternational AAAI conference on weblogs and social media, 258–261. AAAI Press.
Pantel, Patrick & Dekang Lin. 2002. Discovering word senses from text. In Proceedings of the eighth ACM SIGKDDinternational conference on knowledge discovery and data mining KDD ’02, 613–619. New York: ACM. doi:http://doi.acm.org/10.1145/775047.775138.
Ramage, Daniel, Susan Dumais & Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedingsof the international AAAI conference on weblogs and social media, 130–137. Washington, D.C.: Association for theAdvancement of Artificial Intelligence.
Ramage, Daniel, David Hall, Ramesh Nallapati & Christopher D. Manning. 2009. Labeled LDA: A supervised topicmodel for credit attribution in multi-labeled corpora. In Proceedings of the 2009 conference on empirical methods innatural language processing, 248–256. Singapore: ACL.
van Rijsbergen, Cornelis Joost. 1979. Information retrieval. London: Buttersworth.Salton, Gerald, Andrew Wong & Chung-Shu Yang. 1975. A vector space model for automatic indexing. Communications
of ACM 18(11). 613–620.Socher, Richard, Brody Huval, Christopher D. Manning & Andrew Y. Ng. 2012. Semantic compositionality through
recursive matrix-vector spaces. In Proceedings of the 2012 conference on empirical methods in natural languageprocessing, 1201–1211. Stroudsburg, PA.
Socher, Richard, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng & Christopher D. Manning. 2011. Semi-supervisedrecursive autoencoders for predicting sentiment distributions. In Proceedings of the 2011 conference on empiricalmethods in natural language processing, 151–161. Edinburgh, Scotland, UK.: ACL.
Steyvers, Mark & Tom Griffiths. 2006. Probabilistic topic models. In Thomas K. Landauer, D McNamara, S Dennis &W Kintsch (eds.), Latent semantic analysis: A road to meaning, Lawrence Erlbaum Associates.
Stone, Philip J, Dexter C Dunphry, Marshall S Smith & Daniel M Ogilvie. 1966. The General Inquirer: A computerapproach to content analysis. Cambridge, MA: MIT Press.
Stubbs, Michael. 1993. British traditions in text analysis — from Firth to Sinclair. In Mona Baker, Gill Francis & ElenaTognini-Bonelli (eds.), Text and technology: In honour of John Sinclair, 1–33. John Benjamins.
Syrett, Kristen, Christopher Kennedy & Jeffrey Lidz. 2009. Meaning and context in children’s understanding of grad-able adjectives. Journal of Semantics 27(1). 1–35.
Syrett, Kristen & Jeffrey Lidz. 2010. 30-month-olds use the distribution and meaning of adverbs to interpret noveladjectives. Language Learning and Development 6(4). 258–282.
Turian, Joseph, Lev-Arie Ratinov & Yoshua Bengio. 2010. Word representations: A simple and general method forsemi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics,384–394. Uppsala, Sweden: ACL.
Turney, Peter D. 2007. Empirical evaluation of four tensor decomposition algorithms. Tech. Rep. ERB-1152 Institutefor Information Technology, National Research Council of Canada.
Turney, Peter D. & Michael L. Littman. 2003. Measuring praise and criticism: Inference of semantic orientationfrom association. ACM Transactions on Information Systems (TOIS) 21. 315–346. doi:http://doi.acm.org/10.1145/944012.944013.
Turney, Peter D., Michael L. Littman, Jeffrey Bigham & Victor Shnayder. 2003. Combining independent modulesto solve multiple-choice synonym and analogy problems. In Proceedings of the international conference on recentadvances in natural language processing, 482–489. Borovets, Bulgaria.
Turney, Peter D. & Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal ofArtificial Intelligence Research 37. 141–188.
Velikovich, Leonid, Sasha Blair-Goldensohn, Kerry Hannan & Ryan McDonald. 2010. The viability of web-derivedpolarity lexicons. In Human language technologies: The 2010 annual conference of the north american chapter of theassociation for computational linguistics, 777–785. Los Angeles: ACL.
Weaver, Warren. 1955. Translation. In William N. Locke & A. Donald Booth (eds.), Machine translation of languages:Fourteen essays, 15–23. Cambridge, MA: MIT Press.