Similarity of Semantic Relations Peter D. Turney National Research Council Canada Presented by: Jennifer Lee November 14, 2008 CSI 5386
Similarity of Semantic Relations
Peter D. Turney National Research Council Canada
Presented by: Jennifer LeeNovember 14, 2008
CSI 5386
Attributional Similarity
Two words, A and B with high degree of attributional similarity are called synonyms.
An example of a typical synonym question that appears in TOEFL exam:
Choices: (a) imposed
(b) believed
(c) requested (d) correlated
Solution: (a) imposed
Stem: Levied
Attributional Similarity
A measure of similarity: sima(A, B) ϵ R.
Semantic relatedness x semantic distance.
A more general concept than similarity. Semantic relatedness is the same as attributional
similarity.
Attributional Similarity
Example of semantic relatedness:
Similar entities: (banktrust company) Dissimilar entities:
Meronymy: (carwheel) Antonymy: (hotcold) Any functional relationship/frequent
association: (pencilpaper), (penguinAntartica).
Attributional Similarity
Types of attributional similarity: Semantically associated: (beehoney). Semantically similar: (deerpony). Both: (doctornurse).
The term semantic similarity is misleading as it refers to a type of attributional similarity, yet relational similarity is not any less semantic than attributional similarity. Hence, we use the term taxonomical
similarity.
Relational Similarity
Relational similarity: When two pair of words have a high degree of
relational similarity, we say they are analogous.
Measured by: simr(A:B,C:D) ϵ R
A:B::C:DA is to B as C is to D
Verbal Analogy
Examples: traffic:street::water:riverbed mason:stone::carpenter:wood
It seems like in the second example, the relational similarity can be reduced to attributional similarity.
Verbal Analogy
A typical analogy question from SAT:
Choices: a) teacher: chalk
b) carpenter:wood c) soldier:gun d) photograph:camera e) book:word
Solution: carpenter:wood
Stem: mason: stone
Near Analogy
Near Analogy When there's a high degree of relational similarity
between two words, A:B and C:D, there's also a high degree of attributional similarity between A and C, and between B and D.
Otherwise, it is a far analogy.
Which is one of these pairs is a near analogy? (mason:stone::carpenter:wood) (traffic:street::water:riverbed)
Measures of Attributional Similarity
Many algorithms have been proposed. Measures of attributional similarity have been
studied extensively. Applications:
Problems such as recognizing synonyms, information retrieval, determining semantic orientation, grading student essays, measuring textual cohesion, and word sense disambiguation.
Measuring Attributional Similarity
Algorithms: Lexiconbased, corpusbased. Hybrid of the two.
We expect that lexiconbased algorithms would be better at capturing synonymy than corpusbased algorithm. But, this is not the case.
Measuring Attributional Similarity
Reference Description Percent correctBest lexiconbased algorithm 78.75
Terra and Clarke (2003) Best corpusbased algorithm 81.25
Best hybridalgorithm 97.5
Average human score 64.5
Jarmasz and Spakowicz (2003)
Turney et al. (2003)Landauer and Dumais (1997)
Performance of attributional similarity on the 80 TOEFL questions:
Measures of Relational Similarity
Not well developed Potential applications are not so wellknown. Many problems that involve semantic relations
would benefit from an algorithm for measuring relational similarity: NLP, information retrieval and information
extraction.
Using Attributional Similarity to Solve Analogies
We could score each candidate analogy by the average of the attributional similarity, sim
a,
between A and C and between B and D:
score(A:B::C:D) = (sima (A,C) + sim (B,D))
Performance of algorithms was measured by precision, recall, and F
12
Using Attributional Similarity to Solve Analogies
precision=number of correct guesses
total number of guesses made
recall=number of correct guesses
maximum possible number correct
precision=number of correct guesses
total number of guesses made
F=2 x precision x recall
precisionrecall
Using Attributional Similarity to Solve Analogies
For example, using the algorithm of Hirst and StOnge (1998), out of 374 SAT analogy questions, 120 questions were answered correctly, 224 incorrectly, and 30 questions were skipped. Precision was 120/(120 + 224) Recall was 120/(120 + 224 + 30)
Performance of attributional similarity measures on the 374 SAT questions. The bottom two rows are included for comparison.
Using Attributional Similarity to Solve Analogies
Algorithm Type Precision Recall FLexiconbased 34.9 32.1 33.4Hybrid 29.8 27.3 28.5Lexiconbased 32.8 31.3 32
Lin (1998b) Hybrid 31.2 27.3 29.1Hybrid 35.7 33.2 34.4Corpusbased 35 35 35Relational (VSM) 47.7 47.1 47.4
Random Random 20 20 20
Hirst and StOnge (1998)Jian and Conrath (1997)Leacock and Chodorow (1998)
Resnik (1995)Turney (2001)Turney and Littman (2005)
Using Attributional Similarity to Solve Analogies
We conclude that there are enough near analogies in the 374 SAT questions for attributional similarity to perform better than random guessing.
But not enough near analogies for attributional similarity to perform as well as relational similarity.
Recognizing Word Analogies
First attempted by a system called Argus using a small handbuilt semantic network.
Argus was based on a spreading activation model and did not explicitly attempt to measure relational similarity. Therefore, it could only solve a limited set of analogy questions.
Recognizing Word Analogies
Turney at al. (2003) combined 13 independent modules to answer SAT questions. VSM is the best out of 13, achieving a score of 47%.
Veale (2004) applied a lexiconbased approach to the same 374 SAT questions, attaining a score of 43% WordNet was used to get the quality measure,
based on similarity between A:B paths and the C:D paths.
Latent Relational Analysis
Turney (2005) introduced Latent Relational Analysis (LRA), an enhanced version of the VSM approach to measure relational similarity.
LRA has potential in many areas, including information extraction, word sense disambiguation, and information retrieval.
LRA relies on three resources: a search engine with a large corpus of text, thesaurus of synonyms and an efficient implementation of SVD.
Structure Mapping Theory
Most influential on modeling of analogy making, implemented in Structure Mapping Engine (SME).
Produces an analogical mapping between the source and target domain. Uses predicate logic.
Example analogy: Source domain: solar system (basic objects are sun
and planet) Target domain: Rutherford's model of the atom
(basic objects are nucleus and electrons)
Structure Mapping Theory
Each individual connection in an analogy mapping implies that the connection relations are similar.
Later versions of SME allowed similar, nonidentical relations to match.
Although SME focuses on the mapping process as a whole rather than measuring similarity between any two particular relations, LRA can enhance the performance of SME and likewise.
Metaphor
Novel metaphors can be understood through analogy, but conventional metaphors are simply recalled from memory.
It may be fruitful to combine an algorithm (Dolan's 1995) for handling conventional metaphor with LRA and SME for handling novel metaphors.
Metaphor
Lakoff and Johnson (1980):
Metaphorical sentence SATstyle verbal analogy
He shot down all of my arguments. aircraft:shoot down::argument:refute
I demolished his argument. building:demolish::argument:refute
You need to budget your time. money:budget::time:schedule
I’ve invested a lot of time in her. money:invest::time:allocate
My mind just isn’t operating today. machine:operate::mind:think
Life has cheated me. charlatan:cheat::life:disappoint
Inflation is eating up our profits. animal:eat::inflation:reduce
Classifying Semantic Relations
The problem is to classify a nounmodifier pair according to the semantic relation between the head noun and the modifier.
Example: laser printer Rosario and Hearst (2001) trained a neural
network to distinguish 13 classes of semantic relations in the medical domain. Lexical resources used: MeSH and UMLS Each nounmodifier pair is represented with a
feature vector.
Classifying Semantic Relations
Nastase and Szpakowicz (2003) classified 600 general nounmodifier pairs using WordNet and Roget's Thesaurus as lexical resources.
Vanderwende (2004) used handbuilt rules, together with a lexical knowledge base.
Any classification of semantic relations employs some implicit notion of relational similarity.
Classifying Semantic Relations
Barker and Szpakowicz (1998) tried a corpus based approach Explicitly use measure of relational similarity
Moldovan et al. (2004) also used a measure of relational similarity to map each noun and modifier into semantic classes in WordNet. Taken from corpus Surrounding context in the corpus is used in a word
sense disambiguation algorithm to improve the mapping.
Classifying Semantic Relations
Turney and Littman (2005) used the VSM (as the component in a single nearest neighbor learning algorithm) to measure relational similarity. This paper focuses on LRA.
Lauer (1995) used a corpusbased approach to paraphrase nounmodifier pairs by inserting propositions. Example: reptile haven haven for reptiles.→
Lapata and Keller (2004) improved the result by using the database of Alta Vista as a corpus.
Word Sense Disambiguation, Information Extraction
If we can identify the relations between a given word and its context, then we can disambiguate the given word.
For example, consider the word plant. Suppose plant appears in some text near food.
Information Extraction: Given an input document and a specific relation R,
extract all pairs of entities (if any) that have the relation R in the document.
Example: John Smith and Hardcom Corporation.
Information Extraction
With the VSM approach, there were a training set of labeled examples of the relation. Each example would be represented by a vector of
pattern frequencies. Given two entities, we could construct a vector
representing their relation Then measure the relational similarity between the
unlabeled vector and each of the labeled training vectors.
Information Extraction andQuestion Answering
Looks like a problem: Training vectors would be relatively dense The new unlabled vector for the two entities would
be sparse.
Moldovan et al. (2004) propose to map a given question to semantic relation, and then search for that relation in a corpus of semantically tagged text.
Automatic Thesaurus Generation
Hearst (1992) presents an algorithm that can automatically generate a thesaurus or dictionary: Learning hyponym, meronym relations and more.
Hearst and Berland and Charniak (1999) use manually generated rules to mine text for semantic relations.
Turney and Littman (2005) also use a manually generated set of 64 patterns.
Automatic Thesaurus Generation
Instead of manually generating new rules or patterns for each semantic relation, LRA can automatically learn patterns from a large corpus.
Girju, Badulescu, and Moldovan (2003) present an algorithm for learning meronym from a corpus. They supplied manual rules wtih automatically
learned constraints.
Information Retrieval
Veale (2003) proposes to use algorithm for solving word analogies, based on WordNet for information retrieval.
Example: Hindu bible the Vedas.→
Focus on the analogy form: Adjective:noun::adjective:noun Example: Muslim:mosque::Christian: church
An unsupervised algorithm for discovering analogies for clustering words from two different corpora had been developed (Marx et al, 2002).
Identifying Semantic Roles
Semantic roles are merely a special case of semantic relations (Moldovan et al).
Example: Semantic frame: statement Semantic roles: speaker, address and adressee
It is helpful to view semantic frames and their semantic roles as sets of semantic relations.
Measuring Attributional Similarity with the Vector Space Model
In the VSM approach to information retrieval, queries and documents are represented by vectors.
Elements in these vectors are the frequencies of words in the corresponding queries and documents.
The attributional similarity between a query and a document is measured by the cosine of the angle between their corresponding vectors.
Singular Value Decomposition
LRA enhances the VSM by using SVD to smooth vectors.
SVD improves both documentquery attributional similarity measures
Measuring Relational Similarity with VSM
Given two unknown relations, R1 (between a pair of words A and B) and R2 ( between C and D), we wish to measure the relational similarity between R1 and R2.
First, we need to create vectors: R1 = < r1,1, ...., r1,n > R2 = < r2,1, ...., r2,n >
Measuring Relational Similarity with VSM
The measure of similrity of R1 and R2 is given by the cosine of the angle Ɵ between r1 and r2:
Vector r indicates the relationship between two words X and Y. Created by counting the frequencies of short
phrases containing X and Y
cosine = ∑ r1 , i . r2 , i
∑ r1 , i 2 .r2 , i 2=
r1.r2
r1.r2 .r1.r2=
r1.r2∣r1∣.∣r2∣
Measuring Relational Similarity with the VSM
If the number of hits for a query is x, then the corresponding element in the vector r is:
log(x + 1).
To answer multiplechoice analogy questions, vectors are created for the stem pair and each choice pair. Then cosines are calculated for the angles between stem pair and each choice pair.
Sample Multiple Choice This SAT question:
Choices:
(a) day:night
(b) mile:distance
(c) decade: century
(d) friction:heat
(e) part:whole
Solution: (b) mile:distance
Stem: quart:volume
Measuring Relational Similarity with VSM
Turney and Litman (2005) used the Alta Vista search engine to obtain frequency information needed to build vectors for VSM. But, Altavista later changed their policy toward automated searching.
They use the hit count, but LRA uses the number of passages (strings) matching the query.
Measuring Relational Similarity with VSM
For experiment: Waterloo MultiText System (WMTS) is used. It has
5 x 1010 English words. Lin's (1998a) automatically generated
thesaurus online is used to query and fetching the resulting list of synonyms.
Lin's thesaurus: Generated by parsing a corpus of 5x107
words
Measuring Relational Similarity with VSM
Lin's Thesaurus provides and sorts a list of words in order of decreasing order Convenient for LRA. WordNet, in contrast, provides a list of words
grouped by possible senses, with groups sorted by frequency of senses.
Steps of LRA
Let's suppose we want to calculate the relational similarity between the pair quart:volume and the pair mile:distance.
The LRA consists of 12 steps: Step 1: Find alternates:
For each word pair A:B in the input set, look in Lin's thesaurus for the top num_sim words that are most similar to A. Do for A':B and B':A.
Alternate Forms of the original pair quart:volume
Word pair Similarity Frequency Filtering step
quart:volume NA 632 Accept (original pair)
pint:volume 0.210 372 gallon:volume 0.159 1500 Accept
(top alternate) liter:volume 0.122 3323 Accept (top
alternate)
squirt:volume 0.084 54 pail:volume 0.084 28 vial:volume 0.084 373 pumping:volume 0.073 1386 Accept
(top alternate) ounce:volume 0.071 430 spoonful:volume 0.070 42 tablespoon:volume 0.069 96 quart:turnover 0.229 0
quart:output 0.225 34 quart:export 0.206 7 quart:value 0.203 266 quart:import 0.186 16 quart:revenue 0.185 0 quart:sale 0.169 119 quart:investment 0.161 11 quart:earnings 0.156 0 quart:profit 0.156 24
Steps of LRA
Step 2: Filter alternates: For each alternate pair, send a query to the WMTS
to find the frequency of phrases (that begin with one member of the pair and end with another). The phrases cannot have more than max_phrases (in this case, 5). Select the top num_filter most frequent alternates and discard the remainder.
Steps of LRA
Step 3: Find phrases For each pair, make a list of phrases in the corpus
that contain the pair. Query the WMTS for all phrases that begin with one member of the pair and end with the other (in either order). We ignore suffixes.
The phrases cannot have more than max_phase and there must be at least one word in between.
Examples of phrases that contain quart volume:
_____________________________________
quarts liquid volume volume in quarts
quarts of volume volume capacity quarts
quarts in volume volume being about two quarts
quart total volume volume of milk in quarts
quart of spray volume volume include measures like quart
Steps of LRA
Step 4: Find Patterns: For each phrase found in step 2, build patterns from
the intervening words. A pattern is constructed by replacing any/all/none of the intervening words with wild cards. A phrase with n words generate: 2(n2) patterns.
For each pattern, count the number of pairs (original and alternates) with phrases that match the pattern. Keep the top num_patterns (4000 here) most frequent patterns and discard the rest.
Steps of LRA
Step 5: Map pairs to rows To build matrix X, create a mapping of word pairs to
row numbers. For each A:B, create a row for A:B and another row
for B:A.
Step 6: Map patterns to columns Create a mapping of the top num_patterns to
column numbers
For each pattern P, create a column for word1 P
word2 and another column for word
2 P word
1
Steps of LRA
Step 7: Generate a sparse matrix Frequencies of various patterns for quart:volume.
P = “in” P = “* of” P = “of *” P = ”* *”
freq(“quart P volume”) 4 1 5 19
freq(“volume P quart”) 10 0 2 16
Steps of LRA
Step 8: Calculate entropy Let m be the number of rows in matrix X and let n
be number of column. To calculate the entropy of the column, we need to
convert the column into a vector of probabilities
Let pi,j be the probability of x
i,j:
where k = 1 to m.
pi , j=xi , j /∑ xk , j
Step 8: cont
The entropy of jth column:
Give more weight to columns(patterns) with frequencies that vary substantially from one row to the next. Therefore we weight the cell x
i,j by
wj = 1 –
H
j / log(m) which varies from 0 when p
i,j
is uniform to 1 when entropy is minimal We also apply the log transformation to
frequencies, log(xi,j + 1).
Hj=−∑ pk , j . log pk , j .
Step 8: cont, Step 9
Step 8 (cont): For all i and j, replace the original value x
i,j in X by the new value w
j log (x
i,j + 1 ).
Step 9: Apply SVD
SVD decomposes a matrix into a product of three matrices U Ʃ VT, where U and V are orthonomal and Ʃ is a diagonal matrix of singular values.
If X is of rank k, then the matrix Uk Ʃ
k V
KT is the
matrix of rank k that best approximates the original matrix X.
Step 9 and 10
Step 9 (cont): Since the cosine of two vectors is their dot product, XXT = U Ʃ VT (U Ʃ VT) = U Ʃ VT V Ʃ UT =
U Ʃ (U Ʃ )T, which means we can calculate cosines with the smaller matrix U Ʃ.
Step 10: Projection
Project the row vector for each word pair from original 8000 dimensional to 300 (k = 300).
Calculate UkƩ
k
Step 11
Step 11: Evaluate alternatives Let A:B and C:D be any two word pairs in the input
set. From step 2, we have: (num_filter + 1)2 ways to compare a version of A:B with a version of C:D.
Look for the row vectors in UkƩ
k that correspond
to each version. Calculate the (num_filter + 1)2 cosines.
The 16 combinations and their consines
Word pairs Cosine Cosine >= original pair
quart:volume::mile:distance 0.525 Yes (original pairs)
quart:volume::feet:distance 0.464
quart:volume::mile:length 0.634 Yes
quart:volume::length:distance 0.499
liter:volume::mile:distance 0.736 Yes
liter:volume::feet:distance 0.687 Yes
liter:volume::mile:length 0.745 Yes
word pairs cosine cosine>= original pairs
liter:volume::length:distance 0.576 Yes
gallon:volume::mile:distance 0.763 Yes
gallon:volume::feet:distance 0.710 Yes
gallon:volume::mile:length 0.781 Yes (highest cosine)
gallon:volume::length:distance 0.615 Yes
pumping:volume::mile:distance 0.412
pumping:volume::feet:distance 0.439
pumping:volume::mile:length 0.446
pumping:volume::length:distance 0.491
Step 12
Step 12: Calculate relational similarity Find cosines from step 11 that are greater than or
equal to the original cosines This is a way to filter out poor analogies, which may
have slipped through the filtering in step 2. Averaging the cosines, as opposed to taking the
maximum is intended to provide some resistence to noise.
Cosines for the sample SATStem: quart:volume Average Original Highest
cosines cosines cosines
1 2 3
Choices: (a) day:night 0.374 0.327 0.443
(b) mile:distance 0.677 0.525 0.781
(c) decade:century 0.389 0.327 0.470
(d) friction:heat 0.428 0.336 0.552
(e) part:whole 0.370 0.330 0.408
Solution: (b) mile:distance 0.677 0.525 0.781
Gap: (b)−(d) 0.249 0.189 0.229
Performance of LRA on the 374 SAT
Algorithm Precision Recall F
LRA 56.8 56.1 56.5
Veale (2004) 42.8 42.8 42.8
Best attributional similarity 35.0 35.0 35.0
Random guessing 20.0 20.0 20.0
Lowest cooccurrence frequency 16.8 16.8 16.8
Highest cooccurrence frequency 11.8 11.8 11.8
Baseline LRA System
Performance of the baseline LRA system on the 374 SAT questions: 210 questions were correctly answered correctly,
160 incorrectly and 4 questions were skipped because its stem pair and its alternates were represented by zero vectors.
Performance of LRA is slightly better than the lexiconapproach of Veale (2004) and the best performance using attributional similarity, with 95% confidence.
LRA versus VSM
LRA performs better than VSM AV.
With smaller corpus, many more of the input word pairs simply do not appear together in short phrases in the corpus.
Algorithm Correct IncorrectSkipped Precision Recall FVSM – AV 176 193 5 47.7 47.1 47.4VSM – WMTS 144 196 34 42.4 38.5 40.3LRA 210 160 4 56.8 56.1 56.5
LRA versus VSM
LRA is able to answer as many questions as VSMAV although it uses the same corpus as VSMWMTS.
Human performance on 78 verbal SAT1 questions: 57% recall.
The experiment did not attempt to tune the parameter values (k, num_sim, ...) to maximize the precision and recall on the 374 SAT questions.
Ablation Experiments
Results of ablation experiments:
LRA LRA
Baseline LRA LRA No SVD,
system No SVD No synonyms no synonyms VSMWMTS
1 2 3 4 5
Correct 210 198 185 178 144
Incorrect 160 172 167 173 196
Skipped 4 4 22 23 34
Precision 56.8 53.5 52.6 50.7 42.4
Recall 56.1 52.9 49.5 47.6 38.5
F 56.5 53.2 51.0 49.1 40.3
Ablation Experiments
Without VSD, performance dropped. But the drop is not statistically significant with 95% confidence.
More words pairs would likely show SVD is making siginificant contribution; it would also give SVD more leverage.
Dropping synonyms rises the skipped questions. Recall drops significantly, but the drop in precision is not significant.
Ablation Experiments
When both SVD and synonyms are dropped, decrease in recall is significant, but larger decrease in precision is not significant.
The difference betwen LRA and VSMWMTS is the patterns.
Contribution of SVD has not been proven.
Matrix Symmetry and Vector Interpretations
A good measure of relational similarity, simr:
Simr (A:B,C:D) = sim
r (B:A, C:D)
This helps prevent drops in recall and precision. Choose better alternates than all alternates. The semantic content of a vector is ditributed
over the whole vector.
Manual Patterns versus Automatic Patterns
LRA uses 4000 automatically generated patterns, whereas Turney and Litmann (2005) used 64 manually generated patterns.
The improvement in performance with automated patterns is due to the the increased quantity of patterns.
The manually generated patterns are not used to mine text for instances of word pairs that fit patterns.
Classes of Relations
Experiment was performed using the 600 labeled nounmodifiers pairs of Nastase and Szpakowicz (2003). Use single nearest neighbour classification with
leaveoneout crossvalidation The data set is split 600 times
There were originally six groups of semantic relations.
Classes of semantic relations from Nastase and Szpakowicz
Relation Abbr. Example phrase Description
CAUSALITY
cause cs flu virus (*) H makes M occur or exist, H is
necessary and sufficient.
effect eff exam anxiety M makes H occur or exist, M is
necessary and sufficient.
Classes of Relations
Answering 374 SAT questions require calculating: 374 x 5 x 16 = 29,920 cosines.
With leaveoneout crossvalidation, each test pair has 599 choices. So, it requires calculation 600 x 599 x 16 cosines.
To reduce amount of computation, we first ignore alternate pairs:
(600x599 = 359,400 cosines), then apply full LRA to just those 30 neighbours (600 x 30 x 16 = 288.000 cosines) Total = 647,400 cosines.→
Limitations of LRA
Although LRA performs significantly better than VSM, it is also clear that the accuracy might not be adequate for practical applications.
It is possible to adjust the tradeoff between precision and recall.
Speed: took 9 days to answer 374 analogy questions.
Conclusions
The LRA extends the VSM approach of Turney and Litman (2005) by: Exploring variations on the analogies by replacing
words with synonyms (step 1). Automatically generating connecting patterns (step
4). Smoothing the data with SVD (step 9).
The accuracy of LRA is significantly higher than accuracies of VSMAV and VSMWMTS
Conclusions
The difference betwen VSMAV and VSMWMTS shows that VSM is sensitive to the size of corpus.
LRA may perform better with larger corpus. A hybrid approach will surpass any purebred
approach. Pattern selection algorithms has little impact on
performance.