Semantic Relations

Similarity of Semantic Relations

Peter D. Turney National Research Council Canada

Presented by: Jennifer LeeNovember 14, 2008

CSI 5386

Attributional Similarity

Two words, A and B with high degree of attributional similarity are called synonyms.

An example of a typical synonym question that appears in TOEFL exam:

Choices: (a) imposed

(b) believed

(c) requested (d) correlated

Solution: (a) imposed

Stem: Levied


A measure of similarity: sima(A, B) ϵ R.

Semantic relatedness x semantic distance.

A more general concept than similarity. Semantic relatedness is the same as attributional

similarity.


Example of semantic relatedness:

Similar entities: (banktrust company) Dissimilar entities:

Meronymy: (carwheel) Antonymy: (hotcold) Any functional relationship/frequent

association: (pencilpaper), (penguinAntartica).


Types of attributional similarity: Semantically associated: (beehoney). Semantically similar: (deerpony). Both: (doctornurse).

The term semantic similarity is misleading as it refers to a type of attributional similarity, yet relational similarity is not any less semantic than attributional similarity. Hence, we use the term taxonomical

similarity.

Relational Similarity

Relational similarity: When two pair of words have a high degree of

relational similarity, we say they are analogous.

Measured by: simr(A:B,C:D) ϵ R

A:B::C:DA is to B as C is to D

Verbal Analogy

Examples: traffic:street::water:riverbed mason:stone::carpenter:wood

It seems like in the second example, the relational similarity can be reduced to attributional similarity.

Verbal Analogy

A typical analogy question from SAT:

Choices: a) teacher: chalk

b) carpenter:wood c) soldier:gun d) photograph:camera e) book:word

Solution: carpenter:wood

Stem: mason: stone

Near Analogy

Near Analogy When there's a high degree of relational similarity

between two words, A:B and C:D, there's also a high degree of attributional similarity between A and C, and between B and D.

Otherwise, it is a far analogy.

Which is one of these pairs is a near analogy? (mason:stone::carpenter:wood) (traffic:street::water:riverbed)

Measures of Attributional Similarity

Many algorithms have been proposed. Measures of attributional similarity have been

studied extensively. Applications:

Problems such as recognizing synonyms, information retrieval, determining semantic orientation, grading student essays, measuring textual cohesion, and word sense disambiguation.

Measuring Attributional Similarity

Algorithms: Lexiconbased, corpusbased. Hybrid of the two.

We expect that lexiconbased algorithms would be better at capturing synonymy than corpusbased algorithm. But, this is not the case.

Measuring Attributional Similarity

Reference Description Percent correctBest lexiconbased algorithm 78.75

Terra and Clarke (2003) Best corpusbased algorithm 81.25

Best hybridalgorithm 97.5

Average human score 64.5

Jarmasz and Spakowicz (2003)

Turney et al. (2003)Landauer and Dumais (1997)

Performance of attributional similarity on the 80 TOEFL questions:

Measures of Relational Similarity

Not well developed Potential applications are not so wellknown. Many problems that involve semantic relations

would benefit from an algorithm for measuring relational similarity: NLP, information retrieval and information

extraction.

Using Attributional Similarity to Solve Analogies

We could score each candidate analogy by the average of the attributional similarity, sim

a,

between A and C and between B and D:

score(A:B::C:D) = (sima (A,C) + sim (B,D))

Performance of algorithms was measured by precision, recall, and F

12


precision=number of correct guesses

total number of guesses made

recall=number of correct guesses

maximum possible number correct

precision=number of correct guesses

total number of guesses made

F=2 x precision x recall

precisionrecall


For example, using the algorithm of Hirst and StOnge (1998), out of 374 SAT analogy questions, 120 questions were answered correctly, 224 incorrectly, and 30 questions were skipped. Precision was 120/(120 + 224) Recall was 120/(120 + 224 + 30)

Performance of attributional similarity measures on the 374 SAT questions. The bottom two rows are included for comparison.


Algorithm Type Precision Recall FLexiconbased 34.9 32.1 33.4Hybrid 29.8 27.3 28.5Lexiconbased 32.8 31.3 32

Lin (1998b) Hybrid 31.2 27.3 29.1Hybrid 35.7 33.2 34.4Corpusbased 35 35 35Relational (VSM) 47.7 47.1 47.4

Random Random 20 20 20

Hirst and StOnge (1998)Jian and Conrath (1997)Leacock and Chodorow (1998)

Resnik (1995)Turney (2001)Turney and Littman (2005)


We conclude that there are enough near analogies in the 374 SAT questions for attributional similarity to perform better than random guessing.

But not enough near analogies for attributional similarity to perform as well as relational similarity.

Recognizing Word Analogies

First attempted by a system called Argus using a small handbuilt semantic network.

Argus was based on a spreading activation model and did not explicitly attempt to measure relational similarity. Therefore, it could only solve a limited set of analogy questions.

Recognizing Word Analogies

Turney at al. (2003) combined 13 independent modules to answer SAT questions. VSM is the best out of 13, achieving a score of 47%.

Veale (2004) applied a lexiconbased approach to the same 374 SAT questions, attaining a score of 43% WordNet was used to get the quality measure,

based on similarity between A:B paths and the C:D paths.

Latent Relational Analysis

Turney (2005) introduced Latent Relational Analysis (LRA), an enhanced version of the VSM approach to measure relational similarity.

LRA has potential in many areas, including information extraction, word sense disambiguation, and information retrieval.

LRA relies on three resources: a search engine with a large corpus of text, thesaurus of synonyms and an efficient implementation of SVD.

Structure Mapping Theory

Most influential on modeling of analogy making, implemented in Structure Mapping Engine (SME).

Produces an analogical mapping between the source and target domain. Uses predicate logic.

Example analogy: Source domain: solar system (basic objects are sun

and planet) Target domain: Rutherford's model of the atom

(basic objects are nucleus and electrons)

Structure Mapping Theory

Each individual connection in an analogy mapping implies that the connection relations are similar.

Later versions of SME allowed similar, nonidentical relations to match.

Although SME focuses on the mapping process as a whole rather than measuring similarity between any two particular relations, LRA can enhance the performance of SME and likewise.

Metaphor

Novel metaphors can be understood through analogy, but conventional metaphors are simply recalled from memory.

It may be fruitful to combine an algorithm (Dolan's 1995) for handling conventional metaphor with LRA and SME for handling novel metaphors.

Metaphor

Lakoff and Johnson (1980):

Metaphorical sentence SATstyle verbal analogy

He shot down all of my arguments. aircraft:shoot down::argument:refute

I demolished his argument. building:demolish::argument:refute

You need to budget your time. money:budget::time:schedule

I’ve invested a lot of time in her. money:invest::time:allocate

My mind just isn’t operating today. machine:operate::mind:think

Life has cheated me. charlatan:cheat::life:disappoint

Inflation is eating up our profits. animal:eat::inflation:reduce

Classifying Semantic Relations

The problem is to classify a nounmodifier pair according to the semantic relation between the head noun and the modifier.

Example: laser printer Rosario and Hearst (2001) trained a neural

network to distinguish 13 classes of semantic relations in the medical domain. Lexical resources used: MeSH and UMLS Each nounmodifier pair is represented with a

feature vector.


Nastase and Szpakowicz (2003) classified 600 general nounmodifier pairs using WordNet and Roget's Thesaurus as lexical resources.

Vanderwende (2004) used handbuilt rules, together with a lexical knowledge base.

Any classification of semantic relations employs some implicit notion of relational similarity.


Barker and Szpakowicz (1998) tried a corpus based approach Explicitly use measure of relational similarity

Moldovan et al. (2004) also used a measure of relational similarity to map each noun and modifier into semantic classes in WordNet. Taken from corpus Surrounding context in the corpus is used in a word

sense disambiguation algorithm to improve the mapping.


Turney and Littman (2005) used the VSM (as the component in a single nearest neighbor learning algorithm) to measure relational similarity. This paper focuses on LRA.

Lauer (1995) used a corpusbased approach to paraphrase nounmodifier pairs by inserting propositions. Example: reptile haven haven for reptiles.→

Lapata and Keller (2004) improved the result by using the database of Alta Vista as a corpus.

Word Sense Disambiguation, Information Extraction

If we can identify the relations between a given word and its context, then we can disambiguate the given word.

For example, consider the word plant. Suppose plant appears in some text near food.

Information Extraction: Given an input document and a specific relation R,

extract all pairs of entities (if any) that have the relation R in the document.

Example: John Smith and Hardcom Corporation.

Information Extraction

With the VSM approach, there were a training set of labeled examples of the relation. Each example would be represented by a vector of

pattern frequencies. Given two entities, we could construct a vector

representing their relation Then measure the relational similarity between the

unlabeled vector and each of the labeled training vectors.

Information Extraction andQuestion Answering

Looks like a problem: Training vectors would be relatively dense The new unlabled vector for the two entities would

be sparse.

Moldovan et al. (2004) propose to map a given question to semantic relation, and then search for that relation in a corpus of semantically tagged text.

Automatic Thesaurus Generation

Hearst (1992) presents an algorithm that can automatically generate a thesaurus or dictionary: Learning hyponym, meronym relations and more.

Hearst and Berland and Charniak (1999) use manually generated rules to mine text for semantic relations.

Turney and Littman (2005) also use a manually generated set of 64 patterns.

Automatic Thesaurus Generation

Instead of manually generating new rules or patterns for each semantic relation, LRA can automatically learn patterns from a large corpus.

Girju, Badulescu, and Moldovan (2003) present an algorithm for learning meronym from a corpus. They supplied manual rules wtih automatically

learned constraints.

Information Retrieval

Veale (2003) proposes to use algorithm for solving word analogies, based on WordNet for information retrieval.

Example: Hindu bible the Vedas.→

Focus on the analogy form: Adjective:noun::adjective:noun Example: Muslim:mosque::Christian: church

An unsupervised algorithm for discovering analogies for clustering words from two different corpora had been developed (Marx et al, 2002).

Identifying Semantic Roles

Semantic roles are merely a special case of semantic relations (Moldovan et al).

Example: Semantic frame: statement Semantic roles: speaker, address and adressee

It is helpful to view semantic frames and their semantic roles as sets of semantic relations.

Measuring Attributional Similarity with the Vector Space Model

In the VSM approach to information retrieval, queries and documents are represented by vectors.

Elements in these vectors are the frequencies of words in the corresponding queries and documents.

The attributional similarity between a query and a document is measured by the cosine of the angle between their corresponding vectors.

Singular Value Decomposition

LRA enhances the VSM by using SVD to smooth vectors.

SVD improves both documentquery attributional similarity measures

Measuring Relational Similarity with VSM

Given two unknown relations, R1 (between a pair of words A and B) and R2 ( between C and D), we wish to measure the relational similarity between R1 and R2.

First, we need to create vectors: R1 = < r1,1, ...., r1,n > R2 = < r2,1, ...., r2,n >


The measure of similrity of R1 and R2 is given by the cosine of the angle Ɵ between r1 and r2:

Vector r indicates the relationship between two words X and Y. Created by counting the frequencies of short

phrases containing X and Y

cosine = ∑ r1 , i . r2 , i

∑ r1 , i 2 .r2 , i 2=

r1.r2

r1.r2 .r1.r2=

r1.r2∣r1∣.∣r2∣

Measuring Relational Similarity with the VSM

If the number of hits for a query is x, then the corresponding element in the vector r is:

log(x + 1).

To answer multiplechoice analogy questions, vectors are created for the stem pair and each choice pair. Then cosines are calculated for the angles between stem pair and each choice pair.

Sample Multiple Choice This SAT question:

Choices:

(a) day:night

(b) mile:distance

(c) decade: century

(d) friction:heat

(e) part:whole

Solution: (b) mile:distance

Stem: quart:volume


Turney and Litman (2005) used the Alta Vista search engine to obtain frequency information needed to build vectors for VSM. But, Altavista later changed their policy toward automated searching.

They use the hit count, but LRA uses the number of passages (strings) matching the query.


For experiment: Waterloo MultiText System (WMTS) is used. It has

5 x 1010 English words. Lin's (1998a) automatically generated

thesaurus online is used to query and fetching the resulting list of synonyms.

Lin's thesaurus: Generated by parsing a corpus of 5x107

words


Lin's Thesaurus provides and sorts a list of words in order of decreasing order Convenient for LRA. WordNet, in contrast, provides a list of words

grouped by possible senses, with groups sorted by frequency of senses.

Steps of LRA

Let's suppose we want to calculate the relational similarity between the pair quart:volume and the pair mile:distance.

The LRA consists of 12 steps: Step 1: Find alternates:

For each word pair A:B in the input set, look in Lin's thesaurus for the top num_sim words that are most similar to A. Do for A':B and B':A.

Alternate Forms of the original pair quart:volume

Word pair Similarity Frequency Filtering step

quart:volume NA 632 Accept (original pair)

pint:volume 0.210 372 gallon:volume 0.159 1500 Accept

(top alternate) liter:volume 0.122 3323 Accept (top

alternate)

squirt:volume 0.084 54 pail:volume 0.084 28 vial:volume 0.084 373 pumping:volume 0.073 1386 Accept

(top alternate) ounce:volume 0.071 430 spoonful:volume 0.070 42 tablespoon:volume 0.069 96 quart:turnover 0.229 0

quart:output 0.225 34 quart:export 0.206 7 quart:value 0.203 266 quart:import 0.186 16 quart:revenue 0.185 0 quart:sale 0.169 119 quart:investment 0.161 11 quart:earnings 0.156 0 quart:profit 0.156 24

Steps of LRA

Step 2: Filter alternates: For each alternate pair, send a query to the WMTS

to find the frequency of phrases (that begin with one member of the pair and end with another). The phrases cannot have more than max_phrases (in this case, 5). Select the top num_filter most frequent alternates and discard the remainder.

Steps of LRA

Step 3: Find phrases For each pair, make a list of phrases in the corpus

that contain the pair. Query the WMTS for all phrases that begin with one member of the pair and end with the other (in either order). We ignore suffixes.

The phrases cannot have more than max_phase and there must be at least one word in between.

Examples of phrases that contain quart volume:

_____________________________________

quarts liquid volume volume in quarts

quarts of volume volume capacity quarts

quarts in volume volume being about two quarts

quart total volume volume of milk in quarts

quart of spray volume volume include measures like quart

Steps of LRA

Step 4: Find Patterns: For each phrase found in step 2, build patterns from

the intervening words. A pattern is constructed by replacing any/all/none of the intervening words with wild cards. A phrase with n words generate: 2(n2) patterns.

For each pattern, count the number of pairs (original and alternates) with phrases that match the pattern. Keep the top num_patterns (4000 here) most frequent patterns and discard the rest.

Steps of LRA

Step 5: Map pairs to rows To build matrix X, create a mapping of word pairs to

row numbers. For each A:B, create a row for A:B and another row

for B:A.

Step 6: Map patterns to columns Create a mapping of the top num_patterns to

column numbers

For each pattern P, create a column for word1 P

word2 and another column for word

2 P word

1

Steps of LRA

Step 7: Generate a sparse matrix Frequencies of various patterns for quart:volume.

P = “in” P = “* of” P = “of *” P = ”* *”

freq(“quart P volume”) 4 1 5 19

freq(“volume P quart”) 10 0 2 16

Steps of LRA

Step 8: Calculate entropy Let m be the number of rows in matrix X and let n

be number of column. To calculate the entropy of the column, we need to

convert the column into a vector of probabilities

Let pi,j be the probability of x

i,j:

where k = 1 to m.

pi , j=xi , j /∑ xk , j

Step 8: cont

The entropy of jth column:

Give more weight to columns(patterns) with frequencies that vary substantially from one row to the next. Therefore we weight the cell x

i,j by

wj = 1 –

H

j / log(m) which varies from 0 when p

i,j

is uniform to 1 when entropy is minimal We also apply the log transformation to

frequencies, log(xi,j + 1).

Hj=−∑ pk , j . log pk , j .

Step 8: cont, Step 9

Step 8 (cont): For all i and j, replace the original value x

i,j in X by the new value w

j log (x

i,j + 1 ).

Step 9: Apply SVD

SVD decomposes a matrix into a product of three matrices U Ʃ VT, where U and V are orthonomal and Ʃ is a diagonal matrix of singular values.

If X is of rank k, then the matrix Uk Ʃ

k V

KT is the

matrix of rank k that best approximates the original matrix X.

Step 9 and 10

Step 9 (cont): Since the cosine of two vectors is their dot product, XXT = U Ʃ VT (U Ʃ VT) = U Ʃ VT V Ʃ UT =

U Ʃ (U Ʃ )T, which means we can calculate cosines with the smaller matrix U Ʃ.

Step 10: Projection

Project the row vector for each word pair from original 8000 dimensional to 300 (k = 300).

Calculate UkƩ

k

Step 11

Step 11: Evaluate alternatives Let A:B and C:D be any two word pairs in the input

set. From step 2, we have: (num_filter + 1)2 ways to compare a version of A:B with a version of C:D.

Look for the row vectors in UkƩ

k that correspond

to each version. Calculate the (num_filter + 1)2 cosines.

The 16 combinations and their consines

Word pairs Cosine Cosine >= original pair

quart:volume::mile:distance 0.525 Yes (original pairs)

quart:volume::feet:distance 0.464

quart:volume::mile:length 0.634 Yes

quart:volume::length:distance 0.499

liter:volume::mile:distance 0.736 Yes

liter:volume::feet:distance 0.687 Yes

liter:volume::mile:length 0.745 Yes

word pairs cosine cosine>= original pairs

liter:volume::length:distance 0.576 Yes

gallon:volume::mile:distance 0.763 Yes

gallon:volume::feet:distance 0.710 Yes

gallon:volume::mile:length 0.781 Yes (highest cosine)

gallon:volume::length:distance 0.615 Yes

pumping:volume::mile:distance 0.412

pumping:volume::feet:distance 0.439

pumping:volume::mile:length 0.446

pumping:volume::length:distance 0.491

Step 12

Step 12: Calculate relational similarity Find cosines from step 11 that are greater than or

equal to the original cosines This is a way to filter out poor analogies, which may

have slipped through the filtering in step 2. Averaging the cosines, as opposed to taking the

maximum is intended to provide some resistence to noise.

Cosines for the sample SATStem: quart:volume Average Original Highest

cosines cosines cosines

1 2 3

Choices: (a) day:night 0.374 0.327 0.443

(b) mile:distance 0.677 0.525 0.781

(c) decade:century 0.389 0.327 0.470

(d) friction:heat 0.428 0.336 0.552

(e) part:whole 0.370 0.330 0.408

Solution: (b) mile:distance 0.677 0.525 0.781

Gap: (b)−(d) 0.249 0.189 0.229

Performance of LRA on the 374 SAT

Algorithm Precision Recall F

LRA 56.8 56.1 56.5

Veale (2004) 42.8 42.8 42.8

Best attributional similarity 35.0 35.0 35.0

Random guessing 20.0 20.0 20.0

Lowest cooccurrence frequency 16.8 16.8 16.8

Highest cooccurrence frequency 11.8 11.8 11.8

Baseline LRA System

Performance of the baseline LRA system on the 374 SAT questions: 210 questions were correctly answered correctly,

160 incorrectly and 4 questions were skipped because its stem pair and its alternates were represented by zero vectors.

Performance of LRA is slightly better than the lexiconapproach of Veale (2004) and the best performance using attributional similarity, with 95% confidence.

LRA versus VSM

LRA performs better than VSM AV.

With smaller corpus, many more of the input word pairs simply do not appear together in short phrases in the corpus.

Algorithm Correct IncorrectSkipped Precision Recall FVSM – AV 176 193 5 47.7 47.1 47.4VSM – WMTS 144 196 34 42.4 38.5 40.3LRA 210 160 4 56.8 56.1 56.5

LRA versus VSM

LRA is able to answer as many questions as VSMAV although it uses the same corpus as VSMWMTS.

Human performance on 78 verbal SAT1 questions: 57% recall.

The experiment did not attempt to tune the parameter values (k, num_sim, ...) to maximize the precision and recall on the 374 SAT questions.

Ablation Experiments

Results of ablation experiments:

LRA LRA

Baseline LRA LRA No SVD,

system No SVD No synonyms no synonyms VSMWMTS

1 2 3 4 5

Correct 210 198 185 178 144

Incorrect 160 172 167 173 196

Skipped 4 4 22 23 34

Precision 56.8 53.5 52.6 50.7 42.4

Recall 56.1 52.9 49.5 47.6 38.5

F 56.5 53.2 51.0 49.1 40.3


Without VSD, performance dropped. But the drop is not statistically significant with 95% confidence.

More words pairs would likely show SVD is making siginificant contribution; it would also give SVD more leverage.

Dropping synonyms rises the skipped questions. Recall drops significantly, but the drop in precision is not significant.


When both SVD and synonyms are dropped, decrease in recall is significant, but larger decrease in precision is not significant.

The difference betwen LRA and VSMWMTS is the patterns.

Contribution of SVD has not been proven.

Matrix Symmetry and Vector Interpretations

A good measure of relational similarity, simr:

Simr (A:B,C:D) = sim

r (B:A, C:D)

This helps prevent drops in recall and precision. Choose better alternates than all alternates. The semantic content of a vector is ditributed

over the whole vector.

Manual Patterns versus Automatic Patterns

LRA uses 4000 automatically generated patterns, whereas Turney and Litmann (2005) used 64 manually generated patterns.

The improvement in performance with automated patterns is due to the the increased quantity of patterns.

The manually generated patterns are not used to mine text for instances of word pairs that fit patterns.

Classes of Relations

Experiment was performed using the 600 labeled nounmodifiers pairs of Nastase and Szpakowicz (2003). Use single nearest neighbour classification with

leaveoneout crossvalidation The data set is split 600 times

There were originally six groups of semantic relations.

Classes of semantic relations from Nastase and Szpakowicz

Relation Abbr. Example phrase Description

CAUSALITY

cause cs flu virus (*) H makes M occur or exist, H is

necessary and sufficient.

effect eff exam anxiety M makes H occur or exist, M is

necessary and sufficient.

Classes of Relations

Answering 374 SAT questions require calculating: 374 x 5 x 16 = 29,920 cosines.

With leaveoneout crossvalidation, each test pair has 599 choices. So, it requires calculation 600 x 599 x 16 cosines.

To reduce amount of computation, we first ignore alternate pairs:

(600x599 = 359,400 cosines), then apply full LRA to just those 30 neighbours (600 x 30 x 16 = 288.000 cosines) Total = 647,400 cosines.→

Limitations of LRA

Although LRA performs significantly better than VSM, it is also clear that the accuracy might not be adequate for practical applications.

It is possible to adjust the tradeoff between precision and recall.

Speed: took 9 days to answer 374 analogy questions.

Conclusions

The LRA extends the VSM approach of Turney and Litman (2005) by: Exploring variations on the analogies by replacing

words with synonyms (step 1). Automatically generating connecting patterns (step

4). Smoothing the data with SVD (step 9).

The accuracy of LRA is significantly higher than accuracies of VSMAV and VSMWMTS

Conclusions

The difference betwen VSMAV and VSMWMTS shows that VSM is sensitive to the size of corpus.

LRA may perform better with larger corpus. A hybrid approach will surpass any purebred

approach. Pattern selection algorithms has little impact on

performance.

Semantic Relations

Documents