This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Semantic similarity, vector space models and word- sense
disambiguation Corpora and Statistical Methods Lecture 6
Slide 2
Semantic similarity Part 1
Slide 3
Synonymy Different phonological/orthographic words highly
related meanings: sofa / couch boy / lad Traditional definition: w1
is synonymous with w2 if w1 can replace w2 in a sentence, salva
veritate Is this ever the case? Can we replace one word for another
and keep our sentence identical?
Slide 4
The importance of text genre & register With near-synonyms,
there are often register-governed conditions of use. E.g. naive vs
gullible vs ingenuous You're so bloody gullible [] [] outside on
the pavement trying to entice gullible idiots in [] You're so
ingenuous. You tackle things the wrong way. The commentator's
ingenuous query could just as well have been prompted [] However,
it is ingenuous to suppose that peace process [] (source: BNC)
Slide 5
Synonymy vs. Similarity The contextual theory of synonymy:
based on the work of Wittgenstein (1953), and Firth (1957) You
shall know a word by the company it keeps (Firth 1957) Under this
view, perfect synonyms might not exist. But words can be judged as
highly similar if people put them into the same linguistic
contexts, and judge the change to be slight.
Slide 6
Synonymy vs. similarity: example Miller & Charles 1991:
Weak contextual hypothesis: The similarity of the context in which
2 words appear contributes to the semantic similarity of those
words. E.g. snake is similar to [resp. synonym of] serpent to the
extent that we find snake and serpent in the same linguistic
contexts. It is much more likely that snake/serpent will occur in
similar contexts than snake/toad NB: this is not a discrete notion
of synonymy, but a continuous definition of similarity
Slide 7
The Miller/Charles experiment Subjects were given sentences
with missing words; asked to place words they felt were OK in each
context. Method to compare words A and B: find sentences containing
A find sentences containing B delete A and B from sentences and
shuffle them ask people to choose which sentences to place A and B
in. Results: People tend to put similar words in the same context,
and this is highly correlated with occurrence in similar contexts
in corpora.
Slide 8
Issues with similarity Similar is a much broader concept than
synonymous: Contextually related, though differing in meaning: man
/ woman boy / girl master / pupil Contextually related, but with
opposite meanings: big / small clever / stupid
Slide 9
Uses of similarity Assumption: semantically similar words
behave in similar ways Information retrieval: query expansion with
related terms K nearest neighbours, e.g.: given: a set of elements,
each assigned to some topic task: classify an unknown w by topic
method: find the topic that is most prevalent among ws semantic
neighbours
Slide 10
Common approaches Vector-space approaches: represent word w as
a vector containing the words (or other features) in the context of
w compare the vectors of w1, w2 various vector-distance measures
available Information-theoretic measures: w1 is similar to w2 to
the extent that knowing about w1 increases my knowledge (decreases
my uncertainty) about w2
Slide 11
Vector-space models
Slide 12
Basic data structure Matrix M M ij = no. of times w i co-occurs
with w j (in some window). Can also have Document * word matrix We
can treat matrix cells as boolean: if M ij > 0, then w i
co-occurs with w j, else it does not.
Slide 13
Distance measures Many measures take a set-theoretic
perspective: vectors can be: binary (indicate co-occurrence or not)
real-valued (indicate frequency, or probability) similarity is a
function of what two vectors have in common
Dice (car, truck) On the boolean matrix: (2 * 2)/(4+2) = 0.66
Jaccard On the boolean matrix: 2/4 = 0.5 Dice is more generous;
Jaccard penalises lack of overlap more. Dice vs. Jaccard
Turning counts to probabilities P(spacewalking|cosmonaut) = =
0.5 P(red|car) = = 0.25 NB: this transforms each row into a
probability distribution corresponding to a word
Slide 19
Probabilistic measures of distance KL-Divergence: treat W1 as
an approximation of W2 Problems: asymmetric: D(p||q) D(q||p) not so
useful for word-word similarity if denominator = 0, then D(v||w) is
undefined
Slide 20
Probabilistic measures of distance Information radius (aka
Jenson-Shannon Divergence) compares total divergence between p and
q to the average of p and q symmetric! Dagan et al (1997) showed
this measure to be superior to KL- Divergence, when applied to a
word sense disambiguation task.
Slide 21
Some characteristics of vector-space measures 1. Very simple
conceptually; 2. Flexible: can represent similarity based on
document co- occurrence, word co-occurrence etc; 3. Vectors can be
arbitrarily large, representing wide context windows; 4. Can be
expanded to take into account grammatical relations (e.g.
head-modifier, verb-argument, etc).
Slide 22
Grammar-informed methods: Lin (1998) Intuition: The similarity
of any two things (words, documents, people, plants) is a function
of the information gained by having: a joint description of a and b
in terms of what they have in common compared to describing a and b
separately E.g. do we gain more by a joint description of: apple
and chair (both THINGS) apple and banana (both FRUIT: more
specific)
Slide 23
Lins definition cont/d Essentially, we compare the info content
of the common definition to the info content of the separate
definition NB: essentially mutual information!
Slide 24
An application to corpora From a corpus-based point of view,
what do words have in common? context, obviously How to define
context? just bag-of-words (typical of vector-space models) more
grammatically sophisticated
Slide 25
Kilgarriffs (2003) application Definition of the notion of
context, following Lin: define F(w) as the set of grammatical
contexts in which w occurs a context is a triple : rel is a
grammatical relation w is the word of interest w is the other word
in rel Grammatical relations can be obtained using a dependency
parser.
Slide 26
Grammatical co-occurrence matrix for cell Source: Jurafsky
& Martin (2009), after Lin (1998)
Slide 27
Example with w = cell Example triples: Observe that each triple
f consists of the relation r, the second word in the relation
w,..and the word of interest w We can now compute the level of
association between the word w and each of its triples f: An
information-theoretic measure that was proposed as a generalisation
of the idea of pointwise mutual information.
Slide 28
Calculating similarity Given that we have grammatical triples
for our words of interest, similarity of w1 and w2 is a function
of: the triples they have in common the triples that are unique to
each I.e.: mutual info of what the two words have in common,
divided by sum of mutual info of what each word has
Slide 29
Sample results: master & pupil common: Subject-of: read,
sit, know Modifier: good, form Possession: interest master only:
Subject-of: ask Modifier: past (cf. past master) pupil only:
Subject-of: make, find PP_at-p: school
Slide 30
Concrete implementation The online SketchEngine gives
grammatical relations of words, plus thesaurus which rates words by
similarity to a head word. This is based on the Lin 1998
model.
Slide 31
Limitations (or characteristics) Only applicable as a measure
of similarity between words of the same category makes no sense to
compare grammatical relations of different category words Does not
distinguish between near-synonyms and similar words student ~ pupil
master ~ pupil MI is sensitive to low-frequency: a relation which
occurs only once in the corpus can come out as highly
significant.