Corpora and Statistical Methods

Corpora and Statistical Methods Lecture 12

Albert GattCorpora and Statistical Methods

Part 1Semantic similarity and the vector space modelSynonymyDifferent phonological/orthographic wordshighly related meanings:sofa / couchboy / lad

Traditional definition:w1 is synonymous with w2 if w1 can replace w2 in a sentence, salva veritateIs this ever the case? Can we replace one word for another and keep our sentence identical?The importance of text genre & registerWith near-synonyms, there are often register-governed conditions of use.

E.g. naive vs gullible vs ingenuousYou're so bloody gullible [] [] outside on the pavement trying to entice gullible idiots in []You're so ingenuous . You tackle things the wrong way. The commentator's ingenuous query could just as well have been prompted []However, it is ingenuous to suppose that peace process [](source: BNC)Synonymy vs. SimilarityThe contextual theory of synonymy:based on the work of Wittgenstein (1953), and Firth (1957)You shall know a word by the company it keeps (Firth 1957)

Under this view, perfect synonyms might not exist.

But words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight.Synonymy vs. similarity: exampleMiller & Charles 1991:

Weak contextual hypothesis: The similarity of the context in which 2 words appear contributes to the semantic similarity of those words.E.g. snake is similar to [resp. synonym of] serpent to the extent that we find snake and serpent in the same linguistic contexts.It is much more likely that snake/serpent will occur in similar contexts than snake/toad

NB: this is not a discrete notion of synonymy, but a continuous definition of similarityThe Miller/Charles experimentSubjects were given sentences with missing words; asked to place words they felt were OK in each context.

Method to compare words A and B:find sentences containing Afind sentences containing Bdelete A and B from sentences and shuffle themask people to choose which sentences to place A and B in.

Results:People tend to put similar words in the same context, and this is highly correlated with occurrence in similar contexts in corpora.

Issues with similaritySimilar is a much broader concept than synonymous:

Contextually related, though differing in meaning:man / womanboy / girlmaster / pupil

Contextually related, but with opposite meanings:big / smallclever / stupidUses of similarityAssumption: semantically similar words behave in similar ways

Information retrieval: query expansion with related terms

K nearest neighbours, e.g.:given: a set of elements, each assigned to some topictask: classify an unknown w by topicmethod: find the topic that is most prevalent among ws semantic neighbours

Common approachesVector-space approaches:represent word w as a vector containing the words (or other features) in the context of wcompare the vectors of w1, w2 various vector-distance measures available

Information-theoretic measures:w1 is similar to w2 to the extent that knowing about w1 increases my knowledge (decreases my uncertainty) about w2

Part 2Vector-space modelsBasic data structureMatrix M Mij = no. of times wi co-occurs with wj (in some window). Can also have Document * word matrixWe can treat matrix cells as boolean: if Mij > 0, then wi co-occurs with wj , else it does not.

Distance measuresMany measures take a set-theoretic perspective:vectors can be:binary (indicate co-occurrence or not)real-valued (indicate frequency, or probability)similarity is a function of what two vectors have in commonClassic similarity/distance measuresBoolean vector (sets)Real-valued vectorDice coefficient

Jaccard Coefficient

Dice coefficient

Jaccard Coefficient

Dice (car, truck)On the boolean matrix: (2 * 2)/(4+2) = 0.66

JaccardOn the boolean matrix: 2/4 = 0.5

Dice is more generous; Jaccard penalises lack of overlap more.

Dice vs. JaccardClassic similarity/distance measuresBoolean vector (sets)Real-valued vectorCosine similarityCosine similarity(= angle between 2 vectors)

Part 3probabilistic approachesTurning counts to probabilitiesP(spacewalking|cosmonaut) = = 0.5P(red|car) = = 0.25

NB: this transforms each row into a probability distribution corresponding to a word

Probabilistic measures of distanceKL-Divergence: treat W1 as an approximation of W2

Problems:asymmetric: D(p||q) D(q||p)not so useful for word-word similarityif denominator = 0, then D(v||w) is undefined

Probabilistic measures of distanceInformation radius (aka Jenson-Shannon Divergence)compares total divergence between p and q to the average of p and qsymmetric!

Dagan et al (1997) showed this measure to be superior to KL-Divergence, when applied to a word sense disambiguation task.

Some characteristics of vector-space measuresVery simple conceptually;

Flexible: can represent similarity based on document co-occurrence, word co-occurrence etc;

Vectors can be arbitrarily large, representing wide context windows;

Can be expanded to take into account grammatical relations (e.g. head-modifier, verb-argument, etc).Grammar-informed methods: Lin (1998)Intuition:The similarity of any two things (words, documents, people, plants) is a function of the information gained by having:a joint description of a and b in terms of what they have in commoncompared to describing a and b separately

E.g. do we gain more by a joint description of:apple and chair (both THINGS)apple and banana (both FRUIT: more specific) Lins definition cont/dEssentially, we compare the info content of the common definition to the info content of the separate definition

NB: essentially mutual information!

An application to corporaFrom a corpus-based point of view, what do words have in common?context, obviously

How to define context?just bag-of-words (typical of vector-space models)more grammatically sophisticatedKilgarriffs (2003) applicationDefinition of the notion of context, following Lin:

define F(w) as the set of grammatical contexts in which w occurs

a context is a triple :rel is a grammatical relationw is the word of interestw is the other word in rel

Grammatical relations can be obtained using a dependency parser.Grammatical co-occurrence matrix for cell

Source: Jurafsky & Martin (2009), after Lin (1998)Example with w = cellExample triples:

Observe that each triple f consists of the relation r, the second word in the relation w, ..and the word of interest w

We can now compute the level of association between the word w and each of its triples f:

An information-theoretic measure that was proposed as a generalisation of the idea of pointwise mutual information.

Calculating similarityGiven that we have grammatical triples for our words of interest, similarity of w1 and w2 is a function of:the triples they have in commonthe triples that are unique to each

I.e.: mutual info of what the two words have in common, divided by sum of mutual info of what each word has

Sample results: master & pupilcommon:Subject-of: read, sit, knowModifier: good, formPossession: interestmaster only:Subject-of: askModifier: past (cf. past master)pupil only:Subject-of: make, findPP_at-p: schoolConcrete implementationThe online SketchEngine gives grammatical relations of words, plus thesaurus which rates words by similarity to a head word.

This is based on the Lin 1998 model.

Limitations (or characteristics)Only applicable as a measure of similarity between words of the same categorymakes no sense to compare grammatical relations of different category words

Does not distinguish between near-synonyms and similar wordsstudent ~ pupilmaster ~ pupil

MI is sensitive to low-frequency: a relation which occurs only once in the corpus can come out as highly significant.

Part 4Applications of vector-space models to information retrievalInformation retrievalBasic problem definition:Store a (very large) collection of documentsDocument = Newspaper articles, encyclopedia entries, medline abstracts, html pages...Given a user query (some set of words), retrieve the documents that are most relevant to that query.

Most IR systems take a bag of words approach: Document = the words it containsNo syntactic information or higher order semantic informationIR architecture

Basic representationSame as for semantic similarity, except that we use a document by term (=word) matrix

A document d is represented as a vector whose cells contain term weights.

Example document representationFried eggplant recipeDocument representationPlace the flour, egg, and bread crumbs each in 3 small bowls. Add the 1/2 teaspoon of salt to the egg and whisk to combine. Season the bread crumbs with the tablespoon of Essence and stir with a fork or your hands to thoroughly combine.Dredge each piece of eggplant in the flour, coating thoroughly and then shaking to remove any excess flour. Coat each piece with the egg, then dredge in the bread crumb mixture, pressing to make the bread crumbs adhere. Transfer the eggplant pieces to a rack or to paper towels to let them dry slightly before frying.In a deep, heavy skillet heat 1/2-inch of vegetable oil to 375 degrees F. Fry the eggplant pieces, in batches if necessary, for about 1 minute on each side, or until golden brown. Transfer with tongs to paper towels to drain. Sprinkle lightly with salt before serving. Serve with marinara sauce, or as desired.

floureggBread crumbeggplantdj3343The term weights are just simple document frequencies (for now)Example query representationDocument representationfloureggBread crumbeggplantdj3343The term weights are just simple document frequencies (for now)User querySuppose user types: egg and breadcrumb

Query rep could be:

More generallyLet d1 be the eggplant recipe, and d2 be a fried chicken recipe.

User query:Note: intuitively, this query should match both docs (both contain egg and breadcrumb)Which doc would the query fried chicken match?floureggBread crumbeggplantchickend133430d223302

Query processingWe can use the same model as we used for computing word similarity, to compute the degree of match between a query and a doc.

E.g. Compute the cosine similarity between the query and the document vector.Documents can then be ranked by their similarity to the query.Term weightingSo far, the intuition has been that: frequent terms in a document capture the basic meaning of the document.

Another intuition: terms that crop up in a few documents are more discriminatory.

floureggBread crumbeggplantchickend133430d223302Inverse document frequency (IDF)A way of giving a higher weight to more discriminative words.

N = no. of docs in the collectionni = number of documents containing term i

We combine IDF with TF (the term frequency)

TF/IDFfloureggBread crumbeggplantchickend133430d223302floureggBread crumbeggplantchicken0000.300.30tfidfTF-IDF weightingProperties:Weights terms higher if they are:frequent in a document ANDrare in the document collection as a whole.

Modified similarity for query/document retrieval:We only take into account the words actually in the query

Part 5Evaluation of IREvaluation of IR systemsAs with most NLP systems, we require some function that our system should maximise.

A lot of NLP evaluation rely on precision and recall.Basic rationaleFor a given classification problem, we have:a gold standard against which to compareour systems results, compared to the target gold standard:false positives (fp)false negatives (fn)true positives (tp)true negatives (tn)

Performance typically measured in terms of precision and recall.PrecisionDefinition:proportion of items that are correctly classifiedi.e. proportion of true positives out of all the systems classifications

RecallDefinition: proportion of the actual target (gold standard) items that our system classifies correctly

total no. of items that should be correctly classified, including those the system doesnt getCombining precision and recallTypically use the F-measure as a global estimate of performance against gold standardWe need some factor (alpha) to weight precision and recall; 0.5 gives them equal weighting

Precision/Recall in IRWe assume that the results returned by the IR system can be divided into:Relevant docs (tp)Non-relevant docs (fp)

Precision = the fraction of docs that are relevant out of the set of returned docs

Recall: fraction of docs that are relevant out of the whole set of relevant docs

Problem: IR systems tend to rank documents by relevance.

Method 1: interpolated precision and recallWe can split documents into equivalence classes (those at a given rank) and compute P and R for each rank.

From: J&M 2009 p. 807

Based on a collection of 9 docs.

Recall increases when relevant items are encountered.

Precision is very variable!Method 1: interpolated precision and recallPlot of max precision at different recall intervals.

Method 2: Mean average precisionHere, we just compute the average precision at or above a given rank.

Rr = set of relevant docs at or above rank rPrecisionr(d) = the precision at the rank at which the document d was found.

NB: this metric favours systems that return relevant documents at high ranks.Is this justified?

Part 5Improving IRImproving IR: Simple techniquesA lot of queries will contain:Morphologically inflected words (beans, etc)Function words (fried and breadcrumb)

Performance usually improves if:We perform some kind of stemmingWe use a stop list

Function words arent very informative (cf. Zipfs law), while stemming allows us to identify query-doc relatedness in a more general fashion.Using semantic informationHomonymy and polysemy can reduce precision:A query containing bank will match docs containing both senses of the word.But the user will generally only be interested in one of the senses.

Perhaps Word Sense Disambiguation can help improve IR?More on WSD next week.Query expansionOne way to try to improve IR is to expand a query by:Finding similar words to the words in the query (using a thesaurus)E.g. q = (Steve Jobs); qexp = (Steve Jobs, Mac)

But this will depend on the quality of our thesaurus.

E.g. Is it useful to expand the query dog with the related words cat, animal and horse?(These are actual examples from the BNC SketchEngine thesaurus)

Or is it more useful to restrict our thesaurus to only synonyms (dog, canine etc)?