Top Banner
EE3J2 Data Mining Slide 1 EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell
31

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 1

EE3J2 Data Mining

Lecture 7Topic Spotting & Query Expansion

Martin Russell

Page 2: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 2

Objectives

To introduce Topic Spotting– Salience and Usefulness

– Example: The AT&T “How May I Help You?” system

Text retrieval revisited– Query expansion

– Synonyms & hyponyms

– WordNet

Page 3: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 3

Topic Spotting

Type of dedicated IR system– Always looking for the same type of documents

– Documents about a particular topic

– Corpus from which data is retrieved is dynamic

Examples– Detect all weather forecasts in BBC radio 4 broadcasts

– Find all documents written by Charlotte Bronte

– …

Page 4: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 4

Topic Spotting Queries

Examples rather than explicit queries ‘Find more like this’ May have substantial quantity of example data But why is Topic Spotting different to IR? Can exploit large amount of query data Calculate better measures of the usefulness of a term than simple IDF

Page 5: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 5

IDF Revisited

Recall the definition of IDF

IDF is concerned only about whether or not a term is contained in a document

But there is also information in the number of times the term occurs in documents

tND

NDtIDF log

Page 6: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 6

Salience

Suppose we have a set of ‘training’ documents– Some ‘on topic’

– Some ‘off topic’

Then, for a term t, we can estimate the probabilities:– P(t | Topic)

– P(t | Not Topic)

– P(t)

Page 7: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 7

‘Salience’ and ‘Usefulness’

Given a term t and a topic T, define the salience of t (relative to T) by:

Similarly, the usefulness of t (relative to T) is given by:

TP

tTPtTPtS

|log|

tP

TtPTtPtU

|log|

Page 8: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 8

Bayes’ Theorem

Remember Bayes’ Theorem?

)(

)()|(|

tp

TPTtptTP

Page 9: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 9

Salience and Usefulness

tUtp

TP

tp

TtpTtp

tp

TP

tpTP

TPTtp

tp

TPTtp

TP

tTPtTPtS

|log|

|log

|

|log|

Page 10: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 10

Salience and Usefulness

Now, T is the topic, so P(T) is fixed. Therefore

tUtp

TPtS

tUtp

tS1

Page 11: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 11

Salience and Usefulness

So, main difference between Salience and Usefulness is that to have high usefulness, a term must occur frequently

tUtp

tS1

Page 12: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 12

Example

A term w occurs:– t1 times in documents about topic T

– t2 times in documents which are not about topic T

Total number of terms:– in documents about topic T is N1

– in documents not about topic T is N2

Then: P(w|T) = t1/N1 , P(w) = (t1+t2)/(N1+N2)– So

21

21

1

1

1

1 log

NNtt

Nt

N

twU

Page 13: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 13

Example A term w occurs:

– 150 times in documents about topic T

– 230 times in documents which are not about topic T

Total number of terms:– in documents about topic T is 12,500

– in documents not about topic T is 23,100

So:– p(w|T)=0.012, p(w)=0.0107, log(p(w|T)/p(w)) = 0.051

– U(w) = 0.00054

Page 14: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 14

Example (continued)

P(T) =NT /N , where NT and N are the total number of documents about topic T and the total number of documents, respectively

So, if, say, 1 document in 100 is about the topic, then P(T) = 0.01

Then S(w) = (P(T)/p(w))*U(w)

= (0.01/0.0107)*U(w)

Page 15: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 15

Topic spotter

Topic Spotter

Data stream

Page 16: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 16

Example

The AT&T “How May I Help You?” system Task: to understand what AT&T customers say to

operators Look HMIHY? Up on the web

Page 17: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 17

AT&T How May I Help You?

Speech Recognition

Language Processing

Salient word list

Service 1

Service 2

Service 3

Service 15

Page 18: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 18

AT&T How May I Help You? HMIHY? Treats telephone network services as topics

or documents, to be detected or retrieved Example salient words:

Word Salience   Word SalienceDifference 4.04   Dialled 1.29

Cost 3.39   Area 1.28Rate 3.37   Time 1.23Much 3.24   Person 1.23

Emergency 2.23   Charge 1.22Misdialed 1.43   Home 1.13Wrong 1.37   Information 1.11code 1.36   credit 1.11

Allen Gorin, “Processing of semantic information in fluent spoken language, Proc. ICSLP 1996

Page 19: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 19

HMIHY Demonstrations.

See http://www.research.att.com/~algor/hmihy/samples.html

Page 20: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 20

Query Processing

Remember how we previously processed a query: Example:

– “I need information on distance running” Stop word removal

– information, distance, running Stemming

– information, distance, run But what about:

– “The London marathon will take place…”

Page 21: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 21

Synonyms, hyponyms, hypernyms & antonyms We know there is a relationship between

– run, distance, and

– marathon

We know that a ‘marathon’ is a ‘long distance run’ Words with the same meaning are synonyms If a query q contains a word w1 and w2 is a synonym

of w1, then w2 should be added to q

This is an example of query expansion

Page 22: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 22

Thesaurus

A thesaurus is a ‘dictionary’ of synonyms and semantically related words and phrases

E.G: Roget’s Thesaurus Example:

physiciansyn: || croaker, doc, doctor, MD, medical, mediciner, medico ||rel: medic, general practitioner, surgeon

Page 23: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 23

Hyponyms

Not only synonyms are useful for query expansion Query q = “Tell me about England” Document d = “A visit to London should be on

everyone’s itinerary” ‘London’ is a hyponym of ‘England’ Hyponym ~ subordinate ~ subset If a query q contains a word w1 and w2 is a hyponym

of w1, then w2 should be added to q

Page 24: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 24

Hypernyms

Hypernyms are also useful for query expansion Query q = “Tell me about England” Document d = “Places to visit in the British Isles” ‘British Isles’ is a hypernym of ‘England’ Hyponym ~ generalisation ~ superset If a query q contains a word w1 and w2 is a

hypernym of w1, then w2 should be added to q

Page 25: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 25

Antonyms

An antonym is a word which is opposite in meaning to another (e.g. bad and good)

The occurrence of an antonym can also be relevant

Page 26: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 26

WordNet

Online lexical database for the English Language http://www.cogsci.princeton.edu/~wn

Category Forms Meanings (syn sets)

Nouns 57,000 48,800

Adjectives 19,500 10,000

Verbs 21,000 8,400

See Belew, chapter 6

Page 27: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 27

WordNet

WordNet is organised as a set of hierarchical trees For nouns, there are 25 trees Children of a node correspond to hyponyms Words become more specific as you move deeper

into the tree

Page 28: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 28

Noun Categories

act, action, activity natural object

animal, fauna natural phenomenon

artefact person, human being

attribute, property plant, flora

body, corpus possession

cognition, knowledge process

communication quantity, amount

event, happening relation

feeling, emotion shape

food state, condition

group, collection substance

location, place time

motive  

Page 29: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 29

Query-document scoring

A query q is expanded to include hyponyms and synonyms

Recall that for a document d

tIDFfw tdtd

qd

ww

dqSim dqttqtd

,

Page 30: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 30

Query expansion

Suppose:– t is the original term in the query,

– t’ is a synonym or hyponym of t which occurs in d

Then we could define:

Where tt’ is a weighting depending on how ‘far’ t and t’ are apart according to WordNet

tIDFfw dtttdt ''' 10 ' tt

Page 31: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data MiningSlide 31

Summary

Topic spotting:– Salience and usefulness

– How May I Help You?

Query expansion WordNet