Top Banner
Mehran Sahami Timothy D. Heilman A Webbased Kernel Function for A Webbased Kernel Function for Measuring the Similarity of Measuring the Similarity of Short Text Snippets Short Text Snippets
21

Mehran Sahami

Feb 01, 2016

Download

Documents

miach

A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets. Timothy D. Heilman. Mehran Sahami. Introduction. Wish to determine how similar two short text snippets are. High degree of semantic similarity United Nations Secretary General vs Kofi Annan - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mehran Sahami

Mehran Sahami Timothy D. Heilman

A Web based Kernel Function forA Web based Kernel Function forMeasuring the Similarity ofMeasuring the Similarity of

Short Text SnippetsShort Text Snippets

Page 2: Mehran Sahami

IntroductionIntroduction

Wish to determine how similar two short text snippets are.

High degree of semantic similarityUnited Nations Secretary General vs Kofi AnnanAI vs Articial Intelligence

Share termsgraphical models vs graphical interface

5%

Page 3: Mehran Sahami

Related WorkRelated Work

Query expansion techniquesOther means of determining query

similaritySet overlap (intersection)SVM for text classification

Latent Semantic Kernels (LSK)Semantic Proximity Matrix

Cross-lingual techniques

10%

Page 4: Mehran Sahami

A New Similarity FunctionA New Similarity Function

represent a short text snippet (query) to a search engine S

be the set of n retrieved documents

Compute the TFIDF term vector for each document

Truncate each vector to include its m highest weighted term

x

)(xR

nddd ,...,, 21

iv

)(xRdi

iv

15%

Page 5: Mehran Sahami

NormalizeNormalize

Let be the centroid of the L2 normalized vector

Let QE(x) be the L2 normalization of the centroid C(x)

)(xCiv

n

iv

vn i

ixC1

1

2

)(

2)(

)()(xC

xCxQE

20%

Page 6: Mehran Sahami

Kernel FunctionKernel Function

)()(),( yQExQEyxK

25%

Page 7: Mehran Sahami

Initial Results with KernelInitial Results with Kernel

Three genres of text snippet matchingAcronymsIndividuals and their positionsMulti-faceted terms

30%

Page 8: Mehran Sahami

AcronymsAcronyms

Text1 Text2 Kernel Cosine Set Overlap

Support vector machine SVM 0.812 0.0 0.110Portable document format PDF 0.732 0.0 0.060Artificial intelligence AI 0.831 0.0 0.255Artificial insemination AI 0.391 0.0 0.000term frequency inverse document frequency

tf idf 0.831 0.0 0.125

term frequency inverse document frequency

tfidf 0.507 0.0 0.060

35%

Page 9: Mehran Sahami

Individuals and their positionsIndividuals and their positions

40%

Page 10: Mehran Sahami

Multi-faceted termsMulti-faceted terms

45%

Page 11: Mehran Sahami

Related Query SuggestionRelated Query Suggestion

Kernel function foru is any newly issued user query A repository Q of approximately 116 million

popular user queries issued in 2003, determined by sampling anonymized web search logs from the Google search engine

),( iquK Qqi

50%

Page 12: Mehran Sahami

AlgorithmAlgorithm

Given user query and list of matched queries from repository

Output list of queries to suggest Initialize suggestion list Sort kernel scores in descending

order to produce an ordered list of corresponding queries

MAX is set to the maximum number of suggestions

u

ZZ

),( iquK

iq),...,,( 21 kqqqL

55%

Page 13: Mehran Sahami

Post-Filter

|q| denotes the number of terms in query q60%

Page 14: Mehran Sahami

Evaluation of Evaluation of Query Suggestion SystemQuery Suggestion System

1. suggestion is totally off topic.2. suggestion is not as good as original

query.3. suggestion is basically same as original

query.4. suggestion is potentially better than

original query.5. suggestion is fantastic - should suggest

this query since it might help a user find what they're looking for if they issued it instead of the original query.

65%

Page 15: Mehran Sahami

EvaluationsEvaluations

Original Query

Suggested Queries Kernel Score

Human Rating

california lottery

california lotto home 0.812 3

winning lotto numbers in california 0.792 5

california lottery super lotto plus 0.778 3

valentines day

2003 valentine's day 0.832 3

valentine day card 0.822 4

valentines day greeting cards 0.758 4

I love you valentine 0.736 2

new valentine one 0.671 1

70%

Page 16: Mehran Sahami

Average ratings at Average ratings at various kernel thresholdsvarious kernel thresholds

75%

Page 17: Mehran Sahami

Average ratings versus average Average ratings versus average number of query suggestionsnumber of query suggestions

80%

Page 18: Mehran Sahami

Application in QAApplication in QA

K("Who shot Abraham Lincoln", "John Wilkes Booth") = 0.730

K("Who shot Abraham Lincoln", "Abraham Lincoln") = 0.597

85%

Page 19: Mehran Sahami

ConclusionConclusion

A new kernel function for measuring the semantic similarity between pairs of short text snippets

The first is improvement in the generation of query expansions with the goal of improving the match score for the kernel function

Page 20: Mehran Sahami

Term Weighting SchemeTerm Weighting Scheme

The weight associated with the term in document is defined to be :

Where is the frequency of in N is the total number of ducuments ,

and is the total number of documents that contain

jiw ,

it jd

)log(,, idfN

jiji tfw

jitf , it jd

idf

it

Page 21: Mehran Sahami

Given by:

Most common casesP=1 ,This is the L1 norm, which is also

called Manhattan distanceP=2 ,This is the L2 norm, which is also

called the Euclidean distanceP= , This is the L norm, also called the

infinity norm or the Chebyshev norm

Lp NormLp Norm