Top Banner
CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of Computer Science Cornell University Ithaca, NY 1
34
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Artificial Intelligence

CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC

INFORMATION RETRIEVAL

Oren Kurland and Lillian LeeDepartment of Computer Science

Cornell UniversityIthaca, NY

1

Page 2: Artificial Intelligence

INFORMATION RETRIEVAL

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information needfrom within large collections (usually stored on computers).

2

Page 3: Artificial Intelligence

INFORMATION RETRIEVAL (CONTD.)

3

Page 4: Artificial Intelligence

CLUSTERING

Clustering: the process of grouping a set of objects into

classes of similar objectsDocuments within a cluster should be

similar.Documents from different clusters

should be dissimilar.The commonest form of

unsupervised learninglearning from raw data, as opposed to supervised data where a classification of examples is given

A common and important task that finds many applications in IR and other places

4

Page 5: Artificial Intelligence

A DATA SET WITH CLEAR CLUSTER STRUCTURE

Ch. 16

5

Page 6: Artificial Intelligence

CLUSTERING(CONTD.)

6

Page 7: Artificial Intelligence

THE BIG PICTURE

7

Page 8: Artificial Intelligence

TERMINOLOGY

An IR system looks for data matching some criteria defined by the users in their queries.

The langage used to ask a question is called the query language.

These queries use keywords (atomic items characterizing some data).

The basic unit of data is a document (can be a file, an article, a paragraph, etc.).

A document corresponds to free text (may be unstructured).

All the documents are gathered into a collection (or corpus).

8

Page 9: Artificial Intelligence

TERM FREQUENCY

A document is treated as a set of words Each word characterizes that document to

some extent When we have eliminated stop words, the

most frequent words tend to be what the document is about

Therefore: fkd (Nb of occurrences of word k in document d) will be an important measure.

Also called the term frequency (tf)9

Page 10: Artificial Intelligence

DOCUMENT FREQUENCY

What makes this document distinct from others in the corpus?

The terms which discriminate best are not those which occur with high document frequency!

Therefore: dk (nb of documents in which word k occurs) will also be an important measure.

Also called the document frequency (idf)

10

Page 11: Artificial Intelligence

This can all be summarized as:Words are best discriminators when :

they occur often in this document (term frequency) do not occur in a lot of documents (document

frequency)

One very common measure of the importance of a word to a document is : TF.IDF: term frequency x inverse document frequency

There are multiple formulas for actually computing this. The underlying concept is the same in all of them.

TF.IDF

11

Page 12: Artificial Intelligence

TERM WEIGHTS

tf-score : tfi,j = frequency of term i in document j

idf-score : idfi = Inversed document frequency of term i idfi = log(N/ni) with

N, the size of the document collection (nb of documents)

ni , the number of documents in which the term i occurs

idfi = Proportion of the document collection in which termi occurs

Term weight of term i in document j (TF-IDF): tfi,j. idfithe rarity of a term in the document collection

12

Page 13: Artificial Intelligence

LANGUAGE MODELS FOR IR

Language Modeling Approaches Attempt to model query generation process Documents are ranked by the probability that a

query would be observed as a random sample from the respective document model

Multinomial approach

13

Page 14: Artificial Intelligence

LANGUAGE MODELS A probability distribution over word

sequences p(“Today is Wednesday”) 0.001 p(“Today Wednesday is”) 0.0000000000001 p(“The eigenvalue is positive”) 0.00001

Context/topic dependent! Treat each document as the basis for a model

(e.g., unigram sufficient statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) x P(d) / P(q)

P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d

But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model

Very general formal approach

14

Page 15: Artificial Intelligence

15

THE SIMPLEST LANGUAGE MODEL(UNIGRAM MODEL)

Generate a piece of text by generating each word independently

Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

A piece of text can be regarded as a sample drawn according to this word distribution

Page 16: Artificial Intelligence

SMOOTHING

• Smoothing is an important issue, and distinguishes different approaches

• Many smoothing methods are available It depends on the data and the task! Cross validation is generally used to choose

the best method and/or set the smoothing parameters…

For retrieval, Dirichlet prior performs well… Backoff smoothing [Katz 87] doesn’t work well

due to a lack of 2nd-stage smoothing…16

Page 17: Artificial Intelligence

Query = “the algorithms for data mining”

Another Reason for Smoothing

p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)p( “mining”|d1) < p(“mining”|d2)

So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal…

Content words

Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)…

pDML(w|d1): 0.04 0.001 0.02 0.002 0.003 pDML(w|d2): 0.02 0.001 0.01 0.003 0.004

Query = “the algorithms for data mining”

P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409

)!2|()1|(),|(9.0)|(1.0)|( dqpdqpREFwpdwpdwpwithsmoothingAfter DML

17

Page 18: Artificial Intelligence

RETRIEVAL FRAMEWORK

When we rank documents with respect to a query, we desire per-document scores that rely both on information drawn from the particular document's contents and on how the document is situated within the similarity structure of the ambient corpus.

Structure representation via overlapping clusters. Clusters can be represented as facets of the corpus that users might be interested in. Employing intersecting clusters may reduce information loss due to the generalization that clustering can introduce.

Information representation. Motivated by the empirical successes of language-modeling-based approaches , we use language models induced from documents and clusters as our information representation. Thus, pd(q) and pc(q) specify our initial knowledge of the relation between the query q and a particular document d or cluster c.

Information integration. To assign a ranking to the documents in a corpus C with respect to q, we want to score each d Є C against q in a way that incorporates information from query-relevant corpus facets to which d belongs.

18

Page 19: Artificial Intelligence

CLUSTER-BASED SMOOTHING/SCORING Cluster-based query likelihood: Similar to the

translation model, but “translate” the whole document to the query through a set of clusters.

( | , ) ( | ) ( | )

C Clusters

p Q D R p Q C p C D

How likely doc D belongs to cluster C

Only effective when interpolated with the basic LM scores

Likelihood of Q given C

19

Page 20: Artificial Intelligence

RETRIEVAL ALGORITHM

Base line method:- The documents are simply ranked by probabilistic

functions on the basis of frequency of words encountered from query.

20

Page 21: Artificial Intelligence

Probabilistic IR

query

d1

d2

dn

Information need

document collection

matchingmatching

),|( dQRP

Introduction

21

Page 22: Artificial Intelligence

BASIS SELECT

This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.

22

Page 23: Artificial Intelligence

IR based on LM

query

d1

d2

dn

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

ndM

Introduction

23

Page 24: Artificial Intelligence

SET SELECT ALGORITHM

In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant.

BAG SELECT

The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.

24

Page 25: Artificial Intelligence

ASPECT – X RATIO

The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query.

The uniform aspect x ratio assumes that every d Є c has same degree of association.

25

Page 26: Artificial Intelligence

A HYBRID ALGORITHM

An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-xAlgorithmsThe algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¡¸)p(qjc), where ¸ indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = ¸p(qjd) +(1¡¸)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of theaspect-x algorithm. 26

Page 27: Artificial Intelligence

27

TEXT GENERATION WITH UNIGRAM LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02…

Topic 2:Health

Document d

Text miningpaper

Food nutritionpaper

Sampling

Given , p(d| ) varies according to d

Page 28: Artificial Intelligence

ESTIMATION OF UNIGRAM LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?…

Estimation

Total #words=100

10/1005/1003/1003/100

1/100

How good is the estimated model ?

It gives our document sample the highest prob,but it doesn’t generalize well… More about this later…

28

Page 29: Artificial Intelligence

THE BASIC LM APPROACH[PONTE & CROFT 98]

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

Query = “data mining algorithms”

? Which model would most likely have generated this query?

29

Page 30: Artificial Intelligence

Experimental Setup: Data. They conducted the experiments on TREC data. they

used titles (rather than full descriptions) as queries, resulting in an average length of 2-5 terms. Some characteristics of our three corpora are summarized in the following table.

30

Page 31: Artificial Intelligence

EXPERIMENTAL RESULT

31

Page 32: Artificial Intelligence

32

Page 33: Artificial Intelligence

33

Page 34: Artificial Intelligence

Thank You

34