Artificial Intelligence

CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC

INFORMATION RETRIEVAL

Oren Kurland and Lillian LeeDepartment of Computer Science

Cornell UniversityIthaca, NY

1

INFORMATION RETRIEVAL

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information needfrom within large collections (usually stored on computers).

2

INFORMATION RETRIEVAL (CONTD.)

3

CLUSTERING

Clustering: the process of grouping a set of objects into

classes of similar objectsDocuments within a cluster should be

similar.Documents from different clusters

should be dissimilar.The commonest form of

unsupervised learninglearning from raw data, as opposed to supervised data where a classification of examples is given

A common and important task that finds many applications in IR and other places

4

A DATA SET WITH CLEAR CLUSTER STRUCTURE

Ch. 16

5

CLUSTERING(CONTD.)

6

THE BIG PICTURE

7

TERMINOLOGY

An IR system looks for data matching some criteria defined by the users in their queries.

The langage used to ask a question is called the query language.

These queries use keywords (atomic items characterizing some data).

The basic unit of data is a document (can be a file, an article, a paragraph, etc.).

A document corresponds to free text (may be unstructured).

All the documents are gathered into a collection (or corpus).

8

TERM FREQUENCY

A document is treated as a set of words Each word characterizes that document to

some extent When we have eliminated stop words, the

most frequent words tend to be what the document is about

Therefore: fkd (Nb of occurrences of word k in document d) will be an important measure.

Also called the term frequency (tf)9

DOCUMENT FREQUENCY

What makes this document distinct from others in the corpus?

The terms which discriminate best are not those which occur with high document frequency!

Therefore: dk (nb of documents in which word k occurs) will also be an important measure.

Also called the document frequency (idf)

10

This can all be summarized as:Words are best discriminators when :

they occur often in this document (term frequency) do not occur in a lot of documents (document

frequency)

One very common measure of the importance of a word to a document is : TF.IDF: term frequency x inverse document frequency

There are multiple formulas for actually computing this. The underlying concept is the same in all of them.

TF.IDF

11

TERM WEIGHTS

tf-score : tfi,j = frequency of term i in document j

idf-score : idfi = Inversed document frequency of term i idfi = log(N/ni) with

N, the size of the document collection (nb of documents)

ni , the number of documents in which the term i occurs

idfi = Proportion of the document collection in which termi occurs

Term weight of term i in document j (TF-IDF): tfi,j. idfithe rarity of a term in the document collection

12

LANGUAGE MODELS FOR IR

Language Modeling Approaches Attempt to model query generation process Documents are ranked by the probability that a

query would be observed as a random sample from the respective document model

Multinomial approach

13

LANGUAGE MODELS A probability distribution over word

sequences p(“Today is Wednesday”) 0.001 p(“Today Wednesday is”) 0.0000000000001 p(“The eigenvalue is positive”) 0.00001

Context/topic dependent! Treat each document as the basis for a model

(e.g., unigram sufficient statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) x P(d) / P(q)

P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d

But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model

Very general formal approach

14

15

THE SIMPLEST LANGUAGE MODEL(UNIGRAM MODEL)

Generate a piece of text by generating each word independently

Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

A piece of text can be regarded as a sample drawn according to this word distribution

SMOOTHING

• Smoothing is an important issue, and distinguishes different approaches

• Many smoothing methods are available It depends on the data and the task! Cross validation is generally used to choose

the best method and/or set the smoothing parameters…

For retrieval, Dirichlet prior performs well… Backoff smoothing [Katz 87] doesn’t work well

due to a lack of 2nd-stage smoothing…16

Query = “the algorithms for data mining”

Another Reason for Smoothing

p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)p( “mining”|d1) < p(“mining”|d2)

So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal…

Content words

Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)…

pDML(w|d1): 0.04 0.001 0.02 0.002 0.003 pDML(w|d2): 0.02 0.001 0.01 0.003 0.004

Query = “the algorithms for data mining”

P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409

)!2|()1|(),|(9.0)|(1.0)|( dqpdqpREFwpdwpdwpwithsmoothingAfter DML

17

RETRIEVAL FRAMEWORK

When we rank documents with respect to a query, we desire per-document scores that rely both on information drawn from the particular document's contents and on how the document is situated within the similarity structure of the ambient corpus.

Structure representation via overlapping clusters. Clusters can be represented as facets of the corpus that users might be interested in. Employing intersecting clusters may reduce information loss due to the generalization that clustering can introduce.

Information representation. Motivated by the empirical successes of language-modeling-based approaches , we use language models induced from documents and clusters as our information representation. Thus, pd(q) and pc(q) specify our initial knowledge of the relation between the query q and a particular document d or cluster c.

Information integration. To assign a ranking to the documents in a corpus C with respect to q, we want to score each d Є C against q in a way that incorporates information from query-relevant corpus facets to which d belongs.

18

CLUSTER-BASED SMOOTHING/SCORING Cluster-based query likelihood: Similar to the

translation model, but “translate” the whole document to the query through a set of clusters.

( | , ) ( | ) ( | )

C Clusters

p Q D R p Q C p C D

How likely doc D belongs to cluster C

Only effective when interpolated with the basic LM scores

Likelihood of Q given C

19

RETRIEVAL ALGORITHM

Base line method:- The documents are simply ranked by probabilistic

functions on the basis of frequency of words encountered from query.

20

Probabilistic IR

query

d1

d2

dn

…

Information need

document collection

matchingmatching

),|( dQRP

Introduction

21

BASIS SELECT

This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.

22

IR based on LM

query

d1

d2

dn

…

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

…

ndM

Introduction

23

SET SELECT ALGORITHM

In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant.

BAG SELECT

The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.

24

ASPECT – X RATIO

The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query.

The uniform aspect x ratio assumes that every d Є c has same degree of association.

25

A HYBRID ALGORITHM

An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-xAlgorithmsThe algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¡¸)p(qjc), where ¸ indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = ¸p(qjd) +(1¡¸)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of theaspect-x algorithm. 26

27

TEXT GENERATION WITH UNIGRAM LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02…

Topic 2:Health

Document d

Text miningpaper

Food nutritionpaper

Sampling

Given , p(d| ) varies according to d

ESTIMATION OF UNIGRAM LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?…

Estimation

Total #words=100

10/1005/1003/1003/100

1/100

How good is the estimated model ?

It gives our document sample the highest prob,but it doesn’t generalize well… More about this later…

28

THE BASIC LM APPROACH[PONTE & CROFT 98]

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

Query = “data mining algorithms”

? Which model would most likely have generated this query?

29

Experimental Setup: Data. They conducted the experiments on TREC data. they

used titles (rather than full descriptions) as queries, resulting in an average length of 2-5 terms. Some characteristics of our three corpora are summarized in the following table.

30

EXPERIMENTAL RESULT

31

32

33

Thank You

34

Artificial Intelligence

Education

final output

unigram lm

language model

information

important

document collection

term frequency

document frequency