CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC INFORMATION RETRIEVAL Oren Kurland and Lillian Lee Department of Computer Science Cornell University Ithaca, NY 1
Sep 22, 2014
CORPUS STRUCTURE, LANGUAGE MODELS, AND AD HOC
INFORMATION RETRIEVAL
Oren Kurland and Lillian LeeDepartment of Computer Science
Cornell UniversityIthaca, NY
1
INFORMATION RETRIEVAL
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information needfrom within large collections (usually stored on computers).
2
INFORMATION RETRIEVAL (CONTD.)
3
CLUSTERING
Clustering: the process of grouping a set of objects into
classes of similar objectsDocuments within a cluster should be
similar.Documents from different clusters
should be dissimilar.The commonest form of
unsupervised learninglearning from raw data, as opposed to supervised data where a classification of examples is given
A common and important task that finds many applications in IR and other places
4
A DATA SET WITH CLEAR CLUSTER STRUCTURE
Ch. 16
5
CLUSTERING(CONTD.)
6
THE BIG PICTURE
7
TERMINOLOGY
An IR system looks for data matching some criteria defined by the users in their queries.
The langage used to ask a question is called the query language.
These queries use keywords (atomic items characterizing some data).
The basic unit of data is a document (can be a file, an article, a paragraph, etc.).
A document corresponds to free text (may be unstructured).
All the documents are gathered into a collection (or corpus).
8
TERM FREQUENCY
A document is treated as a set of words Each word characterizes that document to
some extent When we have eliminated stop words, the
most frequent words tend to be what the document is about
Therefore: fkd (Nb of occurrences of word k in document d) will be an important measure.
Also called the term frequency (tf)9
DOCUMENT FREQUENCY
What makes this document distinct from others in the corpus?
The terms which discriminate best are not those which occur with high document frequency!
Therefore: dk (nb of documents in which word k occurs) will also be an important measure.
Also called the document frequency (idf)
10
This can all be summarized as:Words are best discriminators when :
they occur often in this document (term frequency) do not occur in a lot of documents (document
frequency)
One very common measure of the importance of a word to a document is : TF.IDF: term frequency x inverse document frequency
There are multiple formulas for actually computing this. The underlying concept is the same in all of them.
TF.IDF
11
TERM WEIGHTS
tf-score : tfi,j = frequency of term i in document j
idf-score : idfi = Inversed document frequency of term i idfi = log(N/ni) with
N, the size of the document collection (nb of documents)
ni , the number of documents in which the term i occurs
idfi = Proportion of the document collection in which termi occurs
Term weight of term i in document j (TF-IDF): tfi,j. idfithe rarity of a term in the document collection
12
LANGUAGE MODELS FOR IR
Language Modeling Approaches Attempt to model query generation process Documents are ranked by the probability that a
query would be observed as a random sample from the respective document model
Multinomial approach
13
LANGUAGE MODELS A probability distribution over word
sequences p(“Today is Wednesday”) 0.001 p(“Today Wednesday is”) 0.0000000000001 p(“The eigenvalue is positive”) 0.00001
Context/topic dependent! Treat each document as the basis for a model
(e.g., unigram sufficient statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) x P(d) / P(q)
P(q) is the same for all documents, so ignore P(d) [the prior] is often treated as the same for all d
But we could use criteria like authority, length, genre P(q | d) is the probability of q given d’s model
Very general formal approach
14
15
THE SIMPLEST LANGUAGE MODEL(UNIGRAM MODEL)
Generate a piece of text by generating each word independently
Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)
A piece of text can be regarded as a sample drawn according to this word distribution
SMOOTHING
• Smoothing is an important issue, and distinguishes different approaches
• Many smoothing methods are available It depends on the data and the task! Cross validation is generally used to choose
the best method and/or set the smoothing parameters…
For retrieval, Dirichlet prior performs well… Backoff smoothing [Katz 87] doesn’t work well
due to a lack of 2nd-stage smoothing…16
Query = “the algorithms for data mining”
Another Reason for Smoothing
p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)p( “mining”|d1) < p(“mining”|d2)
So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal…
Content words
Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)…
pDML(w|d1): 0.04 0.001 0.02 0.002 0.003 pDML(w|d2): 0.02 0.001 0.01 0.003 0.004
Query = “the algorithms for data mining”
P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409
)!2|()1|(),|(9.0)|(1.0)|( dqpdqpREFwpdwpdwpwithsmoothingAfter DML
17
RETRIEVAL FRAMEWORK
When we rank documents with respect to a query, we desire per-document scores that rely both on information drawn from the particular document's contents and on how the document is situated within the similarity structure of the ambient corpus.
Structure representation via overlapping clusters. Clusters can be represented as facets of the corpus that users might be interested in. Employing intersecting clusters may reduce information loss due to the generalization that clustering can introduce.
Information representation. Motivated by the empirical successes of language-modeling-based approaches , we use language models induced from documents and clusters as our information representation. Thus, pd(q) and pc(q) specify our initial knowledge of the relation between the query q and a particular document d or cluster c.
Information integration. To assign a ranking to the documents in a corpus C with respect to q, we want to score each d Є C against q in a way that incorporates information from query-relevant corpus facets to which d belongs.
18
CLUSTER-BASED SMOOTHING/SCORING Cluster-based query likelihood: Similar to the
translation model, but “translate” the whole document to the query through a set of clusters.
( | , ) ( | ) ( | )
C Clusters
p Q D R p Q C p C D
How likely doc D belongs to cluster C
Only effective when interpolated with the basic LM scores
Likelihood of Q given C
19
RETRIEVAL ALGORITHM
Base line method:- The documents are simply ranked by probabilistic
functions on the basis of frequency of words encountered from query.
20
Probabilistic IR
query
d1
d2
dn
…
Information need
document collection
matchingmatching
),|( dQRP
Introduction
21
BASIS SELECT
This algorithm uses the pooling of statistics from documents simply to decide whether the document is worth ranking or not. Only the basis documents are allowed to appear in the final output list having some minimum thresh hold frequency.
22
IR based on LM
query
d1
d2
dn
…
Information need
document collection
generationgeneration
)|( dMQP 1dM
2dM
…
ndM
Introduction
23
SET SELECT ALGORITHM
In this case all the documents may appear in the final output list. The idea is that any document in the “best” cluster, basis or not is potentially relevant.
BAG SELECT
The documents appearing in more than one cluster should get extra consideration. The name is in reference to the incorporation in the document’s multiplicity in the bag formed from the “multi set union”.
24
ASPECT – X RATIO
The degree of relevance on a particular probability is based on the strength of association between d and c where d is the document and c is the query.
The uniform aspect x ratio assumes that every d Є c has same degree of association.
25
A HYBRID ALGORITHM
An Interpolation algorithm combines the advantages of both selection-only algorithms and the aspect-xAlgorithmsThe algorithm can be derived by dropping the original as-pect model's conditional independence assumption|namely, that p(qjd; c) = p(qjc) | and instead setting p(qjd; c) in Equation 1 to p(qjd)+(1¡¸)p(qjc), where ¸ indicates the degree of emphasis on individual-document information. If we do so, then via some algebra we get p(qjd) = ¸p(qjd) +(1¡¸)Pc p(qjc)p(cjd). Finally, applying the same assumptions as described in our discussion of the aspect-x algorithm yields a score function that is the linear interpolation of the score of the standard LM approach and the score of theaspect-x algorithm. 26
27
TEXT GENERATION WITH UNIGRAM LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02…
Topic 2:Health
Document d
Text miningpaper
Food nutritionpaper
Sampling
Given , p(d| ) varies according to d
ESTIMATION OF UNIGRAM LM
(Unigram) Language Model p(w| )=?
Document
text 10mining 5association 3database 3algorithm 2…query 1efficient 1
…text ?mining ?assocation ?database ?…query ?…
Estimation
Total #words=100
10/1005/1003/1003/100
1/100
How good is the estimated model ?
It gives our document sample the highest prob,but it doesn’t generalize well… More about this later…
28
THE BASIC LM APPROACH[PONTE & CROFT 98]
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?…
…food ?nutrition ?healthy ?diet ?…
Query = “data mining algorithms”
? Which model would most likely have generated this query?
29
Experimental Setup: Data. They conducted the experiments on TREC data. they
used titles (rather than full descriptions) as queries, resulting in an average length of 2-5 terms. Some characteristics of our three corpora are summarized in the following table.
30
EXPERIMENTAL RESULT
31
32
33
Thank You
34