Ch 4: Information Retrieval and Text Mining

Ch 4: Information Retrieval and Text Mining

Hakam Alomari

4.1: Is Information Retrieval a Form of Text Mining?

What is the principal computer specialty for processing documents and text??

Information Retrieval (IR) The task of IR is to retrieve relevant documents

in response to a query. The fundamental technique of IR is measuring

similarity A query is examined and transformed into a

vector of values to be compared with stored documents

Cont. 4.1

In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document

The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document

Cont. 4.1

Specify QuerySearch

Document Collection

Return Subset of Relevant Documents

Figure 4.1. Key Steps in Information Retrieval

Examine Document Collection

Learn Classification

Criteria

Apply Criteria to New

Documents

Figure 4.2. Key Steps in Predictive Text Mining

Specify Query Vector

Match Document Collection

Get Subset of Relevant Documents

Examine Document Properties

Figure 4.3. Predicting from Retrieved Documents

Figure 4.2. Key steps in IR

simple criteria such as document’s labels

4.2 Key Word Search

The technical goal for prediction is to classify new, unseen documents

The Prediction and IR are unified by the computation of similarity of documents

IR based on traditional keyword search through a search engine

So we should recognize that using a search engine is a special instance of prediction concept

We enter a key words to a search engine and expect relevant documents to be returned

These key words are words in a dictionary created from the document collection and can be viewed as a small document

So, we want to measuring how similar the new document (query) is to the documents in the collection

So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine

But, the objective of the search engine is to rank the documents, not to assign a label

So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)

4.3 Nearest-Neighbor Methods

A method that compares vectors and measures similarity

In Prediction: the NNMs will collect the K most similar documents and then look at their labels

In IR: the NNMs will determine whether a satisfactory response to the search query has been found

4.4 Measuring Similarity

These measures used to examine how documents are similar and the output is a numerical measure of similarity

Three increasingly complex measures:Shared Word CountWord Count and BonusCosine Similarity

4.4.1 Shared Word Count

Counts the shared words between documents

The words: In IR we have a global dictionary where all

potential words will be included, with the exception of stopwords.

In Prediction its better to preselect the dictionary relative to the label

Computing similarity by Shared words

Look at all words in the new document For each document in the collection count how

many of these words appear No weighting are used, just a simple count The dictionary has true key words (weakly

words removed) The results of this measure are clearly intuitive

No one will question why a document was retrieved


Each document represented as a vector of key words (zeros and ones)

The similarity of 2 documents is the product of the 2 vectors

If 2 documents have the same key word then this word is counted (1*1)

The performance of this measure depends mainly on the dictionary used


Shared words is an exact searcheither retrieving or not retrieving a document.

No weighting can be done on terms in query, A and B, you can’t specify A is

more important than BEvery retrieved document are treated equally

4.4.2 Word Count and Bonus 1/4

TF – term frequency number of times a term occurs in a document

DF –Document frequency Number of documents that contain the term.

IDF – inversed document frequency =log (N/df) N: the total number of documents

Vector is a numerical representation for a point in a multi-dimensional space.

(x1, x2, … … xn) Dimensions of the space need to be defined A measure of the space needs to be defined.


Each indexing term is a dimension Each document is a vector

Di = (ti1, ti2, ti3, ti4, ... tik)

Document similarity is defined as

0

11

,1

jdfjw

jwiDK

j

If word (j) occurs in both documents

otherwise

K = number of words


The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small.

This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.


2.83

1.33

0

1.33

1.5

1.33

2.67

Measure Similarity

With Bonus

10101

00011

01000

10001

00100

01010

10011

SimilarityScores

1011

LabeledSpreadsheet

Vector

New

Document

Figure 4.4. Computing Similarity Scores with Bonus

• A document Space is defined by A document Space is defined by five terms: hardware, software, five terms: hardware, software, user, information, index.user, information, index.•The query is “ hardware, user, The query is “ hardware, user, information.information.

4.4.3 Cosine Similarity The Vector Space

A document is represented as a vector: (W1, W2, … … , Wn)

Binary: Wi= 1 if the corresponding term is in the document Wi= 0 if the term is not in the document

TF: (Term Frequency) Wi= tfi where tfi is the number of times the term

occurred in the document TF*IDF: (Inverse Document Frequency)

Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the number of documents contains the term i, and N the total number of documents in the collection.

4.4.3 Cosine Similarity The Vector Space

vec(D) = (w1, w2, ..., wt)Sim(d1,d2) = cos()

= [vec(d1) vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2|

W(j) > 0 whenever j diSo, 0 <= sim(d1,d2) <=1

A document is retrieved even if it matches the query terms only partially

4.4.3 Cosine Similarity

How to compute the weight wj? A good weight must take into account two

effects: quantification of intra-document contents (similarity)

tf factor, the term frequency within a document quantification of inter-documents separation (dissi-

milarity) idf factor, the inverse document frequency

wj = tf(j) * idf(j)


TF in the given document shows how important the term is in this document (makes the frequent words for the document more important)

IDF makes rare words across all documents more important.

A high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low term frequency in all other documents.

Term weights in a document affects the position of the document vectors di = (wi,1 , wi,2 ….wi,t)


TF-IDF definitions:fik: number occurrences of term ti in document Dk

tfik: fik / max(fik) normalized term frequency

dfk: number of documents which contain tkidfk: log(N / dfk) where N is the total number of documents

wik: tfik idfk term weight

Intuition: rare words get more weight, common words less weight

Example TF-IDF

Given a document containing terms with given frequencies:Kent = 3; Ohio = 2; University = 1and assume a collection of 10,000 documents and document frequencies of these terms are: Kent = 50; Ohio = 1300; University = 250.

THENKent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2


Cosine

W(j) = tf(j) * idf(j) Idf(j) = log(N / df(j))

22 2*1

2*12,1

jwdjwd

jwdjwddd

Ch 4: Information Retrieval and Text Mining

Documents

relevant documents

documents labels4

processing documents

stored documents

search query

documentsthe words

returnedthese key words

true key words weakly