Top Banner
Ch 4: Information Retrieval and Text Mining Hakam Alomari
25

Ch 4: Information Retrieval and Text Mining

Feb 06, 2016

Download

Documents

maylin

Ch 4: Information Retrieval and Text Mining. Hakam Alomari. 4.1: Is Information Retrieval a Form of Text Mining?. What is the principal computer specialty for processing documents and text?? Information Retrieval (IR) The task of IR is to retrieve relevant documents in response to a query. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ch 4: Information Retrieval and Text Mining

Ch 4: Information Retrieval and Text Mining

Hakam Alomari

Page 2: Ch 4: Information Retrieval and Text Mining

4.1: Is Information Retrieval a Form of Text Mining?

What is the principal computer specialty for processing documents and text??

Information Retrieval (IR) The task of IR is to retrieve relevant documents

in response to a query. The fundamental technique of IR is measuring

similarity A query is examined and transformed into a

vector of values to be compared with stored documents

Page 3: Ch 4: Information Retrieval and Text Mining

Cont. 4.1

In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document

The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document

Page 4: Ch 4: Information Retrieval and Text Mining

Cont. 4.1

Specify QuerySearch

Document Collection

Return Subset of Relevant Documents

Figure 4.1. Key Steps in Information Retrieval

Examine Document Collection

Learn Classification

Criteria

Apply Criteria to New

Documents

Figure 4.2. Key Steps in Predictive Text Mining

Page 5: Ch 4: Information Retrieval and Text Mining

Specify Query Vector

Match Document Collection

Get Subset of Relevant Documents

Examine Document Properties

Figure 4.3. Predicting from Retrieved Documents

Figure 4.2. Key steps in IR

simple criteria such as document’s labels

Page 6: Ch 4: Information Retrieval and Text Mining

4.2 Key Word Search

The technical goal for prediction is to classify new, unseen documents

The Prediction and IR are unified by the computation of similarity of documents

IR based on traditional keyword search through a search engine

So we should recognize that using a search engine is a special instance of prediction concept

Page 7: Ch 4: Information Retrieval and Text Mining

We enter a key words to a search engine and expect relevant documents to be returned

These key words are words in a dictionary created from the document collection and can be viewed as a small document

So, we want to measuring how similar the new document (query) is to the documents in the collection

Page 8: Ch 4: Information Retrieval and Text Mining

So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine

But, the objective of the search engine is to rank the documents, not to assign a label

So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)

Page 9: Ch 4: Information Retrieval and Text Mining

4.3 Nearest-Neighbor Methods

A method that compares vectors and measures similarity

In Prediction: the NNMs will collect the K most similar documents and then look at their labels

In IR: the NNMs will determine whether a satisfactory response to the search query has been found

Page 10: Ch 4: Information Retrieval and Text Mining

4.4 Measuring Similarity

These measures used to examine how documents are similar and the output is a numerical measure of similarity

Three increasingly complex measures:Shared Word CountWord Count and BonusCosine Similarity

Page 11: Ch 4: Information Retrieval and Text Mining

4.4.1 Shared Word Count

Counts the shared words between documents

The words: In IR we have a global dictionary where all

potential words will be included, with the exception of stopwords.

In Prediction its better to preselect the dictionary relative to the label

Page 12: Ch 4: Information Retrieval and Text Mining

Computing similarity by Shared words

Look at all words in the new document For each document in the collection count how

many of these words appear No weighting are used, just a simple count The dictionary has true key words (weakly

words removed) The results of this measure are clearly intuitive

No one will question why a document was retrieved

Page 13: Ch 4: Information Retrieval and Text Mining

Computing similarity by Shared words

Each document represented as a vector of key words (zeros and ones)

The similarity of 2 documents is the product of the 2 vectors

If 2 documents have the same key word then this word is counted (1*1)

The performance of this measure depends mainly on the dictionary used

Page 14: Ch 4: Information Retrieval and Text Mining

Computing similarity by Shared words

Shared words is an exact searcheither retrieving or not retrieving a document.

No weighting can be done on terms in query, A and B, you can’t specify A is

more important than BEvery retrieved document are treated equally

Page 15: Ch 4: Information Retrieval and Text Mining

4.4.2 Word Count and Bonus 1/4

TF – term frequency number of times a term occurs in a document

DF –Document frequency Number of documents that contain the term.

IDF – inversed document frequency =log (N/df) N: the total number of documents

Vector is a numerical representation for a point in a multi-dimensional space.

(x1, x2, … … xn) Dimensions of the space need to be defined A measure of the space needs to be defined.

Page 16: Ch 4: Information Retrieval and Text Mining

4.4.2 Word Count and Bonus 2/4

Each indexing term is a dimension Each document is a vector

Di = (ti1, ti2, ti3, ti4, ... tik)

Document similarity is defined as

0

11

,1

jdfjw

jwiDK

j

If word (j) occurs in both documents

otherwise

K = number of words

Page 17: Ch 4: Information Retrieval and Text Mining

4.4.2 Word Count and Bonus 3/4

The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small.

This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.

Page 18: Ch 4: Information Retrieval and Text Mining

4.4.2 Word Count and Bonus 4/4

2.83

1.33

0

1.33

1.5

1.33

2.67

Measure Similarity

With Bonus

10101

00011

01000

10001

00100

01010

10011

SimilarityScores

1011

LabeledSpreadsheet

Vector

New

Document

Figure 4.4. Computing Similarity Scores with Bonus

• A document Space is defined by A document Space is defined by five terms: hardware, software, five terms: hardware, software, user, information, index.user, information, index.•The query is “ hardware, user, The query is “ hardware, user, information.information.

Page 19: Ch 4: Information Retrieval and Text Mining

4.4.3 Cosine Similarity The Vector Space

A document is represented as a vector: (W1, W2, … … , Wn)

Binary: Wi= 1 if the corresponding term is in the document Wi= 0 if the term is not in the document

TF: (Term Frequency) Wi= tfi where tfi is the number of times the term

occurred in the document TF*IDF: (Inverse Document Frequency)

Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the number of documents contains the term i, and N the total number of documents in the collection.

Page 20: Ch 4: Information Retrieval and Text Mining

4.4.3 Cosine Similarity The Vector Space

vec(D) = (w1, w2, ..., wt)Sim(d1,d2) = cos()

= [vec(d1) vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2|

W(j) > 0 whenever j diSo, 0 <= sim(d1,d2) <=1

A document is retrieved even if it matches the query terms only partially

Page 21: Ch 4: Information Retrieval and Text Mining

4.4.3 Cosine Similarity

How to compute the weight wj? A good weight must take into account two

effects: quantification of intra-document contents (similarity)

tf factor, the term frequency within a document quantification of inter-documents separation (dissi-

milarity) idf factor, the inverse document frequency

wj = tf(j) * idf(j)

Page 22: Ch 4: Information Retrieval and Text Mining

4.4.3 Cosine Similarity

TF in the given document shows how important the term is in this document (makes the frequent words for the document more important)

IDF makes rare words across all documents more important.

A high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low term frequency in all other documents.

Term weights in a document affects the position of the document vectors di = (wi,1 , wi,2 ….wi,t)

Page 23: Ch 4: Information Retrieval and Text Mining

4.4.3 Cosine Similarity

TF-IDF definitions:fik: number occurrences of term ti in document Dk

tfik: fik / max(fik) normalized term frequency

dfk: number of documents which contain tkidfk: log(N / dfk) where N is the total number of documents

wik: tfik idfk term weight

Intuition: rare words get more weight, common words less weight

Page 24: Ch 4: Information Retrieval and Text Mining

Example TF-IDF

Given a document containing terms with given frequencies:Kent = 3; Ohio = 2; University = 1and assume a collection of 10,000 documents and document frequencies of these terms are: Kent = 50; Ohio = 1300; University = 250.

THENKent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

Page 25: Ch 4: Information Retrieval and Text Mining

4.4.3 Cosine Similarity

Cosine

W(j) = tf(j) * idf(j) Idf(j) = log(N / df(j))

22 2*1

2*12,1

jwdjwd

jwdjwddd