Ch 4: Information Retrieval and Text Mining Hakam Alomari
Feb 06, 2016
Ch 4: Information Retrieval and Text Mining
Hakam Alomari
4.1: Is Information Retrieval a Form of Text Mining?
What is the principal computer specialty for processing documents and text??
Information Retrieval (IR) The task of IR is to retrieve relevant documents
in response to a query. The fundamental technique of IR is measuring
similarity A query is examined and transformed into a
vector of values to be compared with stored documents
Cont. 4.1
In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document
The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document
Cont. 4.1
Specify QuerySearch
Document Collection
Return Subset of Relevant Documents
Figure 4.1. Key Steps in Information Retrieval
Examine Document Collection
Learn Classification
Criteria
Apply Criteria to New
Documents
Figure 4.2. Key Steps in Predictive Text Mining
Specify Query Vector
Match Document Collection
Get Subset of Relevant Documents
Examine Document Properties
Figure 4.3. Predicting from Retrieved Documents
Figure 4.2. Key steps in IR
simple criteria such as document’s labels
4.2 Key Word Search
The technical goal for prediction is to classify new, unseen documents
The Prediction and IR are unified by the computation of similarity of documents
IR based on traditional keyword search through a search engine
So we should recognize that using a search engine is a special instance of prediction concept
We enter a key words to a search engine and expect relevant documents to be returned
These key words are words in a dictionary created from the document collection and can be viewed as a small document
So, we want to measuring how similar the new document (query) is to the documents in the collection
So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine
But, the objective of the search engine is to rank the documents, not to assign a label
So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)
4.3 Nearest-Neighbor Methods
A method that compares vectors and measures similarity
In Prediction: the NNMs will collect the K most similar documents and then look at their labels
In IR: the NNMs will determine whether a satisfactory response to the search query has been found
4.4 Measuring Similarity
These measures used to examine how documents are similar and the output is a numerical measure of similarity
Three increasingly complex measures:Shared Word CountWord Count and BonusCosine Similarity
4.4.1 Shared Word Count
Counts the shared words between documents
The words: In IR we have a global dictionary where all
potential words will be included, with the exception of stopwords.
In Prediction its better to preselect the dictionary relative to the label
Computing similarity by Shared words
Look at all words in the new document For each document in the collection count how
many of these words appear No weighting are used, just a simple count The dictionary has true key words (weakly
words removed) The results of this measure are clearly intuitive
No one will question why a document was retrieved
Computing similarity by Shared words
Each document represented as a vector of key words (zeros and ones)
The similarity of 2 documents is the product of the 2 vectors
If 2 documents have the same key word then this word is counted (1*1)
The performance of this measure depends mainly on the dictionary used
Computing similarity by Shared words
Shared words is an exact searcheither retrieving or not retrieving a document.
No weighting can be done on terms in query, A and B, you can’t specify A is
more important than BEvery retrieved document are treated equally
4.4.2 Word Count and Bonus 1/4
TF – term frequency number of times a term occurs in a document
DF –Document frequency Number of documents that contain the term.
IDF – inversed document frequency =log (N/df) N: the total number of documents
Vector is a numerical representation for a point in a multi-dimensional space.
(x1, x2, … … xn) Dimensions of the space need to be defined A measure of the space needs to be defined.
4.4.2 Word Count and Bonus 2/4
Each indexing term is a dimension Each document is a vector
Di = (ti1, ti2, ti3, ti4, ... tik)
Document similarity is defined as
0
11
,1
jdfjw
jwiDK
j
If word (j) occurs in both documents
otherwise
K = number of words
4.4.2 Word Count and Bonus 3/4
The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small.
This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.
4.4.2 Word Count and Bonus 4/4
2.83
1.33
0
1.33
1.5
1.33
2.67
Measure Similarity
With Bonus
10101
00011
01000
10001
00100
01010
10011
SimilarityScores
1011
LabeledSpreadsheet
Vector
New
Document
Figure 4.4. Computing Similarity Scores with Bonus
• A document Space is defined by A document Space is defined by five terms: hardware, software, five terms: hardware, software, user, information, index.user, information, index.•The query is “ hardware, user, The query is “ hardware, user, information.information.
4.4.3 Cosine Similarity The Vector Space
A document is represented as a vector: (W1, W2, … … , Wn)
Binary: Wi= 1 if the corresponding term is in the document Wi= 0 if the term is not in the document
TF: (Term Frequency) Wi= tfi where tfi is the number of times the term
occurred in the document TF*IDF: (Inverse Document Frequency)
Wi =tfi*idfi=tfi*(1+log(N/dfi)) where dfi is the number of documents contains the term i, and N the total number of documents in the collection.
4.4.3 Cosine Similarity The Vector Space
vec(D) = (w1, w2, ..., wt)Sim(d1,d2) = cos()
= [vec(d1) vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2|
W(j) > 0 whenever j diSo, 0 <= sim(d1,d2) <=1
A document is retrieved even if it matches the query terms only partially
4.4.3 Cosine Similarity
How to compute the weight wj? A good weight must take into account two
effects: quantification of intra-document contents (similarity)
tf factor, the term frequency within a document quantification of inter-documents separation (dissi-
milarity) idf factor, the inverse document frequency
wj = tf(j) * idf(j)
4.4.3 Cosine Similarity
TF in the given document shows how important the term is in this document (makes the frequent words for the document more important)
IDF makes rare words across all documents more important.
A high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low term frequency in all other documents.
Term weights in a document affects the position of the document vectors di = (wi,1 , wi,2 ….wi,t)
4.4.3 Cosine Similarity
TF-IDF definitions:fik: number occurrences of term ti in document Dk
tfik: fik / max(fik) normalized term frequency
dfk: number of documents which contain tkidfk: log(N / dfk) where N is the total number of documents
wik: tfik idfk term weight
Intuition: rare words get more weight, common words less weight
Example TF-IDF
Given a document containing terms with given frequencies:Kent = 3; Ohio = 2; University = 1and assume a collection of 10,000 documents and document frequencies of these terms are: Kent = 50; Ohio = 1300; University = 250.
THENKent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
4.4.3 Cosine Similarity
Cosine
W(j) = tf(j) * idf(j) Idf(j) = log(N / df(j))
22 2*1
2*12,1
jwdjwd
jwdjwddd