Page 1
Python & Web Mining
Old Dominion UniversityDepartment of Computer Science
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Lecture 5
CS 495 Fall 2012
Hany SalahEldeen Khalil [email protected]
10-03-12
Presented & Prepared by: Justin F. [email protected]
Page 2
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Chapter 6:“Document Filtering”
Page 3
Document Filtering
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
In a nutshell: It is classifying documents based on their content. This classification could be binary (good/bad, spam/not-spam) or n-ary (school-related-emails, work-related, commercials…etc)
Page 4
Why do we need Document filtering?
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Eliminate spam.• Removing unrelated comments in forums
and public message boards.• Classifying social /work-related emails
automatically.• Forwarding information-request emails
to the expert who is most capable of answering the email.
Page 5
Spam Filtering
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• First it was rule-based classifiers:• Overuse capital letters• Words related to pharmaceutical
products• Garish HTML colors
Page 6
Cons of using Rule-based classifiers
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Easy to trick by just avoiding patterns of capital letters…etc.
• What is considered spam varies from one to another.• Ex: Inbox of a medical rep Vs. email of
a house-wife.
Page 7
Solution
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Develop programs that learn.• Teach them the differences and how to
recognize each class by providing examples of each class.
Page 8
Features
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• We need to extract features from documents to classify them.
• Feature: Is anything that you can determine as being either present or absent in the item.
Page 9
Definitions
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• item = document• feature = word• classification = {good|bad}
Page 10
Dictionary Building
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Page 11
Dictionary Building
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Remember:• Removing capital letters reduce the
total number of features by removing the SHOUTING style.
• Size of the features also is crucial (using entire email as feature Vs. each letter a feature)
Page 12
Classifier Training
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• It is designed to start off very uncertain.• Increase certainty upon learning
features.
Page 13
Classifier Training
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Page 14
Probabilities
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• It’s a number between 0-1 indicating how likely an event is.
Page 15
Probabilities
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• ‘quick’ appeared in 2 documents as good and the total number of good documents is 3
Page 16
Conditional Probabilities
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Pr(A|B) = “probability of A given B”
fprob(quick|good) = “probability of quick given good”
= (quick classified as good) / (total good items) = 2 / 3
Page 17
Starting with Reasonable guess
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Using the info we seen so far makes it extremely sensitive in early training stages
• Ex: “money”• Money appeared in casino training
document as bad• It appears with probability = 0 for
good which is not right!
Page 18
Solution: Start with assumed probability
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Start for instance with 0.5 probability for each feature
• Also decide the weight chosen for the assumed probability you will take.
Page 19
Assumed Probability
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.fprob('money','bad') 0.5>>> cl.fprob('money','good') 0.0
we have data for bad, but should we start with 0 probability for money given good?
>>> cl.weightedprob('money','good',cl.fprob) 0.25>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.weightedprob('money','good',cl.fprob) 0.16666666666666666>>> cl.fcount('money','bad')3.0>>> cl.weightedprob('money','bad',cl.fprob) 0.5
define an assumed probability of 0.5then weightedprob() returns the weighted mean of fprob and the assumed probability
weightedprob(money,good) = (weight * assumed + count * fprob()) / (count + weight)
= (1*0.5 + 1*0) / (1+1)= 0.5 / 2= 0.25(double the training)= (1*0.5 + 2*0) / (2+1)= 0.5 / 3= 0.166
Pr(money|bad) remains = (0.5 + 3*0.5) / (3+1) = 0.5
Page 20
Naïve Bayesian Classifier
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Move from terms to documents:Pr(document) = Pr(term1) * Pr(term2) * … * Pr(termn)
• Naïve because we assume all terms occur independently • we know this is as simplifying assumption; it is
naïve to think all terms have equal probability for completing this phrase:
• “Shave and a hair cut ___ ____” • Bayesian because we use Bayes’ Theorem to
invert the conditional probabilities
Page 21
Bayes Theorem
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Given our training data, we know: Pr(feature|classification)
• What we really want to know is: Pr(classification|feature)
• Bayes’ Theorem* :Pr(A|B) = Pr(B|A) Pr(A) / Pr(B)
Pr(good|doc) = Pr(doc|good) Pr(good) / Pr(doc)* http://en.wikipedia.org/wiki/Bayes%27_theorem
we know how to calculate this
#good / #totalwe skip this sinceit is the same foreach classificationOr:
Page 22
Our Bayesian Classifier
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> import docclass >>> cl=docclass.naivebayes(docclass.getwords)>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.prob('quick rabbit jumps','good') quick rabbit jumps0.095486111111111091>>> cl.prob('quick rabbit jumps','bad') quick rabbit jumps0.0083333333333333332
we use these valuesonly for comparison,
not as “real” probabilities
Page 23
Bayesian Classifier
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Testing
Page 24
Classification Thresholds
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'>>> cl.prob('quick money','good') quick money0.09375>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.setthreshold('bad',3.0)>>> cl.classify('quick money',default='unknown') quick money'unknown'>>> cl.classify('quick rabbit',default='unknown')quick rabbitu'good'
only classify something as bad if it is 3X more likely to
be bad than good
Page 25
Classification Thresholds…cont
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> for i in range(10): docclass.sampletrain(cl) >>> cl.prob('quick money','good') quick money0.016544117647058824>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.prob('quick rabbit','good') quick rabbit0.13786764705882351>>> cl.prob('quick rabbit','bad') quick rabbit0.0083333333333333332>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'
Page 26
Fisher Method
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Normalize the frequencies for each category• e.g., we might have far more “bad” training data
than good, so the net cast by the bad data will be “wider” than we’d like
• Calculate normalized Bayesian probability, then fit the result to an inverse chi-square function to see what is the probability that a random document of that classification would have those features (i.e., terms)
Page 27
Fisher Example
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> import docclass>>> cl=docclass.fisherclassifier(docclass.getwords) >>> cl.setdb('mln.db')>>> docclass.sampletrain(cl)>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.fisherprob('quick','good') quick0.5535714285714286>>> cl.fisherprob('quick rabbit','good') quick rabbit0.78013986588957995>>> cl.cprob('rabbit','good') 1.0>>> cl.fisherprob('rabbit','good') rabbit0.75>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.cprob('quick','bad') 0.4285714285714286
Page 28
Fisher Example
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.cprob('money','good') 0>>> cl.cprob('money','bad') 1.0>>> cl.cprob('buy','bad') 1.0>>> cl.cprob('buy','good') 0>>> cl.fisherprob('money buy','good') money buy0.23578679513998632>>> cl.fisherprob('money buy','bad') money buy0.8861423315082535>>> cl.fisherprob('money quick','good') money quick0.41208671548422637>>> cl.fisherprob('money quick','bad') money quick0.70116895256207468
Page 29
Classification with Inverse Chi-Square
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
>>> cl.fisherprob('quick rabbit','good')quick rabbit0.78013986588957995>>> cl.classify('quick rabbit') quick rabbitu'good'>>> cl.fisherprob('quick money','good') quick money0.41208671548422637>>> cl.classify('quick money') quick moneyu'bad'>>> cl.setminimum('bad',0.8) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.4) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.42) >>> cl.classify('quick money') quick money
this version of the classifier does notprint “unknown” as a classification
in practice, we’ll tolerate false positives for “good” more than
false negatives for “good” -- we’d rather see a mesg that is spam rather than lose a mesg that is not spam.
Page 30
Fisher -- Simplified
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Reduces the signal – to – noise ratios• Assumes document occur with normal
distribution• Estimates differences in corpus size with X-
squared• “Chi”-squared is a “goodness-of-fit” b/t an
observed distribution and theoretical distribution• Utilizes confidence interval & std. dev. estimations
for a corpus• http://
en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg&page=1
Page 31
Assignment 4
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
• Pick one question from the end of the chapter.
• Implement the function and state briefly the differences.
• Utilize the python files associated with the class if needed.
• Deadline: Next week