Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Python & Web Mining

Old Dominion UniversityDepartment of Computer Science

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Lecture 5

CS 495 Fall 2012

Hany SalahEldeen Khalil [email protected]

10-03-12

Presented & Prepared by: Justin F. [email protected]


Chapter 6:“Document Filtering”

Document Filtering


In a nutshell: It is classifying documents based on their content. This classification could be binary (good/bad, spam/not-spam) or n-ary (school-related-emails, work-related, commercials…etc)

Why do we need Document filtering?


• Eliminate spam.• Removing unrelated comments in forums

and public message boards.• Classifying social /work-related emails

automatically.• Forwarding information-request emails

to the expert who is most capable of answering the email.

Spam Filtering


• First it was rule-based classifiers:• Overuse capital letters• Words related to pharmaceutical

products• Garish HTML colors

Cons of using Rule-based classifiers


• Easy to trick by just avoiding patterns of capital letters…etc.

• What is considered spam varies from one to another.• Ex: Inbox of a medical rep Vs. email of

a house-wife.

Solution


• Develop programs that learn.• Teach them the differences and how to

recognize each class by providing examples of each class.

Features


• We need to extract features from documents to classify them.

• Feature: Is anything that you can determine as being either present or absent in the item.

Definitions


• item = document• feature = word• classification = {good|bad}

Dictionary Building


Dictionary Building


• Remember:• Removing capital letters reduce the

total number of features by removing the SHOUTING style.

• Size of the features also is crucial (using entire email as feature Vs. each letter a feature)

Classifier Training


• It is designed to start off very uncertain.• Increase certainty upon learning

features.

Classifier Training


Probabilities


• It’s a number between 0-1 indicating how likely an event is.

Probabilities


• ‘quick’ appeared in 2 documents as good and the total number of good documents is 3

Conditional Probabilities


Pr(A|B) = “probability of A given B”

fprob(quick|good) = “probability of quick given good”

= (quick classified as good) / (total good items) = 2 / 3

Starting with Reasonable guess


• Using the info we seen so far makes it extremely sensitive in early training stages

• Ex: “money”• Money appeared in casino training

document as bad• It appears with probability = 0 for

good which is not right!

Solution: Start with assumed probability


• Start for instance with 0.5 probability for each feature

• Also decide the weight chosen for the assumed probability you will take.

Assumed Probability


>>> cl.fprob('money','bad') 0.5>>> cl.fprob('money','good') 0.0

we have data for bad, but should we start with 0 probability for money given good?

>>> cl.weightedprob('money','good',cl.fprob) 0.25>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.weightedprob('money','good',cl.fprob) 0.16666666666666666>>> cl.fcount('money','bad')3.0>>> cl.weightedprob('money','bad',cl.fprob) 0.5

define an assumed probability of 0.5then weightedprob() returns the weighted mean of fprob and the assumed probability

weightedprob(money,good) = (weight * assumed + count * fprob()) / (count + weight)

= (1*0.5 + 1*0) / (1+1)= 0.5 / 2= 0.25(double the training)= (1*0.5 + 2*0) / (2+1)= 0.5 / 3= 0.166

Pr(money|bad) remains = (0.5 + 3*0.5) / (3+1) = 0.5

Naïve Bayesian Classifier


• Move from terms to documents:Pr(document) = Pr(term1) * Pr(term2) * … * Pr(termn)

• Naïve because we assume all terms occur independently • we know this is as simplifying assumption; it is

naïve to think all terms have equal probability for completing this phrase:

• “Shave and a hair cut ___ ____” • Bayesian because we use Bayes’ Theorem to

invert the conditional probabilities

Bayes Theorem


• Given our training data, we know: Pr(feature|classification)

• What we really want to know is: Pr(classification|feature)

• Bayes’ Theorem* :Pr(A|B) = Pr(B|A) Pr(A) / Pr(B)

Pr(good|doc) = Pr(doc|good) Pr(good) / Pr(doc)* http://en.wikipedia.org/wiki/Bayes%27_theorem

we know how to calculate this

#good / #totalwe skip this sinceit is the same foreach classificationOr:

http://en.wikipedia.org/wiki/Bayes'_theorem

http://en.wikipedia.org/wiki/Bayes'_theorem

Our Bayesian Classifier


>>> import docclass >>> cl=docclass.naivebayes(docclass.getwords)>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.prob('quick rabbit jumps','good') quick rabbit jumps0.095486111111111091>>> cl.prob('quick rabbit jumps','bad') quick rabbit jumps0.0083333333333333332

we use these valuesonly for comparison,

not as “real” probabilities

Bayesian Classifier


• http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Testing

http://en.wikipedia.org/wiki/Naive_Bayes_classifier



Classification Thresholds


>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'>>> cl.prob('quick money','good') quick money0.09375>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.setthreshold('bad',3.0)>>> cl.classify('quick money',default='unknown') quick money'unknown'>>> cl.classify('quick rabbit',default='unknown')quick rabbitu'good'

only classify something as bad if it is 3X more likely to

be bad than good

Classification Thresholds…cont


>>> for i in range(10): docclass.sampletrain(cl) >>> cl.prob('quick money','good') quick money0.016544117647058824>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.prob('quick rabbit','good') quick rabbit0.13786764705882351>>> cl.prob('quick rabbit','bad') quick rabbit0.0083333333333333332>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'

Fisher Method


• Normalize the frequencies for each category• e.g., we might have far more “bad” training data

than good, so the net cast by the bad data will be “wider” than we’d like

• Calculate normalized Bayesian probability, then fit the result to an inverse chi-square function to see what is the probability that a random document of that classification would have those features (i.e., terms)

Fisher Example


>>> import docclass>>> cl=docclass.fisherclassifier(docclass.getwords) >>> cl.setdb('mln.db')>>> docclass.sampletrain(cl)>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.fisherprob('quick','good') quick0.5535714285714286>>> cl.fisherprob('quick rabbit','good') quick rabbit0.78013986588957995>>> cl.cprob('rabbit','good') 1.0>>> cl.fisherprob('rabbit','good') rabbit0.75>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.cprob('quick','bad') 0.4285714285714286

Fisher Example


>>> cl.cprob('money','good') 0>>> cl.cprob('money','bad') 1.0>>> cl.cprob('buy','bad') 1.0>>> cl.cprob('buy','good') 0>>> cl.fisherprob('money buy','good') money buy0.23578679513998632>>> cl.fisherprob('money buy','bad') money buy0.8861423315082535>>> cl.fisherprob('money quick','good') money quick0.41208671548422637>>> cl.fisherprob('money quick','bad') money quick0.70116895256207468

Classification with Inverse Chi-Square


>>> cl.fisherprob('quick rabbit','good')quick rabbit0.78013986588957995>>> cl.classify('quick rabbit') quick rabbitu'good'>>> cl.fisherprob('quick money','good') quick money0.41208671548422637>>> cl.classify('quick money') quick moneyu'bad'>>> cl.setminimum('bad',0.8) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.4) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.42) >>> cl.classify('quick money') quick money

this version of the classifier does notprint “unknown” as a classification

in practice, we’ll tolerate false positives for “good” more than

false negatives for “good” -- we’d rather see a mesg that is spam rather than lose a mesg that is not spam.

Fisher -- Simplified


• Reduces the signal – to – noise ratios• Assumes document occur with normal

distribution• Estimates differences in corpus size with X-

squared• “Chi”-squared is a “goodness-of-fit” b/t an

observed distribution and theoretical distribution• Utilizes confidence interval & std. dev. estimations

for a corpus• http://

en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg&page=1

http://en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg&page=1

http://en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg&page=1

Assignment 4


• Pick one question from the end of the chapter.

• Implement the function and state briefly the differences.

• Utilize the python files associated with the class if needed.

• Deadline: Next week

Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Documents