Top Banner
Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall 2012 Hany SalahEldeen Khalil [email protected] 10-03-12 Presented & Prepared by: Justin F. Brunelle [email protected]
31

Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Dec 29, 2015

Download

Documents

Lorraine Lewis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Python & Web Mining

Old Dominion UniversityDepartment of Computer Science

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Lecture 5

CS 495 Fall 2012

Hany SalahEldeen Khalil [email protected]

10-03-12

Presented & Prepared by: Justin F. [email protected]

Page 2: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Chapter 6:“Document Filtering”

Page 3: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Document Filtering

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

In a nutshell: It is classifying documents based on their content. This classification could be binary (good/bad, spam/not-spam) or n-ary (school-related-emails, work-related, commercials…etc)

Page 4: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Why do we need Document filtering?

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Eliminate spam.• Removing unrelated comments in forums

and public message boards.• Classifying social /work-related emails

automatically.• Forwarding information-request emails

to the expert who is most capable of answering the email.

Page 5: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Spam Filtering

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• First it was rule-based classifiers:• Overuse capital letters• Words related to pharmaceutical

products• Garish HTML colors

Page 6: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Cons of using Rule-based classifiers

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Easy to trick by just avoiding patterns of capital letters…etc.

• What is considered spam varies from one to another.• Ex: Inbox of a medical rep Vs. email of

a house-wife.

Page 7: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Solution

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Develop programs that learn.• Teach them the differences and how to

recognize each class by providing examples of each class.

Page 8: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Features

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• We need to extract features from documents to classify them.

• Feature: Is anything that you can determine as being either present or absent in the item.

Page 9: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Definitions

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• item = document• feature = word• classification = {good|bad}

Page 10: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Dictionary Building

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Page 11: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Dictionary Building

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Remember:• Removing capital letters reduce the

total number of features by removing the SHOUTING style.

• Size of the features also is crucial (using entire email as feature Vs. each letter a feature)

Page 12: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Classifier Training

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• It is designed to start off very uncertain.• Increase certainty upon learning

features.

Page 13: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Classifier Training

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Page 14: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Probabilities

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• It’s a number between 0-1 indicating how likely an event is.

Page 15: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Probabilities

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• ‘quick’ appeared in 2 documents as good and the total number of good documents is 3

Page 16: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Conditional Probabilities

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Pr(A|B) = “probability of A given B”

fprob(quick|good) = “probability of quick given good”

= (quick classified as good) / (total good items) = 2 / 3

Page 17: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Starting with Reasonable guess

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Using the info we seen so far makes it extremely sensitive in early training stages

• Ex: “money”• Money appeared in casino training

document as bad• It appears with probability = 0 for

good which is not right!

Page 18: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Solution: Start with assumed probability

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Start for instance with 0.5 probability for each feature

• Also decide the weight chosen for the assumed probability you will take.

Page 19: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Assumed Probability

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

>>> cl.fprob('money','bad') 0.5>>> cl.fprob('money','good') 0.0

we have data for bad, but should we start with 0 probability for money given good?

>>> cl.weightedprob('money','good',cl.fprob) 0.25>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.weightedprob('money','good',cl.fprob) 0.16666666666666666>>> cl.fcount('money','bad')3.0>>> cl.weightedprob('money','bad',cl.fprob) 0.5

define an assumed probability of 0.5then weightedprob() returns the weighted mean of fprob and the assumed probability

weightedprob(money,good) = (weight * assumed + count * fprob()) / (count + weight)

= (1*0.5 + 1*0) / (1+1)= 0.5 / 2= 0.25(double the training)= (1*0.5 + 2*0) / (2+1)= 0.5 / 3= 0.166

Pr(money|bad) remains = (0.5 + 3*0.5) / (3+1) = 0.5

Page 20: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Naïve Bayesian Classifier

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Move from terms to documents:Pr(document) = Pr(term1) * Pr(term2) * … * Pr(termn)

• Naïve because we assume all terms occur independently • we know this is as simplifying assumption; it is

naïve to think all terms have equal probability for completing this phrase:

• “Shave and a hair cut ___ ____” • Bayesian because we use Bayes’ Theorem to

invert the conditional probabilities

Page 21: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Bayes Theorem

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Given our training data, we know: Pr(feature|classification)

• What we really want to know is: Pr(classification|feature)

• Bayes’ Theorem* :Pr(A|B) = Pr(B|A) Pr(A) / Pr(B)

Pr(good|doc) = Pr(doc|good) Pr(good) / Pr(doc)* http://en.wikipedia.org/wiki/Bayes%27_theorem

we know how to calculate this

#good / #totalwe skip this sinceit is the same foreach classificationOr:

Page 22: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Our Bayesian Classifier

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

>>> import docclass >>> cl=docclass.naivebayes(docclass.getwords)>>> docclass.sampletrain(cl) Nobody owns the water.the quick rabbit jumps fencesbuy pharmaceuticals nowmake quick money at the online casinothe quick brown fox jumps>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.prob('quick rabbit jumps','good') quick rabbit jumps0.095486111111111091>>> cl.prob('quick rabbit jumps','bad') quick rabbit jumps0.0083333333333333332

we use these valuesonly for comparison,

not as “real” probabilities

Page 23: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Bayesian Classifier

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Testing

Page 24: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Classification Thresholds

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

>>> cl.prob('quick rabbit','good') quick rabbit0.15624999999999997>>> cl.prob('quick rabbit','bad') quick rabbit0.050000000000000003>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'>>> cl.prob('quick money','good') quick money0.09375>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.setthreshold('bad',3.0)>>> cl.classify('quick money',default='unknown') quick money'unknown'>>> cl.classify('quick rabbit',default='unknown')quick rabbitu'good'

only classify something as bad if it is 3X more likely to

be bad than good

Page 25: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Classification Thresholds…cont

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

>>> for i in range(10): docclass.sampletrain(cl) >>> cl.prob('quick money','good') quick money0.016544117647058824>>> cl.prob('quick money','bad') quick money0.10000000000000001>>> cl.classify('quick money',default='unknown') quick moneyu'bad'>>> cl.prob('quick rabbit','good') quick rabbit0.13786764705882351>>> cl.prob('quick rabbit','bad') quick rabbit0.0083333333333333332>>> cl.classify('quick rabbit',default='unknown') quick rabbitu'good'

Page 26: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Fisher Method

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Normalize the frequencies for each category• e.g., we might have far more “bad” training data

than good, so the net cast by the bad data will be “wider” than we’d like

• Calculate normalized Bayesian probability, then fit the result to an inverse chi-square function to see what is the probability that a random document of that classification would have those features (i.e., terms)

Page 27: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Fisher Example

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

>>> import docclass>>> cl=docclass.fisherclassifier(docclass.getwords) >>> cl.setdb('mln.db')>>> docclass.sampletrain(cl)>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.fisherprob('quick','good') quick0.5535714285714286>>> cl.fisherprob('quick rabbit','good') quick rabbit0.78013986588957995>>> cl.cprob('rabbit','good') 1.0>>> cl.fisherprob('rabbit','good') rabbit0.75>>> cl.cprob('quick','good') 0.57142857142857151>>> cl.cprob('quick','bad') 0.4285714285714286

Page 28: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Fisher Example

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

>>> cl.cprob('money','good') 0>>> cl.cprob('money','bad') 1.0>>> cl.cprob('buy','bad') 1.0>>> cl.cprob('buy','good') 0>>> cl.fisherprob('money buy','good') money buy0.23578679513998632>>> cl.fisherprob('money buy','bad') money buy0.8861423315082535>>> cl.fisherprob('money quick','good') money quick0.41208671548422637>>> cl.fisherprob('money quick','bad') money quick0.70116895256207468

Page 29: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Classification with Inverse Chi-Square

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

>>> cl.fisherprob('quick rabbit','good')quick rabbit0.78013986588957995>>> cl.classify('quick rabbit') quick rabbitu'good'>>> cl.fisherprob('quick money','good') quick money0.41208671548422637>>> cl.classify('quick money') quick moneyu'bad'>>> cl.setminimum('bad',0.8) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.4) >>> cl.classify('quick money') quick moneyu'good'>>> cl.setminimum('good',0.42) >>> cl.classify('quick money') quick money

this version of the classifier does notprint “unknown” as a classification

in practice, we’ll tolerate false positives for “good” more than

false negatives for “good” -- we’d rather see a mesg that is spam rather than lose a mesg that is not spam.

Page 30: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Fisher -- Simplified

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Reduces the signal – to – noise ratios• Assumes document occur with normal

distribution• Estimates differences in corpus size with X-

squared• “Chi”-squared is a “goodness-of-fit” b/t an

observed distribution and theoretical distribution• Utilizes confidence interval & std. dev. estimations

for a corpus• http://

en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg&page=1

Page 31: Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Assignment 4

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

• Pick one question from the end of the chapter.

• Implement the function and state briefly the differences.

• Utilize the python files associated with the class if needed.

• Deadline: Next week