Countering Spam Using Classification Techniques
Steve [email protected] Mining Guest LectureFebruary 21, 2008
Overview
IntroductionCountering Email Spam
Problem DescriptionClassification HistoryOngoing Research
Countering Web SpamProblem DescriptionClassification HistoryOngoing Research
Conclusions
IntroductionThe Internet has spawned numerous information-rich environments
Email SystemsWorld Wide WebSocial Networking Communities
Openness facilities information sharing, but it also makes them vulnerable…
Denial of Information (DoI) Attacks
Deliberate insertion of low quality information (or noise) into information-rich environments
Information analog to Denial of Service (DoS) attacks
Two goalsPromotion of ideals by means of deceptionDenial of access to high quality information
Spam is the currently the most prominent example of a DoI attack
Overview
IntroductionIntroductionIntroductionCountering Email Spam
Problem DescriptionClassification HistoryOngoing Research
Countering Web SpamCountering Web SpamCountering Web SpamProblem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research
ConclusionsConclusionsConclusions
Countering Email Spam
Close to 200 billion (yes, billion) emails are sent each day
Spam accounts for around 90% of that email traffic
~2 million spam messages every second
Old Email Spam Examples
Problem Description
Email spam detection can be modeled as a binary text classification problem
Two classes: spam and legitimate (non-spam)
Example of supervised learningBuild a model (classifier) based on training data to approximatethe target function
Construct a function φ: M {spam, legitimate} such that it overlaps Φ: M {spam, legitimate} as much as possible
Problem Description (cont.)
How do we represent a message?
How do we generate features?
How do we process features?
How do we evaluate performance?
How do we represent a message?
Classification algorithms require a consistent format
Salton’s vector space model (“bag of words”) is the most popular representation
Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>
How do we generate features?
Sources of informationSMTP connections
Network properties
Email headersSocial networks
Email bodyTextual partsURLsAttachments
How do we process features?
Feature TokenizationAlphanumeric tokensN-gramsPhrases
Feature ScrubbingStemmingStop word removal
Feature SelectionSimple feature removalInformation-theoretic algorithms
dccFN
babFP
dcdR
dbdP
+=
+=
+=
+=
How do we evaluate performance?
Traditional IR metricsPrecision vs. Recall
False positives vs. False negatives
Imbalanced error costs
ROC curves
Classification History
Sahami et al. (1998)Used a Naïve Bayes classifierWere the first to apply text classification research to the spam problem
Pantel and Lin (1998)Also used a Naïve Bayes classifierFound that Naïve Bayes outperforms RIPPER
Classification History (cont.)
Drucker et al. (1999)Evaluated Support Vector Machines as a solution to spamFound that SVM is more effective than RIPPER and Rocchio
Hidalgo and Lopez (2000)Found that decision trees (C4.5) outperform Naïve Bayes and k-NN
Classification History (cont.)
Up to this point, private corpora were used exclusively in email spam research
Androutsopoulos et al. (2000a)Created the first publicly available email spam corpus (Ling-spam)Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier
Classification History (cont.)
Androutsopoulos et al. (2000b)Created another publicly available email spam corpus (PU1)Confirmed previous research than Naïve Bayesoutperforms a keyword-based filter
Carreras and Marquez (2001)Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes
Classification History (cont.)
Androutsopoulos et al. (2004)Created 3 more publicly available corpora (PU2, PU3, and PUA)Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB
Zhang et al. (2004)Used Ling-spam, PU1, and the SpamAssassin corporaCompared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB
Classification History (cont.)CEAS (2004 – present)
Focuses solely on email and anti-spam researchGenerates a significant amount of academic and industry anti-spam research
Klimt and Yang (2004)Published the Enron Corpus – the first large-scale corpus of legitimate email messages
TREC Spam Track (2005 – present)Produces new corpora every yearProvides a standardized platform to evaluate classification algorithms
Ongoing Research
Concept Drift
New Classification Approaches
Adversarial Classification
Image Spam
Concept Drift
Spam content is extremely dynamic
Topic drift (e.g., specific scams)Technique drift (e.g., obfuscations)
How do we keep up with the Joneses?
Batch vs. Online Learning
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
01/0301/03 01/0401/04 01/0501/05 01/06Pe
rcen
tage
of
Spam
Mes
sage
sMonth
OBFUSCATING_COMMENTINTERRUPTUS
HTML_FONT_LOW_CONTRASTHTML_TINY_FONT
New Classification Approaches
Filter Fusion
Compression-based Filtering
Network behavioral clustering
Adversarial Classification
Classifiers assume a clear distinction between spam and legitimate features
Camouflaged messagesMask spam content with legitimate contentDisrupt decision boundaries for classifiers
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
640 320 160 80 40 20 10
Wei
ghte
d A
ccur
acy,
λ =
9
Number of Retained Features
Naive BayesSVM
LogitBoost 0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
640 320 160 80 40 20 10
Wei
ghte
d A
ccur
acy,
λ =
9
Number of Retained Features
Naive BayesSVM
LogitBoost 0.4
0.5
0.6
0.7
0.8
0.9
1
640 320 160 80 40 20 10
Wei
ghte
d A
ccur
acy,
λ =
9
Number of Retained Features
Naive BayesSVM
LogitBoost
Camouflage Attacks
Baseline performanceAccuracies consistently higher than 98%
Classifiers under attackAccuracies degrade to between 50% and 70%
Retrained classifiersAccuracies climb back to between 91% and 99%
Camouflage Attacks (cont.)
Retraining postpones the problem, but it doesn’t solve it
We can identify features that are less susceptible to attack, but that’s simply another stalling technique 0
0.2
0.4
0.6
0.8
1
4(A)43(A)32(A)21(A)10(A)0
Frac
tion
of F
alse
Neg
ativ
es
Round Number (A denotes Attack)
NaiveBayesSVM
LogitBoost
Image Spam
What happens when an email does not contain textual features?
OCR is easily defeated
Classification using image properties
Overview
IntroductionIntroductionIntroductionCountering Email SpamCountering Email SpamCountering Email Spam
Problem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research
Countering Web SpamProblem DescriptionClassification HistoryOngoing Research
ConclusionsConclusionsConclusions
Countering Web Spam
What is web spam?Traditional definitionOur definition
Between 13.8% and 22.1% of all web pages
Ad Farms
Only contain advertising links (usually ad listings)
Elaborate entry pages used to deceive visitors
Ad Farms (cont.)
Clicking on an entry page link leads to an ad listing
Ad syndicators provide the content
Web spammers create the HTML structures
Parked Domains
Domain parking servicesProvide place holders for newly registered domainsAllow ad listings to be used as place holders to monetize a domain
Inevitably, web spammers abused these services
Parked Domains (cont.)
Functionally equivalent to Ad FarmsBoth rely on ad syndicators for contentBoth provide little to no value to their visitors
Unique CharacteristicsReliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.)Typically for sale by owner (“Offer To Buy This Domain”)
Parked Domains (cont.)
Advertisements
Pages advertising specific products or services
Examples of the kinds of pages being advertised in Ad Farms and Parked Domains
Problem Description
Web spam detection can also be modeled as a binary text classification problem
Salton’s vector space model is quite common
Feature processing and performance evaluation are also quite similar
But what about feature generation…
How do we generate features?
Sources of informationHTTP connections
Hosting IP addressesSession headers
HTML contentTextual propertiesStructural properties
URL linkage structurePageRank scoresNeighbor properties
Classification History
Davison (2000)Was the first to investigate link-based web spamBuilt decision trees to successfully identify “nepotistic links”
Becchetti et al. (2005)Revisited the use of decision trees to identify link-based web spamUsed link-based features such as PageRank and TrustRank scores
Classification History
Drost and Scheffer (2005)Used Support Vector Machines to classify web spam pagesRelied on content-based features as well as link-based features
Ntoulas et al. (2006)Built decision trees to classify web spamUsed content-based features (e.g., fraction of visible content, compressibility, etc.)
Classification HistoryUp to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets
Webb et al. (2006)Presented the Webb Spam Corpus – a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages)http://www.webbspamcorpus.org
Castillo et al. (2006)Presented the WEBSPAM-UK2006 corpus – a publicly available web spam corpus (only contains 1,924 web spam pages)
Classification HistoryCastillo et al. (2007)
Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data setUsed link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)]
Webb et al. (2008)Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusivelyUsed the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data setFound that these classifiers are comparable to (and in many cases, better than) existing approaches
Ongoing Research
Redirection
Phishing
Social Spam
Redirection
144,801 unique redirect chains (1.54 average HTTP redirects)
43.9% of web spam pages use some form of HTML or JavaScript redirection
49%
14%
11%
8%
7%
5%
3%
2%
1%
302 HTTP redirect
frame redirect
301 HTTP redirect
iframe redirect
meta refresh andlocation.replace()meta refresh
meta refresh and location
location*
Other
Phishing
Interesting form of deception that affects email and web users
Another form of adversarial classification
Social Spam
Comment spam
Bulletin spam
Message spam
Conclusions
Email and web spam are currently two of the largest information security problems
Classification techniques offer an effective way to filter this low quality information
Spammers are extremely dynamic, generating various areas of important future research…
Questions