Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest

Countering Spam Using Classification Techniques

Steve [email protected] Mining Guest LectureFebruary 21, 2008

Overview

IntroductionCountering Email Spam

Problem DescriptionClassification HistoryOngoing Research

Countering Web SpamProblem DescriptionClassification HistoryOngoing Research

Conclusions

IntroductionThe Internet has spawned numerous information-rich environments

Email SystemsWorld Wide WebSocial Networking Communities

Openness facilities information sharing, but it also makes them vulnerable…

Denial of Information (DoI) Attacks

Deliberate insertion of low quality information (or noise) into information-rich environments

Information analog to Denial of Service (DoS) attacks

Two goalsPromotion of ideals by means of deceptionDenial of access to high quality information

Spam is the currently the most prominent example of a DoI attack

Overview

IntroductionIntroductionIntroductionCountering Email Spam

Problem DescriptionClassification HistoryOngoing Research

Countering Web SpamCountering Web SpamCountering Web SpamProblem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research

ConclusionsConclusionsConclusions

Countering Email Spam

Close to 200 billion (yes, billion) emails are sent each day

Spam accounts for around 90% of that email traffic

~2 million spam messages every second

Old Email Spam Examples

Problem Description

Email spam detection can be modeled as a binary text classification problem

Two classes: spam and legitimate (non-spam)

Example of supervised learningBuild a model (classifier) based on training data to approximatethe target function

Construct a function φ: M {spam, legitimate} such that it overlaps Φ: M {spam, legitimate} as much as possible

Problem Description (cont.)

How do we represent a message?

How do we generate features?

How do we process features?

How do we evaluate performance?

How do we represent a message?

Classification algorithms require a consistent format

Salton’s vector space model (“bag of words”) is the most popular representation

Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>


Sources of informationSMTP connections

Network properties

Email headersSocial networks

Email bodyTextual partsURLsAttachments

How do we process features?

Feature TokenizationAlphanumeric tokensN-gramsPhrases

Feature ScrubbingStemmingStop word removal

Feature SelectionSimple feature removalInformation-theoretic algorithms

dccFN

babFP

dcdR

dbdP

+=

+=

+=

+=

How do we evaluate performance?

Traditional IR metricsPrecision vs. Recall

False positives vs. False negatives

Imbalanced error costs

ROC curves

Classification History

Sahami et al. (1998)Used a Naïve Bayes classifierWere the first to apply text classification research to the spam problem

Pantel and Lin (1998)Also used a Naïve Bayes classifierFound that Naïve Bayes outperforms RIPPER

Classification History (cont.)

Drucker et al. (1999)Evaluated Support Vector Machines as a solution to spamFound that SVM is more effective than RIPPER and Rocchio

Hidalgo and Lopez (2000)Found that decision trees (C4.5) outperform Naïve Bayes and k-NN


Up to this point, private corpora were used exclusively in email spam research

Androutsopoulos et al. (2000a)Created the first publicly available email spam corpus (Ling-spam)Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier


Androutsopoulos et al. (2000b)Created another publicly available email spam corpus (PU1)Confirmed previous research than Naïve Bayesoutperforms a keyword-based filter

Carreras and Marquez (2001)Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes


Androutsopoulos et al. (2004)Created 3 more publicly available corpora (PU2, PU3, and PUA)Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB

Zhang et al. (2004)Used Ling-spam, PU1, and the SpamAssassin corporaCompared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB

Classification History (cont.)CEAS (2004 – present)

Focuses solely on email and anti-spam researchGenerates a significant amount of academic and industry anti-spam research

Klimt and Yang (2004)Published the Enron Corpus – the first large-scale corpus of legitimate email messages

TREC Spam Track (2005 – present)Produces new corpora every yearProvides a standardized platform to evaluate classification algorithms

Ongoing Research

Concept Drift

New Classification Approaches

Adversarial Classification

Image Spam

Concept Drift

Spam content is extremely dynamic

Topic drift (e.g., specific scams)Technique drift (e.g., obfuscations)

How do we keep up with the Joneses?

Batch vs. Online Learning

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

01/0301/03 01/0401/04 01/0501/05 01/06Pe

rcen

tage

of

Spam

Mes

sage

sMonth

OBFUSCATING_COMMENTINTERRUPTUS

HTML_FONT_LOW_CONTRASTHTML_TINY_FONT

New Classification Approaches

Filter Fusion

Compression-based Filtering

Network behavioral clustering

Adversarial Classification

Classifiers assume a clear distinction between spam and legitimate features

Camouflaged messagesMask spam content with legitimate contentDisrupt decision boundaries for classifiers

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

640 320 160 80 40 20 10

Wei

ghte

d A

ccur

acy,

λ =

9

Number of Retained Features

Naive BayesSVM

LogitBoost 0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

640 320 160 80 40 20 10

Wei

ghte

d A

ccur

acy,

λ =

9


Naive BayesSVM

LogitBoost 0.4

0.5

0.6

0.7

0.8

0.9

1

640 320 160 80 40 20 10

Wei

ghte

d A

ccur

acy,

λ =

9


Naive BayesSVM

LogitBoost

Camouflage Attacks

Baseline performanceAccuracies consistently higher than 98%

Classifiers under attackAccuracies degrade to between 50% and 70%

Retrained classifiersAccuracies climb back to between 91% and 99%

Camouflage Attacks (cont.)

Retraining postpones the problem, but it doesn’t solve it

We can identify features that are less susceptible to attack, but that’s simply another stalling technique 0

0.2

0.4

0.6

0.8

1

4(A)43(A)32(A)21(A)10(A)0

Frac

tion

of F

alse

Neg

ativ

es

Round Number (A denotes Attack)

NaiveBayesSVM

LogitBoost

Image Spam

What happens when an email does not contain textual features?

OCR is easily defeated

Classification using image properties

Overview

IntroductionIntroductionIntroductionCountering Email SpamCountering Email SpamCountering Email Spam

Problem DescriptionProblem DescriptionProblem DescriptionClassification HistoryClassification HistoryClassification HistoryOngoing ResearchOngoing ResearchOngoing Research

Countering Web SpamProblem DescriptionClassification HistoryOngoing Research

ConclusionsConclusionsConclusions

Countering Web Spam

What is web spam?Traditional definitionOur definition

Between 13.8% and 22.1% of all web pages

Ad Farms

Only contain advertising links (usually ad listings)

Elaborate entry pages used to deceive visitors

Ad Farms (cont.)

Clicking on an entry page link leads to an ad listing

Ad syndicators provide the content

Web spammers create the HTML structures

Parked Domains

Domain parking servicesProvide place holders for newly registered domainsAllow ad listings to be used as place holders to monetize a domain

Inevitably, web spammers abused these services

Parked Domains (cont.)

Functionally equivalent to Ad FarmsBoth rely on ad syndicators for contentBoth provide little to no value to their visitors

Unique CharacteristicsReliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.)Typically for sale by owner (“Offer To Buy This Domain”)

Parked Domains (cont.)

Advertisements

Pages advertising specific products or services

Examples of the kinds of pages being advertised in Ad Farms and Parked Domains

Problem Description

Web spam detection can also be modeled as a binary text classification problem

Salton’s vector space model is quite common

Feature processing and performance evaluation are also quite similar

But what about feature generation…


Sources of informationHTTP connections

Hosting IP addressesSession headers

HTML contentTextual propertiesStructural properties

URL linkage structurePageRank scoresNeighbor properties


Davison (2000)Was the first to investigate link-based web spamBuilt decision trees to successfully identify “nepotistic links”

Becchetti et al. (2005)Revisited the use of decision trees to identify link-based web spamUsed link-based features such as PageRank and TrustRank scores


Drost and Scheffer (2005)Used Support Vector Machines to classify web spam pagesRelied on content-based features as well as link-based features

Ntoulas et al. (2006)Built decision trees to classify web spamUsed content-based features (e.g., fraction of visible content, compressibility, etc.)

Classification HistoryUp to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets

Webb et al. (2006)Presented the Webb Spam Corpus – a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages)http://www.webbspamcorpus.org

Castillo et al. (2006)Presented the WEBSPAM-UK2006 corpus – a publicly available web spam corpus (only contains 1,924 web spam pages)

Classification HistoryCastillo et al. (2007)

Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data setUsed link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)]

Webb et al. (2008)Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusivelyUsed the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data setFound that these classifiers are comparable to (and in many cases, better than) existing approaches

Ongoing Research

Redirection

Phishing

Social Spam

Redirection

144,801 unique redirect chains (1.54 average HTTP redirects)

43.9% of web spam pages use some form of HTML or JavaScript redirection

49%

14%

11%

8%

7%

5%

3%

2%

1%

302 HTTP redirect

frame redirect

301 HTTP redirect

iframe redirect

meta refresh andlocation.replace()meta refresh

meta refresh and location

location*

Other

Phishing

Interesting form of deception that affects email and web users

Another form of adversarial classification

Social Spam

Comment spam

Bulletin spam

Message spam

Conclusions

Email and web spam are currently two of the largest information security problems

Classification techniques offer an effective way to filter this low quality information

Spammers are extremely dynamic, generating various areas of important future research…

Questions

Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest

Documents

Countering Spam Using Classification Techniqueslxiong/cs570s08/share/slides/Webb_spam.pdfCountering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest