This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
FiltronFiltron: A Learning-Based : A Learning-Based Anti-Spam FilterAnti-Spam Filter
Mountain View, CA, July 30Mountain View, CA, July 30thth and 31 and 31stst 2004 2004
First Conference on Email and First Conference on Email and Anti-Spam (CEAS)Anti-Spam (CEAS)
OutlineOutline
Spam Filtering: past, present and futureSpam Filtering: past, present and future Anti-spam filtering withAnti-spam filtering with Filtron Filtron In Vitro EvaluationIn Vitro Evaluation In Vivo EvaluationIn Vivo Evaluation ConclusionsConclusions
Spam Filtering: Spam Filtering: past, present and futurepast, present and future
Past:Past: Black-lists and white-lists of e-mail addressesBlack-lists and white-lists of e-mail addresses Handcrafted rules looking for suspicious keywords Handcrafted rules looking for suspicious keywords
and patterns in headersand patterns in headers Present:Present:
Signature based filtering (Vipul’s Razor)Signature based filtering (Vipul’s Razor) Future:Future:
Combination of several techniques (SpamAssassin)Combination of several techniques (SpamAssassin)
Filtron: An overviewFiltron: An overview
A multi-platform learning-based anti-spam filter.A multi-platform learning-based anti-spam filter. Features for simple the userFeatures for simple the user::
Personalized: based on her legitimate messagesPersonalized: based on her legitimate messages Automatically updating black/white listsAutomatically updating black/white lists Efficient: server-side filtering and interception rulesEfficient: server-side filtering and interception rules
Features for the advanced user and the researcherFeatures for the advanced user and the researcher:: Customizable learning componentCustomizable learning component
– Through WekaThrough Weka open source machine learning platformopen source machine learning platform Support for creating publicly available message collectionsSupport for creating publicly available message collections
– Privacy-preserving encoding of messages and user profilesPrivacy-preserving encoding of messages and user profiles
Portable: Implemented in Java and Tcl/TkPortable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail Currently supported under POSIX-compatible mail
servers (MS Exchange Server port efforts under way)servers (MS Exchange Server port efforts under way)
LegitimateLegitimate
foldersfolders
SpamSpam
foldersfoldersPreprocessorPreprocessor
Vectorizer Vectorizer
LearnerLearner
Attribute Attribute
SelectorSelector
FiltronFiltron
FiltronFiltron’s Architecture’s Architecture
attribute set
training
vectors
User
modelinducedclassifier
black list,white list
PreprocessingPreprocessing
1. Break down mailbox(es) into distinct messages2. Remove from every message:
mail headers html tags attached files
3. Remove messages with no textual content4. Store 5 messages per sender
Avoids bias towards regular correspondents.
5. Remove duplicates6. Encode messages (optional)
Message ClassificationMessage Classification
Incoming e-mail
User’s Mailbox
Unix Mail Server
Procmail
Classifiede-mail
Classification
Filtron
Address Book
Black List
User’sProfile
Classifier
From: sender@provider
Dear Fred,Thanks for the immediatereply. I am glad to hear...
Attachments: 1. File.zip
In Vitro EvaluationIn Vitro Evaluation
We investigated the effect of:We investigated the effect of: Single-token versus multi-token attributes (n-grams Single-token versus multi-token attributes (n-grams
for n=1,2,3)for n=1,2,3) Number of attributes (40-3000)Number of attributes (40-3000) Learning algorithm (Naïve Bayes, Flexible Bayes, Learning algorithm (Naïve Bayes, Flexible Bayes,
SVMs, LogitBoost)SVMs, LogitBoost) Training corpus size (~ 10%-100% of full training Training corpus size (~ 10%-100% of full training
Misclassifying a legitimate message as spam (LMisclassifying a legitimate message as spam (LS) S) is is λλ times more serious times more serious an error than misclassifying a an error than misclassifying a spam to legitimate (Sspam to legitimate (SL)L)
Results:Results: No clear winner among learning algorithms wrt accuracy No clear winner among learning algorithms wrt accuracy
Efficiency (or other criteria) more important for real usage.Efficiency (or other criteria) more important for real usage.
Nevertheless, SVMs consistently among two bestNevertheless, SVMs consistently among two best No substantial improvement with n-grams (for n>1)No substantial improvement with n-grams (for n>1)
Refer to the TR for more details: Refer to the TR for more details: Learning to filter unsolicited commercial e-mailLearning to filter unsolicited commercial e-mail, TRN 2004/2, , TRN 2004/2,
52 false positives (out of 6732)52 false positives (out of 6732) 52%: Automatically generated messages 52%: Automatically generated messages
subscription verifications, virus warnings, etc.subscription verifications, virus warnings, etc. 22%: Very short messages22%: Very short messages
3-5 words in message body3-5 words in message body Along with attachments and hyperlinksAlong with attachments and hyperlinks
26%: Short messages26%: Short messages 1-2 lines1-2 lines Written in casual style, often exploited by spammersWritten in casual style, often exploited by spammers With no attachments or hyperlinksWith no attachments or hyperlinks
173 false negatives (out of 6732)173 false negatives (out of 6732) 30%: “Hard Spam” 30%: “Hard Spam”
Little textual information, avoiding common suspicious word patternsLittle textual information, avoiding common suspicious word patterns Many images and hyperlinksMany images and hyperlinks Tricks to confuse tokenizersTricks to confuse tokenizers
8%: Advertisements of pornographic sites with very casual and well 8%: Advertisements of pornographic sites with very casual and well chosen vocabularychosen vocabulary
23%: Non-English messages 23%: Non-English messages Under-represented in the training corpusUnder-represented in the training corpus
30%: Encoded messages 30%: Encoded messages BASE64 format; Filtron could not process it at that timeBASE64 format; Filtron could not process it at that time
6%: Hoax letters 6%: Hoax letters Long formal letters (“tremendous business opportunity !”)Long formal letters (“tremendous business opportunity !”) Many occurrences of the receiver’s full nameMany occurrences of the receiver’s full name
3%: Short messages with unusual content3%: Short messages with unusual content
ConclusionsConclusions
Signs of arms race between spammers and content-based Signs of arms race between spammers and content-based filtersfilters
Filtron’sFiltron’s performance deemed satisfactory, though it can be performance deemed satisfactory, though it can be improved with:improved with: More elaborate preprocessing to tackle usual countermeasures of More elaborate preprocessing to tackle usual countermeasures of
spammers (misspellings, uncommon words, text on images)spammers (misspellings, uncommon words, text on images) Regular retraining Regular retraining
Currently most promising approach: combination of Currently most promising approach: combination of different filtering approaches along with Machine Learningdifferent filtering approaches along with Machine Learning Collaborative filteringCollaborative filtering Filtering in the transport layer levelFiltering in the transport layer level ……