YOKAI VERSUS THE ELEPHANTHADOOP AND THE FIGHT AGAINST SHAPE-SHIFTING SPAM
Vishwanath Ramarao & Mark Risher
Yahoo! Mail
© SHMorgan - www.obakemono.com
3
AGENDA
Shape-shifting spam
Antispam Origins
Hadoop Algorithms
Applications to Security
Resources for Implementers
5
6
http:/<!--gmail.com-->/f915fde2cf53df18<!--uc22wddprm-->.li<!--cf997b28e-->gh<!--PdNKLr-->tt<!---kxnd2itipuvd.yahoo.com-->o<!--ju1j8V-->p<!--vrgxetdcnubslgacvc-->b<!--OsLaWIv-->o<[email protected]>dy<!--in7oouvxfrg7ax-->.com]*!}v}]along especially consecutive important dmvfu
<!--gmail.com-->
7
8
1,300,925,111,156,286,160,896
Viagorea ViagDrHa V l a g r a VyAGRA via---gra viagrga
via-gra 'V 1 @ G' Ra Viagzra viagdra via_gra ViaZUgra
Viargvra ViagrYa Vii-agra ViagWra vi(@)gr@ Viagvra
V-I-A-G-R-A Vi-ag.ra vigra Vkiagra via.gra v-ii-a=g-ra
V l A G R A VIA7GRA V/i/a/g/r/a VIxAGRA Viaggra vi@gr|@|
ViaTagra ViaVErga Viagr(a Viagr^a Viágrá Viagara
Viag@ra Viag&ra vi@g*r@ V-i.a-g*r-a V1@grA ViaaPrga
Vi$agra ViaJ1gra Viag$ra via---gra Vi.ag.ra Viaoygra
Vi/agra Viag%ra Viarga V|i|a|g|r|a Viag)ra vi@|g|r@
Viag&ra vi**agra vi@gr*@ vi-@gr@ V iagr a V&iagra
(http://bit.ly/cpOyLi)
10
11
TYPICAL ATTACK/RESPONSE PROFILE
1/17/06 1/18/06 1/19/06 1/20/06 1/21/06 1/22/06 1/23/06 1/24/06 1/25/06 1/26/06 1/27/06 1/28/06 1/29/06 1/30/060
5000
10000
15000
20000
25000
30000
35000
Connections from IP:64.21.48.67
Rule change(1/23@01:15)
12
MORE YOKAI - TARGETED ATTACKS
<style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobutton telefoons Jermaine ie saporito roshan 3026 janata trennung palillos toughest n capitole calzado 20200 Omnimedia collective saudade dizaines 205px hardener elongating Invasionofyourprivacy Personnal ftsbedingungen Montaner prozac Serpell fcard bvh capacitate 12502 courtship kiranji utroligt transducer tyee Delhaize clueless toffee nnio Zoa pochino sterns 622 Verordnung carbons waterresistant assessing footerText perrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justi aguardo jibes Chubb inflammatory iteration gran fald asseoir considerations 692px treasured Allotransplantation twoyears appx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsida liefde Voeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrina sarei niques lugo quotedbl bayr 3500 CI addressee optatively gazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rns fuseaction repris taires restraint manchettes trendlines effectue despatch Minsky estadual doses danbrown Muenster jind7n7 smashes gourmandes ashanti sentants rows kyk coated Incontournables coinciden jspa stalker CDS contienen expletives s8 eof replenishing puyallup prato sondra validar orientale sonnets steamer Niwango acrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecnici conciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella Plissier Hellmich Randall Caradonna springa registrada haupt Entran 3060 Rochin capacitor sotol 3413 smirk interdite ServicePoint capabilities bouncefee Linkov 3Dg auntie OSP Caecilia Platzierung wrangler pisos banlieue Daniella enderle israel professionnelles susto 39800 Espana plena radian antic!...........................200KB……….
</style>
<center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><img src="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><img src="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><img src="http://ivywhere.info/images/please2.jpg" border="0"></a><br>
[400kb…]<center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><img src="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><img src="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><img src="http://corfair.info/images/please2.jpg" border="0"></a><br>
14
15
WHY IS THE ANTISPAM PROBLEM HARD
• Scale of the problem; 25B Connections, 5B deliveries, 450M mailboxes
• User feedback is often late, noisy and not always actionable
• Large, diverse stream of legitimate traffic that looks like spam
• Slow adoption of authentication technologies like DKIM and SPF
• Spammers are clever; target and specialize attacks
• Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaign
• A significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo
16
GENERATION 1: MANUAL MANAGEMENT LAYER
• Heuristics, blocks, blacklists– Provide attack mitigation and operational flexibility,
highly explainable. – Not durable, expensive to keep pace with fast
morphing spam
• Ad hoc queries– Proprietary implementations, not very scalable, steep
learning curve– Reactive and usually late
17
GENERATION 2: MACHINE MANAGEMENT LAYER
• Online reputation models– Simple, mostly scoring/counter/ratio based models– Highly scalable due the absence of any state/memory– Generalize too broadly, lack expressive power
• Batch trained reputation models– Typically digested memory based hashing or machine
learning models– Difficult to implement and due to the need for labeled
examples scale well only moderately– Slow to update and learn, lack explainability, limited
operational control
19
DISTRIBUTED COMPUTING PARADIGM
Map:Reduce + distributed storage:
• Simplicity of online, stateless models
• Expressiveness of offline analysis
• Ease of management
20
THE MAP:REDUCE PARADIGM
• Input data format is application-specific, specified by the user
• Output is a set of <key,value> pairs
• User expresses algorithm using two functions– Map is applied on the input data and produces a list
of intermediate <key,value> pairs – Reduce is applied to all intermediate pairs with the
same key. It typically performs some kind of merging operation and produces zero or more output pairs
• Finally, output pairs are sorted by their key value
21
THE MAP:REDUCE PARADIGM
Mapper
Mapper
Mapper <k1,v1>
<k2,v2>
<k1,v3>
<k1,{v1,v3}><k2,v2>
<k1,W1>
Reducer
22
A SIMPLE MAP:REDUCE EXAMPLE
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
// Split up input files (MAP), iterate over chunks, reassemble results (REDUCE)
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
23
A SIMPLE MAP:REDUCE EXAMPLE (bit.ly/bdyi0l)
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
24
A SIMPLE MAP:REDUCE EXAMPLE (bit.ly/bdyi0l)
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
25
Applications &
Outcomes
26
LETS REVIEW OUR DESIGN GOALS AGAIN
• Classifiers are notorious for lack of explainability– Engineers and analysts needs to know what the classifier is
missing– Engineers and analysts need to know about emerging threats– Analysts need “canned” reports along interesting dimensions– Machines need smart feature engineering
• Develop a scalable system to provide deep insight into spammer campaigns– Double up as a platform for standard reporting– Also double up as a platform for adhoc analysis and data
probing– Signal amplification and smart feature extraction platform
27
OUR ANTISPAM ANALYTIC PLATFORM
• Hadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interface
• Feature engineering with small simple Perl programs for data extraction and transformation
• SQL-like “Pig” programming language for data analysis and management
• Mahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithms
• Other proprietary algorithms and frameworks for specialized tasks
28
VARIOUS ASPECTS OF A GRID DRIVEN SOLUTION
• Standard reporting
• Ad hoc querying
• Campaign discovery from spam feedback using frequent item set mining
• “Gaming” detection in notspam feedback using connected components
29
TOP SPAMMY DOMAINS REPORT FOR 01/15/2010
key:noreply.amateurmatch.com|value:1164key:goodmere.info|value:896key:marketing.meredith.com|value:1078key:verizon.net|value:822key:reply.mb00.net|value:980key:insideapple.apple.com|value:1094key:facebookappmail.com|value:882key:mydailymoment.com|value:849key:thetwilightsaga.com|value:4671key:adknowledgemailer6.com|value:859key:freedollarspro.info|value:1164key:smartreachmedia.com|value:1074key:yahoo.es|value:877key:ecomasher.com|value:1197key:leasetrade-statusupdates.com|value:951
key:noreply.amateurmatch.comvalue:1164
30
AD HOC QUERIES FOR ANTISPAM RESEARCH
• Identify domains that had few spam votes in the previous time window but have a high number of spam votes today
• All IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 times
• Which domains/IPs suddenly increased their sending volume after a positive reputation change
• Which FROM addresses exhibit low message size entropy
• All messages that had nothing but a URL and the domain of the URL had low page rank
31
AD HOC QUERIES - ANATOMY OF A PIG QUERY--- This includes some basic string functions, including splitting a
string on the '@' character
register /homes/jpujara/pig_scripts/string.jar;
define splitEmail string.Tokenize('2','@');
--- Load up some data - incoming messages at a date and time, and our trusted user database
MESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray);
USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int);
--- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partners
EXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain);
YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain;
--- Combine the message and sender domains with the trusted user data and select only trusted messages
YAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user;
TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0;
--- Group by domain, and generate a count, order by descending count
DOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain;
DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count;
DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC;
--- Output the results
STORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';
32
CAMPAIGN DISCOVERY IN SPAM FEEDBACK
• Frequent Itemset Mining– Classical method– Research interesting relationships between variables in a large database– Primarily applied for market basket analysis
• Many good implementations– APRIORI
• Easy to implement
• Parallelizes moderately well but bottlenecks for extremely large data sets
• Not very efficient with the number scans
– ECLAT• Parallelizes easily
• Amenable to a good grid implementation
• Fewer scans of the dataset
– Parallel FP GROWTH• Designed explicitly for systems like hadoop
• Implemented in Mahout 0.2
33
FREQUENT ITEM SET – EXAMPLE DATASET
Item sets database - D
I1, I2, I5
I2, I4
I2, I4
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
34
FREQUENT ITEMSET MINING
Slide Courtsey: dortmund.de
35
FREQUENT ITEMSET MINING ON ONE DAY’S SPAM REPORTS
9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)
9 2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)
9 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)
9 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)
9 2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)
9 2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,)
9 1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)
9 1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,)
9 1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)
9 1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)
9 1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)
9 1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)
2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,)
2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)
36
GAMING DETECTION IN NOTSPAM FEEDBACK
• Spammers instrument accounts to vote “not spam” on emails that they send– Delays classification of spamming IP addresses– Throws off the classifiers if the feedback is not filtered well
• Model the problem as a bipartite graph– Well known model for matching algorithms– Broadly applied in various fields like coding theory– A graph whose vertices are disjoint form disjoint sets U,V – There is an edge connecting every U to a vertex in V
37
CONNECTED COMPONETS - EXPLAINED
Y1 = Yahoo user 1, Y2 = Yahoo user 2
IP1 = IP address of the host Y1 “voted” notspam from
y1
y1
IP1
IP2
y1
y1
weight = 2SQUARING
38
CONNECTED COMPONENTS FOR “GAMING” DETECTION
y2
y1 IP3
IP4
IP1
IP2
Set of “voted from” IPs
y3
Set of “voted on” IPsSet of Yahoo IDs
voting notspam
Set of IPs/YIDs used exclusively for voting notspam
Set of (likely new) spamming IPs which are “worth” voting for
39
CONNECTED COMPONENTS - RESULTS
- Connnected components for IPs notspam was voted from
40
CONNECTED COMPONENTS - RESULTS
- Connnected components for IPs notspam was voted on
41
CONCLUSIONS
• We have had success leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithms
• Frequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedback
• Connected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedback
• Grid system based analysis platforms may be broadly applicable across the security domain
42
APPLY SLIDE
• Download Hadoop distribution– http://hadoop.apache.org– Try out Pig on standalone, single Linux box
• Identify source data to aggregate– Start simple: IP patterns across web access logs– Begin with offline aggregation; yesterday’s attacks still interesting
• Read Connected Components and Frequent Itemset Mining papers– Stop looking for a single, invariant “tell” – far too costly– Start thinking about co-occurrence of innocuous features
43
RESOURCES FOR IMPLEMENTERS
• Hadoop setup, documentation and resources– http://hadoop.apache.org/
• Pig documentation and resources– http://hadoop.apache.org/pig/
• Mahout documentation and resources– http://lucene.apache.org/mahout/
• Frequent itemset mining implementation repository– http://fimi.cs.helsinki.fi/src/
• Connected components description– [link not yet live]
• Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 2007
45
CONNECTED COMPONENTS
• Reg IP• Cookie• Username• Birthday
• Reg IP• Cookie• Username• Birthday
• Reg IP• Cookie• Username• Birthday
• Reg IP• Cookie• Username• Birthday
• Reg IP• Cookie• Username• Birthday
• Reg IP• Cookie• Username• Birthday
• Reg IP• Cookie• Username• Birthday