Top Banner
Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,
66

Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

Mar 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

Network-Level Spam and Scam Defenses

Nick FeamsterGeorgia Tech

with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala, Jaeyeon Jung

Page 2: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

2

Spam: More than Just a Nuisance• 95% of all email traffic

– Image and PDF Spam (PDF spam ~12%)

• As of August 2007, one in every 87 emails was a phishing attack

• Targeted attacks on rise– 20k-30k unique phishing

attacks per month

Source: APWG

Page 3: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

3

Approach: Filter

• Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham

• Question: What features best differentiate spam from legitimate mail?– Content-based filtering: What is in the mail?– IP address of sender: Who is the sender?– Behavioral features: How the mail is sent?

Page 4: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

Approach #1: Content Filters

...even mp3s!

PDFs

Excel sheets

Images

Page 5: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

5

Content Filtering: More Problems

• Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc.

• Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed

• High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

Page 6: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

6

Approach #2: IP Addresses

• Problem: IP addresses are ephemeral • Every day, 10% of senders are from previously

unseen IP addresses• Possible causes

– Dynamic addressing– New infections

Received: from mail-ew0-f217.google.com (mail-ew0-f217.google.com [209.85.219.217]) by mail.gtnoise.net (Postfix) with ESMTP id 2A6EBC94A1 for <[email protected]>; Fri, 23 Oct 2009 10:08:24 -0400 (EDT)

Page 7: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

7

Main Idea: Network-Based Filtering

• Filter email based on how it is sent, in addition to simply what is sent.

• Network-level properties: lightweight, less malleable– Network/geographic location of sender and receiver– Set of target recipients– Hosting or upstream ISP (AS number)– Membership in a botnet (spammer, hosting

infrastructure)

Page 8: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

8

Why Network-Level Features?

• Lightweight: Don’t require inspecting details of packet streams– Can be done at high speeds– Can be done in the middle of the network

• Less Malleable: Perhaps more difficult to change some network-level features than message contents

Page 9: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

9

Challenges• Understanding network-level behavior

– What network-level behaviors do spammers have?– How well do existing techniques (e.g., DNS-based

blacklists) work?

• Building classifiers using network-level features– Key challenge: Which features to use?– Two Algorithms: SNARE and SpamTracker

Anirudh Ramachandran and Nick Feamster, “Understanding the Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2007Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, USENIX Security, August 2009

Page 10: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

10

Data: Spam and BGP• Spam Traps: Domains that receive only spam• BGP Monitors: Watch network-level reachability

Domain 1

Domain 2

17-Month Study: August 2004 to December 2005

Page 11: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

11

Data Collection: MailAvenger

• Configurable SMTP server• Collects many useful statistics

Page 12: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

12

Surprising: BGP “Spectrum Agility”

• Hijack IP address space using BGP• Send spam• Withdraw IP address

A small club of persistent players appears to be using

this technique.

Common short-lived prefixes and ASes

61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717

~ 10 minutes

Somewhere between 1-10% of all spam (some clearly intentional,

others “flapping”)

Page 13: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

13

Spectrum Agility: Big Prefixes?

• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP

addresses

• Visibility: Route typically won’t be filtered (nice and short)

Page 14: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

14

Other “Basic” Findings

• Top senders: Korea, China, Japan– Still about 40% of spam coming from U.S.

• More than half of sender IP addresses appear less than twice

• ~90% of spam sent to traps from Windows

Page 15: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

15

Top ISPs Hosting Spam Senders

Page 16: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

16

How Well do IP Blacklists Work?

• Completeness: The fraction of spamming IP addresses that are listed in the blacklist

• Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam

Page 17: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

17

Completeness and Responsiveness

• 10-35% of spam is unlisted at the time of receipt• 8.5-20% of these IP addresses remain unlisted

even after one month

Data: Spam trap data from March 2007, Spamhaus from March and April 2007

Page 18: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

18

Why Do IP Blacklists Fall Short?

• Based on ephemeral identifier (IP address)– More than 10% of all spam comes from IP addresses

not seen within the past two months• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines

• Often require a human to notice/validate the behavior– Spamming is compartmentalized by domain and not

analyzed across domains

Page 19: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

19

Other Possible Approaches

• Option 1: Stronger sender identity [AIP, Pedigree]

– Stronger sender identity/authentication may make reputation systems more effective

– May require changes to hosts, routers, etc.

• Option 2: Behavior-based filtering [SNARE, SpamTracker]

– Can be done on today’s network– Identifying features may be tricky, and some may

require network-wide monitoring capabilities

Page 20: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

20

Outline

• Understanding the network-level behavior– What behaviors do spammers have?– How well do existing techniques work?

• Classifiers using network-level features– Key challenge: Which features to use?– Two algorithms: SNARE and SpamTracker

• Network-level Scam Defenses

Page 21: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

21

Finding the Right Features

• Goal: Sender reputation from a single packet?– Low overhead– Fast classification– In-network– Perhaps more evasion-resistant

• Key challenge– What features satisfy these properties and can

distinguish spammers from legitimate senders?

Page 22: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

22

Set of Network-Level Features

• Single-Packet– Geodesic distance– Distance to k nearest senders– Time of day– AS of sender’s IP– Status of email service ports

• Single-Message– Number of recipients– Length of message

• Aggregate (Multiple Message/Recipient)

Page 23: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

23

Sender-Receiver Geodesic Distance

90% of legitimate messages travel 2,200 miles or less

Page 24: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

24

Density of Senders in IP Space

For spammers, k nearest senders are much closer in IP space

Page 25: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

25

Local Time of Day at Sender

Spammers “peak” at different local times of day

Page 26: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

26

Combining Features: RuleFit• Put features into the RuleFit classifier• 10-fold cross validation on one day of query logs

from a large spam filtering appliance provider

• Comparable performance to SpamHaus– Incorporating into the system can further reduce FPs

• Using only network-level features• Completely automated

Page 27: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

27

Ranking of Features

Page 28: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

28

SNARE: Putting it Together

• Email arrival• Whitelisting• Greylisting• Retraining

Page 29: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

29

Benefits of Whitelisting

Whitelisting top 50 ASes:False positives reduced to 0.14%

Page 30: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

30

Another Possible Feature: Coordination

• Idea: Blacklist sending behavior (“Behavioral Blacklisting”)– Identify sending patterns commonly used by

spammers

• Intuition: More difficult for a spammer to change the technique by which mail is sent than it is to change the content

Page 31: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

31

SpamTracker: Clustering

• Construct a behavioral fingerprint for each sender

• Cluster senders with similar fingerprints

• Filter new senders that map to existing clusters

Page 32: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

32

SpamTracker: Identify Invariant

domain1.com domain2.com domain3.com

spam spam spam

IP Address: 76.17.114.xxxKnown Spammer

DHCPReassignment

Behavioral fingerprint

domain1.com domain2.com domain3.com

spam spam spam

IP Address: 24.99.146.xxxUnknown sender

Cluster on sending behavior

Similar fingerprint!

Cluster on sending behavior

Infection

Page 33: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

33

Building the Classifier: Clustering

• Feature: Distribution of email sending volumes across recipient domains

• Clustering Approach– Build initial seed list of bad IP addresses– For each IP address, compute feature vector:

volume per domain per time interval– Collapse into a single IP x domain matrix:– Compute clusters

Page 34: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

34

Clustering: Output and Fingerprint

• For each cluster, compute fingerprint vector:

• New IPs will be compared to this “fingerprint”

IP x IP Matrix: Intensity indicates pairwise similarity

Page 35: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

35

Clustering ResultsHam

Spam

SpamTracker Score

Separation may not be sufficient alone, but could be a useful feature

Page 36: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

36

Deployment: SpamSpotter

• As mail arrives, lookups received at BL

• Queries provide proxy for sending behavior

• Train based on received data

• Return score

Approach

Page 37: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

37

Challenges

• Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead?

• Dynamism: When to retrain the classifier, given that sender behavior changes?

• Reliability: How should the system be replicated to better defend against attack or failure?

• Evasion resistance: Can the system still detect spammers when they are actively trying to evade?

Page 38: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

38

Latency

Performance overhead is small.

Page 39: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

39

Sampling

Relatively small samples can achieve low false positive rates

Page 40: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

40

Improvements

• Accuracy– Synthesizing multiple classifiers– Incorporating user feedback– Learning algorithms with bounded false positives

• Performance– Caching/Sharing– Streaming

• Security– Learning in adversarial environments

Page 41: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

41

Summary• Spam increasing, spammers becoming agile

– Content filters are falling behind– IP-Based blacklists are evadable

• Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month

• Complementary approach: behavioral blacklisting based on network-level features– Key idea: Blacklist based on how messages are sent– SNARE: Automated sender reputation

• ~90% accuracy of existing with lightweight features– SpamTracker: Spectral clustering

• catches significant amounts faster than existing blacklists– SpamSpotter: Putting it together in an RBL system

Page 42: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

42

Network-Level Scam Defenses

Page 43: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

43

Network-Level Scam Defenses

• Scammers host Web sites on dynamic scam hosting infrastructure– Use DNS to redirect users to different sites

when the location of the sites move

• State of the art: URL Blacklisting

• Our approach: Blacklist based on network-level fingerprints

Konte et al., “Dynamics of Online Scam Hosting Infrastructure”, PAM 2009

Page 44: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

44

Online Scams

• Often advertised in spam messages• URLs point to various point-of-sale sites• These scams continue to be a menace

– As of August 2007, one in every 87 emails constituted a phishing attack

• Scams often hosted on bullet-proof domains

• Problem: Study the dynamics of online scams, as seen at a large spam sinkhole

Page 45: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

45

Online Scam Hosting is Dynamic

• The sites pointed to by a URL that is received in an email message may point to different sites

• Maintains agility as sites are shut down, blacklisted, etc.

• One mechanism for hosting sites: fast flux

Page 46: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

46

Mechanism for Dynamics: “Fast Flux”

Source: HoneyNet Project

Page 47: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

47

Summary of Findings

• What are the rates and extents of change?– Different from legitimate load balance– Different cross different scam campaigns

• How are dynamics implemented?– Many scam campaigns change DNS mappings at all

three locations in the DNS hierarchy• A, NS, IP address of NS record

• Conclusion: Might be able to detect based on monitoring the dynamic behavior of URLs

Page 48: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

48

Data Collection Method

• Three months of spamtrap data– 384 scam hosting domains– 21 unique scam campaigns

• Baseline comparison: Alexa “top 500” Web sites

Page 49: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

49

Time Between Record ChangesFast-flux Domains tend to change much more frequently than legitimately hosted sites

Page 50: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

50

Location: Many Distinct SubnetsScam sites appear in many more distinct networks

than legitimate load-balanced sites.

Page 51: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

51

Summary• Scam campaigns rely on a dynamic hosting

infrastructure• Studying the dynamics of that infrastructure may

help us develop better detection methods

• Dynamics– Rates of change differ from legitimate sites, and differ

across campaigns– Dynamics implemented at all levels of DNS hierarchy

• Location– Scam sites distributed across distinct subnets

Data: http://www.gtnoise.net/scam/fast-flux.html TR: http://www.cc.gatech.edu/research/reports/GT-CS-08-07.pdf

Page 52: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

52

Final Thoughts and Next Steps

• Duality between host security and network security.

• Can programmable networks (e.g., OpenFlow, NetFPGA, etc.) offer a better refactoring?– Resonance: Inference-based Dynamic Access Control for Enterprise

Networks, A. Nayak, A. Reimers, N. Feamster, R. ClarkACM SIGCOMM Workshop on Research on Enterprise Networks.

• Can better security primitives at the host help the network make better decisions about the security of network traffic?– Securing Enterprise Networks with Traffic Tainting, A.

Ramachandran, Y. Mundada, M. Tariq, N. Feamster. In submission.

Page 53: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

53

References• Anirudh Ramachandran and Nick Feamster, “Understanding the

Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006• Anirudh Ramachandran, Nick Feamster, and Santosh Vempala,

“Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2007• Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser,

“SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, USENIX Security, August 2009

• Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala, “A Dynamic Reputation Service for Spotting Spammers”, GT-CS-08-09

• Maria Konte, Nick Feamster, Jaeyeon Jung, “Dynamics of Online Scam Hosting Infrastructure”, Passive and Active Measurement Conference, April 2009.

Page 54: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

54

Page 55: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

55

Design Choice: Augment DNSBL• Expressive queries

– SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org

• Ans: 127.0.0.3 (=> listed in exploits block list)– SpamSpotter: $ dig \

receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net

• e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net

• Ans: 127.1.3.97 (SpamSpotter score = -3.97)

• Also a source of data– Unsupervised algorithms work with unlabeled

data

Page 56: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

56

Evaluation

• Emulate the performance of a system that could observe sending patterns across many domains– Build clusters/train on given time interval

• Evaluate classification– Relative to labeled logs– Relative to IP addresses that were eventually listed

Page 57: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

57

Data

• 30 days of Postfix logs from email hosting service– Time, remote IP, receiving domain, accept/reject– Allows us to observe sending behavior over a large

number of domains– Problem: About 15% of accepted mail is also spam

• Creates problems with validating SpamTracker

• 30 days of SpamHaus database in the month following the Postfix logs– Allows us to determine whether SpamTracker detects

some sending IPs earlier than SpamHaus

Page 58: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

58

Classifying IP Addresses

• Given “new” IP address, build a feature vector based on its sending pattern across domains

• Compute the similarity of this sending pattern to that of each known spam cluster– Normalized dot product of the two feature vectors– Spam score is maximum similarity to any cluster

Page 59: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

59

Sampling: Training Time

Page 60: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

60

Additional History: Message Size Variance

Senders of legitimate mail have a much higher variance in sizes of messages they send

Message Size Range

Certain Spam

Likely Spam

Likely Ham

Certain Ham

Surprising: Including this feature (and others with more history) can actually decrease the accuracy of the classifier

Page 61: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

61

Completeness of IP Blacklists

~80% listed on average

~95% of bots listed in one or more blacklists

Number of DNSBLs listing this spammer

Only about half of the IPs spamming from short-lived BGP are listed in any blacklistF

ract

ion

of

all

spam

rec

eive

d

Spam from IP-agile senders tend to be listed in fewer blacklists

Page 62: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

62

Low Volume to Each Domain

Lifetime (seconds)

Am

ou

nt

of

Sp

am

Most spammers send very little spam, regardless of how long they have been spamming.

Page 63: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

63

Some Patterns of Sending are Invariant

domain1.com domain2.com domain3.com

spam spam spam

IP Address: 76.17.114.xxx

DHCPReassignment

domain1.com domain2.com domain3.com

spam spam spam

IP Address: 24.99.146.xxx

• Spammer's sending pattern has not changed• IP Blacklists cannot make this connection

Page 64: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

64

Characteristics of Agile Senders

• IP addresses are widely distributed across the /8 space

• IP addresses typically appear only once at our sinkhole

• Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot-checked

• Some IP addresses were in allocated, albeit unannounced space

• Some AS paths associated with the routes contained reserved AS numbers

Page 65: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

65

Early Detection Results

• Compare SpamTracker scores on “accepted” mail to the SpamHaus database– About 15% of accepted mail was later determined to

be spam– Can SpamTracker catch this?

• Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month– 65 emails had a score larger than 5 (85th percentile)

Page 66: Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Sven Krasser, Santosh Vempala,

66

Evasion

• Problem: Malicious senders could add noise– Solution: Use smaller number of trusted domains

• Problem: Malicious senders could change sending behavior to emulate “normal” senders– Need a more robust set of features…