Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Jaeyeon Jung, Santosh Vempala.

Network-Level Spam and Scam Defenses

Nick FeamsterGeorgia Tech

with Anirudh Ramachandran, Shuang Hao, Maria KonteAlex Gray, Jaeyeon Jung, Santosh Vempala

2

Spam: More than Just a Nuisance

• 95% of all email traffic– Image and PDF Spam

(PDF spam ~12%)

• As of August 2007, one in every 87 emails constituted a phishing attack

• Targeted attacks on the rise– 20k-30k unique phishing attacks per month

Source: CNET (January 2008), APWG

3

Approach: Filter

• Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham

• Question: What features best differentiate spam from legitimate mail?– Content-based filtering: What is in the mail?– IP address of sender: Who is the sender?– Behavioral features: How the mail is sent?

Content Filters: Chasing a Moving Target

...and even mp3s!

PDFs Excel sheets Images

5

Problems with Content Filtering

• Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc.

• Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed

• High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

6

Another Approach: IP Addresses

• Problem: IP addresses are ephemeral

• Every day, 10% of senders are from previously unseen IP addresses

• Possible causes– Dynamic addressing– New infections

7

Our Idea: Network-Based Filtering

• Filter email based on how it is sent, in addition to simply what is sent.

• Network-level properties are less malleable– Network/geographic location of sender and receiver– Set of target recipients– Hosting or upstream ISP (AS number)– Membership in a botnet (spammer, hosting

infrastructure)

8

Why Network-Level Features?

• Lightweight: Don’t require inspecting details of packet streams– Can be done at high speeds– Can be done in the middle of the network

• Robust: Perhaps more difficult to change some network-level features than message contents

9

Challenges• Understanding network-level behavior

– What network-level behaviors do spammers have?– How well do existing techniques work?

• Building classifiers using network-level features– Key challenge: Which features to use?– Two Algorithms: SNARE and SpamTracker

• Building the system – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there

• Applications to phishing and scams

10

Data: Spam and BGP• Spam Traps: Domains that receive only spam• BGP Monitors: Watch network-level reachability

Domain 1

Domain 2

17-Month Study: August 2004 to December 2005

11

Finding: BGP “Spectrum Agility”• Hijack IP address space using BGP• Send spam• Withdraw IP address

A small club of persistent players appears to be using

this technique.

Common short-lived prefixes and ASes

61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717

~ 10 minutes

Somewhere between 1-10% of all spam (some clearly intentional,

others might be flapping)

12

Spectrum Agility: Big Prefixes?

• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP

addresses

• Visibility: Route typically won’t be filtered (nice and short)

13

How Well do IP Blacklists Work?

• Completeness: The fraction of spamming IP addresses that are listed in the blacklist

• Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam

14

Completeness and Responsiveness

• 10-35% of spam is unlisted at the time of receipt• 8.5-20% of these IP addresses remain unlisted

even after one month

Data: Trap data from March 2007, Spamhaus from March and April 2007

15

Problems with IP Blacklists

• IP addresses of senders have considerable churn

• Based on ephemeral identifier (IP address)– More than 10% of all spam comes from IP addresses not seen

within the past two months• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines

• Often require a human to notice/validate the behavior– Spamming is compartmentalized by domain and not analyzed

across domains

16

Outline

• Understanding the network-level behavior– What behaviors do spammers have?– How well do existing techniques work?

• Classifiers using network-level features– Key challenge: Which features to use?– Two algorithms: SNARE and SpamTracker

• System: SpamSpotter – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there

• Application to phishing and scams

17

Finding the Right Features

• Goal: Sender reputation from a single packet?– Low overhead– Fast classification– In-network– Perhaps more evasion resistant

• Key challenge– What features satisfy these properties and can

distinguish spammers from legitimate senders?

18

Set of Network-Level Features

• Single-Packet– Geodesic distance– Distance to k nearest senders– Time of day– AS of sender’s IP– Status of email service ports

• Single-Message– Number of recipients– Length of message

• Aggregate (Multiple Message/Recipient)

19

Sender-Receiver Geodesic Distance

90% of legitimate messages travel 2,200 miles or less

20

Density of Senders in IP Space

For spammers, k nearest senders are much closer in IP space

21

Local Time of Day at Sender

Spammers “peak” at different local times of day

22

Combining Features: RuleFit• Put features into the RuleFit classifier• 10-fold cross validation on one day of query logs

from a large spam filtering appliance provider

• Comparable performance to SpamHaus– Incorporating into the system can further reduce FPs

• Using only network-level features• Completely automated

23

Benefits of Whitelisting

Whitelisting top 50 ASes:False positives reduced to 0.14%

24

Outline

• Understanding the network-level behavior– What behaviors do spammers have?– How well do existing techniques work?

• Building classifiers using network-level features– Key challenge: Which features to use?– Algorithms: SpamTracker and SNARE

• System (SpamSpotter)– Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there

25

Deployment: Real-Time Blacklist

• As mail arrives, lookups received at BL

• Queries provide proxy for sending behavior

• Train based on received data

• Return score

Approach

26

Design Choice: Augment DNSBL• Expressive queries

– SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org

• Ans: 127.0.0.3 (=> listed in exploits block list)– SpamSpotter: $ dig \

receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net

• e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net

• Ans: 127.1.3.97 (SpamSpotter score = -3.97)

• Also a source of data– Unsupervised algorithms work with unlabeled

data

27

Challenges

• Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead?

• Dynamism: When to retrain the classifier, given that sender behavior changes?

• Reliability: How should the system be replicated to better defend against attack or failure?

• Evasion resistance: Can the system still detect spammers when they are actively trying to evade?

28

Latency

Performance overhead is negligible.

29

Sampling

Relatively small samples can achieve low false positive rates

30

Possible Improvements

• Accuracy– Synthesizing multiple classifiers– Incorporating user feedback– Learning algorithms with bounded false positives

• Performance– Caching/Sharing– Streaming

• Security– Learning in adversarial environments

31

Spam Filtering: Summary

• Spam increasing, spammers becoming agile– Content filters are falling behind– IP-Based blacklists are evadable

• Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month

• Complementary approach: behavioral blacklisting based on network-level features– Key idea: Blacklist based on how messages are sent– SNARE: Automated sender reputation

• ~90% accuracy of existing with lightweight features– SpamSpotter: Putting it together in an RBL system– SpamTracker: Spectral clustering

• catches significant amounts faster than existing blacklists

32

Phishing and Scams

• Scammers host Web sites on dynamic scam hosting infrastructure– Use DNS to redirect users to different sites

when the location of the sites move

• State of the art: Blacklist URL

• Our approach: Blacklist based on network-level fingerprints

Konte et al., “Dynamics of Online Scam Hosting Infrastructure”, PAM 2009

33

Online Scams

• Often advertised in spam messages• URLs point to various point-of-sale sites• These scams continue to be a menace

– As of August 2007, one in every 87 emails constituted a phishing attack

• Scams often hosted on bullet-proof domains

• Problem: Study the dynamics of online scams, as seen at a large spam sinkhole

34

Online Scam Hosting is Dynamic

• The sites pointed to by a URL that is received in an email message may point to different sites

• Maintains agility as sites are shut down, blacklisted, etc.

• One mechanism for hosting sites: fast flux

35

Mechanism for Dynamics: “Fast Flux”

Source: HoneyNet Project

36

Summary of Findings

• What are the rates and extents of change?– Different from legitimate load balance– Different cross different scam campaigns

• How are dynamics implemented?– Many scam campaigns change DNS mappings at all

three locations in the DNS hierarchy• A, NS, IP address of NS record

• Conclusion: Might be able to detect based on monitoring the dynamic behavior of URLs

37

Data Collection Method

• Three months of spamtrap data– 384 scam hosting domains– 21 unique scam campaigns

• Baseline comparison: Alexa “top 500” Web sites

38

Top 3 Spam Campaigns

• Some campaigns hosted by thousands of IPs• Most scam domains exhibit some type of flux• Sharing of IP addresses across different roles

(authoritative NS and scam hosting)

39

Rates of Change

• How (and how quickly) do DNS-record mappings change?

• Rates of change are much faster than for legitimate load-balanced sites.– Scam domains change on shorter intervals than their

TTL values.

• Domains for different scam campaigns exhibit different rates of change.

40

Rates of Change

• Domains that exhibit fast flux change more rapidly than legitimate domains

• Rates of change are inconsistent with actual TTL values

Rates of change are much faster than for legitimate load-balanced sites.

41

Time Between Record ChangesFast-flux Domains tend to change much more frequently than legitimately hosted sites

42

Rates of Change by CampaignDomains for different scam campaigns exhibit different

rates of change.

43

Rates of Accumulation

• How quickly do scams accumulate new IP addresses?

• Rates of accumulation differ across campaigns• Some scams only begin accumulating IP

addresses after some time

44

Rates of Accumulation

45

Location

• Where in IP address space do hosts for scam sites operate?

• Scam networks use a different portion of the IP address space than legitimate sites– 30/8 – 60/8 --- lots of legitimate sites, no scam sites

• Sites that host scam domains (both sites and authoritative DNS) are more widely distributed than those for legitimate sites

46

Location: Many Distinct SubnetsScam sites appear in many more distinct networks

than legitimate load-balanced sites.

47

Conclusion• Scam campaigns rely on a dynamic hosting

infrastructure• Studying the dynamics of that infrastructure may

help us develop better detection methods

• Dynamics– Rates of change differ from legitimate sites, and differ

across campaigns– Dynamics implemented at all levels of DNS hierarchy

• Location– Scam sites distributed across distinct subnets

Data: http://www.gtnoise.net/scam/fast-flux.html TR: http://www.cc.gatech.edu/research/reports/GT-CS-08-07.pdf

http://www.gtnoise.net/scam/fast-flux.html

48

References• Anirudh Ramachandran and Nick Feamster, “Understanding

the Network-Level Behavior of Spammers”, ACM SIGCOMM, August 2006

• Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, November 2007

• Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, USENIX Security, August 2009

• Maria Konte, Nick Feamster, Jaeyeon Jung, “Dynamics of Online Scam Hosting Infrastructure”, Passive and Active Measurement, April 2009

Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Jaeyeon Jung, Santosh Vempala.

Documents

networklevel spam

pdf spam pdf spam

apwg slide

sophisticated slide

short slide

distinguishing spam

networklevel properties

networklevel behaviors