Spam and Botnets: Characterization and Mitigation

Spam and Botnets:Characterization and Mitigation

Nick Feamster

Anirudh RamachandranDavid DagonGeorgia Tech

2

Talk Overview

• Network-level behavior of spammers– Ultimate goal: Construct spam filters based on network-level

properties, rather than content – Content-based properties are malleable

• Low cost to evasion: Spammers can easily alter content• High admin cost: Filters must be continually updated

– Content-based filters are applied at the destination• Too little, too late: Wasted network bandwidth, storage, etc.

• Study of DNS-based blacklists

• “Discovery”: One of the most telling network-level properties is botnet membership– DNSBL Counter-Intelligence– Network monitoring

3

Network-Level Behavior of Spammers: Major Findings

• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters

• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed

• How is spam sent?– Most coming from Windows hosts (likely, bots)– Identification of spamming groups (e.g., botnets) could help

4

Data Collection

• Two domains instrumented with MailAvenger (both on same network)– Sinkhole domain #1

• Continuous spam collection since Aug 2004• No real email addresses---sink everything• 10 million+ pieces of spam

• Legitimate mail corpus from a large email provider (40 million inboxes)

• Monitoring BGP route advertisements from same network

5

Mail Collection: MailAvenger

• Highly configurable SMTP server that collects many useful statistics

6

BGP Data Collection

MX 1

MX 2

7

Spam Study: Major Findings



• How is spam sent?– Most coming from Windows hosts (likely, bots)– identification of spamming groups (e.g., botnets) could help

8

What IP ranges does spam come from?

/24 prefix

Fra

ctio

nSpam comes from a few concentrated regions

of IP address space

9

Distribution across ASes

• Top two spamming ASes: 10% of received spam• ASes in the US: most spam• Top ASes for legitimate email are different

Top 10 ASes by Spam Count Top 10 ASes by Legit Email Count

Points to note

10




• How is spam sent?– Most coming from Windows hosts (likely, bots)– Indentification of spamming groups (e.g., botnets) could help

11

BGP Spectrum Agility

• Log IP addresses of SMTP relays• Join with BGP route advertisements seen at network

where spam trap is co-located.

A small club of persistent players appears to be using

this technique.

Common short-lived prefixes and ASes

61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717

~ 10 minutes

Somewhere between 1-10% of all spam (some clearly intentional,

others might be flapping)

12

Why Such Big Prefixes?

• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP

addresses

• Visibility: Route typically won’t be filtered (nice and short)

13




• How is spam sent?– Most coming from Windows hosts (likely, bots)– Identification of spamming groups (e.g., botnets) could help

14

Characteristics of spamming bots

• Distribution across IP space for bots– Similar to IP space distribution for all spam– Lower bot activity in ranges where spam also comes

from hijacked routes

• Operating Systems of Spamming Hosts– ~ 95% run Windows– The 4% Unix-based hosts send up to 8% spam

15

Most Bots Send Low Volumes of Spam

Lifetime (seconds)

Am

ou

nt

of

Sp

amMost bot IP addresses send very little spam, regardless

of how long they have been spamming…

99% of bots

16

Most Bot IP addresses are quiet

65% of bots only send mail to a domain once over 18 months

Blacklists may want to target IP ranges, rather than individual IPs

Lifetime (seconds)

Per

cen

tag

e o

f b

ots

17

Take-Away Lessons

• Network-level properties are less malleable, and are observable closer to the source of spam

• Aggregate properties (e.g., IP prefix, ASN, route used etc.) may be more effective

• Some network-level properties can be incorporated into spam filters– could be used as a first-pass filter

• Spam filtering requires a better notion of end-host identity

• Securing the Internet routing infrastructure is key to traceabilty

Network-Level Spam Filtering

Redefining End-Host Identifiers

• DNS-based Blacklists (DNSBLs)– The most prevalent network-level spam filtering

mechanism today– Various criteria: open relays/proxies, virus senders,

bad/unused address spaces etc.– Hundreds of DNSBLs of all sizes

• How to measure the effectiveness of DNSBLs?– Completeness– Responsiveness

What about DNS-Based Blacklists?

• What is the completeness of the DNSBL?

• What is the responsiveness of the DNSBL?– How many distinct domains are targeted by a

spamming host before it is blacklisted?

• Does frequency of spam from a host change after it is blacklisted?

Questions about DNSBLs

20

Blacklisting: Completeness

~80% listed on average

~95% of bots listed in one or more blacklists

Number of DNSBLs listing this spammer

Only about half of the IPs spamming from short-lived BGP are listed in any blacklistF

ract

ion

of

all

spam

rec

eive

d

Spam from IP-agile senders tend to be listed in fewer blacklists

21

Are IP-Based Blacklists Enough?

• Mail Avenger is very aggressive– Eight different blacklists

• Cloaking techniques complicate detection– For example, what if a bot could change IP addresses

and remain reachable?• LAN agility• BGP agility

• Response Time– Difficult to calculate without “ground truth”

– Can still estimate lower bound

Infection

S-Day

Possible DetectionOpportunity

RBL Listing

Time

Response Time

Lifecycle of a spamming host

A Model of Responsiveness

• Data– 1.5 days worth of packet captures of DNSBL queries

from a mirror of Spamhaus– 46 days of pcaps from a hijacked C&C for a Bobax

botnet; overlaps with DNSBL queries

• Method– Monitor DNSBL for lookups for known Bobax hosts

• Look for first query

• Look for the first time a query response had a ‘listed’ status

Measuring Responsiveness

• Observed 81,950 DNSBL queries for 4,295 (out of over 2 million) Bobax IPs

• Only 255 (6%) Bobax IPs were blacklisted through the end of the Bobax trace (46 days)– 88 IPs became listed during the 1.5 day DNSBL trace

– 34 of these were listed after a single detection opportunity

Both responsiveness and completeness appear to be low.Much room for improvement.

Responsiveness: Preliminary Results

25

• Over 60% are queried by just one IP/AS– Hypothesis: Decreased chances of being reported

Domains Performing Lookups

26

So…What can be Done?

• Network-level behavior of spammers– Ultimate goal: Construct spam filters based on network-level

properties, rather than content – Content-based properties are malleable

• Low cost to evasion: Spammers can easily alter content• High admin cost: Filters must be continually updated

– Content-based filters are applied at the destination• Too little, too late: Wasted network bandwidth, storage, etc.

• Study of DNS-based blacklists

• “Discovery”: One of the most telling network-level properties is botnet membership– DNSBL Counter-Intelligence– Network monitoring

27

Mitigation #1: Counter-Intelligence

• Botmasters advertise spamming bots for which bots are not listed in any blacklist.

• Insight: Someone must be looking up the bots!

• Can we fish out these DNSBL “reconnaissance” queries and identify subjects/targets as suspect?

28

Legit Queries vs. Reconnaissance

• Legitimate queriers are also the targets of queries

• Reconnaissance queriers are ususally not queried themselves

email to mx.a.com

DNS-Based

Blacklist

Legit Mail Server Amx.a.com

Legit Mail Server B

mx.b.com

email to mx.b.com

lookupmx.a.com

lookup mx.b.com

DNS-Based

Blacklist

Reconnaissance host

29

Measurement Approach

• Log Spamhaus queries

• Construct querier/queried graph

• Prune graph: only nodes in the Bobax trace

• Examine nodes with high out-degree– Hypothesis: targets of nodes with high out-degree likely bots

30

Who’s Doing the Lookups?

• The botmaster, on behalf of the bots• The bots, on behalf of themselves• The bots, on behalf of each other

Spam Sinkhole

Implication: Use a “seed” set to bootstrap?

Known bobax drone!

31

Some Problems with Counter-Intel

• Constructing the query graph is intensive– Computationally– Storage-wise

• Initially pruning the graph with IP addresses of known suspects (e.g., spammers) could help

32

Mitigation: Network Monitoring

• In-network filtering– Requires the ability to detect botnets

• Question: Can we detect botnets by observing communication structure among hosts?

Example: Migration between command and control hosts

New type of problem: essentially coupon collectionHow good are current traffic sampling techniques at exposing these patterns?

33

Experimental Setup

34

(Preliminary) Results

Feasible sampling rates

Conventional sampling techniques are not well-

suited to collecting conversations

35

Summary Lessons

• Network-level spam filtering holds promise– Potentially a useful complement to content-based filters– Today’s DNSBLs aren’t doing the tricks

• Two critical pieces– Monitoring techniques (which might be used together)

• In-network (e.g., with better traffic monitoring techniques)• At the edge (e.g., DNSBL reconnaissance)

– Routing security

• “Clean-slate” wish list– Better notions of identity– More agile monitoring/sampling techniques

Spam and Botnets: Characterization and Mitigation

Documents