Spam and Botnets: Characterization and Mitigation

Post on 01-Jan-2016






Click to see full reader


Spam and Botnets: Characterization and Mitigation. Nick Feamster Anirudh Ramachandran David Dagon Georgia Tech. Talk Overview. Network-level behavior of spammers Ultimate goal: Construct spam filters based on network-level properties, rather than content - PowerPoint PPT Presentation


Spam and Botnets:Characterization and Mitigation

Nick Feamster

Anirudh RamachandranDavid DagonGeorgia Tech


Talk Overview

• Network-level behavior of spammers– Ultimate goal: Construct spam filters based on network-level

properties, rather than content – Content-based properties are malleable

• Low cost to evasion: Spammers can easily alter content• High admin cost: Filters must be continually updated

– Content-based filters are applied at the destination• Too little, too late: Wasted network bandwidth, storage, etc.

• Study of DNS-based blacklists

• “Discovery”: One of the most telling network-level properties is botnet membership– DNSBL Counter-Intelligence– Network monitoring


Network-Level Behavior of Spammers: Major Findings

• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters

• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed

• How is spam sent?– Most coming from Windows hosts (likely, bots)– Identification of spamming groups (e.g., botnets) could help


Data Collection

• Two domains instrumented with MailAvenger (both on same network)– Sinkhole domain #1

• Continuous spam collection since Aug 2004• No real email addresses---sink everything• 10 million+ pieces of spam

• Legitimate mail corpus from a large email provider (40 million inboxes)

• Monitoring BGP route advertisements from same network


Mail Collection: MailAvenger

• Highly configurable SMTP server that collects many useful statistics


BGP Data Collection

MX 1

MX 2


Spam Study: Major Findings

• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters

• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed

• How is spam sent?– Most coming from Windows hosts (likely, bots)– identification of spamming groups (e.g., botnets) could help


What IP ranges does spam come from?

/24 prefix



nSpam comes from a few concentrated regions

of IP address space


Distribution across ASes

• Top two spamming ASes: 10% of received spam• ASes in the US: most spam• Top ASes for legitimate email are different

Top 10 ASes by Spam Count Top 10 ASes by Legit Email Count

Points to note


Spam Study: Major Findings

• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters

• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed

• How is spam sent?– Most coming from Windows hosts (likely, bots)– Indentification of spamming groups (e.g., botnets) could help


BGP Spectrum Agility

• Log IP addresses of SMTP relays• Join with BGP route advertisements seen at network

where spam trap is co-located.

A small club of persistent players appears to be using

this technique.

Common short-lived prefixes and ASes 4678 2156282.0.0.0/8 8717

~ 10 minutes

Somewhere between 1-10% of all spam (some clearly intentional,

others might be flapping)


Why Such Big Prefixes?

• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP


• Visibility: Route typically won’t be filtered (nice and short)


Spam Study: Major Findings

• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters

• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed

• How is spam sent?– Most coming from Windows hosts (likely, bots)– Identification of spamming groups (e.g., botnets) could help


Characteristics of spamming bots

• Distribution across IP space for bots– Similar to IP space distribution for all spam– Lower bot activity in ranges where spam also comes

from hijacked routes

• Operating Systems of Spamming Hosts– ~ 95% run Windows– The 4% Unix-based hosts send up to 8% spam


Most Bots Send Low Volumes of Spam

Lifetime (seconds)






amMost bot IP addresses send very little spam, regardless

of how long they have been spamming…

99% of bots


Most Bot IP addresses are quiet

65% of bots only send mail to a domain once over 18 months

Blacklists may want to target IP ranges, rather than individual IPs

Lifetime (seconds)




e o

f b



Take-Away Lessons

• Network-level properties are less malleable, and are observable closer to the source of spam

• Aggregate properties (e.g., IP prefix, ASN, route used etc.) may be more effective

• Some network-level properties can be incorporated into spam filters– could be used as a first-pass filter

• Spam filtering requires a better notion of end-host identity

• Securing the Internet routing infrastructure is key to traceabilty

Network-Level Spam Filtering

Redefining End-Host Identifiers

• DNS-based Blacklists (DNSBLs)– The most prevalent network-level spam filtering

mechanism today– Various criteria: open relays/proxies, virus senders,

bad/unused address spaces etc.– Hundreds of DNSBLs of all sizes

• How to measure the effectiveness of DNSBLs?– Completeness– Responsiveness

What about DNS-Based Blacklists?

• What is the completeness of the DNSBL?

• What is the responsiveness of the DNSBL?– How many distinct domains are targeted by a

spamming host before it is blacklisted?

• Does frequency of spam from a host change after it is blacklisted?

Questions about DNSBLs


Blacklisting: Completeness

~80% listed on average

~95% of bots listed in one or more blacklists

Number of DNSBLs listing this spammer

Only about half of the IPs spamming from short-lived BGP are listed in any blacklistF









Spam from IP-agile senders tend to be listed in fewer blacklists


Are IP-Based Blacklists Enough?

• Mail Avenger is very aggressive– Eight different blacklists

• Cloaking techniques complicate detection– For example, what if a bot could change IP addresses

and remain reachable?• LAN agility• BGP agility

• Response Time– Difficult to calculate without “ground truth”

– Can still estimate lower bound



Possible DetectionOpportunity

RBL Listing


Response Time

Lifecycle of a spamming host

A Model of Responsiveness

• Data– 1.5 days worth of packet captures of DNSBL queries

from a mirror of Spamhaus– 46 days of pcaps from a hijacked C&C for a Bobax

botnet; overlaps with DNSBL queries

• Method– Monitor DNSBL for lookups for known Bobax hosts

• Look for first query

• Look for the first time a query response had a ‘listed’ status

Measuring Responsiveness

• Observed 81,950 DNSBL queries for 4,295 (out of over 2 million) Bobax IPs

• Only 255 (6%) Bobax IPs were blacklisted through the end of the Bobax trace (46 days)– 88 IPs became listed during the 1.5 day DNSBL trace

– 34 of these were listed after a single detection opportunity

Both responsiveness and completeness appear to be low.Much room for improvement.

Responsiveness: Preliminary Results


• Over 60% are queried by just one IP/AS– Hypothesis: Decreased chances of being reported

Domains Performing Lookups


So…What can be Done?

• Network-level behavior of spammers– Ultimate goal: Construct spam filters based on network-level

properties, rather than content – Content-based properties are malleable

• Low cost to evasion: Spammers can easily alter content• High admin cost: Filters must be continually updated

– Content-based filters are applied at the destination• Too little, too late: Wasted network bandwidth, storage, etc.

• Study of DNS-based blacklists

• “Discovery”: One of the most telling network-level properties is botnet membership– DNSBL Counter-Intelligence– Network monitoring


Mitigation #1: Counter-Intelligence

• Botmasters advertise spamming bots for which bots are not listed in any blacklist.

• Insight: Someone must be looking up the bots!

• Can we fish out these DNSBL “reconnaissance” queries and identify subjects/targets as suspect?


Legit Queries vs. Reconnaissance

• Legitimate queriers are also the targets of queries

• Reconnaissance queriers are ususally not queried themselves

email to



Legit Mail Server

Legit Mail Server B

email to




Reconnaissance host


Measurement Approach

• Log Spamhaus queries

• Construct querier/queried graph

• Prune graph: only nodes in the Bobax trace

• Examine nodes with high out-degree– Hypothesis: targets of nodes with high out-degree likely bots


Who’s Doing the Lookups?

• The botmaster, on behalf of the bots• The bots, on behalf of themselves• The bots, on behalf of each other

Spam Sinkhole

Implication: Use a “seed” set to bootstrap?

Known bobax drone!


Some Problems with Counter-Intel

• Constructing the query graph is intensive– Computationally– Storage-wise

• Initially pruning the graph with IP addresses of known suspects (e.g., spammers) could help


Mitigation: Network Monitoring

• In-network filtering– Requires the ability to detect botnets

• Question: Can we detect botnets by observing communication structure among hosts?

Example: Migration between command and control hosts

New type of problem: essentially coupon collectionHow good are current traffic sampling techniques at exposing these patterns?


Experimental Setup


(Preliminary) Results

Feasible sampling rates

Conventional sampling techniques are not well-

suited to collecting conversations


Summary Lessons

• Network-level spam filtering holds promise– Potentially a useful complement to content-based filters– Today’s DNSBLs aren’t doing the tricks

• Two critical pieces– Monitoring techniques (which might be used together)

• In-network (e.g., with better traffic monitoring techniques)• At the edge (e.g., DNSBL reconnaissance)

– Routing security

• “Clean-slate” wish list– Better notions of identity– More agile monitoring/sampling techniques

top related