Spam and Botnets: Characterization and Mitigation Nick Feamster Anirudh Ramachandran David Dagon Georgia Tech
Jan 01, 2016
Spam and Botnets:Characterization and Mitigation
Nick Feamster
Anirudh RamachandranDavid DagonGeorgia Tech
2
Talk Overview
• Network-level behavior of spammers– Ultimate goal: Construct spam filters based on network-level
properties, rather than content – Content-based properties are malleable
• Low cost to evasion: Spammers can easily alter content• High admin cost: Filters must be continually updated
– Content-based filters are applied at the destination• Too little, too late: Wasted network bandwidth, storage, etc.
• Study of DNS-based blacklists
• “Discovery”: One of the most telling network-level properties is botnet membership– DNSBL Counter-Intelligence– Network monitoring
3
Network-Level Behavior of Spammers: Major Findings
• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters
• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed
• How is spam sent?– Most coming from Windows hosts (likely, bots)– Identification of spamming groups (e.g., botnets) could help
4
Data Collection
• Two domains instrumented with MailAvenger (both on same network)– Sinkhole domain #1
• Continuous spam collection since Aug 2004• No real email addresses---sink everything• 10 million+ pieces of spam
• Legitimate mail corpus from a large email provider (40 million inboxes)
• Monitoring BGP route advertisements from same network
5
Mail Collection: MailAvenger
• Highly configurable SMTP server that collects many useful statistics
6
BGP Data Collection
MX 1
MX 2
7
Spam Study: Major Findings
• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters
• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed
• How is spam sent?– Most coming from Windows hosts (likely, bots)– identification of spamming groups (e.g., botnets) could help
8
What IP ranges does spam come from?
/24 prefix
Fra
ctio
nSpam comes from a few concentrated regions
of IP address space
9
Distribution across ASes
• Top two spamming ASes: 10% of received spam• ASes in the US: most spam• Top ASes for legitimate email are different
Top 10 ASes by Spam Count Top 10 ASes by Legit Email Count
Points to note
10
Spam Study: Major Findings
• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters
• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed
• How is spam sent?– Most coming from Windows hosts (likely, bots)– Indentification of spamming groups (e.g., botnets) could help
11
BGP Spectrum Agility
• Log IP addresses of SMTP relays• Join with BGP route advertisements seen at network
where spam trap is co-located.
A small club of persistent players appears to be using
this technique.
Common short-lived prefixes and ASes
61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717
~ 10 minutes
Somewhere between 1-10% of all spam (some clearly intentional,
others might be flapping)
12
Why Such Big Prefixes?
• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP
addresses
• Visibility: Route typically won’t be filtered (nice and short)
13
Spam Study: Major Findings
• Where does spam come from?– Most received from few regions of IP address space– Insight about spammier prefixes could improve filters
• Do spammers hijack routes?– A small set of spammers continually advertise short-lived routes– Traceability is not guaranteed
• How is spam sent?– Most coming from Windows hosts (likely, bots)– Identification of spamming groups (e.g., botnets) could help
14
Characteristics of spamming bots
• Distribution across IP space for bots– Similar to IP space distribution for all spam– Lower bot activity in ranges where spam also comes
from hijacked routes
• Operating Systems of Spamming Hosts– ~ 95% run Windows– The 4% Unix-based hosts send up to 8% spam
15
Most Bots Send Low Volumes of Spam
Lifetime (seconds)
Am
ou
nt
of
Sp
amMost bot IP addresses send very little spam, regardless
of how long they have been spamming…
99% of bots
16
Most Bot IP addresses are quiet
65% of bots only send mail to a domain once over 18 months
Blacklists may want to target IP ranges, rather than individual IPs
Lifetime (seconds)
Per
cen
tag
e o
f b
ots
17
Take-Away Lessons
• Network-level properties are less malleable, and are observable closer to the source of spam
• Aggregate properties (e.g., IP prefix, ASN, route used etc.) may be more effective
• Some network-level properties can be incorporated into spam filters– could be used as a first-pass filter
• Spam filtering requires a better notion of end-host identity
• Securing the Internet routing infrastructure is key to traceabilty
Network-Level Spam Filtering
Redefining End-Host Identifiers
• DNS-based Blacklists (DNSBLs)– The most prevalent network-level spam filtering
mechanism today– Various criteria: open relays/proxies, virus senders,
bad/unused address spaces etc.– Hundreds of DNSBLs of all sizes
• How to measure the effectiveness of DNSBLs?– Completeness– Responsiveness
What about DNS-Based Blacklists?
• What is the completeness of the DNSBL?
• What is the responsiveness of the DNSBL?– How many distinct domains are targeted by a
spamming host before it is blacklisted?
• Does frequency of spam from a host change after it is blacklisted?
Questions about DNSBLs
20
Blacklisting: Completeness
~80% listed on average
~95% of bots listed in one or more blacklists
Number of DNSBLs listing this spammer
Only about half of the IPs spamming from short-lived BGP are listed in any blacklistF
ract
ion
of
all
spam
rec
eive
d
Spam from IP-agile senders tend to be listed in fewer blacklists
21
Are IP-Based Blacklists Enough?
• Mail Avenger is very aggressive– Eight different blacklists
• Cloaking techniques complicate detection– For example, what if a bot could change IP addresses
and remain reachable?• LAN agility• BGP agility
• Response Time– Difficult to calculate without “ground truth”
– Can still estimate lower bound
Infection
S-Day
Possible DetectionOpportunity
RBL Listing
Time
Response Time
Lifecycle of a spamming host
A Model of Responsiveness
• Data– 1.5 days worth of packet captures of DNSBL queries
from a mirror of Spamhaus– 46 days of pcaps from a hijacked C&C for a Bobax
botnet; overlaps with DNSBL queries
• Method– Monitor DNSBL for lookups for known Bobax hosts
• Look for first query
• Look for the first time a query response had a ‘listed’ status
Measuring Responsiveness
• Observed 81,950 DNSBL queries for 4,295 (out of over 2 million) Bobax IPs
• Only 255 (6%) Bobax IPs were blacklisted through the end of the Bobax trace (46 days)– 88 IPs became listed during the 1.5 day DNSBL trace
– 34 of these were listed after a single detection opportunity
Both responsiveness and completeness appear to be low.Much room for improvement.
Responsiveness: Preliminary Results
25
• Over 60% are queried by just one IP/AS– Hypothesis: Decreased chances of being reported
Domains Performing Lookups
26
So…What can be Done?
• Network-level behavior of spammers– Ultimate goal: Construct spam filters based on network-level
properties, rather than content – Content-based properties are malleable
• Low cost to evasion: Spammers can easily alter content• High admin cost: Filters must be continually updated
– Content-based filters are applied at the destination• Too little, too late: Wasted network bandwidth, storage, etc.
• Study of DNS-based blacklists
• “Discovery”: One of the most telling network-level properties is botnet membership– DNSBL Counter-Intelligence– Network monitoring
27
Mitigation #1: Counter-Intelligence
• Botmasters advertise spamming bots for which bots are not listed in any blacklist.
• Insight: Someone must be looking up the bots!
• Can we fish out these DNSBL “reconnaissance” queries and identify subjects/targets as suspect?
28
Legit Queries vs. Reconnaissance
• Legitimate queriers are also the targets of queries
• Reconnaissance queriers are ususally not queried themselves
email to mx.a.com
DNS-Based
Blacklist
Legit Mail Server Amx.a.com
Legit Mail Server B
mx.b.com
email to mx.b.com
lookupmx.a.com
lookup mx.b.com
DNS-Based
Blacklist
Reconnaissance host
29
Measurement Approach
• Log Spamhaus queries
• Construct querier/queried graph
• Prune graph: only nodes in the Bobax trace
• Examine nodes with high out-degree– Hypothesis: targets of nodes with high out-degree likely bots
30
Who’s Doing the Lookups?
• The botmaster, on behalf of the bots• The bots, on behalf of themselves• The bots, on behalf of each other
Spam Sinkhole
Implication: Use a “seed” set to bootstrap?
Known bobax drone!
31
Some Problems with Counter-Intel
• Constructing the query graph is intensive– Computationally– Storage-wise
• Initially pruning the graph with IP addresses of known suspects (e.g., spammers) could help
32
Mitigation: Network Monitoring
• In-network filtering– Requires the ability to detect botnets
• Question: Can we detect botnets by observing communication structure among hosts?
Example: Migration between command and control hosts
New type of problem: essentially coupon collectionHow good are current traffic sampling techniques at exposing these patterns?
33
Experimental Setup
34
(Preliminary) Results
Feasible sampling rates
Conventional sampling techniques are not well-
suited to collecting conversations
35
Summary Lessons
• Network-level spam filtering holds promise– Potentially a useful complement to content-based filters– Today’s DNSBLs aren’t doing the tricks
• Two critical pieces– Monitoring techniques (which might be used together)
• In-network (e.g., with better traffic monitoring techniques)• At the edge (e.g., DNSBL reconnaissance)
– Routing security
• “Clean-slate” wish list– Better notions of identity– More agile monitoring/sampling techniques