Top Banner
Clustering Spam MIT Spam Conference 2008 Phil Tom
18

Clustering Spam MIT Spam Conference 2008 Phil Tom.

Dec 24, 2015

Download

Documents

Octavia French
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Clustering Spam

MIT Spam Conference 2008

Phil Tom

Page 2: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Simple Clustering Algorithm

Expand clusters to include similar messages:

1. Identical originating IP addresses.

2. Identical subject lines.

3. Identical message bodies.

for each cluster in clusters expand cluster for each message in unclustered messages create a new cluster add message to cluster expand cluster

Clustering pseudocode

Page 3: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Dimensional Model

Page 4: Clustering Spam MIT Spam Conference 2008 Phil Tom.

update sdbf_message set cluster_id = ? where (cluster_id <> ? or cluster_id is null) and sender_ip_id in (select sender_ip_id from sdbf_message where cluster_id = ?)

Expand Cluster By IP

Page 5: Clustering Spam MIT Spam Conference 2008 Phil Tom.

update sdbf_message m set cluster_id = ? from sdbd_body b where (m.cluster_id <> ? or m.cluster_id is null) and m.body_id in (select body_id from sdbf_message where cluster_id = ?) and m.body_id = b.body_id and b.size_in_bytes > 25

Expand Cluster By Body

Page 6: Clustering Spam MIT Spam Conference 2008 Phil Tom.

update sdbf_message m set cluster_id = ? from sdbd_subject s where (m.cluster_id <> ? or m.cluster_id is null) and m.subject_id in (select subject_id from sdbf_message where cluster_id = ?) and m.subject_id = s.subject_id and (s.word_count > 1 or length(s.subject) > 10)

Expand Cluster By Subject

Page 7: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Test Data Set

• Dec 22, 2007 - Dec 29, 2007

• Single “Received:” header tag only

• No multi-part messages

• 1.7 million messages

• Roughly 20%

Page 8: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Cluster Results

Min Cluster Size Max Cluster Size Clusters Messages % of Messages1 10 26610 64510 3.7%

11 100 3221 79218 4.6%101 1000 156 39413 2.3%

1001 10000 26 72786 4.2%10001 100000 2 37945 2.2%

100001 1 1436206 83.0%Totals 30016 1730078

Page 9: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Messages per Cluster Size*Not including the big cluster

0

10000

20000

30000

1 60 125 303 26979

Cluster Size

Sum of Messages

Page 10: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Top Clusters by IPs

cluster_id | messages | subject | bodies | ips | networks | countries------------+----------+---------+--------+--------+----------+----------- 1 | 1436206 | 99836 | 330852 | 325660 | 8940 | 177 62 | 26623 | 451 | 25992 | 1313 | 57 | 2 59 | 11322 | 19 | 15 | 962 | 4 | 1 68 | 1065 | 2 | 1065 | 609 | 12 | 4 69 | 4476 | 59 | 85 | 514 | 17 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 953 | 722 | 149 | 333 | 275 | 16 | 1 175 | 307 | 2 | 306 | 208 | 179 | 26 379 | 240 | 7 | 9 | 184 | 4 | 1 18219 | 5581 | 15 | 5212 | 153 | 119 | 26 3924 | 2934 | 20 | 2934 | 150 | 1 | 1 144 | 377 | 22 | 377 | 125 | 3 | 1 242 | 307 | 4 | 3 | 124 | 5 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 209 | 156 | 4 | 155 | 105 | 96 | 19 198 | 1117 | 174 | 1100 | 101 | 4 | 1

Page 11: Clustering Spam MIT Spam Conference 2008 Phil Tom.

The Big One

messages | subject | bodies | ips | networks | countries----------+---------+--------+--------+----------+----------- 1436206 | 99836 | 330852 | 325660 | 8940 | 177

messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------------- 254948 | 30854 | 62772 | 27464 | 1453 | United States 75969 | 5110 | 27366 | 27446 | 170 | Germany 114328 | 6558 | 39312 | 26758 | 147 | Spain 78378 | 4705 | 29291 | 25263 | 48 | Turkey 91527 | 4624 | 29926 | 20930 | 209 | United Kingdom 51708 | 3194 | 19983 | 16842 | 42 | Peru 52652 | 2848 | 19644 | 15533 | 148 | Columbia 39475 | 3059 | 13344 | 10129 | 152 | Chile 34827 | 5063 | 12790 | 9664 | 12 | Brazil 40144 | 4381 | 13368 | 9372 | 126 | Italy

Cluster 1 summary

Top 10 countries by IP count

Page 12: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Clustering the Big One

• Create clusters on subject and body

messages | cluster_id | ips | subjects | bodies ----------+------------+--------+----------+-------- 740447 | 34641 | 131024 | 34 | 136 fake watches 111122 | 34643 | 79419 | 330 | 59166 penis enlargement 76521 | 34642 | 59112 | 27 | 55129 online casino 55421 | 34644 | 44772 | 55 | 25023 fake name brand goods 27789 | 34653 | 7190 | 81 | 16225 viagra 26815 | 34646 | 11099 | 20 | 19680 valium 25679 | 34656 | 5990 | 14846 | 25644 online pharmacy 12953 | 34649 | 3391 | 45 | 5 stock investment 12924 | 34645 | 4149 | 3 | 5 porn 12919 | 34648 | 3483 | 9 | 12332 software 10071 | 34650 | 9240 | 17 | 9273 russian dating

1099737 messages 284493 unique IPs

Page 13: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Clustering the Big One (cont)

rolex gambling enlargement knockoffs porn valium software stocks dating viagragambling 11820enlargement 14869 20514knockoffs 9316 13173 14885porn 1779 873 925 705valium 245 67 94 57 14software 308 10 14 7 2 9stocks 719 783 895 641 63 3 0dating 2182 3058 3412 2106 189 14 0 175viagra 96 13 8 6 1 92 4 2 1pharmacy 123 30 35 17 1 89 4 2 8 52

Number of overlapping IPs between clusters

Page 14: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Am I Bot or Not?

cluster_id | messages | subjects | bodies | ips | networks | countries------------+----------+----------+--------+-------+----------+----------- 62 | 26623 | 451 | 25992 | 1313 | 57 | 2

• Subject content widely varied• Many blocks of consecutive IPs• Some blocks are entire or most of a /24

messages | subjects | bodies | ips | networks | country_name ----------+----------+--------+-------+----------+--------------- 1246 | 87 | 1246 | 5 | 3 | Canada 25377 | 443 | 24746 | 1308 | 54 | United States

Page 15: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Failure is SuccessDelivery Notification cluster: cluster_id | messages | subject | bodies | ips | networks | countries------------+----------+---------+--------+--------+----------+----------- 68 | 1065 | 2 | 1065 | 609 | 12 | 4

Subject Detail messages | subject ----------+------------------ 613 | Delivery failure 452 | failure delivery

• Delivery notification from legitimate mail servers• Not clustered with spam or sources of spam

Page 16: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Chinese Spam

All Chinese messages messages | ips | networks | clusters | country_name ----------+------+----------+----------+--------------- 92235 | 5179 | 197 | 922 | China 139 | 2 | 1 | 2 | Thailand 78 | 12 | 3 | 4 | United States 5 | 4 | 1 | 2 | Germany

Top 10 Chinese Clusters cluster_id | messages | subject | bodies | ips | networks | countries------------+----------+---------+--------+--------+----------+----------- 59 | 11322 | 19 | 15 | 962 | 4 | 1 3534 | 9987 | 1803 | 8 | 19 | 3 | 1 12 | 8054 | 9 | 8 | 26 | 1 | 1 10477 | 5521 | 5 | 9 | 283 | 4 | 1 69 | 4476 | 59 | 85 | 514 | 17 | 1 134 | 3399 | 48 | 169 | 114 | 17 | 1 121 | 2347 | 10 | 10 | 1 | 1 | 1 456 | 2187 | 21 | 73 | 41 | 6 | 1 56 | 2047 | 29 | 45 | 61 | 14 | 1 4621 | 1944 | 3 | 4 | 5 | 1 | 1

Page 17: Clustering Spam MIT Spam Conference 2008 Phil Tom.

Small Clusters

• Varied subjects and bodies.

• Manual clustering of “online pharmacy” spam

Coalesced clusters: messages | ips | subjects | bodies | clusters ----------+------+----------+--------+---------- 30333 | 9685 | 19453 | 30298 | 3651

Example subjects:Buy sugar pills online cheap!!!!11oneBuy sugar pills online cheap!!!1cos(0)Buy sugar pills online cheap!111pi^0

Page 18: Clustering Spam MIT Spam Conference 2008 Phil Tom.

What’s Next?

• Improve the similarity metrics

• Cluster a population or random sample

• Add time to the analysis