Top Banner
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi [email protected]
16

A False Positive Safe Neural Network for Spam Detection

Jan 21, 2016

Download

Documents

Avani

A False Positive Safe Neural Network for Spam Detection. Alexandru Catalin Cosoi [email protected]. Does this look familiar?. Anatrim. Oh boy, it’s getting worst!!!. Oh boy, it’s getting worst!!!. Bad Bad Spammer!!!. Databases: D: Random legitimate text - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A False Positive Safe Neural Network for Spam Detection

A False Positive Safe Neural Network for Spam Detection

Alexandru Catalin Cosoi

[email protected]

Page 2: A False Positive Safe Neural Network for Spam Detection

Does this look familiar?

Page 3: A False Positive Safe Neural Network for Spam Detection

Anatrim

Page 4: A False Positive Safe Neural Network for Spam Detection

Oh boy, it’s getting worst!!!

Page 5: A False Positive Safe Neural Network for Spam Detection

Oh boy, it’s getting worst!!!

Page 6: A False Positive Safe Neural Network for Spam Detection

Bad Bad Spammer!!!

• Databases:• D: Random legitimate text

• D1: Different rephrases of a certain spam phrase

• D2: Different rephrases of another spam phrase

• …………………

• Dn: Different rephrases of another spam phrase

– Create spam message script:

– Choose a random phrase from D1

– Choose random text from D

– Choose a random phrase from D2

– Choose random text from D– …………….

– Chose random phrase from Dn

• Send message.

• 40 samples of different subjects

• 50 samples of different titles

• 30 samples of different titles (part II)

• 60000 different combinations

Appeared as a consequence

of botnets

Page 7: A False Positive Safe Neural Network for Spam Detection

Features

• Larger time frame – KeyWord!!!!• Weak features

– Words like “Anatrim”, “Viagra”, “Xanax”, “Stock”– Simple word combinations like “Stock alert”, “Strong buy”– Simple Header Heuristics (for both spam and ham) like: valid

reply, weird message id, forged headers

• Example:– Top 500 spammy words from a Bayesian dictionary– Some simple header heuristics from spamassasins’ SARE

Ninjas– Trainer’s personal flavour

Page 8: A False Positive Safe Neural Network for Spam Detection

Why ART?

• Training occurs by modifying the weights of each neuron

• For large amounts of data, forgetting important details might actually happen

• Solves the stability-plasticity dilemma• Based on template detection• Unlimited number of templates

involves unlimited number of patterns• 2 self organizing neural networks + a

mapping module = supervised organizing neural network

Page 9: A False Positive Safe Neural Network for Spam Detection

Adaptive Resonance Theory

• Similar to a cluster algorithm (as many clusters as needed)

• ARTMAP = ARTa + ARTb + MapField

Page 10: A False Positive Safe Neural Network for Spam Detection

ART Vigilance

Small Value - Imprecise Big value - Fragmented

• A big value: Accepts small errors; Many small clusters; High precision• A small value: Accepts high errors; A few big clusters; Errors can appear

Page 11: A False Positive Safe Neural Network for Spam Detection

ART ++

Page 12: A False Positive Safe Neural Network for Spam Detection

Algorithm

Page 13: A False Positive Safe Neural Network for Spam Detection

Corpus

• 2.5 million spam messages (sampled on waves with a high degree of variation) and around 1000 simple low relevance text heuristics (not counting the standard header heuristics).

• The first 1000 words (ordered by discrimination, but with a minimum of 10-30 hundred occurrences) from a bayesian dictionary trained on this corpus, and also standard header heuristics.

• Almost 1 million legitimate email messages• 75% of the message corpus were used for training the neural

network and,• 25% were used in testing the neural network.

• 1.5 days to train!!!!

Page 14: A False Positive Safe Neural Network for Spam Detection

Results

• FP: 1% 0.0001%• FN: 4% 20 %

• On some corpuses (TREC 2006) we had … not so great results (but current heuristics)

• FN: 35% ()• FP: 2 email messages! ()

• At least, just a few false positives!

Page 15: A False Positive Safe Neural Network for Spam Detection

Conclusions

• ART + Simple Features + Spam = Love• ART + False Positives + Spam = OMG!!!• (ART++) = Heuristic Filter + ARTMAP• Must use a lot of email messages. It is highly difficult to

find representative samples for individual waves.• Can also be applied to other neural networks• Interesting PowerPoint template…

Page 16: A False Positive Safe Neural Network for Spam Detection

Thanks

QUESTIONS?