Internal Presentation by: Lei Wang Pervasive and Artificial Intelligenge research group http://diuf.unifr.ch/pai On: An Artificial Immune System for E-mail Classification Andy Secker, Alex Freitas, Jon Timmis Computing Laboratory, University of Kent Canterbury, Kent, UK http://www.cs.kent.ac.uk/~ads3 19/02/2004
22
Embed
Internal Presentation by : Lei Wang Pervasive and Artificial Intelligenge research group On: An Artificial Immune System for E-mail.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Internal Presentation by: Lei Wang
Pervasive and Artificial Intelligenge research grouphttp://diuf.unifr.ch/pai
On: An Artificial Immune System for E-mail Classification
Andy Secker, Alex Freitas, Jon Timmis
Computing Laboratory, University of KentCanterbury, Kent, UK
http://www.cs.kent.ac.uk/~ads3
19/02/2004
An Artificial Immune System for E-mail Classification
Andy Secker, Alex Freitas, Jon Timmis
Computing Laboratory, University of KentCanterbury, Kent, UK
http://www.cs.kent.ac.uk/~ads3
19/02/2004
19/02/2004
Significance
With the increase in information on the Internet, the strive to find more effective tools for distinguishing between interesting and non-interesting material is increasing.
This paper provides an immune-inspired algorithm called AISEC that is capable of continuously classifying electronic mail as interesting and non-interesting without the need for re-training.
Comparing with a naïve Bayesian classifier, the system proposed in this paper performs as well as the naïve Bayesian system and has a great potential for augmentation.
19/02/2004
AISEC, immunity-inspired system
Immune system Human body constantly under attack. Immune system must adapt and
respond The (natural) immune system is:
1. Dynamic
2. Adaptive
3. Robust
4. Etc.
Artificial Immune Systems (AIS) use principles and process from observed and theoretical immunology to solve problems
19/02/2004
Artificial Immune Systems
Engineering framework Representation of individual immune cells Affinity measures
• Evaluate interaction of individuals with environment and/or each other
Algorithms
• Procedures of adaptation manipulate populations of immune cells
AIS as a classifier AIRS
• A successful supervised AIS algorithm for classification
19/02/2004
AIS for Web Mining
Web mining, an umbrella term used to describe three quite different types of data mining: Content mining
• A process of extracting useful information from the text, images and other forms of content that make up the pages
• The mining of textual data is a common task, often for the purposes of information retrieval
Usage mining Structure mining
AISEC research goal To develop a highly adaptive system capable of retrieving interesting
information from the internet based on user’s current interests The authors believe AIS may offer a number of advantages
19/02/2004
What is AISEC ?
AISEC isn’t a spam filter It has no methods to penalize false positives (loss of important
e-mail) Without a very low false positive rate, a spam filter would not be
trusted
19/02/2004
What is AISEC ?
AISEC is A first step towards an AIS for web mining.
• A study of performance and characteristics of an AIS applied to text mining in a dynamic domain
A text classification algorithm capable of continuous adaptation, which may yield a classification accuracy comparable to a Bayesian approach.
• User behaviour and interaction with e-mail can be similar to web pages• Supervised classification algorithm
E-mail classified as interesting and uninteresting Uses constant(ish) feedback from user Capable of continuous adaptation
• This tracks concept drift and can also handle concept shift• A specialised AIS algorithm based in part on the immune principle of
clonal selection No previously documented algorithm was suited for use in this
situation without extensive changes
19/02/2004
Representation
Each cell contains 3 sets of words (+ state) Punctuation is removed from fields Research literature has suggested header information is
enough to accurately classify e-mail*
A = [<free,DVD> , <sales,com> , < canterbury,UK>]
Subject field
Title of the E-mail
Sender field
Sender’s name
Return field
(Sender’s address)
* Diao, Lu & Wu (2000). A Comparative Study of Classification Based Personal E-mail Filtering, PAKDD 2000
19/02/2004
Affinity
Affinity value is proportion of words in one cell found in another More features would require a less naïve distance measure Cosine distance is an obvious choice Resultant value always between 0 and 1
A = [<free,DVDs> , <offers,DVD,com> , <offers,DVD,com>]
B = [<half,price,sale>,<sales,DVD,com>,<sales,DVD,com>]
affinity(A,B) = 4/9
PROCEDURE affinity (bc1, bc2)IF(bc1 has a shorter feature vector than bc2)
bshort ← bc1, blong ← bc2ELSE
bshort ← bc2, blong ← bc1count ← the number of words in bshort present in
blongbs_len ← the length of bshort’s feature vector
RETURN count/bs_len
19/02/2004
Clone-Mutation
One mutation takes a word previously used in subject or address and replaces single location Subject, sender and return
PROCEDURE clone_mutate(bc1,bc2) aff ← affinity(bc1,bc2) clones ← ∅ num_clones ← | aff * Kl | num_mutate ← | (1-aff) * bc’s feature vector length * Km | DO(num_clones)TIMES bcx ← a copy of bc1 DO(num_mutate)TIMES p ← a random point in bcx’s feature vector w ← a random word from the appropriate gene library replace word in bcx’s feature vector at location p with w bcx’s stimulation level ← Ksb clones ← clones ∪ {bcx}RETURN clones
19/02/2004
The algorithm - classification
1. System is initialised with known uninteresting e-mail
2. E-mail presented for classification. Classified as uninteresting as it stimulates close cells
Memory cells
Naive cells
19/02/2004
The algorithm – correct classification
3. Highly stimulated cell reproduces 7 times. Less stimulated cell produces only 2 clones but with higher mutation rate
4. Cell with highest affinity is known to be useful therefore rewarded by becoming memory cell.
Classification Region
Stimulation Region
19/02/2004
The algorithm cont…
Cell removal
• Aged naïve cells deleted. Memory cells placed in already covered areas also deleted.
Incorrect classification
5. Any cell responsible for incorrect classification is removed (memory or otherwise)
19/02/2004
Results – Classification accuracy
2268 e-mails (742 uninteresting) received over 6 months E-mails presented in the order of date received Feedback given after EVERY classification AISEC run 10 times, results show mean C5.0, neural network and C&R tree all run in “Clementine” data
mining package Bayesian algorithm used feedback to update like AISEC