Machine Learning Basics with Applications to Email Spam Detection UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI
Mar 29, 2015
Machine Learning Basics with Applications to Email Spam
Detection
UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI
General background information about the process of machine
learning
The process of email detection
⦿ Motivation of this project
⦿ Pre-processing of data
⦿ Classifier Models● Evaluation of classifiers
Motivation of this project
⦿Spam email has been annoyed every personal email account●60% of January 2004 emails were spam● Fraud & Phishing
⦿Spam vs. Ham email
Our Goal
Spam Email example
Ham Email example
The process of email detection
⦿ Motivation of this project⦿ Pre-processing of data
⦿ Classifier Models● Evaluation of classifiers
Pre-processing of data
⦿ Convert capital letters to lowercase
⦿ Remove numbers, and extra white space
⦿ Remove punctuations
⦿ Remove stop-words
⦿ Delete terms with length greater than 20.
Pre-processing of data
⦿Original Email
Pre-processing of data
⦿After pre-processing
Pre-processing of data
⦿Extract Terms
Pre-processing of data
⦿Reduce Terms●Keep word length < 20
The process of email detection
⦿ Motivation of this project
⦿ Pre-processing of data⦿ Classifier Models● Evaluation of classifiers
Different classification methods
⦿ K Nearest Neighbor (KNN)
⦿ Naive Bayes Classifier
⦿ Logistic Regression
⦿ Decision Tree Analysis
What is K Nearest Neighbor
⦿ Use k "closet" samples (nearest neighbors) to perform classification
What is K Nearest Neighbor
Initial outcome and strategies for improvement
⦿ KNN accuracy was ~64% - very low
⦿ KNN classifier does not fit our project
⦿ Term-list is still too large
⦿ Try different method to classify and see if evaluation results are better than KNN results
⦿ Continue to reduce size of term list by removing terms that are not meaningful
Steps for improvement
⦿Remove sparsity⦿Reduced length threshold⦿Created hashtable⦿Used alternative classifier
●Naive- Bayes Classifier
⦿ Calculate Hash Key for each term in term-list. ⦿ Once collision occurs, use the separate chain
Hashtable
Naive- Bayes classifier
Secondary Results
⦿Correctness increases from 62% to 82.36%
Suggestions for further improvement
⦿Revise pre-processing⦿Apply additional classifiers
Thank you
⦿Questions?