Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Machine Learning Basics with Applications to Email Spam

Detection

UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI

General background information about the process of machine

learning

The process of email detection

⦿ Motivation of this project

⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Motivation of this project

⦿Spam email has been annoyed every personal email account●60% of January 2004 emails were spam● Fraud & Phishing

⦿Spam vs. Ham email

Our Goal

Spam Email example

Ham Email example


⦿ Motivation of this project⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Pre-processing of data

⦿ Convert capital letters to lowercase

⦿ Remove numbers, and extra white space

⦿ Remove punctuations

⦿ Remove stop-words

⦿ Delete terms with length greater than 20.


⦿Original Email


⦿After pre-processing


⦿Extract Terms


⦿Reduce Terms●Keep word length < 20


⦿ Motivation of this project

⦿ Pre-processing of data⦿ Classifier Models● Evaluation of classifiers

Different classification methods

⦿ K Nearest Neighbor (KNN)

⦿ Naive Bayes Classifier

⦿ Logistic Regression

⦿ Decision Tree Analysis

What is K Nearest Neighbor

⦿ Use k "closet" samples (nearest neighbors) to perform classification

What is K Nearest Neighbor

Initial outcome and strategies for improvement

⦿ KNN accuracy was ~64% - very low

⦿ KNN classifier does not fit our project

⦿ Term-list is still too large

⦿ Try different method to classify and see if evaluation results are better than KNN results

⦿ Continue to reduce size of term list by removing terms that are not meaningful

Steps for improvement

⦿Remove sparsity⦿Reduced length threshold⦿Created hashtable⦿Used alternative classifier

●Naive- Bayes Classifier

⦿ Calculate Hash Key for each term in term-list. ⦿ Once collision occurs, use the separate chain

Hashtable

Naive- Bayes classifier

Secondary Results

⦿Correctness increases from 62% to 82.36%

Suggestions for further improvement

⦿Revise pre-processing⦿Apply additional classifiers

Thank you

⦿Questions?

Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Documents

classification slide

goal slide

meaningful slide

ham email slide

nearest neighbor slide

arye nehorai slide

separate chain hashtable

project spam email