Top Banner
Machine Learning Basics with Applications to Email Spam Detection UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI
24

Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Mar 29, 2015

Download

Documents

River Amor
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Machine Learning Basics with Applications to Email Spam

Detection

UGR PROJECT - HAOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI 

Page 2: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

General background information about the process of machine

learning

Page 3: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

The process of email detection

⦿ Motivation of this project

⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Page 4: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Motivation of this project

⦿Spam email has been annoyed every personal email account●60% of January 2004 emails were spam● Fraud & Phishing

⦿Spam vs. Ham email

Page 5: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Our Goal

Page 6: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Spam Email example

Page 7: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Ham Email example

Page 8: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

The process of email detection

⦿ Motivation of this project⦿ Pre-processing of data

⦿ Classifier Models● Evaluation of classifiers

Page 9: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Pre-processing of data

⦿ Convert capital letters to lowercase 

⦿ Remove numbers, and extra white space

⦿ Remove punctuations 

⦿ Remove stop-words

⦿ Delete terms with length greater than 20. 

Page 10: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Pre-processing of data

⦿Original Email

Page 11: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Pre-processing of data

⦿After pre-processing

Page 12: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Pre-processing of data

⦿Extract Terms

Page 13: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Pre-processing of data

⦿Reduce Terms●Keep word length < 20

Page 14: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

The process of email detection

⦿ Motivation of this project

⦿ Pre-processing of data⦿ Classifier Models● Evaluation of classifiers

Page 15: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Different classification methods

⦿ K Nearest Neighbor (KNN)

⦿ Naive Bayes Classifier

⦿ Logistic Regression

⦿ Decision Tree Analysis

Page 16: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

What is K Nearest Neighbor

⦿ Use k "closet" samples (nearest neighbors) to perform classification

Page 17: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

What is K Nearest Neighbor

Page 18: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Initial outcome and strategies for improvement

⦿ KNN accuracy was ~64% - very low

⦿ KNN classifier does not fit our project 

⦿ Term-list is still too large 

⦿ Try different method to classify and see if evaluation results are better than KNN results

⦿ Continue to reduce size of term list by removing terms that are not meaningful

Page 19: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Steps for improvement

⦿Remove sparsity⦿Reduced length threshold⦿Created hashtable⦿Used alternative classifier

●Naive- Bayes Classifier

Page 20: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

⦿ Calculate Hash Key for each term in term-list. ⦿ Once collision occurs, use the separate chain

Hashtable

Page 21: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Naive- Bayes classifier

Page 22: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Secondary Results

⦿Correctness increases from 62% to 82.36%

Page 23: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Suggestions for further improvement

⦿Revise pre-processing⦿Apply additional classifiers

Page 24: Machine Learning Basics with Applications to Email Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.

Thank you

⦿Questions?