GPGPU based Parallel Spam Filter Prachi Goyal Juneja M.Tech Scholar Maulana Azad National Institute of Technology Bhopal(M.P) India-462003 R.K.Pateriya Associate Professor Maulana Azad National Institute of Technology Bhopal(M.P) India-462003 Abstract- Spam means unwanted emails in our mailboxes each day. These emails consist of promotional messages from companies, viruses, lucrative offers of earning extra income and many more. They are sent in bulk to flood our mailboxes and come from unknown sources. Various ways have been devised to deal with spam; these are known as Spam Filtering Techniques. Spam Filtering is done based on many parameters like keywords, URL, content etc. Content based spam filtering is becoming famous since it incorporates the judging of the email content and then analyzing it to be spam or ham. As the data is increasing and electronic data taking over most of the communication medium, one needs faster processing and computing devices. GPGPU’s have come up in a great way in sharing the CPU’s tasks and make parallel processing possible. Keywords- Spam, Bayesian Spam Filtering, Serial Spam Filter, Parallel Spam Filter, Spamicity. 1. INTRODUCTION Increase in internet communication has eventually led to an enormous increase of spam. Spam is unwanted data sent to a user without their wish, i.e., this data was neither asked by them nor did they desire to receive it [1]. Increase in spam leads to an enormous number of problems like slower access to emails, increase in network traffic, unwanted space occupancy and many more[2,3]. To get rid of spam two spam filters are proposed: 1. Serial Bayesian spam filter 2. Parallel Bayesian spam filter using GPGPU(general purpose computation on GPU) The serial spam filter is designed first and later parallelized using mail list division approach to make it a parallel spam filter. In designing of both the spam filters Bayesian approach is used [4, 5, 6, 7]. The filters proposed in this paper consist of two phases: Training Phase and the Filtering Phase. In the training phase three databases are created: Keyword Database: Keywords are taken from the ham and spam mails Ham Database: Database for ham mails Spam Database: Database for spam mails. The databases are taken from Enron and Snort Dataset. Once the databases are created, the spam probability of every keyword is calculated using Bayesian statistics. The rest of this paper is organized as follows: Section 2 describes Bayesian Spam Filtering method. In this section the training phase and the filtering phase are briefly discussed. Section 3 encompasses the basics of a serial spam filter along with its design and algorithm. Sections 4 contain the basic parallel spam filter model. In this section the complete parallelization process using GPGPU is discussed. Section 5 provides a summary of serial and parallel spam filters with a summary of Bayesian spam filtering steps. We discuss the related work in Section 6 and conclude the paper in Section 7. 2. BACKGROUND With many new filtering techniques coming up and lots of work going on in this field, daily new techniques and ways are devised to fight spam. In work done by authors Hu Yin and Zhang Chaoyang, [8] a Bayesian serial spam filtering algorithm is implemented where the email content is tokenized and these tokens are searched for in the mail. The brute force method of searching is applied here. It is time consuming to sequentially search for each word and preprocess it separately. (IJCSIS) International Journal of Computer Science and Information Security, Vol. 12, No. 7, July 2014 49 http://sites.google.com/site/ijcsis/ ISSN 1947-5500
Spam means unwanted emails in our mailboxes each day. These emails consist of promotional messages from companies, viruses, lucrative offers of earning extra income and many more. They are sent in bulk to flood our mailboxes and come from unknown sources. Various ways have been devised to deal with spam; these are known as Spam Filtering Techniques. Spam Filtering is done based on many parameters like keywords, URL, content etc. Content based spam filtering is becoming famous since it incorporates the judging of the email content and then analyzing it to be spam or ham. As the data is increasing and electronic data taking over most of the communication medium, one needs faster processing and computing devices. GPGPU’s have come up in a great way in sharing the CPU’s tasks and make parallel processing possible.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GPGPU based Parallel Spam Filter
Prachi Goyal Juneja
M.Tech Scholar
Maulana Azad National Institute of Technology
Bhopal(M.P) India-462003
R.K.Pateriya
Associate Professor Maulana Azad National Institute of Technology
Bhopal(M.P) India-462003
Abstract- Spam means unwanted emails in our
mailboxes each day. These emails consist of promotional
messages from companies, viruses, lucrative offers of
earning extra income and many more. They are sent in
bulk to flood our mailboxes and come from unknown
sources. Various ways have been devised to deal with
spam; these are known as Spam Filtering Techniques.
Spam Filtering is done based on many parameters like
keywords, URL, content etc. Content based spam
filtering is becoming famous since it incorporates the
judging of the email content and then analyzing it to be
spam or ham. As the data is increasing and electronic
data taking over most of the communication medium,
one needs faster processing and computing devices.
GPGPU’s have come up in a great way in sharing the
CPU’s tasks and make parallel processing possible.
Keywords- Spam, Bayesian Spam Filtering, Serial Spam
Filter, Parallel Spam Filter, Spamicity.
1. INTRODUCTION
Increase in internet communication has eventually
led to an enormous increase of spam. Spam is
unwanted data sent to a user without their wish, i.e.,
this data was neither asked by them nor did they
desire to receive it [1]. Increase in spam leads to an
enormous number of problems like slower access to
emails, increase in network traffic, unwanted space
occupancy and many more[2,3].
To get rid of spam two spam filters are proposed:
1. Serial Bayesian spam filter
2. Parallel Bayesian spam filter using
GPGPU(general purpose computation on
GPU)
The serial spam filter is designed first and later
parallelized using mail list division approach to make
it a parallel spam filter. In designing of both the spam
filters Bayesian approach is used [4, 5, 6, 7]. The
filters proposed in this paper consist of two phases:
Training Phase and the Filtering Phase. In the
training phase three databases are created:
Keyword Database: Keywords are taken
from the ham and spam mails
Ham Database: Database for ham mails
Spam Database: Database for spam mails.
The databases are taken from Enron and Snort
Dataset. Once the databases are created, the spam
probability of every keyword is calculated using
Bayesian statistics.
The rest of this paper is organized as follows: Section
2 describes Bayesian Spam Filtering method. In this
section the training phase and the filtering phase are
briefly discussed. Section 3 encompasses the basics
of a serial spam filter along with its design and
algorithm. Sections 4 contain the basic parallel spam
filter model. In this section the complete
parallelization process using GPGPU is discussed.
Section 5 provides a summary of serial and parallel
spam filters with a summary of Bayesian spam
filtering steps. We discuss the related work in Section
6 and conclude the paper in Section 7.
2. BACKGROUND
With many new filtering techniques coming up and
lots of work going on in this field, daily new
techniques and ways are devised to fight spam.
In work done by authors Hu Yin and Zhang
Chaoyang, [8] a Bayesian serial spam filtering
algorithm is implemented where the email content is
tokenized and these tokens are searched for in the
mail. The brute force method of searching is applied
here. It is time consuming to sequentially search for
each word and preprocess it separately.
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 12, No. 7, July 2014