Top Banner
72 Int. J. Security and Networks, Vol. 8, No. 2, 2013 Copyright © 2013 Inderscience Enterprises Ltd. Simplified features for email authorship identification Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and Ahmed F. Otoom Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, Hashemite University, Zarqa, Jordan Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected] *Corresponding author Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: [email protected] Abstract: We present an investigation analysis approach for mining anonymous email content. The core idea behind our approach is concentrated on collecting various effective features from previous emails for all the possible suspects. The extracted features are then used with several machine learning algorithms to extract a unique writing style for each suspect. A sophisticated comparison between the investigated anonymous email and the suspects writing styles is employed to extract evidence of the possible email sender. Extensive experimental results on a real data sets show the improved performance of the proposed method with very limited number of features. Keywords: digital forensics; cyber crimes; email forensics; email misuse; authorship analysis; stylometric features. Reference to this paper should be made as follows: Abdallah, E.E., Abdallah, A.E., Bsoul, M., Otoom, A.F. and Al-Daoud, E. (2013) ‘Simplified features for email authorship identification’, Int. J. Security and Networks, Vol. 8, No. 2, pp.72–81. Biographical notes: Emad E. Abdallah is currently an Assistant Professor in the Department of Computer Information Systems at the Hashemite University (HU), Jordan. He received his PhD in Computer Science from Concordia University in 2008, where he worked on multimedia security, pattern recognition and 3D object recognition. He received his BS in Computer Science from Yarmouk University, Jordan, and MS in Computer Science from the University of Jordan in 2000 and 2004, respectively. Prior to joining HU, he was a Software Developer at SAP Labs Montreal. His current research interests include computer graphics, multimedia security, pattern recognition and computer networks. Alaa E. Abdallah is an Assistant Professor in the Department of Computer Science of Hashemite University since 2011. He obtained his BSc in Computer Science from Yarmouk University in 2000, MSc in Computer Science from University of Jordan in 2003, and PhD in Computer Science from Concordia University in 2008, Montreal-Canada. Prior to joining Hashemite University, he was a Network Researcher at consulting private company in Montreal (2008–2011). His research interest includes the routing protocols for ad hoc networks, parallel and distributed systems, and multimedia security. Mohammad Bsoul is an Assistant Professor in the Computer Science Department at the Hashemite University. He received his BSc in Computer Science from Jordan University of Science and Technology, Jordan, Master from University of Western Sydney, Australia, and PhD from Loughborough University, UK. His research interests include wireless sensor networks, grid computing, distributed systems and performance evaluation.
10

Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: [email protected]

Jul 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

72 Int. J. Security and Networks, Vol. 8, No. 2, 2013

Copyright © 2013 Inderscience Enterprises Ltd.

Simplified features for email authorship identification

Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and Ahmed F. Otoom Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, Hashemite University, Zarqa, Jordan Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected] *Corresponding author

Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: [email protected]

Abstract: We present an investigation analysis approach for mining anonymous email content. The core idea behind our approach is concentrated on collecting various effective features from previous emails for all the possible suspects. The extracted features are then used with several machine learning algorithms to extract a unique writing style for each suspect. A sophisticated comparison between the investigated anonymous email and the suspects writing styles is employed to extract evidence of the possible email sender. Extensive experimental results on a real data sets show the improved performance of the proposed method with very limited number of features.

Keywords: digital forensics; cyber crimes; email forensics; email misuse; authorship analysis; stylometric features.

Reference to this paper should be made as follows: Abdallah, E.E., Abdallah, A.E., Bsoul, M., Otoom, A.F. and Al-Daoud, E. (2013) ‘Simplified features for email authorship identification’, Int. J. Security and Networks, Vol. 8, No. 2, pp.72–81.

Biographical notes: Emad E. Abdallah is currently an Assistant Professor in the Department of Computer Information Systems at the Hashemite University (HU), Jordan. He received his PhD in Computer Science from Concordia University in 2008, where he worked on multimedia security, pattern recognition and 3D object recognition. He received his BS in Computer Science from Yarmouk University, Jordan, and MS in Computer Science from the University of Jordan in 2000 and 2004, respectively. Prior to joining HU, he was a Software Developer at SAP Labs Montreal. His current research interests include computer graphics, multimedia security, pattern recognition and computer networks.

Alaa E. Abdallah is an Assistant Professor in the Department of Computer Science of Hashemite University since 2011. He obtained his BSc in Computer Science from Yarmouk University in 2000, MSc in Computer Science from University of Jordan in 2003, and PhD in Computer Science from Concordia University in 2008, Montreal-Canada. Prior to joining Hashemite University, he was a Network Researcher at consulting private company in Montreal (2008–2011). His research interest includes the routing protocols for ad hoc networks, parallel and distributed systems, and multimedia security.

Mohammad Bsoul is an Assistant Professor in the Computer Science Department at the Hashemite University. He received his BSc in Computer Science from Jordan University of Science and Technology, Jordan, Master from University of Western Sydney, Australia, and PhD from Loughborough University, UK. His research interests include wireless sensor networks, grid computing, distributed systems and performance evaluation.

Page 2: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

Simplified features for email authorship identification 73

Ahmed F. Otoom is currently working as an assistant dean at the Faculty of Prince Al-Hussein bin Abdullah II for Information Technology. He is also an assistant professor at the Software Engineering department at Hashemite University, Jordan. He has a PhD degree in computer science from the University of Technology, Sydney (UTS), Australia, 2010. In 2003, he received his master’s degree in software engineering from the University of Western Sydney, Australia. In 2002, he received his bachelor degree in computer science from Jordan University of Science and Technology, Jordan. He worked as a lecturer at Jerash Private University, Jordan, between 2003 and 2005. His main research interests include computer vision and pattern recognition techniques for image and video analysis with a focus on realistic scenarios within video surveillance.

Essam Al-Daoud received his BSc from Mu’tah University, MSc from Al al-bayt University and his PhD in computer science from University Putra Malaysia in 2002. He is currently an associate professor and the chairman of computer science department at Zarka University. His research interests include Machine Learning, Soft-computing, Cryptography, Bioinformatics, Quantum Cryptography, Quantum Computation, DNA Computing, Nano-Technology.

1 Introduction

Information security is gaining good attention from experts in the community especially after the growing penetration of the e-based systems and e-information at large-scale worldwide (Casey, 2010). Most of the information is now collected, processed and stored on electronic computers and transmitted across networks to other computers. Unfortunately, along with the advantages, the bower of the internet is often exploited in various ways for illegal purposes. Clearly, emails are the most common way to transmit information over the internet with no authentication. Therefore, criminals, attackers and terrorists are often use emails for their furtive communication. Spam emails for instance are no longer just unsolicited emails. Cyber criminals use spam to spread malware over the internet and infect other people’s computers to entice people to phishing sites that steal vital personal information (Wei et al., 2008). Moreover, attacks may be form of email abuse such as sexual harassment, phishing, transmitting worms, hoaxes, child pornography, forgery, email bombing and email viruse (Wong and Tian, 2012).

Email can be easily hacked and the data enclosed in the header about the sender could change or forged. Moreover, the attacker or the real sender may modify the trail in which the email has passed through; this action will murder any effort for identifying the location of the real sender. Furthermore, the sender may attempt to hide his/her true identity in order to avoid detection (De Vel et al., 2001). An email can be routed through several unidentified servers to bury any useful information about the sender of the email or its origin. In addition, email could be sent from a public internet café to hide the location of sender. Thus, criminals and terrorists are often use emails for their communication. In this situation, email forensic analysis that examines the features of a malicious email is the only option in order to draw conclusions on its authorship form a list of suspects.

Authorship identification is the process of examining the features of a malicious email in order to draw conclusions on its authorship form a list of suspects. Clearly, the digital investigator needs to gather several convincing clues from the malicious email and compare it with the possible

suspects writing styles. The extracted evidence could be used in the future to determine the likelihood of a specific suspect (McElroy and Seta, 2007; Okolica et al., 2008). One of the major difficulties facing email forensics is the large amount of emails that need to be inspected (Iqbal et al., 2008). However, most of the times, email structure is the only way to identify the author.

Early algorithms on the problem of authorship in the context of email forensics was introduced by Gray et al. (1997) where a toolkit called IDENTIFIED is developed to assists with the automatic extraction of a wide variety of metrics. The metrics are used for software forensics and authorship analysis. In the work of De Vel (2000) and De Vel et al. (2001), different email features are derived including structural characteristics and linguistic pattern. The Support Vector Machine (SVM) is employed as a learning engine. The main limitation of the above mentioned algorithms is that there is no steady categorisation performance for all suspects. To overcome this limitation, a combination of features such as relative function word frequencies should be considered.

Determining the gender of a document’s author is presented in the work of Koppel et al. (2002). A set of training documents and combinations of lexical and syntactic features are used to draw a linear separator between male and female writing styles. In the wok of Abbasi and Chen (2008), a rich set of features are incorporated to develop a transforms-based write-prints technique for similarity detection. Across feature set that were applied to multiple domains including asynchronous and synchronous computer-mediated communication. Koppel et al. (2009) achieved trustable results with a large amount of training data, where each test document should have a minimum of 5000–10,000 words long. One limitation of this technique from the view point of email authorship content mining is that, in general, emails and online documents do not contain that much of training text. A framework for identity tracking of online messages is developed by Zheng et al. (2006), where the experiments on English and Chinese languages showed that the SVM performs better than decision tree and neural network.

Page 3: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

74 E.E. Abdallah et al.

Latest research in the area of authorship attribution is introduced by Iqbal et al. (2008, 2010) and Abdallah et al. (2012). The write-print of every suspect is collected as combinations of features that occurred frequently. The common frequent features are then filtered to have a unique pattern to every suspect. The unknown email is converted into a feature vector and compared with the set of the unique patterns to discover which vector is the most closely matched. In the wok of Iqbal et al. (2010), all the training emails are clustered by set of stylometric features. Then, a unique writing style is extracted from each cluster. This technique is useful when no suspects list or training examples are known to the cybercrime investigator. Stylometry clustering is applied to categorise the main groups of stylistics belonging to different suspects. The problem with the stylometry clustering is that the accuracy rate is decreased when the number of the candidate suspect is increased.

Motivated by the need for a better categorisation performance with an enhanced accuracy rate, we propose new features that demonstrate great improvement to the authorship verification problem. The comprehensive set of the extracted features include lexical, syntactic, structural characteristics, content-specific, and the author positive/ negative emotions. Our approach employed the decision trees, SVM, random forest, functional tree, logistic, naive Bayes and the AdaBoost as learning engines (Domingos and Pazzani, 1997; Hand and Yu, 2001; Diederich et al., 2000; Quinlan, 1986; Breiman, 2001; Tweedie et al., 1996; Witten and Frank, 2000). Extensive numerical experiments are performed to demonstrate the much improved performance of the proposed approach.

The remainder of this paper is organised as follows. In Section 2, we briefly review some background material and describe the authorship verification problem. In Section 3, we introduce the proposed approach and describe in detail the features extraction and classification algorithms. In Section 4, we present some experimental results, and we show the robustness of the proposed approach against number of suspects and number of email messages per suspect. Finally, we conclude and point out future directions in Section 5.

2 Background

In this section, we formally define the authorship verification problem.

2.1 Authorship verification problem statement

The authorship verification problem is summarised in Figure 1, where the cyber forensics investigators receive a case from the court with unknown sender of a malicious email. The investigator needs to identify a particular author of the email from a group of suspects where each suspect has a set of previous emails to be used in the training stage. The investigator captures the writing style of every suspect by extracting the frequent features of his writings (see Section 3.1). The extracted features are then used to generate a classification model (see Section 3.2). Finally, investigators extract all the possible features from the malicious email and feed the classifier model to identify the suspect whose writing style closely matches the email under investigation.

Figure 1 Authorship features extraction process (see online version for colours)

Page 4: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

Simplified features for email authorship identification 75

3 Proposed approach

Authorship analysis includes authorship attribution, verification, profiling, and/or similarity detection. In the proposed approach, we extracted different types of features and employed several machine learning algorithms. The previous studies show that there is no predefined special feature set that can be used to determine the writing style (Iqbal et al., 2008). The writing style contains (a) Lexical features, (b) Syntactic features, (c) Content-specific features, (d) Structural features, and (e) Idiosyncratic features. Selected combinations of these features are needed to be identified and extracted to verify the writing style. One of the main contributions in the paper is to reduce training time by extracting small number of the effective features that decrease significantly the training and the classification processing time.

3.1 Features extraction

In this section, we describe the extraction process of certain features that have the most impact on the email content mining problem. Word usage, selection of special characters, composition of sentences and the organisation of sentences into paragraphs are used in the work of Corney et al. (2002) and Iqbal et al. (2010) for defining the writing style. The total number of stylometric features used in the work of Abbasi and Chen (2008) exceeded 1000 features and in the work of Iqbal et al. (2010) total 419 features are used. In order to reduce the training time, our experimental results show that only 16 features (see Table 1) need to be considered. The author emotions that he/she expressed in their emails, their distinct way of writing some phrases are the most important features for recognising the unique or near to unique writing style.

Table 1 Summary of the most effective features for email authorship identification

Feature Category Feature Feature Type

Self reference

Social words

Overall-cognitive words

Articles (a,an,the) Words count

Big words (more than 6 letters)

Content features

Maximum sentence length

Minimum sentence length Sentences length

Average sentence length

Lexical features

Ratio of dots

Ratio of commas Occurrences of Punctuation marks Ratio of other special

characters

Syntactic features

The complexity of the text

Two word phrases frequency Linguistic complexity and readability The readability of the text

Linguistic features

Positive emotions Words meanings Negative emotions

Content features/Expressed emotions

Our proposed text analysis tool extracts several effective features as shown in Table 1, where the self-reference is the subject to speak of him/her self, or itself (e.g. I, she, our and they). Social words show the manners (please and thank you are examples in a good flavour). The emotions expressed by humans can be divided into two categories. Positive emotions that state an objective to incorporate, taking the entire into concern, learning more viewpoints, cooperating more with others, making things better. Interest, laughter, empathy, enthusiasm, curiosity are examples of the positive emotions. On the other hand, the negative emotions state an objective to eliminate. Fear from the unidentified and the others actions, manage the others or prevent them from hurting him/her (fear, hatred, shame, blame, regret, resentment). The experiments show that the way the authors uses the cognitive words (could, know, consider, cause) is one of the most effective features in detecting the writing style. It is usually concluded from the processes of awareness, judgement, thinking and reasoning. The lexical density is used to compute the text complexity, where it captures how easy or difficult is a text to read or to understand. Complexity factor is an important feature to recognise the authors’ writing. It measures the number of unique words within a given text using the following formula:

(Number of different words)ComplexityFactor = ×100

(Total number of words)

An email with a high complexity factor implies harder text to understand. Several measures can be used for the readability feature including Flesch, Dale-Chall, Gunning-fog formulas, or Fry graph. The Gunning-fog formula is used to verify that an email can read easily by the intended readers. The gunning-fog index (Gunning, 1952) is calculated using:

number of wordsReadability = 4×

average sentence length

complex words+100×

words

Complex words are the words with three or more syllables. A gunning-fog index of less than 12 is needed to consider an email is easy to read. Figure 2 shows a snapshot for the feature extraction process of our text analysis tool. The email is selected from Enron email corpus database (Kaelbling, 2009) that is mostly recognised for email forensics research.

As the features have different types and weight, we applied a normalisation formula to balance all the extracted features to numerical values. Figure 3 depicts the numerical vectors for the extracted features which are shown in Figure 2.

The experimental results show that complexity, positive emotions, negative emotions, minimum sentence length, and the two word phrases have a major impact on uniquely identifying the author writing style.

Page 5: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

76 E.E. Abdallah et al.

Figure 2 Features viewer editor (see online version for colours)

Figure 3 Numerical features vectors

3.2 Email classification

In this section, we provide the main steps of the proposed authorship identification algorithm. Email classification process starts by extracting several features from the training and the testing emails data sets. All samples of the two sets are labelled with a predefined known author. The extracted features of each email are represented as a normalised features vectors. The classification model is built using the extracted vectors from the training samples. Several machine learning classifiers are employed including Decision trees, SVM, Random forest, Functional tree, Logistic, Naive Bayes and the AdaBoost. The feature vectors of the testing emails are used to evaluate the

accuracy of the proposed algorithm. Each email is assigned to one of the suspects authors using the machine learning classifiers.

The feature extraction and the email identification description are shown in Algorithm 1. The algorithm presents our proposed authorship identification technique for determining the authorship of a malicious email from one of the predefined suspects. A set of training emails for all the possible suspects S1, S2 and Sk is necessarily for the training process. The algorithm starts by extracting frequent effective features for each suspect. Then, a single normalised feature vector V is extracted for each suspect. The extracted Vi is uniquely identify Si writing style. All the extracted features

Page 6: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

Simplified features for email authorship identification 77

vectors are then used to feed several supervised machine learning algorithms to create sophisticated classification models. Part 2 of Algorithm 1 is used to identify the author of the malicious email from S1, S2 and Sk. The algorithm extracts from the same set of features extracted from the training emails. The features are then fed the created classification models to identify the closest suspect writing style. The algorithm fails to classify if the identification score is less than a predefined threshold. The threshold is chosen carefully to minimise the false alarm.

Figure 5 illustrate in a block diagram the identification process of the proposed technique. The correctly and incorrectly classified emails are used to compute the classification accuracy rate and the false acceptance rate. Figure 4 shows a simplified example of the developed J48 decision-tree classifier model.

3.3 Simulation set-up and experimental evaluation

To evaluate the authorship identification accuracy of our proposed algorithm, we performed several experiments on the Enron email data set (Kaelbling, 2009), where there are 150 users and roughly 500,000 emails. In our simulations, we used the Weka (Hall et al., 2009) data mining software, which is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data using several machine learning schemes. In order to determine the best performing classifier, several metrics are used to measure the performance of the extracted features using nine different classifiers. First, maximising the classification accuracy rate and minimising the misclassification instances. Second, extracting the most effective features and reducing the training time.

Figure 4 J48 Decision-tree classifier model, where X1, X2, X3, and X4 are potential suspect

Figure 5 Authorship identification process (see online version for colours)

Page 7: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

78 E.E. Abdallah et al.

Algorithm 1 Proposed authorship identification algorithm1

1: Input A sets of training emails 11 12 1, ,..., ,nE E E

21 22 2, ,..., ,nE E E

1 2..., , ,...,k k knE E E previously written by potential

suspects 1 2,S S and kS ,respectively.

A malicious email with an unknown sender. 2: Output: Identify the sender of the malicious email . 3: /* Training */ 4: for each suspect Si do

5: for each email ,.,jE En do

6: Extract frequent effective features (Content, Lexical, Syntactic, and Linguistic features).

7: Convert jE to a feature vector Vj .

8: end for 9: Extract for Si a single normalised feature vector Vi that is

uniquely identify his/her writings 10: Feed the extracted features vectors to several supervised

machine learning algorithms 11: end for 12: Several classification models are created 13: 14: /* Identify the author of the malicious email from

1 2,S S and kS */

15: Extract several effective features (Content, Lexical, Syntactic, and Linguistic features) from

16: Feed the extracted features vector to the created classification model

17: 18: if identification score ≥ predefined threshold 19: identify the suspect with the highest score 20: else 21: No classification ‘The email matchs none of the suspects writing styles’ 22: end if

A performance comparison between different classifiers is made on two different levels of conducted tests. In the first test, we applied our feature extraction tool on ten emails randomly selected for two suspects to show the way we extracted the suspects writing styles. The second test is used to analyse the performance of the proposed email authorship identification system. Subsets of four senders from the original data set are randomly selected. We show our experiment on the selected senders where no restrictions on the recipients have considered. Although most of the research in the literature, the emails are manually filtered to have a common format, we reduced the manual filtering to minimum. In order to increase the reliability of the presented results, each experiment is repeated five times independently and the average values are reported. Moreover we used several testing options including the percentage split, supplied test, and the cross-validation with different folds.

In the first experiment, we applied our feature extraction tool on ten random emails selected from two arbitrarily suspects. Figures 6–9 depict the extracted results of four different features. Figure 6 shows clearly that positive

emotion appears on nine samples written by suspect 1. However, it appears only twice in suspects two emails. Hence, the positive emotions feature is set for suspect 1 and it becomes part of his writing style. Similarly from Figure 7, negative emotion feature is set for suspect two. Figures 8 and 9, the complexity and the two words phrases’ features are set for suspect 2 and suspect 1, respectively. Obviously it is possible to set the same feature for several suspects.

Figure 6 Positive emotions feature vs. ten random emails for two suspects from Enron email data set

Figure 7 Negative emotions feature vs. ten random emails for two suspects from Enron email data set

Figure 8 Email complexity feature vs. ten random emails for two suspects from Enron email data set

Page 8: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

Simplified features for email authorship identification 79

Figure 9 Two words phrases feature vs. ten random emails for two suspects from Enron email data set

The second experiment is used to analyse the performance of the proposed email authorship identification system using nine different classifiers over four randomly selected suspects (X1, X2, X3 and X4) of the Enron email data set. For each suspect Xi, we applied our features extraction process to his/her emails. With the percentage split testing option. The emails are randomly partitioned into 60% for training and 40% for testing. In the training phase, the training emails are fed to the machine learning algorithms to extract a unique writing style of each suspect. In the testing phase, the testing emails are fed to the classification model to identify the sender of each email in testing set. The authorship classification accuracy is calculated by the percentage of the correctly classified emails in the testing set. The obtained results for X1 are shown in Figure 10a. Figure 10b–10d shows the calculated results for suspects X2, X3 and X4, respectively. Clearly, the results indicate that classification accuracy obtained using Simple Logistic and Functional Tree classifiers perform the best for all suspects.

The third experiment is used to analyse the performance of the proposed authorship identification system over the same four randomly selected suspects that are used in the second experiment. However, in this experiment we used the training set as a testing option. The obtained results for X1 are shown in Figure 11a. Figure 11b–11d depicts the calculated results for suspects X2, X3, and X4, respectively. Clearly, 100% accuracy rate is achieved for most classification models and for all suspects. The results are expected due to the training method, where all emails are used for training and then the same set of emails are used for testing.

The forth experiment is used to analyse the performance using the stratified cross-validation with a fix number of fold as a testing option. In order to get statistically meaningful results, we used tenfold cross-validation. Tenfold means 100 calls of one classifier with training data and tested against test data. The obtained results are shown in Figure 12a–12d for suspects X1, X2, X3, and X4, respectively. The results indicate that the classification accuracy obtained using the Simple Logistic and AdaBoost comparatively better than the recognition accuracy achieved using the Naive Bays, Random Forest and Decision Tree classifiers.

Figure 10 Accuracy vs. classification model using percentage-split testing option. (a) suspect 1, (b) suspect 2, (c) suspect 3, and (d) suspect 4. Emails are randomly partitioned into 60% for training and 40% for testing

(a)

(b)

(c)

(d)

Page 9: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

80 E.E. Abdallah et al.

Figure 11 Accuracy vs. classification model using training testing option. (a) suspect 1, (b) suspect 2, (c) suspect 3, and (d) suspect 4

(a)

(b)

(c)

(d)

Figure 12 Accuracy vs. classification model using tenfold cross-validation testing option: (a) suspect 1, (b) suspect 2, (c) suspect 3, and (d) suspect 4

(a)

(b)

(c)

(d)

Page 10: Emad E. Abdallah*, Alaa E. Abdallah, Mohammad Bsoul and ... · Essam Al-Daoud Faculty of Science and Information Technology, Zarka University, Zarka, Jordan Email: essamdz@zpu.edu.jo

Simplified features for email authorship identification 81

The new effective features that we employed in the proposed scheme improve the average authorship identification accuracy. We obtained an accuracy rate from 80% to 90% for all the four randomly selected suspects as depicted in the figures. Hence, the new features seem very encouraging to be utilised in the email identification problem. The results are consistent even if we increase number of candidate possible senders to 50. Moreover, the experiments show that employing Simple Logistic classifier provides comparatively better results than other classifiers. Simple Logistic performs better for the authorship identification problem due to the statistical procedures that it uses for estimating the associations among variables. It comprises several schemes for modelling and analysing variables, when the system centre of attention is the relationship between dependent variables it assists to recognise how the values of the dependent variable varies when any one of the other variables is varied. Simple Logistic offers better results when it chooses from limited number of possible classes and this is the case for the authorship identification problem. Suspect 4, sometimes suffers from lower classification results than other suspect; this is due to the limited number of the training emails in his container. However, with only five training emails for suspect 4, our method was capable to extract a unique writing style for his/her emails.

4 Conclusion

In this paper, we presented a computationally inexpensive investigation analysis tool for authorship identification of anonymous emails. The key idea is to extract selected effective features from the suspects’ previous writings to be used in the learning process. Different email features are derived including structural characteristics and linguistic pattern. Several learning algorithms are employed to build classification models. To evaluate the effectiveness of the proposed analysis tool, we conducted several experiments on a real email data set. The results clearly showed the ability of identifying the authors with very limited number of features. For future work, we plan to analyse the relationship between the numbers of features used in the extraction process, optimal two word phrases, and modifying the learning engine to further improve the classification performance in the context of email forensics.

References

Abbasi, A. and Chen, H. (2008) ‘Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace’, ACM Transactions on Information Systems, Vol. 26, No. 2, pp.1–29.

Abdallah, E., Otoom, A., Saqer, A., Aisheh, O., Omari, D. and Salem, G. (2012) ‘Detecting email forgery using random forests and naive Bayes classifiers’, Proceeding of International Conference on Computer and Software Engineering (ICCSE), Spain.

Breiman, L. (2001) ‘Random forests’, Machine Learning, Vol. 45, No. 1, pp.5–32.

Casey, E. (2010) Handbook of Digital Forensics and Investigation, Elsevier.

Corney, M., Vel, O., Anderson, A. and Mohay, G. (2002) ‘Gender-preferential text mining of e-mail discourse’, Proceeding of 18th Annual Computer Security Applications Conference, pp.21–27.

De Vel, O. (2000) ‘Mining email authorship’, Proceeding of Workshop on Text Mining, ACM International Conference on Knowledge

Discovery and Data Mining, Boston, Massachusetts, USA. De Vel, O., Anderson, A., Corney, M. and Mohay, G. (2001)

‘Mining email content for author identification forensics’, SIGMOD Record, Vol. 30, No. 4, pp.55–64.

Diederich, J., Kindermann, J., Leopold, E. and Paass, G. (2000) ‘Authorship attribution with support vector machines’, Applied Intelligence, Vol. 19, No. 1, pp.109–123.

Domingos, P. and Pazzani, M. (1997) ‘On the optimality of the simple Bayesian classifier under zero-one loss’, Machine Learning, Vol. 103, No. 2, pp.103–137.

Gray, A., Sallis, P. and MacDonell, S. (1997) ‘Software forensics: extending authorship analysis techniques to computer programs’, Proceeding of 3rd Biannual Conference International Association of Forensic Linguists, Durham, NC.

Gunning, R. (1952) The Technique of Clear Writing, New York, McGraw-Hill International Book.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2009) ‘The WEKA data mining software: an update’, SIGKDD Explorations, Vol. 11, No. 1, pp.10–18.

Hand, D. and Yu, K. (2001) ‘Idiot’s Bayes – not so stupid after all?’, International Statistical Review, Vol. 69, No. 3, pp.385–399.

Iqbal, F., Binsalleeh, H., Fung, B. and Debbabi, M. (2010) ‘Mining writeprints from anonymous emails for forensic investigation’, Digital Investigation, Vol. 7, Nos. 1/2, pp.56–64.

Iqbal, F., Hadjidj, F., Fung, B., Debbabi, M. (2008) ‘A novel approach of mining write-prints for authorship attribution in email forensics’, Digital Investigation, Vol. 5, No. 1, pp.42–51.

Kaelbling, L. (2009) Enron email dataset, CALO Project. Available online at: http://www.cs.cmu.edu/enron/

Koppel, M., Argamon, S. and Shimoni, A. (2002) ‘Automatically categorizing written texts by author gender’, Literary and Linguistic Computing, Vol. 17, No. 4, pp.401–412.

Koppel, M., Schler, J. and Argamon, S. (2009) ‘Computational methods in authorship attribution’, Journal of the American Society for Information Science and Technology, Vol. 60, No. 1, pp.9–26.

McElroy, T. and Seta, J. (2007) ‘Framing the frame: how task goals determine the likelihood and direction of framing effects’, Judgment and Decision Making, Vol. 2, No. 4, pp.251–256.

Okolica, J., Peterson, G. and Mills, R. (2008) ‘Using PLSI-U to detect insider threats by data mining e-mail’, International Journal of Security and Networks, Vol. 3, No. 2, pp.114–121.

Quinlan, J. (1986) ‘Induction of decision trees’, Machine Learning, Vol. 1, No. 1, pp.81–106.

Tweedie, F., Singh, S. and Holmes, D. (1996) ‘Neural network applications in stylometry: the federalist papers’, Computers and the Humanities, Vol. 30, No. 1, pp.1–10.

Wei, C., Sprague, A., Skjellum, A., and Warner, G. (2008) ‘Mining spam email to identify common origins for forensic application’, Proceeding of ACM Symposium on Applied Computing, pp.1433–1437.

Witten, I. and Frank, E. (2000) ‘Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, CA.

Wong, D. and Tian, X. (2012) ‘E-mail protocols with perfect forward secrecy’, International Journal of Security and Networks, Vol. 7, No. 1, pp.1–5.

Zheng, R., Li, J., Chen, H. and Huang, Z. (2006) ‘A framework for authorship identification of online messages: writing-style features and classification techniques’, Journal of the American Society for Information Science and Technology, Vol. 57, No. 3, pp.378–393.