Spam Campaign Detection, Analysis, and Formalization

Spam Campaign Detection, Analysis, andFormalization

Thèse

Mina Sheikhalishahi

Doctorat en informatiquePhilosophiæ doctor (Ph.D.)

Québec, Canada

© Mina Sheikhalishahi, 2016

Spam Campaign Detection, Analysis, andFormalization

Thèse

Mina Sheikhalishahi

Sous la direction de:

Directeur de recherche: Mohamed MejriCodirectrice de recherche: Nadia Tawbi

Résumé

Les courriels Spams (courriels indésirables ou pourriels) imposent des coûts annuels extrême-ment lourds en termes de temps, d’espace de stockage et d’argent aux utilisateurs privés et auxentreprises. Afin de lutter efficacement contre le problème des spams, il ne suffit pas d’arrêterles messages de spam qui sont livrés à la boîte de réception de l’utilisateur. Il est obligatoire,soit d’essayer de trouver et de persécuter les spammeurs qui, généralement, se cachent derrièredes réseaux complexes de dispositifs infectés, ou d’analyser le comportement des spammeursafin de trouver des stratégies de défense appropriées.Cependant, une telle tâche est difficile en raison des techniques de camouflage, ce qui nécessiteune analyse manuelle des spams corrélés pour trouver les spammeurs.Pour faciliter une telle analyse, qui doit être effectuée sur de grandes quantités des courrielsnon classés, nous proposons une méthodologie de regroupement catégorique, nommé CCTree,permettant de diviser un grand volume de spams en des campagnes, et ce, en se basant surleur similarité structurale. Nous montrons l’efficacité et l’efficience de notre algorithme declustering proposé par plusieurs expériences. Ensuite, une approche d’auto-apprentissage estproposée pour étiqueter les campagnes de spam en se basant sur le but des spammeur, parexemple, phishing. Les campagnes de spam marquées sont utilisées afin de former un clas-sificateur, qui peut être appliqué dans la classification des nouveaux courriels de spam. Enoutre, les campagnes marquées, avec un ensemble de quatre autres critères de classement, sontordonnées selon les priorités des enquêteurs.Finalement, une structure basée sur le semiring est proposée pour la représentation abstraitede CCTree. Le schéma abstrait de CCTree, nommé CCTree terme, est appliqué pour formali-ser la parallélisation du CCTree. Grâce à un certain nombre d’analyses mathématiques et derésultats expérimentaux, nous montrons l’efficience et l’efficacité du cadre proposé.

iii

Abstract

Spam emails yearly impose extremely heavy costs in terms of time, storage space, and moneyto both private users and companies. To effectively fight the problem of spam emails, it isnot enough to stop spam messages to be delivered to end user inbox or be collected in spambox. It is mandatory either to try to find and persecute the spammers, generally hiding be-hind complex networks of infected devices, which send spam emails against their user will, i.e.botnets; or analyze the spammer behavior to find appropriate strategies against it. However,such a task is difficult due to the camouflage techniques, which makes necessary a manualanalysis of correlated spam emails to find the spammers.To facilitate such an analysis, which should be performed on large amounts of unclassifiedraw emails, we propose a categorical clustering methodology, named CCTree, to divide largeamount of spam emails into spam campaigns by structural similarity. We show the effective-ness and efficiency of our proposed clustering algorithm through several experiments.Afterwards, a self-learning approach is proposed to label spam campaigns based on the goalof spammer, e.g. phishing. The labeled spam campaigns are used to train a classifier, whichcan be applied in classifying new spam emails. Furthermore, the labeled campaigns, with theset of four more ranking features, are ordered according to investigators priorities.A semiring-based structure is proposed to abstract CCTree representation. Through severaltheorems we show under some conditions the proposed approach fully abstracts the tree rep-resentation. The abstract schema of CCTree, named CCTree term, is applied to formalizeCCTree parallelism.Through a number of mathematical analysis and experimental results, we show the efficiencyand effectiveness of our proposed framework as an automatic tool for spam campaign detection,labeling, ranking, and formalization.

iv

Table des matières

Résumé iii

Abstract iv

Table des matières v

Liste des tableaux vii

Liste des figures ix

Remerciements xii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 State of the Art 92.1 Spam Emails Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Clustering Spam emails into Campaigns . . . . . . . . . . . . . . . . . . . . 122.3 Labeling and Ranking Spam Campaigns . . . . . . . . . . . . . . . . . . . . 172.4 On the Formalization of Clustering and its Applications . . . . . . . . . . . 18

3 Spam Campaign Detection 223.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Preliminary Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Categorical Clustering Tree (CCTree) . . . . . . . . . . . . . . . . . . . . . 303.5 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Effectiveness and Efficiency of CCTree in Spam Campaign Detection 344.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Discussion and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

v

5 Labeling and Ranking Spam Campaigns 605.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Digital Waste Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5 Ranking Spam Campaigns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Algebraic Formalization of CCTree 876.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Feature-Cluster Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.4 Feature-Cluster (Family) Term Abstraction . . . . . . . . . . . . . . . . . . 996.5 Relations on Feature-Cluster Algebra . . . . . . . . . . . . . . . . . . . . . 1096.6 CCTrees Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7 Conclusions and Future Work 1247.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A Appendix 130A.1 Source Codes of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . 130A.2 Tables of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Bibliography 144

Bibliographie 144

vi

Liste des tableaux

4.1 Features extracted from each email. . . . . . . . . . . . . . . . . . . . . . . . . 374.2 CCTree Internal evaluation with fixed number of elements. . . . . . . . . . . . . 414.3 Internal evaluation results of CCTree, COBWEB and CLOPE. . . . . . . . . . 454.4 Silhouette values and number of clusters in function of µ for four email datasets. 504.5 Silhouette result, hamming distance, ε = 0.001, and µ changes . . . . . . . . . . 524.6 Number of Clusters , ε = 0.001, and µ changes . . . . . . . . . . . . . . . . . . 524.7 External evaluation results of CCTree, COBWEB and CLOPE. . . . . . . . . . 554.8 Campaigns on the February 2015 dataset from five clustering methodologies. . 57

5.1 Features extracted from each email. . . . . . . . . . . . . . . . . . . . . . . . . 715.2 Feature vectors of a spam email for each class. . . . . . . . . . . . . . . . . . . 725.3 Classification results evaluated with K-fold validation on training set. . . . . . . 775.4 Classification results evaluated on test set. . . . . . . . . . . . . . . . . . . . . . 775.5 Training set generated from small knowledge. . . . . . . . . . . . . . . . . . . . 815.6 DWS classification results for the labeled spam campaigns. . . . . . . . . . . . 815.7 Set of ranking features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.8 Normalized score of spam campaigns label . . . . . . . . . . . . . . . . . . . . . 845.9 Three first ranked campaigns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.1 CCTree Rewriting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.2 Composition Rewriting System . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.1 Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A.1 Language of spam message and subject . . . . . . . . . . . . . . . . . . . . . . . 138A.2 Type of Attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138A.3 Attachment Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139A.4 Number of attachment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139A.5 Average size of attachments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139A.6 Type of Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139A.7 Length of Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140A.8 IP-based links verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140A.9 Mismatch links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140A.10 Number of links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A.11 Number of Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A.12 Average number of dots in links . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A.13 Hex character in links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A.14 Words in Subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

vii

A.15 Characters in subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.16 Non ASCII characters in subject . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.17 Recipients of spam email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.18 Images in spam messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

viii

Liste des figures

1.1 Steady volume of spam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Mcafee Report 2015. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 The framework of thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 dataset 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 dataset 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Spam 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Spam 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 A Small CCTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 CCTree(0.001,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 CCTree (0.01,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 CCTree(0.1,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4 CCTree(0.5,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Internal evaluation at the variation of the ε parameter. . . . . . . . . . . . . . 444.6 COBWEB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.7 CCTree(0.001,1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.8 CCTree(0.001,10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.9 CCTree(0.001,100) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.10 CCTree(0.001,1000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.11 CLOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.12 Silhouette in function of the number of clusters for different values of µ. . . . . 504.13 Sihouette (Hamming). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.14 Generated Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.15 Sihouette (Hamming). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.16 Generated Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.1 Advertisement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3 Fraud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.4 Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.5 Crypto Ransomeware volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.6 Phishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.7 DWS Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.8 Insert new instance X in a CCTree . . . . . . . . . . . . . . . . . . . . . . . . . 745.9 ROC curve / Advertisement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.10 ROC curve / Portal Redirection . . . . . . . . . . . . . . . . . . . . . . . . . . 785.11 ROC curve / Fraud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

ix

5.12 ROC curve / Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.13 ROC curve/ Phishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.1 A Small CCTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.2 Parallel Clustering Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

x

To my love, my familyand

To any one who looks forworldwide peace and happiness

xi

Remerciements

Though only my name appears on the cover of this dissertation, a great many people havecontributed to its production. I owe my gratitude to all those people who have made thisdissertation possible.

First and foremost, I want to thank my supervisor, professor Mohamed Mejri, for acceptingme in his research group, which improved my view of life. I appreciate all his contributions oftime, ideas, patience, and funding to make my Ph.D. experience productive and stimulating.Thanks for allowing me to grow as a research scientist, for all his patience and support.I also would like to express my deeply thanks to my co-advisor, professor Nadia Tawbi, whohas been always there to listen and give advices. Thanks to her for all her kind mental,financial supports and helpful discussions in different stages of my Ph.D. studies. I gratefullyacknowledge her support for my cooperation with IIT-CNR research group that changed mylife.I really appreciate the insightful comments and constructive criticisms of my advisor and co-advisor at different stages of my research. For encouraging the use of correct grammar andconsistent notations in my writings.

Besides my advisors, I would like to thank the rest of my thesis committee : Prof. FabioMartinelli, Prof. Raphael Khoury, and Dr. Ilaria Matteucci, for their insightful comments andencouragement. Special thanks to professor Fabio Martinelli for accepting me to join to hisresearch group in IIT-CNR, Italy, which enriched my research experience.

My time in Quebec was made enjoyable in large part due to the many friends that became partof my life. I am grateful to my dearest Shadi, who supported me continuously during threeyears of my staying in Quebec. With her presence in Quebec, I always felt I have a familymember who takes care of me. To my kind friend Bahareh, who several times I bothered fromItaly to do something in Quebec instead of me. Thanks to my other kind friends in Quebec :Elaheh, Afrooz, Sheyda, Soamyeh.

I am especially grateful to my best friend Sara, who always, in very difficult moments of myPh.D., was available from Iran to send me messages, to support, encourage, and motivate me.She was always there to hear me, although with different timezones of Iran and Canada. I will

xii

always appreciate all her kind continuous supports.Many thanks to my other friends from Iran : Mahboobeh, for continuous memorizing andpraying me, Mahmoud, for always following my weblog and motivating me.

I would like to deeply thank my family for all their love and encouragement. To my father whoalways motivated us to read, to know, to follow our dreams, who always love us as we are.To my mother, who finally accepted my travel to Canada although never convinced, for allthe worries she passed during my Ph.D., for all her patience when I was following my dreams,even against her dreams.Thanks to my dearest sister, Mojgan, who was my joint to Iran. She was always followingwhat ever I needed to be done in Iran, who always motivated me with her typical sweet words.Many thanks to my brother, Mohammad, who always supported me in all my pursuits, whowe are always proud of him. I am also grateful to Hamed, my brother-in-law, who called memany times from Iran to tell me we all love you and miss you. To my kindest aunt, Azra, whoalways teaches us that you can still smile when the life is passing its most difficult challengingstage.

Most of all, I would like to give my deep gratitude to my colleague, my friend, my love, Andrea,who cleared lots of the obstacles that I faced along my Ph.D. path. Who generously from thefirst moments of my arrival in Italy, taught me his experiences of research. Many thanks, forall his faithful support, patience, and encouragement during the difficult stages of my Ph.D.thesis. Thanks for his presence in my life, for all happiness he brought with himself, for makingthe feeling that I am able to make all my dreams come true.

Mina SheikhalishahiLaval UniversityQuebec, Canada

xiii

Chapitre 1

Introduction

The term spam became well-known from one comedy program of “Monty Python’s FlyingCircus”, where the servant was proposing dishes containing an unknown ingredient called spam,which corresponds to a brand of canned meat produced by “American Hormel Foods Corp”. Inthe sketches of this program, all the foods in the restaurant are served with lots of spam, andthe waitress repeats the word spam several times in describing how much spam is in the plates.After doing this, a group of “Vikings” in the corner start a song : “Spam, spam, spam, spam,spam, spam, spam, spam, lovely spam ! Wonderful spam ! ” Hence, the meaning of the term wasreferring to something that keeps repeating and repeating to great annoyance 1. Due to thesuccess of this program, probably since the canned meat constituted the only nutritious foodavailable in England during the Second World War, the term “SPAM” indicated somethinginevitably omnipresent.The name imported to unwanted electronic messages, believed that the first spam email hasbeen sent on 1 May 1978 by Digital Equipment Corporation to advertise a new product, andsent to all the users of ARPAnet of the West Coast of the United States, containing a fewhundred people 2.Only many years later, after the birth (dating back to January 1994 3), the first unwantedcommercial message in large scale distributed across USENET, titled “Global Alert for All :Jesus is Coming Soon”. It was posted to every newsgroup, indicating unwanted messages,which were sent massively to unwilling recipients.

More precise definition of spam email got introduced later in the literature. [8] define Spamemail, also known as junk email or unsolicited bulk email, as an electronic message, sentin bulk, against the will of the receiver. [83] define spam email as an unwanted email, sentindiscriminately by a sender who has no current relationship with the receiver.

Nowadays, spam emails are not just undesired advertisement. The problem of unsolicited

1. http ://www.internetsociety.org/2. www.templetons.com/brad/spamreact.html3. www.wired.com

1

emails causes incredible huge costs to companies and private users [113], [83], [84]. Currentproposed approaches [30], [46], [123], though being quiet effective in stopping spam emailsto be delivered to end users inbox [21], [89], they do not propose a methodology to organizehuge amount of messages in order to be able to fight against the root of the problem, i.e. thespammer.

Any effort in this regard requires a first analysis of large amount of spam emails, mostly col-lected in honey-pots. This first analysis demands grouping huge amount of data into smallergroups, named spam campaigns, which are supposed to be originated from the same source(spammer). Then, it is required to train a classifier to label and group new spam emails.Furthermore, the large set of detected spam campaigns should be ordered based on the inves-tigators’ priorities, automatically.

Figure 1.1 – Steady volume of spam.

To this end, in present thesis, we first propose a fast and effective categorical clustering al-gorithm, named CCTree, to detect spam campaigns on the base of structural similarity ofmessages. Afterwards, we propose a self-learning methodology to automatically label detectedspam campaigns based on the goal of spammer. The labeled detected campaigns are rankedautomatically considering a set of ranking priorities. A semiring-based formal method is pro-posed to abstract CCTree representation. The abstract form is used to formalize the processof clustering spam emails in parallel computers, which may help to speed up the process ofspam campaign detection.

2

1.1 Motivation

Being incredibly cheap to send, spam messages are vastly used by adversaries to steal money,distribute malware, advertise the goods and/or services, etc.Cisco Report, in 2015 [36], shows that although adversaries develop more sophisticated tech-niques to breach network defense, spam emails still continue to play a major role in theseattacks, and the worldwide volume of spam has remained relatively consistent (Figure 1.1).Furthermore, it has been shown [36] that 4.5 billion emails get blocked every day. InternetThreats Trend Report [114] estimates that 54 billion spam emails were sent per day in 2014.According to McAfee 2015 Report [100], unsolicited emails constitute up more than 70 percentof total amount of email messages in 2014 (Figure 1.2).

Figure 1.2 – Mcafee Report 2015.

Microsoft and Google [113] estimate spam emails cost to American firms and consumers up to20 billion dollars per year. Ferris Research estimated the worldwide cost of spam in 2005 at$50 billion, and raised its estimate to $100 billion in 2007 and $130 billion in 2009 4,[112]. [83]report that 382 million mailing attempts resulted in 28 sales. Yahoo ! data on similar “highticket” items, which were sold through the marginal profit more than $50, shows that theyhad conversation rates of about 1 in 25,000 [112].

4. www.email-museum.com/

3

The problem of undesired electronic messages became a serious issue, due to a lot of troublescaused by spam to Internet Community. [5] categorize spam losses in three different groups,named direct losses, indirect losses, and defense costs, and call the sum of these losses as thesociety losses of spam. In what follows, the sets of society losses proposed in [5] are listed :

Direct losses by spam :

• “Money withdrawn from victim accounts• Time and effort to reset account credentials (for banks and consumers)• Distress suffered by victims• Secondary costs of overdrawn accounts : deferred purchases, inconvenience of not havingaccess to money when needed• Lost attention and bandwidth caused by spam messages, even if they are not reacted to.”

Indirect losses by spam :

• “Loss of trust in online banking, leading to reduced revenues from electronic transaction fees,and higher costs for maintaining branch staff and cheque clearing facilities• Missed business opportunity for banks to communicate with their customers by email• Reduced uptake by citizens of electronic services as a result of lessened trust in onlinetransactions• Efforts to clean-up PCs infected with malware for a spam sending botnet”

Defense costs of spam :

• “Security products such as spam filters, antivirus, and browser extensions to protect users• Security services provided to individuals, such as training and awareness measures• Security services provided to industry, such as website take-down services• Fraud detection, tracking, and recuperation efforts• Law enforcement• The inconvenience of missing messages falsely classified as spam”

Considering that the large amount of spam traffic among servers cause the delay for deliveringlegitimate emails ; Sorting out the unsolicited messages takes time ; Whilst in the process ofclassifying messages into spam and legitimate, there is the risk of deleting an important emailby mistake, the problems resulting of spam emails makes unbearable situations for every onewho uses the Internet.

To get a better insight on the direct and indirect losses of spam, here we briefly present somereports.Microsoft and Google [113] estimate that spam emails cost to American firms and consumersup to 20 billion dollars per year, whilst [83], [84] show that a successful spam campaign canearn revenues between $400k to $1000k. [133] estimated Cutwail botnet for providing spam

4

services earns around $1.7 million to $4.2 million in one year. It has been calculated that acompany with 1000 employees, looses just $500,000 per year as productivity cost resultingfrom spam messages 5.

The most popular solution to the problem of spam is Filtering [21]. The spam filtering canbe defined as a methodology to divide messages into spam and legitimate [21]. Currently, themost used approach for fighting spam emails consists in identifying and blocking them onthe recipient machine through filters [30], [46], [123], which generally are based on machinelearning techniques or content features [22], [138], [139].

Nevertheless that the existing filtering algorithms often show the accuracy of more than 90%in experimental evaluations [21], [89], it does not stop spammers from imposing considerablecost to users and companies [113]. We believe the reason could be that the spammer, the rootof the problem, feels the minimum risk to be caught or followed.

To effectively fight the problem of spam emails, it is mandatory to find and persecute thespammers, generally hiding behind complex networks of infected devices, which send spamemails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficulttask, however possible [142], [149], [45].

To simplify this analysis, first of all, huge amount of spam emails are required to be dividedinto spam campaigns. A spam campaign is the set of messages spread by a spammer with aspecific purpose [27], like advertising a specific product, spreading ideas, or for criminal intentse.g. phishing. Grouping spam messages into spam campaigns reveals behaviors that may bedifficult to be inferred when we look at a large collection of spam emails as a whole [132]. It isnoteworthy to be mentioned that the problem of grouping a large amount of spam emails intosmaller groups is an unsupervised learning task. The reason is that there is no labeled datafor training a classifier in the beginning. The proposed approach for clustering spam messagesshould be based on this premise that the general appearance of messages belonging to thesame spam campaign mainly remain unchanged, although spammers usually insert randomtext or links [27]. The rationale behind this approach is that two messages in the same format,i.e. similar language, size, same number of attachments, same amount of links, etc., are morelikely to be originated from the same source (spammer), belonging thus to the same campaign.Hence, the discriminative structural features of messages required be to be selected correctly.Furthermore, the clustering algorithm should be quite fast and effective in grouping junkemails into spam campaigns.

Afterward to each campaign should be assigned a label describing the purpose of spammer.This goal-based labeling facilitates for investigators the analysis of spam campaigns, eventuallydirected toward a specific cybercrime. Moreover, the spam campaign labeling based on thegoal of spammer can help to rank spam campaigns.

5. http ://www.fixedbyvonnie.com/2013/08/what-is-spam-and-how-you-get-junk-email/

5

Ranking spam campaigns based on the investigator’s priorities, provides ordered set of spamcampaigns that on the base of it the investigator decides which spam campaigns must be firstanalyzed, which is a difficult task when we look at large number of detected spam campaignsas a whole.

It is not uncommon that data mining process requires several days or weeks to be completed.Parallel computing systems bring significant benefit, say high performance, in implementationof massive database [33]. Parallel clustering is a methodology proposed to alleviate the problemof time and memory usage in clustering large amount of data [94], [18].Because of the huge amount of received spam emails, which vastly increases every hour (8billions per hour) [110], [101] and for the high variance that related emails may show, due tothe use of obfuscation techniques [108], it would be helpful if we are able to parallelize theprocess of clustering in several parallel computers. Parallel clustering will speed up the processof grouping unwanted messages into spam campaigns.

In the present thesis, we address all aforementioned issues related to spam campaign detection,analysis, labeling, and speeding up the process through parallelism with the use of formalmethods. In what follows, the contribution of the thesis is explained in detail.

1.2 Main Contributions

The main contribution of this thesis can be summarized as following :

— We propose a categorical clustering algorithm, named CCTree, which is designed todivide spam emails into smaller groups, named spam campaigns, based on the structuralsimilarity. The main hypothesis is that some parts of spam emails, belonging to thesame spam campaign, remains unchanged. The CCTree has a tree-like structure, wherethe leaves of the tree represent the desired spam campaigns ([126]).

— A set of 21 categorical features are presented which characterize the structure of spamemails. An extensible and portable framework is provided to automatically extract theset of proposed features from raw emails. These features well represent the structure ofan email. Some of these features hardly change when a spammer creates his own spamcampaign ([129]).

— We propose and validate through analysis of 200k spam emails, a methodology to choosethe optimal CCTree configuration parameters. The proposed technique shows that oncethe input parameters of CCTree are chosen for a dataset, they can be used for similardatasets with comparable size ([129]).

— We show the effectiveness and efficiency of CCTree in clustering emails into campaignsthrough two well-known evaluation indexes, named internal evaluation, i.e. the abilityof CCTree in obtaining homogeneous clusters and external evaluation, i.e. the ability toeffectively classify similar elements (emails), when classes are known beforehand ([129]).

6

— We propose a framework, named Digital Waste Sorter (DWS), which exploits a self lear-ning goal of the spammer -based approach for spam email classification. The proposedapproach aims at automatically classifying large amount of raw unclassified spam emailsdividing them into campaigns and labeling each campaign with its spammer goals. Tothis end, we proposed five class labels to group spammer goals into five macro-groups,namely Advertisement, Portal Redirection, Advanced Fee Fraud, Malware Distributionand Phishing ([128]).

— A ranking methodology is proposed to order sets of spam campaigns on the base ofinvestigator priorities. The proposed approach extract five ranking features from eachdiscovered spam campaign, according to investigator priorities. Including the spammer-goal label of spam campaign, these features are used to automatically attribute a gradeto each spam campaign. The set of spam campaigns are ordered based on their grades.

— A semiring-based formal method, named Feature-Cluster Algebra, is proposed to abs-tract the representation of CCTree. The resulted term equivalent to a CCTree is calledCCTree term. Trough several theorems we prove that the proposed algebraic struc-ture, under some conditions, fully abstracts tree representation. A rewriting system isproposed to automatically verify whether a term is a CCTree term or not ( [127] ).

— The abstract schema of CCTree is applied to formalize CCTree parallelism. The pa-rallelism approach can be applied to speed up the process of clustering in parallelcomputers. To formalize CCTree parallelism, a set of rewriting rules are provided toget a final CCTree from the resulted CCTrees of parallel computers. Through the setof examples and theorems, we show how the proposed approach works.

1.3 Thesis Outline

The present thesis is structured as follows. First, we provide related work synthesis in theeffort of spam campaign detection, labeling, and formalization in Chapter 2.In Chapter 3, we propose a categorical clustering algorithm, named CCTree, to cluster spamemails based on structural similarity (step 1 in Figure 1.3), the result of this step is a set ofspam campaigns, which are the leaves of the CCTree (step 2 of Figure 1.3).The effectiveness and efficiency of CCTree in spam email campaign detection is presented inChapter 4.We propose a self-learning approach to label spam campaigns on the base of the goal of thespammers (step 3,4 of Figure 1.3), and rank the labeled spam campaigns (step 5 of Figure1.3) in Chapter 5.The aforementioned steps are complete to divide a large amount of spam emails into spamcampaigns. On the other side, to speed up the process of clustering algorithms, one well-knownapplied technique is parallel clustering. In the rest of the thesis, we formalize the CCTreeparallelism. Hence, it is possible that the whole set of data to be divided in parallel computers

7

Figure 1.3 – The framework of thesis.

(step 6,7 of Figure 1.3). In Chapter 6, we abstract the CCTree representation with the useof a well-known algebraic structure, named semiring. We prove that the proposed algebraicbased technique abstracts tree representation. The formal representation of CCTree is namedCCTree term. We propose a rewriting system to verify whether a term is a CCTree term ornot. The CCTree term is used to formalize CCTree parallelism with the use of a rewritingsystem (step 8 of Figure 1.3) . The result of final CCTree is the set of spam campaigns (step10 of Figure 1.3), which can be delivered to previous explained parts of the framework to belabeled and ranked. We conclude with future directions of the present thesis in Chapter 7.

8

Chapitre 2

State of the Art

In line with the growing concerns regarding spam messages, there has been an increasingnumber of works dedicated to the problem, which studies the issue from different aspects. Inthis chapter, we present a comprehensive literature review to the problem of spam emails,directly or indirectly related to our work. At the end of the chapter, we present the studiesrelated to formal methods applied in feature models’ presentations. We refer how these formalapproaches are similar (and different) to our proposed semiring-based formalization techniquefor abstracting feature-based categorical clustering algorithm, and finally to speed up theprocess of clustering through parallelism.

2.1 Spam Emails Issues

In this section we explain different problems of spam emails discussed in the literature.

Botnet is one of the main topic related to spam emails, which vastly came under considerationin recent years. [76] report that more than 85% of worldwide spam is sent by botnets 1. Theterm botnet refers to a group of campaign host computers that are controlled by a smallnumber of commander hosts referred to as command and control (CC) servers. Compromisedmachines on the Internet are generally referred to as bots, and the set of bots controlled by asingle entity is called a botnet [153]. In other words, botnet is a network of “zombie” computersinfected by a malicious software (or “malware”) designed to enslave them to a master computer.The malware is installed in a variety of ways, such as downloading an attachment received bya spam email [25],[78], [35].

[146] perform a large scale analysis of spamming botnet characteristics and identify trendsthat can benefit future botnet detection and defense mechanisms. The proposed framework isbased on the premise that botnet spam emails are mostly sent in an aggregate fashion, resul-ting in content prevalence similar to the worm propagation. The focus of research is on URLd

1. www.symantec.com

9

embedded in email content. With the use of three-month collected spam emails from Hotmail,the proposed framework, named AutoRE, [146] found several interesting results regarding thedegree of email obfuscation, properties of botnet IP addresses, sending patterns, and theircorrelation with network scanning traffic.[79] present a platform, named Botlab, which continually monitors and analyzes the behaviorof spam botnets. The result of this study shows that six botnets are responsible for 79% ofspam messages arriving at the University of Washington campus.[96] first discuss about the fundamental concepts of botnets, including formation and exploi-tation, lifecycle, and two major kinds of topologies. Several related attacks, detection, tracing,and countermeasures, are introduced later.[47] propose a spam zombie detection system, named SPOT (Sequential Probability RatioTest), which monitors outgoing messages of a network. Through a two-month e-mail tracecollected in a large US campus network, they show that SPOT is an effective and efficienttechnique in automatically detecting compromised machines in a network.[52] apply PageRank approach, with an additional clustering algorithm, to efficiently detectstealthy botnets through peer-to-peer communication.[133] provide interesting statistic about botnet : at two hours about 29.6% of bots are blacklis-ted, and 46.4% are blacklisted after three hours. By six hours, roughly 75.3% are blacklisted.The rate reaches 90% after a period of about 18 hours.[142], [149], [45] propose several approaches to find the botmaster through step stones.[13], [122], [116], [107] provide a brief look at the existing botnet research, the evolution andfuture of botnets, as well as the goals and visibility of today networks intersection in order toinform the field of botnet technology and defense.

The other topic related to the problem of spam emails is about the cost of spam messages,and the revenue of spammers.[119] believe that any marketing based on spam emails brings the advantage of costing thesender small. Hence, the sender send large number of messages to maximize the return.There are several researches focusing on what spammer get back from spam campaigns. Theconversion rate of spam marketing is discussed in [83], while in [133] , [112] , and [134] theunderground economy of spam is analyzed. [133] show that spam-as-a-service can be purchasedfor approximately $100–$500 per million emails sent. Botnets can also be rented to groupsinterested in sending out larger amount of designed spam emails, which are capable in sending100 million emails per day for $10,000 per month. Considering in their own study that a cutwailoperators may have paid between $1,500 and $15,000 on a recurring basis to grow and maintaintheir botnet, and estimating the value of the largest email address list (containing over than1,596,093,833 unique addresses) from advertised prices, it is worth approximately $10,000–$20,000. Finally, the Cutwail gangs profit for providing spam services is estimated to around$1.7 million to $4.2 million since June 2009. They also observed that several individuals offer10,000 malware installations for approximately $300– $800, and rates for one million email

10

addresses ranging from $25 to $50, with discounted prices for bulk purchases.[84] show that a successful spam campaign can earn revenues between $400k to $1000k.The other side of cost effect of spammer has been evaluated as productivity cost 2. To measurethe cost of spam emails in terms of productivity, suppose that the average money an employeemakes per year equals to $ 80k, while he is working 220 days per year. Let’s say that hereceives 100 messages per day, which 40 of them are spam, and the average time to read amessage and delete it takes 5 seconds. Then, he gets $45 per hour, and needs 3 minutes justfor deleting spam emails, he lost $2.25 per day just for checking the spam messages. It means,if a company has 1000 employees, it looses just $500,000 per year as productivity cost resultedfrom spam messages.

The other main focus of research related to the problem of spam emails refers to spam filteringmethods.Spam filtering is based on analysis of the message contents and additional information, tryingto identify spam messages from legitimate ones [143], [21]. Generally, a spam filter is anapplication which implements a function as following :

f(m, θ) =

C(spam) if the message m is spamC(leg) if the message m is legitimate

where m is a message to be classified, and θ is a vector of parameters, and C(spam) andC(leg) are labels assigned to the message.

Mostly spam filtering is performed with the use of machine learning algorithms, e.g. applyingNaive bayesian approaches [9],[8], and other classifiers [75], [151], [90], [22], [138], [139]. Theapproaches proposed in the literature for filtering spam emails constitute a variety of topics.[29] presents an overview of approaches aimed at spam filtering. Text analysis and characteri-zing spam emails with the use special words, was another applicable approach in the field ofspam filtering. To this end, [48] apply lazy learning algorithms to tackle concept drift in spamfiltering, while [80] use n-grams in an anti-spam approach based on words. Spammers startto obfuscate text in spam messages, or embed the text in images, to avoid being identifiedtrough text filtering techniques. Image spam filtering methodologies [10], [20], came underconsideration to block these kinds of spam messages.

Nevertheless, despite the growing research on spam filtering, often showing accuracy of abovethan 90% [21], the evolution of spam messages is still considerable. Actually, a filter preventsend-users from wasting their time on junk messages, but it does not stop resources misuse,since however the messages are delivered [21].We believe the reason could be that the spammer, the root of the problem, feels that there isthe minimum risk to be caught.

2. http ://www.fixedbyvonnie.com/2013/08/what-is-spam-and-how-you-get-junk-email/

11

To effectively fight the problem of spam emails, it is mandatory to find and persecute thespammers, generally hiding behind complex networks of infected devices which send spamemails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficulttask, however possible [142], [149], [45]. To this end, first of all it is required that efficientlyand effectively divide huge amount of spam emails in the direction of being helpful to caughtthe spammer.

2.2 Clustering Spam emails into Campaigns

Detecting a spammer, analyzing his behavior, deciding which spammers among all have thepriority to be followed, constitutes an extremely challenging task, due to the huge amount ofspam emails, which vastly increases every hour (8 billions per hour) [110] [101] and for thehigh variance that related emails may show, due to the use of obfuscation techniques [108].To this end : ;

• First of all, a fast and effective clustering algorithm is required to divide huge amount ofspam messages into smaller groups, each representing a spam campaign, originated fromthe source (spammer).

In the research field of spam emails, several works exist which cluster spam emails into spamcampaigns.

The basic idea in [87] for identifying spam campaigns is based on the keywords or stringstanding for specific types of campaigns. For example, all templates containing the stringlinksh are defined as a type of self-propagation campaigns. Several campaign types, relatedto the same spammer purpose, constitute a campaign class. The purpose of a spam campaignis identified on the base of keywords in the text or subject. The set of messages containing notext, and just the feature, belong to the image campaign. Finally, 10 spam campaign classesare presented, named 1) Image spam, 2) Job ads, 3) Other ads, 4) Personal ad, containing fakedating/matchmaking advance money scams, 5) Pharma containing pointers to web sites sellingViagra, Cialis, etc, 6) Phishing, which forces victims to enter sensitive information 7 ) Politicalcampaigning 8) Self-prop, i.e. the spam messages which tricks victims into executing Stormbinaries 9) Stock scam that ricks victims into buying a particular penny stock 10) (Other)Manual selection of keywords needs too much efforts iteratively, while the spammers soon bysoon change the keywords that they use. Moreover, spammer continuously fight keywords-based approaches by means of obfuscation techniques.

It has been inferred by [87] that 65 percent of instances last less than 2 hours and the longestexisting ones are pharmaceutical which were available for months, and crucial self propagationworking for 12 days.Three large campaigns, named Pharma, self-propagation and stock storm have the large num-

12

ber of unique headers in template, but Pharma and self propagation have actually few differentbodies. The authors suggest that may be it is better to focus on clustering on headers to iden-tify these three campaigns and then try to identify other campaigns using other techniques.

In [54], although the authors focus on analysis of spam URLs in Facebook, the study of URLsand clustering spam messages is similar to our goal concerning spam emails. First, all wallposts that share the same URLs are clustered together, then the description of wall postsare analyzed and if two wall posts have the same description their clusters are merged. Inthis study factors like bursty activity and distributed communication have also come underconsideration.The distributed property in sending spam emails refers to the number of users who send spammessages in the cluster and in this case is usually computed from IP addresses of the senders,while in facebook spam messages it refers to users‘ unique ID.The bursty property comes from the rational that most spam campaigns are involved in anaction within a short period of time.The threshold values for distributed and bursty properties in this study has been identifiedas 5 and 1.5 hours, respectively. This means that if a spammer sends spam messages to lessthan 5 different accounts or the interval of sending messages is greater than 1.5 hours, he isconsidered as a person who have no important effect in the system.

Furthermore, the authors found that for attracting people attention, the spammers techniquescan mostly (88.2) be classified into three types : 1) They promise free gifts, 2) They use somephrases to trigger the curiosity, like some one likes them, 3) They describe a product for sale.

It has been discovered that approximately 80 percent of malicious accounts are active lessthan one hour and about 10 percent are active for longer than one day. According to eachtime zone most malicious wall posts were sent around 3 am to avoid detection, and among 187million wall posts of 3.5 million facebook users, 200,000 malicious wall posts were attributedto 57,000 malicious accounts.

[92] believe that spam emails with identical URLs are highly clusterable and mostly sent inburst. In their method, if the same URL exists in spam emails from source A and source B,and each has a unique IP address, they will be connected with an edge to each other andthe connected components are the desired clusters. It is also observed that if a spammer isassociated with multiple groups, it has a higher probability of sending more spam mails inthe near future. Furthermore, the authors found a very small fraction of the active spammersactually accounted for a large portion of the total spam mails. Furthermore, they inferred thatthe spam emails from the same group of spammers are sent in burst.

Spamscatter [4] is a method that automatically clusters the destination web sites extractedfrom URLs in spam emails with the use of image shingling. In image shingling, images aredivided into blocks and the blocks are hashed, then two images are considered similar if 70

13

percent of the hashed blocks are the same. The life time of each detected spam campaign iscomputed through finding the first (in terms of time) and last (in terms of time) spam messagein the spam campaign. The result shows that over 40% of the malicious scams persist for lessthan 120 hours, whereas the lifetime for the same percentage of shopping scams is 180 hoursand the median for all scams is 155 hours.

[150] cluster spam messages based on the images of spam to trace the origins of spam emails. Tothis end, spam images are divided into two parts : foreground and background. The foregroundcomprises the text and/or illustrations while background is the colors and/or textures. Thespam emails are visually similar if their illustrations, text, layouts, and/or background texturesare similar. In this study, spam images are separated to foreground and background, wherethe foreground contains the text and illustration, and the background means various colorsand textures. The two-stage clustering, first with the use of Optimal Character Recognitionrecognizes texts whose bounding boxes represent the text layout. Afterwards, the illustrationsare separated from the background by detecting the background. The authors mention thatthe proposed approach requires to be mixed with other methods to get better result.

[130] focus on clustering spam emails based on IP addresses resolved from URLs inside thebody of these emails. The rational behind it is that the authors believe in many cases it is noteasy to change the IP addresses easily, since it requires to compromise a lot of computers. Inthis study, two emails belong to the same cluster, if their IP addresses resolved from URLs areexactly the same. Afterwards, the relationship between spam sending system and maliciousWeb servers connected to URLs , and also some information like the number of unique URLs,unique domain names, etc are provided.

By examining three weeks spam messages gathered on used SMTP server, the authors concludethat the proposed methodology outperforms comparing to clustering techniques based ondomain names and URLs, while the claim is justified due to the fact that domain namesassociated with the scam changes frequently, also the period that a URL is active is too shortfor performing the investigation, and most of the time the URLs used in spam emails areunique.

In all aforementioned works for clustering spam emails into campaigns, the pairwise comparisonof each two email is required, where the time complexity is quadratic. Furthermore, the spamcampaign detection is limited to one or two features in spam emails, where if the spam messagesdoes not contain the related feature, the methodology fails in its clustering. For example, foremails without URL or without images, the approaches of [130] , [150] fail, respectively.

Other limitations of the former approaches have been identified in [132], which shows howonly considering IP addresses resolved from URLs is insufficient for dividing emails in spamcampaigns. More precisely, since web servers contain lots of domains with the same IP address,every spam campaign identified by such a mean (such as [130]) are instead made of a large

14

amount of spam emails sent by different controlling entities.

Thus, [130] propose a new technique for spam campaign detection, named O-means cluste-ring, which is based on K-means clustering algorithm. The distance of two spam messages iscalculated based on 12 features extracted from emails, which are expressed by numbers andthe distance is computed with the use of euclidean measurement. The set of 12 features are1) size of email, 2) number of lines, 3) number of unique URLs, 4) average length of uniqueURLs, 5) average length of domain names, 6) average length of query, 7) average number ofkey values pair, 8) average length of path, 9)average length of keys, 10) average length ofvalues, 11) average number of dots in domains, 12) number of global top 100 URLs.

The limitation of O-means is that it requires the number of clusters to be known from begin-ning, which is generally not a working hypothesis. On the other hand, the applied features areconsidered numerical, not representing well the reality, specially for considering the distanceof two emails based on the the number of links numerically, i.e. the two email with one linkbe considered closer to the email with 10 links rather than the one with 11 links.

After clustering spam emails according to O-means method, [131] found that the 10 largestclusters had sent about 90 percent of all spam emails. Hence, the authors investigate these 10clusters to implement heuristic analysis for selecting significant features among 12 features usedin previous work. As a result they select four most important features which could effectivelyseparate these 10 clusters from each other. These features are : “Size of emails”, “Number oflines”, “Length of URLs” and “Number of dots”. However, the authors mentioned that it isnot the best method for selecting the most significant features, since it was based on analysisof the top 10 clusters. By the way, it results almost with the same accuracy of clustering ofthe previous method which used 12 features. The accuracy ranges from 86.63 percent to 86.33percent, which the difference is negligible but the execution time from 28,772 sec decreases to6,124 sec.

[144] first extract eleven features of each spam email. This set of features includes : “MessageId”, “Sender IP address”, “Sender Email”, “Subject”, “Body Length”, “Word Count”, “Attach-ment File Name”, “Attachment MD5”, “Attachment Size”, “Body URL”, “Body URL Domain”,while some attributes are broken down into two sub-attributes, for example, “body URL” into“Machine Name” and “Path”.Afterwards, two clustering algorithms are applied to divide spam messages. At first an agglo-merative hierarchical algorithm [66] is used to cluster the whole data set based on messages’subject comparison. This means that at the beginning, each email is a cluster by itself andthen clusters sharing common subject are merged. The distance D(i, j) between two clustersi and j is equal to 0 if they share common feature of an attribute and equal to 1 if not. Thus,when the distance between two clusters is 0, the two clusters are merged. Finding that withfirst merge based on the subject, 67% of messages are attributed to one cluster. To solve the

15

problem of false positive rate for big clusters, the connected component with weighted edgesalgorithm is applied. A connected component [12] is an undirected graph in a set A of vertexessuch that for each vertex v ∈ A, the set of vertexes for which there exists a path from v

to them is exactly the set A. The weight on edges represents the strength of the connectionbetween two vertexes. Applying this approach, edges connect two spam emails based on theeleven attributes. The desired clusters are the connected components of this graph with theweight above a specified threshold.

The main drawback of this methodology is that it cannot be applied on large datasets, sincethe pairwise comparison are done for pair of emails in the dataset several times.

The basic hypothesis in [27] for clustering spam emails is that some parts of spam messagesare static in the view of recognizing a spam campaign. In this work, as an improvement of[92], just URLs are not considered for clustering. In this work, for identifying spam campaignssome features extracted from spam emails, named “language of email”, “message layout”, “typeof message”, “URLs” and “subject”. Afterwards, the frequencies of proposed features in a largedataset are computed in order to cluster spam messages with the use of FP-Tree. FrequentPattern Tree (FP-Tree), proposed by [67], is a signature based method in which each nodeafter the root depicts a feature extracted from the spam message that is shared by the sub-trees beneath. Thus, each path in this tree shows sets of features that co-occur in messages,with the property of non-increasing order of frequency of occurrences.

Applying FP-Tree for spam campaign detection, in [27] and [44], has several limitations. Firstof all, in the side of URL similarity, since each token of a URL is considered as a feature,it fails to distinguish dynamic URLs in emails belonging to the same campaign [27]. On theother hand, considering token of URLs as feature causes that a spam email containing severalURLs be directed to several campaigns.

Moreover, in the side of layout detection, FP-Tree is too much sensitive to very small changesin the layout. More precisely, FP-Tree reads each message line by line, and then the layout isprovided as the string of letters, e.g. UTBUUB, where the i’th letter in the string representsthe i’th line of spam message, e.g. if U occurs in the first letter of layout string, it means thatin the first line of message we have URL. Considering that spammers use several techniquesfor random text and URL obfuscations, it is possible that two very similar emails, belongingto the same spam campaign, be considered as having two different layouts in FP-Tree, justbecause the random text reaches to the next line in one email whilst not in the other one.

In summary, the previous works for clustering spam emails mainly could be divided into twomain categories : the first group focus on pairwise comparison of each pair of emails, forexample URL comparison, and the second group in which a clustering algorithm is used,for example O-means clustering. In general, the aforementioned previous works suffer fromone of the following problems : 1) They consider one or two features for grouping spam

16

messages, which decreases the accuracy, 2) The pairwise comparison is used, with quadratictime complexity, 3) The number of clusters is required as a former knowledge, 4) The featureswhich create a pure cluster are not focused. In our proposed methodology for clustering spamemails into campaigns, we try to address the aforementioned problems.

2.3 Labeling and Ranking Spam Campaigns

• In the next step, to address the spam message problem, an approach is required to labeldetected spam campaigns in order to train a classifier with the use of labeled set ofmessages, and then to investigate an order among detected spam campaigns accordingto investigator priorities.

In the literature, the spam campaigns are usually labeled based on characteristic strings (key-words) representing individual campaign types as in [44], [88] and [55]. As explained, in theseworks, the occurrence of some specific string in a spam message means that the spam is labe-led as a pre-identified type spam campaign. For example, all templates containing the stringlinksh are defined as a type of self-propagation campaigns. First of all, manual string selectionrequires a lot of time, while the spammers soon by soon change the set of words in the body ofmessages applying obfuscation techniques. Moreover, it is worth noticing that many spammerapply the same words, like “viagra”, to deceive the victims. Hence, training a classifier basedon the words label is not helpful in spam campaign detection, while the spam campaign isdefined according to our need, i.e. originated from the same source.

[106] label spam campaigns on the base of contact information in the body of messages. To thisend, URLs, phone number, Skype ID, and Mail ID used as contact information are consideredfor clustering spam emails into similar groups, whilst the contact information is consideredas the label of detected spam campaign. This methodology is effective only against emailsreporting contacts, which are only a subset of all the spam emails found in the wild.

There are several approaches in the literature in which the spammer goal is considered. Howe-ver, these approaches are mainly focused on detecting phishing emails, not considering otherspammer purposes. Phishing email [3] as a special type of spam message, has become anenormous threat for all Internet based commercial operations, which causes non negligiblefinancial losses to organizations and individual users. Phisher attempts to redirect users tofake websites, which is designed to obtain financial data such as usernames, passwords, andcredit card detail, etc of a person illegally in an electronic communication [3].In this regard, mostly the set of features which represent a phishing email structure are pro-posed, and then a machine learning algorithm is used to classify set of emails into phishing orlegitimate.

[50] applied 10 email features to discern phishing emails from ham (good) emails. These 10

17

features include : 1) IP-based URLs, 2)age of linked-to domain names, 3) nonmatching URLs,4)“Here” links to non-model domain, 5) HTML emails, 6) number of links, 7) number ofdomains, 8) number of dots, 9) containing javascript, 10) spam-filter output.

[17] propose a similar methodology with additional features to train a classifier in order tofilter phishing emails. Advanced email features are generated by adaptively trained DynamicMarkov Chains and latent Class-Topic Models. The set of features are divided into three maingroups, named basic features, dynamic markove chain features, latent topic model features.Basic features by itself contain several features, e.g. structural features, link features.

[34] propose a methodology to detect phishing emails based on both machine learning and heu-ristics. The proposed novel heuristic anti-phishing system employs Gestalt and decision theoryconcepts in modeling the similarity. [3] provide a survey on different techniques in filteringphishing emails, while Gansterer et al. [53] compare different machine learning algorithms inphishing detection. Furthermore, the authors propose a technique which refines the previousphishing filtering approaches. In this work, three types of messages, named ham, spam andphishing are distinguished automatically. Nevertheless, the category of emails containing spam,is not precisely characterized.

There are number of works discussing on different aspects of spam email attacks, spanningfrom the network of malware distribution [104] , PageRank spam analysis [1] to total revenuesfor a range of spam advertised campaigns [84], [83]. However, in these works also some specificaspects of one type of spam attack is analyzed, where the detection of different types of spamattacks is not discussed.In the side of ranking spam campaigns, [44] consider Canadian law enforcement elements, e.g.Canadian IP addresses, “.ca” top-level domain names, and IP ranges of Canadian IP addresses.

To the best of our knowledge, the present work is the first effort in labeling spam campaignsbased on the different goals of spammer based on the structural features of messages, whereasthe goal-based label of each campaign is applied to order the set of detected labeled spamcampaigns.

2.4 On the Formalization of Clustering and its Applications

• As the next step, we formalize CCTree, as the effective and efficient categorical clusteringalgorithm. The formal schema is used to formalize CCTree parallelism with the use ofrewriting system.

It is hard to find studies in the literature on the formalization of different concepts related toclustering algorithms.

[58] formalize hierarchical clustering as an Integer Linear Programming (ILP) problem with a

18

natural objective function. The dendrogram properties of hierarchical clustering are enforcedas linear constraints. The proposed formalization technique has the benefit of that relaxingthe constraints may provide novel program variation, like overlapping clusterings.

[103] formally define the problem of clustering in Multi-Criteria Decision Aid (MCDA) system.As in most MCDAs, the preferences of a decision maker are modeled based on a set of decisionalternatives. To find the optimal solution, the authors propose a heuristic approach, which isvalidated trough tests on a large set of artificially generated benchmarks.

[2] propose an approach to formalize the problem of data streams in clustering algorithms,based on the set theory. Data stream refers to infinite sequences of data. The formalizationscheme made it possible to identify and propose basic properties for the design and comparisonof data stream clustering algorithms. To this end, they extended Kleinberg’s properties [86]to represent clustering partitions evolving according to the data stream behavior. They foundthat it is difficult to find an algorithm to comply with expressiveness property in a data streamcontext.

[41] apply predicate logic language in terms of sets of if-then rules to formalize heuristicrules in clustering algorithms. In this approach, it is possible to describe traditional clusteringalgorithms, like k-means. However, in none of the few number of works on formalizing clusteringalgorithms, algebraic methodology is used in abstracting a clustering algorithm representation.In what follows we present several techniques and methodologies used to formalize featuremodels.

Feature models are information models in a way that a set of products, e.g. software productsor DVD player products, are represented as hierarchically arrangement of features, with dif-ferent relationships among features [15]. Feature models are used in many applications as theresult of being able to model complex systems, being interpretable, and the ability to handleboth ordered and unordered features [105]. Benavids et al. [15] believe designing a family ofsoftware systems in terms of features, makes it easy to be understood by all stakeholders,rather than the time they are expressed in terms of objects or classes. Representing featuremodels as a tree of features, were first introduced by Kang et. al in [82], to be used in soft-ware product line. Some studies [31], [32], show that tree models combined with ensembletechniques, lead to an accurate performance on variety of domains. In feature model tree, dif-ferently from CCTree, the root is the desired product, the nodes are the features, and differentrepresentation of edges demonstrates the mandatory or optional presence of features.Hofner et al. [73], [74], were the first who applied idempotent semiring as the basis for theformalization of tree models of products, and they called it feature algebra. The concept ofsemiring is used to answer the needs of product family abstract form of expression, refine-ments, multi-view reconciliation, and product development and classification. The elements ofsemiring in the proposed methodology, are sets of products, or product families.

19

To get better insight on how feature algebra works, we present a brief history of productfamily from definition to formalization. Furthermore we explain that despite our inspirationfrom the concept of feature algebra in formalizing tree model system, our proposed approachis different in several aspects.

FODA used feature models as the means to give the mandatory, optional and alternativeconcepts within a domain [81], [115]. For example, in a car, the transmission system is amandatory feature, and an air conditioning is an optional feature, whilst the transmissionsystem can either be manual or automatic. The part of the FODA feature model most relatedto formalizations works is the proposed feature diagram. It builds a tree of features andcaptures the mandatory, optional, and alternative relationships among features.

[82] perform an analysis of commonalities among applications in a particular domain in terms ofservices, operating environments, domain technologies and implementation techniques. After-wards, they construct a model named feature model to capture commonalities as an AND/ORgraph. The AND nodes in this graph demonstrate mandatory features and OR nodes showalternative features chosen from different applications.

[39] proposed a feature model represented by a hierarchically arranged diagram where a parentfeature is composed of a combination of some or all of its children. A vertex parent featureand its children in this diagram can have one of the following relationships :– And relationship, which indicates that all children must be considered in the compositionof the parent feature– Alternative relationship, which indicates that only one child forms the parent feature– Or relationship, which shows that one or more children features can be involved in thecomposition of parent feature– Mandatory relationship, which indicates that children features are required– Optional relationship, which shows that children features are optional.

Lopez-Herrejon, Batory, and Lengauer model features as functions and feature compositionas function composition [97] [95]

To get better insight how feature algebra works, we refer to an example of product line,provided in [24]. Suppose that an electronic company have a family of three product lines :mp3 Players, DVD Players and Hard Disk Recorders. All members share the set of featuresgiven in the Commonalities. A member can contain some mandatory features and mightcontain some optional features that another member of the same product line do not have.For instance, a product could be a DVD Player that is able to play music CDs, whilst theother one does not have this feature. However, all the DVD players of the DVD Player productline must contain the Play DVD feature. Furthermore, it is possible to have a DVD playerthat is able to play several DVDs at the same time.

20

Different researchers have proposed different views of what a feature is or should be. A defi-nition that is common to most (if not all) of them in Feature-Oriented Software Development(FOSD) is that “a feature is a structure that extends and modifies the structure of a givenprogram in order to satisfy a stakeholder’s requirement, to implement a design decision, andto offer a configuration option” [72].

Mostly, a set of features are composed to create a final program, which is itself consideredas a feature. Under this assumption, a feature is either a complete program which can beexecuted or a program increment that requires further features to lead to a complete program.The structure of a basic feature is modeled as a tree, called feature structure tree (FST),which builds the feature’s structural elements, e.g., classes, fields, or methods, hierarchically.A specified name and type information is assigned to each node of an FST, which helps toprevent the composition of incompatible nodes during feature composition [72].

The concept of product families entered from hardware industry to the software developmentprocess [72]. The reason was that the software developers also prefer not to build just asingle product but a family of similar products, sharing some functionalities, whilst havingsome well-identified variabilities. These elements, known as features, in software family can becharacterized as requirements, architectural properties, components, middleware, or code. Dueto the fact that the systems are characterized by their features, in [72] the authors call theirproposed methodology feature algebra. Idempotent semirings is the basis of feature algebra,which allows a formal treatment of the aforementioned elements as well as the calculations withthem. Sets of products are particular models of proposed feature algebra, which in its extensionform covers product lines, refinement, product development and product classification.

The tree-like structure which is formalized in product family problems has different structurefrom CCTree. In product family structure, against CCTree, the edges of the tree have no labels,only the nodes have ones. Furthermore, different representations of edges convey differentconcepts, whilst in CCTree we do not have different possible edge representations.

To the best of our knowledge, we are the first to apply an algebraic structure to abstract acategorical clustering algorithm representation and formalize the interesting concepts relatedto it, i.e. clustering parallelism. To this end, we attribute an algebraic representation of a treestructure and then trough several theorems and examples we show the proposed abstractionalgebraic term fully abstract tree representation. Calling the term resulted from CCTree, asCCTree term , a rewriting system is proposed to automatically verify whether a term representsCCTree structure or not. Furthermore, a set of rewriting rules are provided to parallelize theresult of parallel clustering.

21

Chapitre 3

Spam Campaign Detection

Spam emails constitute a fast growing and costly problems associated with the Internet today.To fight effectively against spammers, it is not enough to block spam messages. Instead, it isnecessary to analyze the behavior of spammer and catch them in the case. This analysis isextremely difficult if the huge amount of spam messages is considered as a whole. Clusteringspam emails into smaller groups according to their inherent similarity, facilitates discoveringspam campaigns sent by a spammer, in order to analyze the spammer behavior. In this chapter,we propose a methodology to group large amount of spam emails into spam campaigns, onthe base of categorical attributes of spam messages. A new informative clustering algorithm,named Categorical Clustering Tree (CCTree), is introduced to cluster and characterize spamcampaigns. The complexity of the algorithm is also analyzed and its efficiency is proved ([126]).

3.1 Introduction

Nowadays, the problem of receiving spam messages leaves no one untouched. According toMcAfee [100] report, out of the daily 191.4 billions of emails sent worldwide in average [110],more than 70% are spam emails. Microsoft and Google [113] estimate spam emails cost toAmerican firms and consumers up to 20 billion dollars per year. Moreover, Cisco Report [136]shows that spam volume increased 250 percent from January 2014 to November 2014. Spamemails cause problems, from direct financial losses to misuses of traffic, storage space andcomputational power.

Given the relevance of the problem, several approaches have already been proposed to tacklethis issue. Currently, the most used approach for fighting spam emails consists in identifyingand blocking them [30], [46], [123], on the recipient machine through filters, which generallyare based on machine learning techniques or content features [22], [138], [139]. Alternativeapproaches are based on the analysis of spam botnets [79],[91],[146], [152].

Though some mechanisms to block spam emails already exist, spammers still impose non

22

negligible cost to users and companies [113]. Thus, the analysis of spammers behavior and theidentification of spam sending infrastructures is of capital importance in the effort of defininga definitive solution to the problem of spam emails.

Such an analysis, which is based on structural dissection of raw emails, constitutes an extremelychallenging task, due to the following factors :

— The amount of data to be analyzed is huge and growing too fast every single hour.— Always new attack strategies are designed and the immediate understanding of such

strategies is paramount in fighting criminal attacks brought through spam emails (e.g.phishing).

To simplify this analysis, huge amount of spam emails should be divided into spam campaigns.A spam campaign is the set of messages spread by a spammer with a specific purpose [27], likeadvertising a specific product, spreading ideas, or for criminal intents e.g. phishing. Groupingspam messages into spam-campaigns reveals behaviors that may be difficult to be inferredwhen we look at a large collection of spam emails as a whole [132]. According to [27], in orderto characterize the strategies and traffic generated by different spammers, it is necessary toidentify groups of messages that are generated following the same procedure and that are partof the same campaign.It is noteworthy to be mentioned that the problem of grouping a large amount of spam emailsinto smaller groups is an unsupervised learning task. The reason is that there is no labeled datafor training a classifier in the beginning. More specifically, supervised learning requires classesto be defined in advance and the availability of a training set with elements for each class.In several classification problems, this knowledge is not available and unsupervised learning isused instead. The problem of unsupervised learning refers to trying to find hidden structurein unlabeled data [57]. The most known unsupervised learning methodology is clustering.Clustering is an unsupervised learning methodology that divides data into groups (clusters)of objects, such that object in the same group are more similar to each other than to those inother groups [77].

However, dividing spam messages into spam campaigns is not a trivial task due to the followingreasons :

— Spam campaign classes are not known beforehand, which means we need an unsuper-vised machine learning technique.

— Feature extraction is difficult. Finding the elements that best characterize an email isan open problem addressed differently in various research works [50], [17], [150], [132].

For these reasons the most used approaches to classify spam emails is clustering them on thebase of their similarities [4], [111], [132].

However, the accuracy of current solutions is still somehow limited and further improvementsare needed. While some categorical attributes, for example the language of spam message, areprimary, discriminative and outstanding characteristics to specify a spam campaign, neverthe-

23

less in previous works [87], [92], [4], [130], [131],[144], [28], these categorical features are notconsidered, or the homogeneity of resulted campaigns are not on the base of these features.

In this chapter, after a thorough literature review on the clustering and classification of spamemails, we propose a preliminary work on the design of a categorical clustering algorithm forgrouping spam emails, which is based on structural features of emails like language, numberof links, email size etc. The rationale behind this approach is that two messages in the sameformat, i.e. similar language, size, same number of attachments, same amount of links, etc., aremore probable to be originated from the same source, belonging thus to the same campaign.To this aim, we expect to extract categorical features (attributes) from spam emails, which arerepresentative of their structure and that should clearly shape the differences between emailsbelonging to different campaigns.

The proposed clustering algorithm, named Categorical Clustering Tree (CCTree), builds atree starting from a whole set of spam messages. At the beginning, the root node of the treecontains all data points, which constitutes a skewed dataset where non related data are mixedtogether. Then, the proposed clustering algorithm divides data points, step-by-step, clusteringtogether data that are similar and obtaining homogeneous subsets of data points. The measureof similarity of clustered data points at each step of the algorithm is given by an index callednode purity. If the level of purity is not sufficient, it means that the data points belonging tothis node are not sufficiently homogeneous and they should be divided into different subsets(nodes) based on the characteristic (attribute) that yields the highest value of entropy. Therationale under this choice is that dividing data on the base of the attribute which yieldsthe greatest entropy helps in creating more homogeneous subset where the overall value ofentropy is consistently reduced. This approach, aims at reducing the time needed to obtainhomogeneous subsets. This division process of non homogeneous sets of data points is repeatediteratively till all sets are sufficiently pure or the number of elements belonging to a node isless than a specific threshold set in advance. These pure sets are the leaves of the tree and willrepresent the different spam campaigns.The usage of categorical attribute is crucial for the proposed approach, which exploits theShannon Entropy [125], which yields good results on nominal attributes.After detailing the CCTree algorithm and briefly presenting categorical features for categori-zing spam emails, we will discuss the algorithm efficiency proving its linear complexity.

The rest of this chapter is structured as follows. Section 3.2 provides some preliminary notionsof the topic. Section 3.3 reports a literature review concerning the previous techniques used forclustering spam emails into campaigns. In Section 3.4, we describe the proposed categoricalclustering algorithm for clustering spam messages. In Section 3.5 the analysis of the proposedmethodology is discussed. Finally, Section 3.6 is a brief conclusion and a sketch of some futuredirections.

24

3.2 Preliminary Notions

In this section we briefly present some preliminary notions required to be known in our pro-posed process for clustering spam emails into campaigns.

Clustering Let X be a dataset which consists data points (or objects, instances, cases, pat-terns, tuples, transactions, elements) xi = (xi1, xi2, . . . , xid) in attribute space A, i.e.each xij ∈ A, 1 ≤ i ≤ n, 1 ≤ j ≤ d where n is the number of points belonging toX and d is the number of attributes. Furthermore, each xij is numerical or categoricalattribute (or feature, value, component). Such a point-by-attribute data representationconceptually corresponds to a matrix. The ultimate goal of clustering [18] is to assignpoints to a finite set of k subsets C1, C2, . . . , Ck, named clusters. Usually subsets do notintersect (where this assumption is sometimes violated), and their union is equal to afull dataset with possible exception of outliers :

X = C1 ∪ C2 ∪ . . . ∪ Ck ∪ Coutlier , Ci ∩ Cj = ∅,∀1 ≤ i, j ≤ k

Clustering groups data points into subsets in such a manner that similar instances aregrouped together, while different points belong to different groups [117]. Due to the factthat clustering is grouping similar instances, it means that some sort of measure thatcan determine whether two objects are similar or dissimilar is required.Many clustering techniques use distance measures to determine the similarity or dissimi-larity between any pair of objects. The distance between two points xi and xj is usuallyshown as d(xi, xj). A valid distance measure should be symmetric and get the minimumvalue (usually zero) in the case of identical vectors. The distance measure is called ametric distance measure if it also satisfies the following properties :

d(xi, xk) ≤ d(xi, xj) + d(xj , xk) ∀xi, xj , xk ∈ X

d(xi, xj) = 0 ⇔ xi = xj ∀xi, xj ∈ X

Shannon Entropy In information theory, entropy is a measure of the uncertainty of a ran-dom variable. More specifically the Shannon entropy [125], as a measure of uncertainty,for a random variable X with N outcomes x1, x2, . . . , xN is defined as follows :

H(X) = −k∑i=1

p(xi) log(p(xi))

where p(xi) = NiN , Ni is the number of outcomes of xi, and N is the total number of

elements of X.The amount of Shannon entropy is maximal when all outcomes are equally likely, i.e.the number of elements for each value is almost the same, and it gets its minimum, i.e.zero, when all data belonging to a set are identical. Thus, the more closer to zero themore pure is a dataset.

25

To get better insights how shannon entropy works in returning the purity of a data set,Figures 3.1 and 3.2 are provided. From the first glance, it is clear that the dataset 2 ismore pure or homogeneous than dataset 1. In the following two equations, we can seethat shannon entropy returns the minimum possible amount, i.e. zero, for the completepure dataset 2.

Figure 3.1 – dataset 1

Figure3.1 : H(dataset1) = −(0.4 log(0.4) + 0.3 log(0.3) + 0.3 log(0.3)) = 0.4729

Figure3.2 : H(dataset2) = −(10

10log(

10

10)) = 0

[38] and [93] show that entropy works well as a measure distance in clustering algorithms.

Spam Campaign A spam campaign is the set of messages spread by a spammer with aspecific purpose [27], like advertising a specific product, spreading ideas, or for criminalintents e.g. phishing. The premise in our spam campaign detection, as [27], is based onthe fact that the spammers generally keep some parts of the message static, whilst someother parts are changed systematically with automated text, image, or dynamic linkgeneration.

To get better insight of how two spam emails belong to the same campaign, we refer thereader to the Figures 3.3 and 3.4. Although in these two emails, the text, images, anddynamic links are different, it is obvious they both are generated from the same source,or designed by the same spammer. The rational behind our spam campaign detection isfocusing on features that almost remain unchanged, when a spam campaign is created,e.g. the language of message, the number of images, etc.

Figure 3.2 – dataset 2

26

Figure 3.3 – Spam 1

Figure 3.4 – Spam 2

27

3.3 Related Works

To the best of our knowledge just a few works exist related to the problem of clustering spamemails into campaigns.

In [87], the basic idea for identifying campaigns is the keywords standing for specific typesof campaigns. In this study, at first campaigns are found manually based on keywords andthen some interesting results are extracted from groups of campaigns. As the result of needingmanual scanning of spam, it is not suitable to be used for large amount of data set. In [54],although the authors focus on analysis of spam URLs in Facebook, the study of URLs andclustering spam messages is similar to our goal concerning spam emails. First, all wall poststhat share the same URLs are clustered together, then the description of wall posts are ana-lyzed and if two wall posts have the same description their clusters are merged. In [92], theauthors believe that spam emails with identical URLs are highly clusterable and mostly sentin burst. In their method, if the same URL exists in spam emails from source A and source B,and each has a unique IP address, they will be connected with an edge to each other and theconnected components are the desired clusters. Spamscatter [4] is a method that automati-cally clusters destination web sites extracted from URLs in spam emails with the use of imageshingling. In image shingling, images are divided into blocks and the blocks are hashed, thentwo images are considered similar if 70 percent of the hashed blocks are the same. In [150], thespam emails are clustered based on their images to trace the origins of spam emails. They arevisually resembled if their illustrations, text, layouts, and/or background textures are similar.J. Song et al. [130] focus on clustering spam emails based on IP addresses resolved from URLsinside the body of these emails. Two emails belong to the same cluster, if their IP addresssets resolved from URLs are exactly the same. In previous works, pairwise comparison of eachtwo emails is required for finding the clusters. This kind of comparison has two problems : thetime complexity is quadratic, which is not suitable for big data clustering, and furthermorefinding clusters is based on just one or two features of messages, which causes the decreasingof precision. In what follows, spam emails are grouped with the use of clustering algorithms.

In [132], the same authors of [130] mentioned that only considering IP addresses resolved fromURLs is insufficient for clustering. Since web servers contain lots of Web sites with the same IPaddress, so each IP cluster in [130] consists of a large amount of spam emails sent by differentcontrolling entities. Thus, the authors clustered spam emails by IP addresses resolved fromURLs in their new method, called O-means clustering, which is based on K-means clusteringmethod. The distance is based on 12 features in the body of an email which are expressedby numbers and the euclidean distance is used to measure the distance between two emails.In [131], after clustering spam emails according to O-means method, the authors found that10 largest clusters had sent about 90 percent of all spam emails in their data set. Hence, theauthors investigate these 10 clusters to implement heuristic analysis for selecting significantfeatures among 12 features used in previous work. As a result they select four most important

28

features which could effectively separate these 10 clusters from each other. Since the ideafor clustering is based on k-means clustering, computationally NP-hard algorithms. Also itrequires the number of clusters to be known from beginning.

In [144] the authors focus on a set of eleven attributes extracted from messages to clusterspam emails. Two clustering methods have been used : the agglomerative hierarchical algorithmclusters the whole data set. Next, for some clusters containing too many emails, the connectedcomponent with weighted edges algorithm is used to solve the problem of false positive rate.With the use of agglomerative clustering [66] a global clustering is done based on commonfeatures of email attributes. In the beginning, each email is a cluster by itself and then clusterssharing common features are merged. In this model, edges connect two nodes (spam emails)based on the eleven attributes. The desired clusters are the connected components of this graphwith the weight above a specified threshold. This method suffers from not being useful for largedata set. The pairwise comparison requires quadratic time complexity. The basic hypothesisin FP-Tree method [27] for clustering spam emails is that some parts of spam messages arestatic in the view of recognizing a spam campaign. In this work as an improvement of [92],just URLs are not considered for clustering.

For identifying spam campaigns, Frequent Pattern Tree (FP-Tree) as a signature based me-thod, is constructed from some features extracted from spam emails. These features are :language of email, message layout, type of message, URL and subject. In this tree, each nodeafter the root depicts a feature extracted from the spam message that is shared by the sub-trees beneath. Thus, each path in this tree shows sets of features that co-occur in messages,with the property of non-increasing order of frequency of occurrences. The problem of FP-Tree is that it is based on frequency of features rather than creating pure clusters in terms ofhomogeneity. The redundant features also are removed for specifying a campaign accordingto the frequency property, while in our method redundant features are characterized based onpurity or homogeneity of campaigns. However, the greatest problem results from sensitivityof FP-Tree to dynamic URL and text generation in layout detection. The reason is that thelayout is extracted line by line, which means two very similar emails with one line difference,will be attributed to two different layout.

In summary, the previous works for clustering spam emails could be mainly divided in twocategories : The first group focus on pairwise comparison of each pair of emails, for exampleURL comparison, and the second group consists of those in which a clustering algorithm isused, for example O-means clustering. In general, the aforementioned previous works sufferfrom one of the following problems : 1) They consider one or two features for grouping spammessages, which decreases the accuracy, 2) The pairwise comparison is used, with quadratictime complexity, 3) The number of clusters is required as a former knowledge, 4) The featureswhich create a pure cluster are not focused. In our proposed algorithm, we try to solve theseproblems.

29

3.4 Categorical Clustering Tree (CCTree)

The general idea for construction comes from a supervised learning algorithm called InductionDecision Tree (ID3) [109]. To create the CCTree, a set of objects is given in which each datapoint is described in terms of a set of categorical attributes, e.g. the language of a message.Each attribute represents the value of an important feature of data and is limited to assume aset of discrete, mutually exclusive values, e.g. the Language as an attribute can take its valuesor features as English or F rench. Then, a tree is constructed in which the leaves of the treeare the desired clusters, while other nodes contain non pure data needing an attribute-basedtest to separate them. The separation is shown with a branch for each possible outcome of thespecific attribute values. Each branch or edge extracted from that parent node is labeled withthe selected value which directs data to the child node. The attribute for which the Shannonentropy is maximum is selected to divide the data based on it. A purity function on a node,based on Shannon entropy, is defined. Purity function represents how much the data belongingto a node are homogeneous. A required threshold of node purity is specified. When a nodepurity is equal or greater than this threshold, or the number of elements in a node is less thana threshold, the node is labeled as a leaf or terminal node.

The precise process of CCTree construction can be formalized as follows :— Input : Let D be a set of data points, containing N tuples on a set A of d attributes

and a set of stop conditions S.

Attributes An ordered set of d attributes A = A1, A2, . . . , Ad is given, where eachattribute is an ordered set of mutually exclusive values. Thus, the j’th attributecould be written as Aj = v1j , v2j , . . . , v(rj)j, where rj is the number of features ofattribute Aj . For example Ai could be the Language of spam email, and the set ofpossible values is English, French, Spanish.

Data Points A set D of N data points is given, where each data point is a vectorwhose elements are the features of attributes, e.g. Di = (v1i1 , v

2i2, . . . , vdid), where

vkik ∈ Ak is the ik’th feature of the k’th attribute. For example we have : spam 1 =(English, excel attachment, image based).

Stop Conditions A set of stop conditions S = (µ, ε) is given. µ is the “minimumnumber of elements in a node”, i.e. when the number of elements in a node is lessthan µ, then the node is not divided even if it is not pure enough. ε represents the“minimum desired purity” for each cluster, i.e. when the purity of a node is betteror equal to ε, it will be considered as a leaf.

To calculate the node purity, a function based on Shannon entropy is defined asfollows :Let Nkji represents the number of elements having the k’th value of the j’th at-tribute in node i, and Ni be the number of elements in node i. Thus, considering

30

p(vkji) =NkjiNi

, the purity of node i, denoted by ρ(i), is defined as following :

ρ(i) = −d∑j=1

rj∑k=1

p(vkji)log(p(vkji))

where d is the number of attributes, and rj is the number of features of j’th attribute.

— Output : A set of clusters which are the leaves of the categorical clustering tree.

S

Sr

red

Sb

Sb.s

small

Sb.l

large

blue

Figure 3.5 – A Small CCTree

We report in the following the process of creating the CCTree :

At the beginning all data points, as the set of N tuples, are assigned to the root of the tree.Root is the first new node. The clustering process is applied iteratively for each new creatednode. For each new node of the tree, the algorithm checks if the stop conditions are verifiedand if the number of data points is less than a thresholdM , or the purity, is less than or equalto ε. In this case, the node is labeled as a leaf, otherwise the node should be split.

In order to find the best attribute to be used to divide the cluster, the Shannon entropybased on the distribution of each attribute values is calculated. The attribute for which theShannon entropy is maximal is selected. The reason is that the attribute which has the mostequiprobable distribution of values, generates the highest amount of chaos (non homogeneity)in a node. For each possible value of the selected attribute, a branch is extracted from thenode, with the label of that value, directing the data respecting that value to the correspondingchild node. Then the process is iterated until each node is either a parent node or is labeledas a leaf. At the last step all final nodes or leaves of the tree are the set of desired clusters,named C1, C2, . . . , Ck.Figure 3.5 depicts an example of a small CCTree, whilst a formal description of algorithm isgiven in Algorithm 1.

The source codes are provided in A.1.

31

Algorithme 1 : Categorical Clustering Tree (CCTree) algorithmInput : Input : Data points Dk , Attributes Al, Attribute Values Vm,

node_purity_threshold, max_num_elem)Output : Clusters Ck

1 Root node N0 takes all data points Dk

2 for each node Ni !=leaf node do3 if node_purityi < node_purity_threshold||4 num_elemi < max_num_elem then5 Label Ni as leaf;6 else7 for each attribute Aj do8 if Aj yields max Shannon entropy then9 use Aj to divide data of Ni;

10 generate new nodes Ni1 , . . . , Nit with t = size of V for attribute Aj ;11 end12 end13 end14 end

3.5 Time Complexity

The proposed structure-based methodology for clustering spam emails into campaigns respec-ting the aforementioned requirements of our problem, is linear in terms of complexity. Thisproperty becomes more impressive when it is compared with the complexity of previous worksfor grouping spam emails into campaigns, which are mostly based on pairwise comparison ofspam messages, suffering from quadratic time complexity, resulted from this kind of compari-son.Here, we briefly discuss the precise time complexity of the proposed methodology. Let us consi-der n as the number of the whole data set, ni the number of elements in node i, m the totalnumber of features, vl the number of features of attribute Al, r the number of attributes, andvmax = argmaxvl (l = 1, 2, . . . , d).For constructing a CCTree, it is needed to create an ni×m matrix based on the data belongingto each non leaf node i, which takes O(m × ni) time. For finding the appropriate attributefor dividing data based on, constant time is required. To divide the ni points, based on the vlfeatures of selected attribute (Al), O(ni × vl) time is needed. This process is repeated in eachnon leaf node. Thus, if K = m+ 1 be the maximum number of non leaf nodes, which arises ina complete tree, then the maximum time required for constructing a CCTree with n elementsequals to O(K × (n×m+ n× vmax)).Recalling that the number of features m and consequently K = m + 1 are constant number,we conclude that the result is linear on the number of data points.

32

3.6 Conclusion

Spam emails impose a cost which is non negligible, damaging users and companies for severalmillions of dollars each year. To fight spammers effectively, catch them or analyze their beha-vior, it is not sufficient to stop spam messages from being delivered to the final recipient.Characterizing a spam campaign sent by a specific spammer, instead, is necessary to analyzethe spammer behavior. Such an analysis can be used to tailor a more specific prevention stra-tegy which could be more effective in tackling the issue of spam emails. Considering a large setof spam emails as a whole, makes the definition of spam campaigns an extremely challengingtask. Thus, we argue that a clustering algorithm is required to group this huge amount ofdata, based on message similarities.

In this chapter we proposed a new categorical clustering algorithm, named CCTree, that weargue to be useful in the problem of clustering spam emails. This algorithm, in fact, allows aneasy analysis of data based on an informative structure. The CCTree algorithm introduces aneasy-to-understand representation, where it is possible to infer at a first glance the criteria usedto group spam emails in clusters. This information can be used, for example, by officers to trackand persecute a specific subset of spam emails, which may be related to an important crime.Here, we have mainly presented the theoretical results of our approach, the implementationof the CCTree algorithm and its usage in clustering spam emails is presented in the followingchapter.

33

Chapitre 4

Effectiveness and Efficiency of CCTreein Spam Campaign Detection

Spam emails yearly impose extremely heavy costs in terms of time, storage space and moneyto both private users and companies. Finding and persecuting spammers and eventual spamemails stakeholders should allow to directly tackle the root of the problem. To facilitate sucha difficult analysis, which should be performed on large amounts of unclassified raw emails, inthis chapter we propose a framework to fast and effectively divide large amount of spam emailsinto homogeneous campaigns through structural similarity. The framework exploits a set of21 features representative of the email structure and a novel categorical clustering algorithmnamed Categorical Clustering Tree (CCTree). The methodology is evaluated and validatedthrough standard tests performed on three dataset accounting to more than 200k real recentspam emails ([129]).

4.1 Introduction

Spam emails constitute a notorious and consistent problem still far from being solved. In thelast year, out of the daily 191.4 billions of emails sent worldwide in average, more than 70% arespam emails [110]. Spam emails cause several problems, spanning from direct financial losses,to misuses of Internet traffic, storage space and computational power [113]. Moreover, spamemails are becoming a tool to perpetrate different cybercrimes, such as phishing , malwaredistribution, or social engineering-based frauds.

Given the relevance of the problem, several approaches have already been proposed to tacklethe spam email issue. Currently, the most used approach for fighting spam emails consistsin identifying and blocking them on the recipient machine through filters, which generallyare based on machine learning techniques or content features, such as keywords, or non asciicharacters [30] [46] [123] [22]. Unfortunately, these countermeasures just slightly mitigate the

34

problem which still impose non negligible cost to users and companies [113].

To effectively fight the problem of spam emails, it is mandatory to find and persecute thespammers, generally hiding behind complex networks of infected devices which send spamemails against their user will, i.e. botnets. Thus, information useful in finding the spammershould be inferred analyzing text, attachments and other elements of the emails, such as links.Therefore, the early analysis of correlated spam emails is vital [44] [4]. However, such ananalysis, constitutes an extremely challenging task, due to the huge amount of spam emails,which vastly increases hourly (8 billions per hour) [110] and for the high variance that relatedemails may show, due to the use of obfuscation techniques [108]. To simplify this analysis,huge amount of spam emails, generally collected through honey-pots, should be divided intospam campaigns [132]. A spam campaign is the set of messages spread by a spammer with aspecific purpose [27], like advertising a product, spreading ideas, or for criminal intents.

In this chapter, we propose to use our algorithm presented in Chapter 3 on set of 21 attributesto fast and effectively group large amount of spam emails by structural similarity. A set of 21discriminative structural features are considered to obtain homogeneous email groups, whichidentify different spam campaigns. Grouping spam emails on the base of their similarities is aknown approach. However, previous works mainly focus on the analysis of few specific parame-ters [4] [111] [132] [139], showing results whose accuracy is still somehow limited. The approachis based on applying CCTree, a tree-like structure whose leaves represent the various spamcampaigns. The algorithm clusters (groups) emails through structural similarity, verifying ateach step the homogeneity of the obtained clusters and dividing the groups not enough homo-geneous (pure) on the base of the attribute which yields the greatest variance (entropy). Theeffectiveness of the proposed approach has been tested against 10k spam emails extracted froma real recent dataset 1 , and compared with other well-known categorical clustering algorithm,reporting the best results in terms of clustering quality (i.e. purity and accuracy) and timeperformance.

The contribution of present chapter can be summarized as follows :— We introduce a set of 21 categorical features representative of email structure, brie-

fly discussing the discretization procedure for numerical features, which are used forapplying CCTree.

— The performance of CCTree has been thoroughly evaluated through internal evaluation,to estimate the ability in obtaining homogeneous clusters and external evaluation, forthe ability to effectively classify similar elements (emails), when classes are knownbeforehand. Internal and external evaluation have been performed respectively on adataset of 10k unclassified spam emails and 276 emails manually divided in classes.

— We propose and validate through analysis on 200k spam emails, a methodology tochoose the optimal CCTree configuration parameters based on detection of max curva-

1. http ://untroubled.org/spam

35

ture point (knee) on an homogeneity-number of clusters graph.— We compare the proposed methodology with two general categorical clustering algo-

rithms, and other methodologies specific for clustering spam emails.The rest of this chapter is structured as follows. Section 4.2 describes the proposed framework,detailing the extracted features and reporting implementation details. Section 4.3 reports theexperiments to evaluate the ability of CCTree in clustering spam emails, comparing the resultswith the ones of two well known categorical clustering algorithms. Also the methodology toset the CCTree parameters is reported and validated. Section 4.4 discuss limitations andadvantages of the proposed approach reporting result comparison with some related work.Other related work on clustering spam emails is presented in Section 4.5. Finally Section 4.6briefly concludes proposing future directions.

4.2 Framework

The presented framework acts in two steps. At first raw emails are analyzed by a parser toextract vectors of structural features. Afterward the collected vectors (elements) are cluste-red through the introduced CCTree algorithm. This section reports details on the proposedframework for analysis and clustering spam emails and extracted features.

4.2.1 Feature Extraction and Definition

To describe spam emails, we have selected a set of 21 categorical attributes, which are repre-sentatives of the structural properties of emails. The reason is that the general appearance ofmessages belonging to the same spam campaign mainly remain unchanged, although spammersusually insert random text or links [27]. The selected attributes extends the set of structuralfeatures proposed in [99] to label emails as spam or ham.

The attributes and a brief description are presented in Table 4.1.

Since the clustering algorithm is categorical, all selected features are categorical as well. It isworth noting that some features are meant to represent numerical values, e.g. AttachmentSize,instead that categorical ones. However, it is always possible to turn these features from nume-rical into categorical, defining intervals and assigning a feature value to each interval definedin such a way. We chose these intervals on the base of the ChiMerge discretization method[85], which returns outstanding results for discretization in decision tree-like problems [56].

The detail of discretization results are provided in Tables A.1, A.2, A.3, A.4,A.5, A.6, A.7,A.8, A.9, A.10, A.11, A.12, A.13, A.14, A.15, A.16, A.17, A.18.

Features of particular interest are the ones that report the amount of pictures in the email(ImagesNumber), or the presence of HTML tags (IsHTML), or again, the amount of links(NumberOfLinks). Through these features, in fact, it is possible to determine if the email

36

Attribute DescriptionRecipientNumber Number of recipients addresses.NumberOfLinks Total links in email text.NumberOfIPBasedLinks Links shown as an IP address.NumberOfMismatchingLinks Links with a text different from the real link.NumberOfDomainsInLinks Number of domains in links.AvgDotsPerLink Average number of dots in link in text.NumberOfLinksWithAt Number of links containing “@”.NumberOfLinksWithHex Number of links containing hex chars.SubjectLanguage Language of the subject.NumberOfNonAsciiLinks Number of links with non-ASCII chars.IsHtml True if the mail contains html tags.EmailSize The email size, including attachments.Language Email language.AttachmentNumber Number of attachments.AttachmentSize Total size of email attachments.AttachmentType File type of the biggest attachment.WordsInSubject Number of words in subject.CharsInSubject Number of chars in subject.ReOrFwdInSubject True if subject contains “Re” or “Fwd”.NonAsciiCharsInSubject Number of non ASCII chars in subject.ImagesNumber Number of images in the email text.

Table 4.1 – Features extracted from each email.

is raw text, contains several images, or is presented in the form of a web page, which mostlyremain unchanged when a spammer designs a spam campaign to be sent in burst.

4.2.2 Implementation Details

On the implementation side, an email parser has been developed in Java to automaticallyanalyze raw mails text and extract the features in form of vectors. The software exploits theJSoup [69] for HTML parsing, and of the LID 2 Python tools for language recognition. TheLID software exploits the technique of n-grams to recognize the language of a text. For eachlanguage that LID has to recognize, a database of words must be provided to the software,in order to extract n-grams. The language on which LID has been trained are the following :English, Italian, French, German, Spanish, Portuguese, Chinese, Japanese, Persian, Arabic,Croatian. We have implemented the CCTree algorithm using the MATLAB 3 software, whichtakes as input the matrix of emails features extracted by the parser.

It is worth noting that the complete framework, i.e. feature extraction and clustering module,are totally portable on different operation system. In fact, both the feature extraction module

2. http ://www.cavar.me/damir/LID/3. http://mathworks.com

37

http://mathworks.com

and the clustering module (i.e. MATLAB) are Java-based and executable on the vast majorityof general purpose operative system (Java, UNIX, iOS, etc.). Also the Python module forlanguage analysis it is portable. Moreover, LID has been made as a disposable component,i.e. if the Python interpreter is missing, the analysis is not stopped. For the emails where thelanguage is not inferable, the UKNOWN_LANGUAGE value for the attribute is used instead.

4.3 Evaluation and Results

This section reports on the experimental results to evaluate the quality of the CCTree algo-rithm on the problem of clustering spam emails. A first set of experiments has been performedon a dataset of 10k recent spam emails (February 2015), to estimate the capability of theCCTree algorithm in obtaining homogeneous clusters. This evaluation is known as InternalEvaluation and estimates the quality of the clustering algorithm, measuring how much eachelement of the resulting cluster is similar to the elements of the same cluster and dissimilarfrom the elements of other clusters. A second set of experiments aims at assessing the ca-pability of CCTree to correctly classify data using a small dataset with benchmark classesknown beforehand. This evaluation is named External Evaluation and measures the similaritybetween the resulting clusters of a specific algorithm and the desired clusters (classes) of thepre-classified dataset. For external evaluation, CCTree has been tested against a dataset of276 emails, manually labeled in 29 classes 4. The emails have been manually divided, lookingboth at the structure and the semantic of the message. Thus, emails belonging to one classcan be considered as part of a single spam campaign.

The results of CCTree are compared with those of two categorical clustering algorithms, na-mely COBWEB and CLOPE, well-known of being accurate and fast clustering algorithms,respectively. The comparison has been done both for internal and external evaluation on thesame aforementioned datasets. A time performance analysis is also reported. It is worth notingthat the three algorithms are all implemented on Java-based tools, hence the validity of timecomparison.

In what follows, we briefly introduce these two algorithms :

COBWEB COBWEB proposed by [51], is a categorical clustering algorithm, which builds adendrogram where each node is associated with a conditional probability which summa-rizes the attribute-value distributions of objects belonging to a specific node. Differentlyfrom the CCTree algorithm, also includes a merging operation to join two separate nodesin a single one. COBWEB is computationally demanding and time consuming, since itre-analyzes at each step every single data point. Actually, COBWEB employs four ope-rations as following :• merging two nodes : in merging of two nodes, the two nodes are replaced by a node

4. Available at : http ://security.iit.cnr.it/images/Mails/cctreesamples.zip

38

whose children is the original nodes of children and it summarizes the attribute-valuedistribution of the elements classified under them.• Split a node : a node is split by replacing it with its own children• inserting a new node : a new node is created for a new data inserting to the tree• passing a datum in the tree : the datum is located in the node it respects.

However, the COBWEB algorithm is used in several fields for its good accuracy, in away that its similarity distance measure, named Category Utility, is used to evaluatecategorical clustering accuracy [7], and is formally defined as what follows.Definition 4.1. Category Utility (CU) : The Category utility [60] is defined as thedifference between the expected number of attribute values which can be guessed correctlywith the given clustering algorithm, and the expected number of correct guess when wedo not have this knowledge. Let C1, . . . , Ck are the set of clusters, and vij’s (for allpossible j) are the values of attribute Ai, then CU is defined as following :

CU =∑Ci

|Ci|k

∑i

∑j

[P (Ai = vij |Ci)2 − P (Ai = vij)2]

The WEKA [65] implementation of COBWEB has been used for our experiments.

CLOPE : CLOPE [148] is a fast categorical clustering algorithm which maximizes the num-ber of elements with the same value for a subset of attributes, attempting to increasethe homogeneity of each obtained cluster. In this algorithm, a global criterion functionis proposed to increase the intra-cluster overlapping by increasing the height-to-weightratio of the cluster histogram. The clustering with maximum amount of height-to-widthratio on all cluster histograms is the optimum result. Formally, the CLOPE clusteringis defined as what follows :Let X = x1, x2, . . . xn be the set of n tuples, while all the features of data pointxi 1 ≤ i ≤ n are categorical. Suppose C = C1, C2, . . . Ck represents the devision ofX to k clusters, and D(Ci) shows the statistic histogram of Ci respect to the categoricalattributes. Two measure functions are introduced in this method as follows :

S(Ci) =∑xj∈Ci

|xj |

where |xj | is the dimensionality of xj .

W (Ci) = |Hi|

where |Hi| is the number of bins in histogram Hi. Then, the criterion function of CLOPEis defined as :

maxProfit(C) =1

n

k∑i=1

S(Ci)

W (Ci)2|Ci|

where |Ci| is the number of elements in cluster Ci.

Also for CLOPE we have used the WEKA [65] implementation for the performed experiments.

39

4.3.1 Internal Evaluation

When the result of clustering algorithm is evaluated based on the data that was clustered itself,it is called internal evaluation. Internal evaluation measures the ability of a clustering algorithmin obtaining homogeneous clusters. A high score on internal evaluation is given to clusteringalgorithms which maximize the intra-cluster similarity, i.e. elements within the same clusterare similar, and minimize the inter-cluster similarity, i.e. elements from different clusters aredissimilar. The cluster dissimilarity is measured by computing the distances between elements(data points) in various clusters. The used distance function changes for the specific problem. Inparticular, for elements described by categorical attributes, the common geometric distances,e.g. Euclidean distance, cannot be used. Hence, in this work the Hamming and Jaccard distancemeasures [66] are applied. The Hamming distance considers two elements closer when they havethe same value for a higher number of attributes. On the other hand, the Jaccard distance isdefined as the size of intersection of attributes of two elements divided by their union. Internalevaluation can be performed directly on the dataset on which the clustering algorithm operates,i.e. the knowledge of the classes (desired clusters) is not a prerequisite. The indexes used forinternal evaluation are the Dunn Index [19] and the Silhouette [118], which are defined asfollows :

Dunn index : Let ∆i be the diameter of cluster Ci, that can be defined as the maximumdistance of elements of Ci :

∆i = maxx,y∈Ci , x 6=y

d(x, y)

where d(x, y) measures the distance of pair x and y, which can be considered any distanceas specified by user, e.g. Hamming distance, and |C| shows the number of elementsbelonging to cluster C. Also, let δ(Ci, Cj) be the inter-cluster distance between clustersCi and Cj , which is calculated as the pairwise distance between elements of two clusters.Then, on a set of k clusters, the Dunn index [64], is defined as :

DIk = min1≤i≤k

min1≤j≤k

δ(Ci, Cj)

max1≤t≤k ∆t

A higher Dunn index value means a better cluster quality. It is worth noting that thevalue of Dunn index is negatively affected by the greatest diameter between the elementsof all generated clusters (max1≤t≤k ∆t). Hence, even a single resulting cluster with poorquality (non homogeneous), will cause a low value of the Dunn index. On the other hand,higher values of this index means that the overall homogeneity of resulting clusters isnoticeable.

Silhouette : The dissimilarity of point xi from a cluster C is the average distance from xi

to points of C. Mostly, dissimilarity refers to distance measure, where for categoricalattributes, distance measure can be considered as hamming distance. Let d(xi) be the

40

average dissimilarity of data point xi with other data points within the same cluster.Also, let d′(xi) be the lowest average dissimilarity of xi to any other cluster, except thecluster that xi belongs to. Then, the silhouette [118], s(i) for xi is defined as :

s(i) =d′(i)− d(i)

maxd(i), d′(i)=

1− d(i)

d′(i) d(i) < d′(i)

0 d(i) = d′(i)d′(i)d(i) − 1 d(i) > d′(i)

where the definition result in s(i) ∈ [−1, 1]. As much as s(i) is closer to 1, the morethe data point xi is appropriately clustered. The average value of s(i) over all data of acluster, shows how tightly related are data within a cluster. Hence, the more the averagevalue of s(i) is close to 1, the better is the clustering result. For easy interpretation, thesilhouette of all clustered points is also represented through a silhouette plot.

Performance Comparison

As discussed in Chapter 3, CCTree algorithm requires two stop conditions as input, i.e. theminimum number of elements in a node to be split (µ), and minimum purity in a cluster(ε). Henceforth, the notation CCTree(ε, µ) will be used to refer the specific implementationof the CCTree algorithm. To choose the stop conditions, we first fix the minimum numberof elements µ = 1, and then we change node purity to see how internal indexes are affected.Worth noticing that when µ is fixed to 1, the only stop conditions affecting the result is ε.Table 4.2 shows the result of internal evaluations when µ is fixed to 1 and ε gets five differentvalues as 0.0001, 0.001, 0.01, 0.1 and 0.5.

Table 4.2 – CCTree Internal evaluation with fixed number of elements.

Algorithm CCTree - µ = 1ε = 0.0001 ε = 0.001 ε = 0.01 ε = 0.1 ε = 0.5

Silhouette(Hamming) 0.9772 0.9772 0.9642 0.7124 0.1040Silhouette(Jaccard) 0.9777 0.9777 0.9650 0.7110 0.0820Dunn(Hamming) 0.5 0.5 0.2 0.1111 0.0714Dunn(Jaccard) 0.25 0.25 0.1571 0.1032 0.0704

For a further insight, we report in Figures 4.1, 4.2, 4.3, 4.4 the Hamming distance silhouetteplots for the CCTrees with the same parameters of Table 4.2. The graphs are horizontalhistograms in which every bar represents the silhouette result, s(i) ∈ [−1, 1], for each datapoint xi, as from the aforementioned definition. It can be seen that both CCTree(0.001,1)

(Fig. 4.1) and CCTree(0.01,1) (Fig. 4.2) do not show negative values for any data point,hence, the high value of silhouette close to 1. Actually, the first row of Table 4.2 shows theaverage of silhouette result for all points CCTree is constructed on, with identified input

41

Figure 4.1 – CCTree(0.001,1)

Figure 4.2 – CCTree (0.01,1)

42



43

parameters. The white spaces in plots show the points for which the silhouette result equalsto one.

Figure 4.5 graphs the internal evaluation measurements of CCTree with five different valuesof ε, when the minimum number of elements µ has been set to 1. It is worth noting that ifµ = 1, the only stop condition affecting the result is the node purity. This is the reason thatwe first fix µ = 1 to find the best amount of required node purity for our dataset.

Figure 4.5 – Internal evaluation at the variation of the ε parameter.

As shown in Figure 4.5, the purity value reach the maximum and stabilize when ε = 0.001.More strict purity requirements (i.e., ε < 0.001) do not further increase the precision. Thisvalue of ε will be fixed for the following evaluations. More precisely, we first fix one of theinput parameters in a way that it does not affect the result, i.e. µ = 1, and attribute differentvalues to other parameter. We see that in one point, more strict parameter is not affecting thegeneral homogeneity result.

Fixing the node purity ε = 0.001, we then look for the better value for the µ parameter tobe able to compare CCTree performance with accurate COBWEB and fast CLOPE. To thisend, we provide four different values of minimum number of elements in a cluster. Table 4.3presents the Silhouette and Dunn index results for proposed values of µ, namely 1, 10, 100,and 1000. In addition, the last two rows of Table 4.3 reports the resulting number of clustersand the time required to generate the clusters.

Table 4.3 also reports the comparison with the two categorical clustering algorithms COBWEB

44

Table 4.3 – Internal evaluation results of CCTree, COBWEB and CLOPE.

Algorithm COBWEB CCTree - ε = 0.001 CLOPEµ = 1 µ = 10 µ = 100 µ = 1000

Silhouette(Hamming) 0.9922 0.9772 0.9264 0.7934 0.5931 0.2801Silhouette(Jaccard) 0.9922 0.9777 0.9290 0.8021 0.6074 0.2791Dunn(Hamming) 0.1429 0.5 0.1 0.0769 0.0769 0Dunn(Jaccard) 0.1327 0.25 0.1 0.0879 0.0857 0

Clusters 1118 619 392 154 59 55Time (s) 17.81 0.6027 0.3887 0.1760 0.08610 3.02

and CLOPE, previously described. The first two columns from left, show comparable resultsin term of clustering precision for the silhouette index. In fact, COBWEB and CCTree haveboth a good precision, when the CCTree purity is set to ε = 0.001 and the minimum numberof elements is set to µ = 1 (CCTree(0.001,1)). COBWEB performs slightly better on thesilhouette index, for both distance measures. However, the difference (less than 2 percent) isnegligible by considering that COBWEB creates almost twice more the number of clustersrather than CCTree(0.001,1).

It can be inferred that a higher number of small clusters improves the internal homogeneity(e.g., a cluster with one element is totally homogeneous). However, as it will be detailed inthe following, a number of clusters, strongly greater than the expected number of groups,is not desirable. In fact, it can be inferred from the Silhouette definition that, in case everyelement xi is unique, the maximum theoretical value is achieved if each cluster contains onlyone element.

Moreover, CCTree(0.001,1) returns better result for the Dunn index, with respect to COB-WEB. We recall that the value of Dunn index is strongly affected by the cluster homogeneityof the worst resulting cluster. The value returned for CCTree(0.001,1) shows that all thereturned clusters globally have a good homogeneity, compared to COBWEB, i.e. the worstcluster for CCTree(0.001,1) is much more homogeneous than the worst cluster for COBWEB.

The rightest column of Table 4.3 reports the results for the CLOPE clustering algorithm.CLOPE is a categorical clustering algorithm, known to be fast in creating as much as possiblepure clusters. The accuracy of CLOPE is quite limited for Silhouette, and zero for the Dunnindex, whilst CCTree(0.001, 1000) with almost the same number of clusters is 35 times fasterthan CLOPE.

A graphical description of the accuracy difference between the clustering of Table 4.3 canbe inferred from the Hamming Silhouette plots of Figures 4.6, 4.7, 4.8 , 4.9, 4.10, and 4.11.The plots are horizontal histograms in which every bar represents the silhouette result, s(i) ∈[−1, 1], for each data point xi, as from the aforementioned definition.

45

Figure 4.6 – COBWEB

Both COBWEB and CCTree(0.001,1) show no negative values, with the majority of datapoints scoring s(i) = 1. In fact, for CCTree(0.001, 1000) the worst data points do not scoreless than −0.5, whilst for CLOPE some data points have a silhouette of −0.8, which willcause a strong non-homogeneity in their clusters. Also, the number of data point with positivevalues are much more for CCTree(0.001,1000), than for CLOPE. This also justifies the bettervalue of Dunn index for the CCTree(0.001,1000), which we recall to be affected by the non-homogeneity of the worst cluster. Also, the number of data point with positive values aremuch more for CCTree(0.001,1000), than for CLOPE, even if CLOPE returns some pointswhose Silhouette value is 1. However, CCTree(0.001,1000) returns a better Silhouette thanCLOPE.

The outstanding point is that the runtime of CCTree(1000) is almost 30 times less thanCLOPE, applicable as a fast categorical clustering algorithm.

Finally, Table 4.3 also reports the time elapsed for the clustering performed by the algorithms.It can be observed that COBWEB pay its accuracy with an elapsed time of 17 seconds onthe dataset of 10k emails, against the 3 seconds of the much more inaccurate CLOPE. TheCCTree algorithm outperforms both COBWEB and CLOPE, requiring only 0.6 seconds inthe most accurate configuration (CCTree(0.001,1)).

From internal evaluation we can thus conclude that the CCTree algorithm obtains clusterswhose quality is comparable with the ones of COBWEB, requiring even less computationaltime than the fast but inaccurate algorithm CLOPE.

46



47


Figure 4.10 – CCTree(0.001,1000)

48

Figure 4.11 – CLOPE

4.3.2 CCTree Parameters Selection

Through internal evaluation and the results reported in Table 4.3 and Figures 4.6, 4.7, 4.8 ,4.9, 4.10 , 4.11, we showed the dependence of the internal evaluation indexes and number ofclusters to the values of µ and ε parameters. We will briefly discuss here some guidelines tocorrectly choose the CCTree parameters to maximize the clustering effectiveness.

Concerning the ε parameter, we showed in Section 4.3 that it is possible to find the optimalvalue of ε by setting µ = 1 and varying the ε to find the fixed point in terms of accuracy, i.e.the optimal ε was considered the one for which the lesser amount of ε is not improving theaccuracy.

Fixed the parameter ε, the parameter µ must be selected to balance the accuracy with thenumber of generated clusters. As the number of cluster is affected by the µ parameter, it ispossible to choose the optimal value of µ knowing the optimal number of clusters. The problemof estimating the optimal number of clusters for hierarchical clustering algorithms, has beensolved in [120], by determining the point of maximum curvature (knee) on a graph showingthe inter-cluster distance in function of the number of clusters.

Recalling that silhouette index is inversely related to inter-cluster distance, it is sound to findthe knee on the graph of Figure 4.12 computed with the silhouette (Hamming) on the datasetused for internal evaluation, with seven different values of µ. The graph reports the valuescomputed on the same dataset used for internal evaluation. For the sake of representation, wedo not show in this graph plots for µ greater than 100.

49

Figure 4.12 – Silhouette in function of the number of clusters for different values of µ.

Figure 4.13 – Sihouette (Hamming).

Applying the L-method described in [120], it is possible to find that the knee is located atµ = 10. It is worth recalling from Table 4.3 that the knee value for µ gives a silhouette valuehigher than 90%, keeping the number of generated clusters much lower than the ones obtainedwhen µ = 1.

Table 4.4 – Silhouette values and number of clusters in function of µ for four email datasets.

Data µ = 1 µ = 10 µ = 100 µ = 1000Silhouette Clusters Silhouette Clusters Silhouette Clusters Silhouette Clusters

February 0.9772 610 0.9264 385 0.7934 154 0.5931 59March I 0.9629 635 0.9095 389 0.7752 149 0.6454 73March II 0.9385 504 0.8941 306 0.8127 145 0.6608 74March III 0.9397 493 0.8926 296 0.8102 131 0.6156 44

A further insight can be taken from the results of Table 4.4 and Figures 4.13, 4.14, reporting

50

Figure 4.14 – Generated Clusters.

the analysis on three additional datasets of spam emails coming from three different weeksof March 2015 from the spam honeypot 5. The sets have comparable size with the one of thedataset used for internal evaluation (first week of February 2015), with respectively 10k, 10kand 9k spam emails.

From both the table and the graph it is possible to infer how the same trend for both silhouettevalue and number of clusters holds for all the tested datasets. Hence, we verify (i) the validity ofthe the knee methodology, (ii) the possibility of using the same CCTree parameters for datasetswith the same data type and comparable size. Finally, for a further insight we graphicallyreport the results of Table 4.4 in Figures 4.13, 4.14.

From both figures it is visible how the four datasets follow the same trends for internal eva-luation indexes and number of clusters with the same values for the µ parameter.

To give statistical validity to the performed analysis on parameter determinacy, we have ana-lyzed a dataset of more than 200k emails collected from October 2014 to May 2015 from thehoney pot 6. The emails have been divided in 21 datasets, containing 10k spam emails each.Each set represents one week of spam emails.

Tables 4.5 and 4.6 report the result of silhouette index and number of clusters for 6 months,from October 2014 to May 2015, except February and March which were already reported inTables 4.4 and 4.3.

To show that silhouette value and number of clusters of spam campaigns of Tables 4.5, 4.6,4.4 and 4.3 keep the trends of data sets with comparable size, we first in what follows briefly

5. http ://untroubled.org/spam6. http ://untroubled.org/spam

51

Table 4.5 – Silhouette result, hamming distance, ε = 0.001, and µ changes

Month µ = 1 µ = 10 µ = 100 µ = 1000

Oct1 0,9264 0,88 0,7936 0,5738Oct2 0,9223 0,8625 0,7299 0,5557Oct3 0,9071 0,8555 0,7474 0,6623Nov1 0,9228 0,8706 0,7616 0,5593Nov2 0,9655 0,9083 0,7873 0,5054Nov3 0,9702 0,9064 0,7951 0,5078Dec1 0,9566 0,9012 0,7736 0,6264Dec2 0,9626 0,9108 0,7784 0,651Dec3 0,9787 0,942 0,8451 0,6739Jan1 0,9697 0,9387 0,8876 0,6675Jan2 0,9683 0,9369 0,8776 0,7407Jan3 0,9739 0,9441 0,8923 0,662Apr1 0,9706 0,9161 0,7894 0,6894Apr2 0,9694 0,9174 0,8234 0,6378Apr3 0,9738 0,9361 0,8344 0,6866May1 0,9675 0,9184 0,7679 0,5553May2 0,964 0,9128 0,7712 0,4434May3 0,9668 0,9299 0,819 0,5068

Table 4.6 – Number of Clusters , ε = 0.001, and µ changes

Month µ = 1 µ = 10 µ = 100 µ = 1000

Oct1 507 310 141 31Oct2 652 376 143 61Oct3 562 333 124 64Nov1 564 312 128 50Nov2 689 399 161 56Nov3 685 391 172 61Dec 1 586 359 135 66Dec2 583 343 127 64Dec3 437 273 118 50Jan1 366 237 132 44Jan2 366 216 110 43Jan3 344 208 118 37Apr1 574 341 127 54Apr2 528 304 131 53Apr3 408 242 101 47May1 622 419 159 73May2 578 372 133 60May3 474 313 131 48

52

present what standard deviation means.

Standard Deviation : In statistics, the standard deviation (mostly shown as σ) [11] is ameasure for quantifying the amount of variation of a set of values. A standard deviationwhich is close to 0 shows that the data points tend to be very close to the mean, whilstthe high amount of standard deviation indicates that data points are spread out over awide range of values. Formally, let X be a random variable with mean value E(X), thestandard deviation of X equals to :

σ =√E(X2)− (E(X))2

It means that the standard deviation σ is the square root of the variance of X.

Figures 4.15 and 4.16 show the average values for number of clusters and silhouette compu-ted on the 21 dataset varying the value of the µ parameter with the values of the formerexperiments (i.e. 1, 10, 100, 1k). The standard deviation (defined above) is also reported aserror bars. It is worth noting that, the standard deviation for the values of µ = 1, 10 on 10datasets is slightly higher than 2%, while it reaches 4% for µ = 100 and 8% for µ = 1000,which is in line with the results of Table 4.4. Comparable results are also obtained for thenumber of clusters where the highest value for standard deviation is, as expected, for µ = 1,amounting to 108, which again is in line with the results of Table 4.4. Thus, for all the 21analyzed datasets, spanning eight months of spam emails, we can always locate the knee forsilhouette and number of clusters when µ = 10.

Figure 4.15 – Sihouette (Hamming).

4.3.3 External Evaluation

The external evaluation is a standard technique to measure the capability of a clusteringalgorithm to correctly classify data. To this end, external evaluation is performed on a smalldataset, whose classes, i.e the desired clusters, are known beforehand. This small dataset must

53

Figure 4.16 – Generated Clusters.

be representative of the operative reality, and it is generally separated from the dataset usedfor clustering.

The external evaluation is a standard technique to measure the capability of a clusteringalgorithm to correctly classify data. To this end, external evaluation is performed on a smalldataset, whose classes, i.e the desired clusters, are known beforehand. This small dataset mustbe representative of the operative reality, and it is generally separated from the dataset usedfor clustering. A common index used for external evaluation is the F-measure [98], whichcoalesce in a single index the performance aboutfor cectly classified elements and misclassifiedoneelement external evaluation, the result of clustering algorithm is evaluated based on thedata which are not used for clustering. The class of of these data are prior known. This setof pre-classified items are often labeled by human experts. External measures for clusteringevaluate how close the result of clustering algorithm is to predetemined labeled data.

Formally, let the sets C1, C2, . . . , Ck be the desired clusters (classes) for the dataset D, andlet C ′1, C ′2, . . . , C ′l be the set of clusters returned by applying a clustering algorithm on D.Then, the F-measure F (i) for each cluster Ci is defined as follows :

F (i) = maxi,j

2|Ci ∩ C ′j ||Ci|+ |C ′j |

To compute the overall F-Measure on all desired clusters, as an aggregated index, the weightedaverage F-Measures of all predetermined clusters is computed as :

54

Fc =k∑i=1

F (i)|Ci|

| ∪kj=1 Cj |

The F-measure result is returned in the range [0,1], where 1 represents the ideal situationin which the cluster Ci is exactly equal to one of the resulted clusters. More precisely, in F-Measure the elements of each predetermined class is compared with the elements of all resultedclusters, and the maximum similarity, equal to 1, is returned when in resulted clusters thereis one identical to predetermined class.

Experimental Results

For the sake of external evaluation, 276 spam emails collected from different spam folders ofdifferent mailboxes has been manually analyzed and classified. Emails have been divided in29 groups (classes) according to the structural similarity of raw email message. The externalevaluation set has no intersection with the one used for internal evaluation.

Table 4.7 – External evaluation results of CCTree, COBWEB and CLOPE.

Algorithm COBWEB CCTree - ε = 0.001 CLOPEµ = 1 µ = 5 µ = 10 µ = 50

F-Measure 0.3582 0.62 0.6331 0.6330 0.6 0.0076Clusters 194 102 73 50 26 15

The results of external evaluation are reported in Table 4.7. Building on the results of internalevaluation, the value of node purity has been set to ε = 0.001 to obtain homogeneous clusters.The values of µ have been chosen according to the following rationale. µ = 1 represents aCCTree instantiation in which the µ parameter does not affect the result. On the other handµ = 50 returns a number of clusters comparable with the 29 clusters manually collected.Higher values of µ do not modify the result for a dataset of this size.

The best results are returned for µ = 5, 10. The F-measure for these two values is higherthan 0.63, with a negligible difference, even if the number of generated clusters is higher thanthe expected one. The F-measure, in fact is considers as correctly classified, elements from asame original cluster C even if divided in more clusters C ′1, . . . , C ′n which do not contain datafrom other clusters different from C.

Table 4.7 also reports the comparison with the COBWEB and CLOPE algorithms. As shownsCCTree algorithm outperforms COBWEB and CLOPE for the F-measure index, showing thusa higher capability in correctly classifying spam emails. We recall that for internal evaluation,

55

COBWEB returned slightly better results than CCTree. The reason of this difference is resul-ted from the number of resulting clusters. COBWEB, in fact, always returns a high numberof clusters (Table 4.7). This generally yields a high cluster homogeneity (several small andhomogeneous clusters). However, it not necessarily implies a good classification capability. Infact, as shown in Table 4.7, COBWEB returns almost 200 clusters on a dataset of 276 emails,which is six times the expected number of clusters (i.e., 29 clusters). This motivates the lowerF-measure score for the COBWEB algorithm. It is worth noting that the CCTree outperformCOBWEB even if the minimum number of elements per node is not considered (i.e., µ = 1).On the other hand, CLOPE also performs poorly on F-measure for the 276 emails dataset.The CLOPE algorithm, in fact, only produced 15 clusters, i.e. less than half of the expectedones, with an F-measure score of 0.0076.

4.4 Discussion and Comparisons

[86] introduced three properties related to clustering algorithms, named scale invariance (re-quiring that the output of a clustering be invariant to uniform scaling of output), consistency(requiring that if within-cluster distances are decreased, and between-cluster distances areincreased, then the output of a clustering function does not change) and richness (requiringthat by modifying the distance function, any partition of the underlying data set can be obtai-ned). In his famous theorem, Kleinberg proved that “independent of any particular algorithm,objective function, or generative data model”, there is no clustering function simultaneouslysatisfies the proposed three properties. The Kleinberg theorem is referred in the literature[147], [102], [137], [124] to justify that no clustering algorithm stand as a perfect function fora specific problem, instead it is required to respect, as much as possible, the specified anddesired properties of the associated problem.

The presented methodology based on a set of 21 categorical features and a novel categoricalclustering algorithm, named CCTree, shows the capability of dividing spam emails in veryhomogeneous cluster, with a good accuracy in discerning different campaigns. The comparisonwith other categorical clustering algorithms showed the effectiveness and efficiency of CCTreewhen applied to the same set of features. We recall that CCTree is an unsupervised machinelearning technique. Unsupervised learning algorithms do not provide results with the sameaccuracy of supervised learning techniques (i.e. trained classifiers). However, they have theadvantage of not requiring any training procedure, thus they can be applied also on datasetsfor which no previous knowledge is available. This often represent the reality in the analysis ofthe spam emails, where it is necessary to cope with the large amount of emails daily producedand collected by honeypots.

Combining the analysis of 21 features, the proposed methodology, becomes suitable to analyzealmost any kind of spam emails. This is one of the main advantages with respect to other

56

approaches, which mainly exploit one or two features to cluster spam emails into campaigns.These features are links [4], [92], keywords [22], [30], [46], [123], or images [150] alternatively.The analysis of these methodologies remains limited to those spam emails that effectivelycontain the attributed features. However, emails without links and/or images are a consistentpercentage of spam emails.

In fact, from the analysis of the dataset used for internal evaluation, 4561 emails out of 10165do not contain any link. Furthermore, only 810 emails contain images. To verify the clusteringcapability of these methodologies we have implemented three programs to cluster the emailsof the internal evaluation dataset on the base of the contained URLs, reported domains andlinks of remote images. The emails without links and pictures have not been considered.

Table 4.8 – Campaigns on the February 2015 dataset from five clustering methodologies.

Cluster Methodology Analyzed emails Generated CampaignsLink Based Clustering 4561 4449

Domain Based Clustering 4561 4547Image Based Clustering 810 807COBWEB (21 Features) 10165 1118

CCTree(0.001,10) 10165 392

Table 4.8 reports the generated number of clusters for each methodology. It is worth notingthat on large dataset these cluster methodologies are highly inaccurate, generating a numberof campaigns close to the number of analyzed elements, hence, almost every cluster has asingle element.

For comparison purpose we reported the results of the most accurate implementation of CC-Tree and of COBWEB, which we recall being able to produce extremely homogeneous cluster,reporting a Silhouette value of 99%. We point out that, comparing Silhouette is meaningless,due to the different sets of used features. Comparisons with other methodologies such as theFPTree-based approaches [27] [44], which require the extraction and analysis of a different setof features, are left as a future work.

4.5 Related Work

As discussed in Section 4.4, there are several works in literature which cluster spam emailsthrough pairwise comparisons of URLs, IP address resolved from URLs, domains and imagelinks. The main weaknesses of these approaches is that they cannot be applied on emails notpresenting the required features. Also the pairwise comparison impose a quadratic complexity,against the linear one in CCTree. Another clustering approach exploiting a pairwise compari-sons of email subjects is presented in Wei et al. [144]. The proposed methodology introducesa set of eleven features to cluster spam emails through two clustering algorithms. At first an

57

agglomerative hierarchical algorithm is used to cluster the whole data set based on subjectpairwise comparison. Afterward, the connected component graph algorithm is used to improvethe performance.

The authors of [132] applied a methodology based on k-means algorithm, named O-meansclustering, which exploits twelve features extracted from each email. The O-mean algorithmworks on the hypothesis that the number of clusters is known beforehand, which is not alwaysa working hypothesis. Furthermore, the authors use Euclidean distance, which for several fea-tures that they apply, it does not bring meaningful information. Differently from this approachthe CCTree exploits the more general distance measure, i.e. Shannon entropy. Moreover theCCTree does not require the desired number of clusters as input parameter.

Frequent Pattern Tree (FP-Tree), is another technique applied to detect spam campaigns inlarge datasets. The authors of [27],[44] extract set of features from each spam message. The FP-Tree is built based on the frequency of features. The sensitive representation of both messagelayout and URL features, causes that two spam emails with small difference be assigned todifferent campaigns. For this reason, FP-Tree approach becomes prone to text obfuscationtechniques [108], used to deceive anti-spam filters and to emails with dynamically generatedlinks. Our methodology, based on categorical features which do not consider text and linksemantics is more robust against these deceptions.

4.6 Conclusion

In this chapter, we proposed a methodology, based on a categorical clustering algorithm, namedCCTree, to cluster large amount of spam emails into campaigns, grouping them by structuralsimilarity. To this end, the set of features representing message structure, have been preciselychosen and the intervals for each feature has been found through discretization approach.The CCTree algorithm has been extensively tested on two dataset of spam emails, to measureboth the capability in generating homogeneous clusters and the specificity in recognizingpredefined groups of spam emails. The guideline for selecting CCTree parameters is provided,whilst the determinacy of selected parameter for the similar data set with the same size hasbeen proven statistically. To this end, several datasets of spam emails, each containing largeamount of spam messages (each set contains almost 10k spam emails), gathered from the sameuntroubled honey-pot, are clustered with the use of CCTree algorithm. In this experiment,the same stopping criteria are applied. Through tables and diagrams we show that the CCtreepreserves the same trend when the datasets are in (almost) the same size.Considering the proven accuracy and efficiency, the proposed methodology may stand as avaluable tool to help authorities in rapidly analyzing large amount of spam emails, with thepurpose of finding and persecuting spammers.To the best of our knowledge we are the first who showed the effectiveness and efficiency of

58

proposed algorithm for clustering spam emails into campaigns, whilst in previous techniquesthe proposed techniques were not evaluated.

59

Chapitre 5

Labeling and Ranking SpamCampaigns

Fast analysis of correlated spam emails may be vital in the effort of finding and prosecu-ting spammers performing cybercrimes such as phishing and online frauds. In this chapterwe present a self-learning framework to automatically divide and classify large amounts ofspam emails in correlated labeled groups. Building on large datasets daily collected throughhoney-pots, the emails are firstly divided into homogeneous groups of similar messages (cam-paigns), which can be related to a specific spammer. Each campaign is then associated to aclass which specifies the goal of the attacker, i.e. phishing, advertisement, etc. The proposedframework exploits a categorical clustering algorithm to group similar emails, and a classifierto subsequently label each email group. The main advantage of the proposed framework isthat it can be used on large spam emails datasets, for which no prior knowledge is provided.The approach has been tested on more than 3200 real and recent spam emails, divided inmore than 60 campaigns, reporting a classification accuracy of 97% on the classified data.Afterwards, a ranking approach is proposed to automatically rank spam campaigns accordingto investigator priorities ([128]).

5.1 Introduction

At the end of 2014, emails are still one of the most common form of communication in Internet.Unfortunately, emails are also the main vector for sending unsolicited bulks of messages,generally for commercial purpose, commonly known as spam. The research community hasinvestigated the problem for several years, proposing tools and methodologies to mitigate thisissue. However, a definitive solution to the problem of spam emails still has to be found.Unfortunately, the problem of spam emails is not only related to unsolicited advertisement.Spam emails have become a vector to perform different kind of cybercrimes including phishing,cyber-frauds and spreading malware.

60

Motivation : Trying to filter spam emails at the user end, actually is not enough to fightthis kind of attacks, which moves the effect of unsolicited spam emails from illicit to realcrime. Finding the spammers becomes important not only to tackle at the source the problemof spam emails, but also to legally prosecute the responsible of cybercrimes brought by spamemails different from undesired advertisement. To identify spammers, the early analysis of hugeamount of messages to find correlated spam emails with the specific spammer purpose is vital.Several papers in the literature observed that the forensic analysis, which plays a major role infinding and persecuting spammers for cybercrimes, needs a proactive mechanism or tool whichis able to perform a fast, multi-staged analysis of emails in a timely fashion [63], [44], [144],[40]. To this end, large amounts of spam emails, generally collected through honey-pots shouldbe at first divided in similar groups, which could be related to the same spammer (i.e., spamcampaigns). Afterward to each campaign should be assigned a label describing the purpose ofspammer. This goal-based labeling facilitates for investigators the analysis of spam campaigns,eventually directed toward a specific cybercrime, Moreover, the spam campaign labeling basedon the goal of spammer can help to rank spam campaigns. However, this analysis generallyappears to be a challenging task. In fact, considering the number of produced spam emailsand their variance, spam email datasets are huge and very difficult to handle. In particular,human analysis is almost impossible, considering the amount of spam emails daily caught by aspam honey-pot [141] [144]. On the other hand, an automated and accurate analysis requiresthe usage of correctly trained computational intelligence tools, i.e. classifiers, whose trainingrequires accurately chosen datasets, which presents to the classifier a good reality descriptionin which it will be employed. Moreover, due to the high variance of spam emails, a validtraining set may become obsolete in few weeks, and a new up-to-date training set must begenerated in a short period of time.

Though previous work largely improved the state of the art in analysis of spam emails forforensic purposes, more improvement is still needed. In particular, previous work either focuseson a specific cybercrime only, especially phishing [50], or exploit in the analysis a small setof features not effective in identifying some cybercrime emails. For example, the analysis ofemail text words [63], link domains [44] is not effective in identifying emails used to distributemalware, which often do not contain text, or spam emails with dynamic links [16].

Contribution : In this chapter, we propose Digital Waste Sorter (DWS), a framework whichexploits a self learning goal of the spammer -based approach for spam email classification. Theproposed approach aims at automatically classifying large amount of raw unclassified spamemails dividing them into campaigns and labeling each campaign with its spammer goals. Tothis end we propose five class labels to group spammer goals in five macro-groups, namelyAdvertisement, Portal Redirection, Advanced Fee Fraud, Malware Distribution and Phishing.Moreover, a set of 21 categorical features representative of email structure is proposed toperform a multi-feature analysis aimed at identifying emails related to a large range of cyber-

61

crimes. DWS is based on the cooperation of unsupervised and supervised learning algorithms.Given a set of classes describing different spammer goals and a dataset of non classified spamemails, the proposed approach at first automatically creates a valid training set for a classifierexploiting a categorical clustering algorithm, named CCTree (Categorical Clustering Tree). Inmore detail, this clustering algorithm divides the dataset into structurally similar groups ofemails, named spam campaigns [26]. DWS is built on the results of CCTree , which is effec-tive in dividing spam emails in homogeneous clusters. Afterward, significant spam campaignsuseful in the generation of the training set are selected through similarity with a small set ofknown emails, representative of each spam class. Hence, a classifier is trained using the selectedcampaigns as training set, and will be used to classify the remaining unclassified emails of thedataset.

To further meet the needs of forensic investigators, which have limited time and resource toperform email examinations [40], the DWS methodology does not require a prior knowledge ofdataset, except the desired classes (i.e. spammer goals) and a small set of emails representativeof each class. It is worth noting that this email set cannot be used to train the classifier. Infact, this set contains a small number of emails not belonging to the dataset to be classified,being thus not necessarily descriptive of the reality in which the classifier will operate.

In the following we will describe in details the DWS framework, explaining the process ofdivision in campaigns, training set generation and campaigns classification. The frameworkeffectiveness has been evaluated against a set of 3200 recent raw spam emails extracted by ahoney pot. DWS reported a classification accuracy on this preliminary dataset of 97.8%. Fur-thermore, to justify the classifier selection, an analysis of performances on different classifiers ispresented. Furthermore, we propose five features, including the label of campaigns discoveredwith DWS, to automatically rank a set of spam campaigns according to investigator priorities.

The rest of the paper is organized as follows. Section 5.2 reports related work on email classi-fication. Section 5.3 presents the DWS framework and work-flow in details, also it gives briefbackground information on the clustering algorithm. Section 5.4 presents the results of theanalysis on a real dataset of spam emails, as well as a comparison on the classification resultsof four different classifiers. In Section 5.7 a technique is proposed to rank spam campaigns.Finally Section 5.6 briefly concludes reporting planned future extensions.

5.2 Related Work

In the literature, the spam campaigns are usually labeled based on characteristic strings (key-words) representing individual campaign types as in [44], [88] and [55]. These approaches areweak against the kind of spam emails which do not contain keywords or that use word obfus-cation techniques. [106] label spam campaigns on the base of URLs, phone number, Skype ID,and Mail ID used as contact information. This methodology is effective only against emails

62

reporting contacts, which are only a subset of all the spam emails found in the wild. Thismeans that the proposed approachh fail in detecting spam campaigns not containing the afo-rementioned contact information.

There are several approaches in the literature in which the spammer goal is considered. Howe-ver, these approaches are mainly focused on detecting phishing emails, not considering otherspammer purposes. Fette et al. [50] applied 10 email features to discern phishing emails fromham (good) emails. Bergholz et al. [17] propose a similar methodology with additional fea-tures to train a classifier in order to filter phishing emails. Almomani et al. [3] provide asurvey on different techniques in filtering phishing emails, while Gansterer et al. [53] comparedifferent machine learning algorithms in phishing detection. Furthermore, the authors proposea technique which refines the previous phishing filtering approaches. In this work, three typesof messages, named ham, spam and phishing are distinguished automatically. Nevertheless,the category of emails containing spam, is not precisely characterized. In [34] a methodologyto detect phishing emails based on both machine learning and heuristics is proposed. Theseapproaches report accuracy ranging from 92% to 96%, where the classifiers have been trainedthrough labeled datasets. On the contrary, DWS generates the training set on the fly, withoutrequiring a pre-trained classifier. Notwithstanding, in the performed experiments DWS showscomparable accuracy.

5.3 Digital Waste Sorting

DWS is a Java-based framework which takes as input datasets of unclassified spam emails.Hence, DWS divides the emails in campaigns by mean of a hierarchical clustering algorithm,then labels each campaign through a classifier. The classifier is trained on the fly, througha training set generated by DWS directly from the unlabeled input dataset, exploiting theknowledge generated by the clustering algorithm.

This section describes in details the DWS framework and methodology. First we will presentthe classes used to label each spam campaign. Then, we present the feature extraction processfrom raw emails, discussing the features relevance in describing structural elements of an emailand their relation to each spam class. The framework is then presented, briefly introducingthe clustering algorithm and the methodology for the generation of the training set. Finallythe classification process is presented.

5.3.1 Definition of Classes

As anticipated, spam emails can be sent with different intentions, spanning from the commonadvertisement to vectors of different cybercrimes. We argue that spam emails can be dividedin five well known macro-groups which represent the main target of spammers, and can thusbe used to label spam campaigns.

63

Figure 5.1 – Advertisement

Advertisement : The advertisement class contains those emails whose target is convincing auser to buy a specific product [84]. Advertisement emails embody the most typical idea ofspam messages, advertising any kind of product which could be of interest of companiesor private users. Generally these emails only constitute a hindrance to the users that haveto spend time removing them from the inbox. Notwithstanding, some campaigns providerevenues up to 1M US dollars to spammers per month [84]. The main requirements for acommercial email to be legal according to Federal Trade Commission [49], is that it usesno deceptive subject lines, provides correct complete header information, real physicallocation of the business, offers an opt-out choice, and honors opt-out requests in 10business days. In present work, we consider as advertisement emails both the ones whichcomply with the legal requirements and the ones that does not, given that their purposeis clearly advertising a product.

The first time that spam came under consideration as business, was in April 1994. Twolawyers from Phoenix, named Canter and Siegel, hired a programmer to distribute their“Green Card Lottery Final One !” message to as many newsgroups as possible. Theinteresting point in this act was that they did not hide the fact that they were thespammers, and even instead they were proud of it. Canter and Siegel decided to writea book with the title of “How to Make a Fortune on the Information Superhighway :Everyones Guerrilla Guide to Marketing on the Internet and Other On Line Services”.Moreover, planned to open a consulting company to teach others, and help them, topost their own advertisement, which never took off.However, still, in 2015, spam emails are as one of the most popular tools for advertisingthe goods.

Figure 5.1 shows a typical sample of avertisement spam email, containing several pho-

64

tos, and prices, which clicking on each photo directs the user to the spammer website,convincing him to buy a product or service.

Portal Redirection : Portal redirection spam emails are the enablers of an evolved adverti-sement methodology. This spam emails are characterized by a minimal structure gene-rally reporting one or more links to one or more websites. Once the user clicks on thelink, she is redirected several times to different pages whose address is dynamically gene-rated. The final target page is mostly an advertisement portal with several links dividedby categories, generally related to common user needs (e.g., medical insurance). Thisstrategy is useful in reducing the legal responsibility on spam emails of the companieswhich are advertising a product. The rationale is that the advertised company cannotbe sued because another website, i.e. the portal, links to it. As an example, the opt-out clause of advertisement emails [83] does not apply. Moreover, the multi-redirectionwith dynamic links strategy makes difficult to track the responsible websites. It is worthnoting that the strategy of portal redirection emails, is also used to redirect users onwebsites with the intention of defrauding the users, or to distribute malicious code, andalso increasing the visits of a web page.

The first redirect service 1, in 1999, got the advantage of top-level domains (TLD), like".to" (Tonga), ".at" (Austria) and ".is" (Iceland). The aim was to make memorableURLs. The first redirect service return to V3.com that redirected 4 million users at itspeak in 2000. The success of V3.com was resulted from the fact that it contained awide variety of short memorable domains including “r.im”, “go.to”, “i.am”, “come.to” and“start.at”. Due to the fact that the sales price of top level domains started falling from$70.00 per year to less than $10.00, the use of redirection services got declined.

Spamdexing, is an attack with the purpose of indexing web page. The goal of a webdesigner is to create a web page that will find favorable rankings in the search engines,and the designers create their own web pages on the base of the standards that theythink it will succeed. Spam emails are a good place for embedding the links desired toget the higher score of visits. To this end, portal redirection technique can be applied toredirect the users to several desired web pages.

Figure 5.2 demonstrates a typical form of a portal redirection spam email, containingseveral hyper-links, hiding under a luring text, deceiving the user to click.

Advanced Fee Fraud : An advanced fee fraud or confidence trick spam email (synonymsinclude confidence scheme or scam) attempts to defraud a person after first gainingtheir confidence, used in the classical sense of trust [71]. Confidential trick spam exploitssocial engineering to trick the user in paying, by her own will, a certain amount ofmoney to the attacker. Scammers may use several techniques to deceive the user in

1. http ://news.bbc.co.uk/2/hi/technology/6991719.stm

65

Figure 5.2 – Portal

Figure 5.3 – Fraud

paying money, generally exploiting sentimental relations or promising a large amount ofmoney in return. The confidential trick emails, mostly are written in a friendly long text,to convince the victim the interactions. These kinds of emails, usually, do not redirectthe users to other web pages, mainly contain an email address.

419 scams [61] are one of the most common types of confidence tricks, that dates backto the late 18th century. The confidential trick scam typically promises the victim asignificant share of a large sum of money, which the spammer requires a small in-advancepayment to obtain. If a victim provides the payment, the fraudsters either asks furtherfees from the victim, or simply disappears. In these cases, the emails’ subject line oftencontains something like “From the desk of barrister [Name]”, “Your assistance is needed”,and so on.When the victim’s confidence has been achieved, the scammer then introduces a problemthat prevents the deal from occurring as planned, such as “To transmit the money, we

66

need to bribe a bank official. Could you help us with a loan ?” or similar. Although beingdifficult to evaluate the success rate of fraud spam campaign, one individual estimatedthat he sent 500 emails per day and he received about seven replies, mentioning thatwhen he received a reply, he was 70 percent certain that he would get the money 2.

The lottery scam is another well-known sample of confidential trick spam emails. Itcontains fake notice of lotter win. The winner is usually asked to send sensitive infor-mation such as name, residential address, occupation/position, lottery number etc. toa free email. Then the spammer informs the victim that releasing the money requiressome small fee (insurance, registration, or shipping). When the victim sends the fee,the scammer asks another fee to be paid 3. In the UK, lottery scams become such a bigproblem that many legitimate lottery sites dedicated pages on the subject to addressthe issue 4.

Figure 5.3 represents a typical confidential trick spam email, which through a long texttries to earn the victim confidence. To this end, a friendly long text is written to earnthe reader’s confidence.

Malware : Emails are an important vector for spreading malicious software or malware.Generally the malware is sent as email attachment, while the email structure is verysimple, with a small text which encourages the reader to open the attachment or notext at all 5. Once opened the malware infects the user device, showing different pos-sible malicious behaviors. Commonly, the malware transform the victim’s device in abot which is used to send spam messages to other prospective victims, which can bechosen by the spammer, or even being part of the victim’s contact list. To this categorybelongs Command and Control malware and Worms [104]. Often the malicious file iscamouflaged, inserted in a zip file or with a modified extension, which allows to deceivebasic anti-virus control implemented by some spam filters.

Figure 5.4 shows a typical representation of malware spam email, where mostly containsan attachment, convincing the user with luring sentences to open it. Notice that it isalso possible that malware campaign be designed in the format of portal redirection. Bythe way, here, when we talk about malware campaign, we mean that from the layoutof spam messages, we are almost sure that the spam campaign has been designed formalware distribution.

One of the very well-known malware spam campaign, is titled “Melissa.A”, a virus witha woman’s name, appearing on 26 March 1999 in the United States. Melissa.A camewith the message “Here is the document you asked me for . . . do not show it to anyone”.The virus came through email including an MS Word attachment. once opened, it was

2. www.articles.latimes.com3. www.snopes.com4. www.lottery.co.uk/scams5. http ://www.symantec.com

67

Figure 5.4 – Malware

emailing itself to the first 50 people in the MS Outlook contact list. In a few days, itbecame one of the most important cases of massive infection in history, causing damage ofmore than 80 million dollars to American companies, such that companies like Microsoft,Intel and Lucent Technologies blocked their Internet connections to be protected fromMelissa.A. The virus infected up to 20% of computers worldwide.

The ILOVEYOU virus is another malware spam campaign attack, that many consideredto be the most damaging virus ever written. It distributed itself by email in 2000 throughan attachment in the message. When opened, it loaded itself to the memory, infectingexecutable files. Once a user received and opened the email containing the attachment“LOVE-LETTER-FOR-YOU.txt.vbs”, the computer became automatically infected. Itthen infected executable files, image files, audio files, etc. Afterwards, it sent itself toothers by looking up the addresses contained in the MS Outlook contact list. It causedbillions of dollars in damages.

CryptoLocker 6 is a new malware campaign, as ransomware trojan which targeted com-puters running Microsoft Windows, first distributed in Internet on 5 September 2013.When activated, the malware encrypts several types of files stored on local and moun-ted network drives with the use of RSA public-key cryptography, with the private keystored only on the malware’s control server. Afterwards, the malware shows a messagewhich offers to decrypt the data if a payment (through either Bitcoin or a pre-paid cashvoucher) is made by a stated deadline, and the victim threatened to delete the privatekey if the deadline passes. Figure 5.5 shows the increase of Crypto ransomeware from2013 to 2014 7.

6. www.arstechnica.com7. www.symantec.com

68

Figure 5.5 – Crypto Ransomeware volume

Phishing : Phishing emails attempt to redirect users to websites, which are designed toobtain credentials or financial data such as usernames, passwords, and credit card detailillegally [3]. Generally, these emails pretend to be sent by a banking organization, orcoming from a service accessible through username and password, e.g. social networks,instant messaging etc., reporting fake security issues that will require the user to confirmher data to access again the service. To this end, phishing emails are mostly very wellpresented with a well organized structure, even reporting contact informations such asphone numbers and email. The representative structure of phishing emails we appliedin this research, contain short well written text, providing the victim some importantnews. Mostly there exists one link, which direct the user to a very well designed fakewebsite of a bank, which directly asks the victim to provide her credit card information.

On 26 January 2004, the U.S. Federal Trade Commission proposed the first lawsuitagainst a suspected phisher. It started from a Californian teenager, who created a web-page designed to look like the America Online website, and used it to steal credit cardinformation 8. Other countries have followed this lead by tracing and arresting phishers.A phishing kingpin, Valdir Paulo de Almeida, was arrested in Brazil for leading one ofthe largest phishing crime rings, which in two years stole between $18 million and $37million 9.Phishing, still in 2015, are one of the most dangerous effective kind of spam emails,requiring extensive efforts to fight against.

Figure 5.6 demonstrates a typical sample of phishing spam email, mostly well designedto be as much as possible to seem real as referred organization that it pretends to come

8. edition.cnn.com/2003/techinternet/07/21/phishing.scam9. www.channelregister.co.uk

69

Figure 5.6 – Phishing

from.

5.3.2 Feature Extraction

DWS parses raw spam emails (eml files) extracting a set of 21 categorical features building anumerical vector readable by clustering and classification algorithms. The extracted featuresare reported in Table 5.1. Worth noticing that Table 5.1 and 4.1 are identical, and here wejust bring it again to relate spammer goal and set of features as what follows.

The “number of recipients” which are in the To and Cc fields of the email differentiate betweenemails which should look strictly personal, e.g. communications from a bank (phishing) andthose that pretend to be sent to several recipients, such as some kind of frauds or advertisement.The structure of links in the email text gives several information useful in determining theemail goal. Portal redirections emails and advertisement generally show a high “Number oflinks”, in the first case to redirect the user to different portal websites, in the second one toredirect the user to the website where she can buy the products. Generally, fraud emails do notreport links except for “IP based links”. These links are expressed through IP addresses, withoutreporting domain names, to reduce the likelihood of being tracked or to make the email text,generally discussing about secret money transaction, more legitimate. The “number of domainsin links” represents the number of different domains globally found in all the links in the emailtext. Phishing and advertisement emails generally have just a single domain respectively ofthe website where to buy the advertised product and the website of the authority which themessage pretends to be sent from. On the other hand portal redirection may contain severaldomains to redirect the reader to different portal websites. Moreover, links in portal redirectionemails generally have a high “average number of dots in links” (i.e. sub-domains) and beingdynamically generated are likely to contain hexadecimal or non ASCII - characters. Non ASCII

70

Attribute DescriptionRecipientNumber Number of recipients addresses.NumberOfLinks Total links in email text.NumberOfIPBasedLinks Links shown as an IP address.NumberOfMismatchingLinks Links with a text different from the real link.NumberOfDomainsInLinks Number of domains in links.AvgDotsPerLink Average number of dots in link in text.NumberOfLinksWithAt Number of links containing “@”.NumberOfLinksWithHex Number of links containing hex chars.NumberOfNonAsciiLinks Number of links with non-ASCII chars.IsHtml True if the mail contains html tags.EmailSize The email size, including attachments.Language Email language.AttachmentNumber Number of attachments.AttachmentSize Total size of email attachments.AttachmentType File type of the biggest attachment.WordsInSubject Number of words in subject.CharsInSubject Number of chars in subject.ReOrFwdInSubject True if subject contains “Re” or “Fwd”.SubjectLanguage Language of the subject.NonAsciiCharsInSubject Number of non ASCII chars in subject.ImagesNumber Number of images in the email text.

Table 5.1 – Features extracted from each email.

characters in the links are also typical of some advertisement emails redirecting to foreignwebsites. It is worth noting that all these link based features consider the real destinationaddress, not the clickable text shown to the user. If the clickable text (hyper-link) shows anaddress (“click here”-like text is not considered) different from the destination address, the linkis considered mismatching and counted through the feature “mismatching links”. Phishing andportal redirection emails make extensive use of mismatching links to deceive the user.

Advertisement and phishing emails may appear like a web-page. In this case, the email containsHTML tags. On the other hand, fraud, malware and portal emails rarely are presented inHTML format. The size of an email is another important structural feature. Confidential trickand portal redirections generally are quite small in size, considering they are raw text. Adver-tisement, malware and some kind of phishing emails generally have a more complex structure,including images and/or attachments, which makes the message size to noticeably grow. “At-tachment Number”, “Attachment Size” and “Attachment Type” are structural features mainlyused to distinguish between the attachment of malware emails and those of advertisement andphishing emails, which attach to the email images for a correct visualization. The “Numberof Images” in an email determines the global look of the message. Images are typical of someadvertisement emails and phishing ones. Finally three features are used for the analysis ofthe subject. For example, some advertisement emails use several one-character words or nonASCII characters in emails to deceive typical spam detection techniques based on keywords

71

Table 5.2 – Feature vectors of a spam email for each class.

Class NumAttach Typeattach NumLinks NumImages NumDomains EmailSize SubjLang CharsInSubj Lang

Advert. 0 0 11 12 2 14 10 3 10Portal 0 0 10 0 1 10 1 3 1Fraud 0 0 0 0 0 10 1 1 1

Malware 1 5 0 0 0 21 1 2 1Phishing 0 0 2 0 2 9 1 3 1

[123]. It is worth noting that rarely non ASCII characters are used in phishing emails, tomake them look more legit. Moreover, some fraud and phishing emails send deceiving mailsubject with the “Re” : or “Fwd” : keyword to look like part of a conversation triggered by thevictim. Furthermore, some fraud emails are characterized by the difference between the email“Language” and the “Subject Language”. Many scam emails are, in fact, translated throughautomatic software which ignore the subject, causing this language duality.

For a further insight, Table 5.2 shows the vectors of some selected features extracted from thefive emails of Figures 5.1, 5.2, 5.3, 5.6, 5.4.

5.3.3 DWS Classification Workflow

After the email features have been extracted, the resulting feature vectors are given as inputto the DWS classification workflow. This process aims at dividing the unclassified spam emailsin campaign and label them through a classifier trained on the fly. The classifier can be appliedto label new spam emails. To get better insight, the workflow of proposed approach is depictedin Figure 5.7.

The main part of the workflow is aimed at generating a valid training set from the datasetof unclassified emails, applying hierarchical clustering algorithm to divide email in campaigns(step 1 in Figure 5.7). The chosen algorithm, named Categorical Clustering Tree (CCTree)generates a tree like structure (step 2) which is exploited to associate a campaign to eachemail coming from a small dataset of labeled emails. The campaign receives the label of theemail associated to it (step 3). Thus, this set of campaigns is used as training set for a classifier(step 4), successively used to label all the remaining campaigns (step 5 and 6).

The framework is based on a clustering algorithm (CCTree) and a classifier. As discussed,classifiers are generally more accurate than clustering algorithms, due to the supervised lear-ning approach. However, the major drawback of the classifiers is that a valid training set, withenough elements and representative of the reality is not always available. We argue that it ispossible to create such a training set with exploiting the CCTree algorithm and a small setof classified emails (C). The C set contains emails representatives of each class, however, thenumber of elements is not enough to constitute a valid training set for a classifier.

72

Figure 5.7 – DWS Workflow.

The CCTree algorithm starting from dataset D generates a decision tree-like structures whoseleaves are the final clusters, unlabeled. Following the CCTree structure it is possible to collocatethe emails of the set C in the unlabeled clusters of the set D, to find similarly structured emails.In fact, in the problem of clustering spam emails, each cluster represents a set of homogeneoussimilar spam emails, i.e. a spam campaign. Thus, for the purpose of goal-based labeling, allemails belonging to the same cluster will receive the same label. Finally, the emails of thesehomogeneous clusters can be used as training set for the supervised learning classifier. Afterthe classifier has been trained, it is used to classify the remaining leaves of the CCTree thatwere not reached by any email of the set C.

Figure 5.7 schematically depicts the typical operative work-flow of the proposed framework.In the following the six steps of the DWS workflow are described in detail.

Phase 1 : Clustering Spam Emails into Campaigns

The first step performed by the DWS framework is to divide large amounts of unclassifiedspam emails (constituting the set D) into smaller groups of similar messages (steps 1 and 2 inFigure 5.7). Emails are clustered by structural similarity exploiting the CCTree algorithm.

73

Figure 5.8 – Insert new instance X in a CCTree

Phase 2 : Training Set Generation

In order to label the campaigns, it is necessary to train a classifier to recognize emails comingfrom the five predefined spam classes (steps 3 and 4 in Figure 5.7). To this end, it is necessaryto provide to the classifier a good training set, which has to be representative of the realityin which the classifier has to operate. For this reason the training set will be extracted fromthe unclassified emails dataset D itself. More specifically, the CCTree structure generated inprevious step is exploited to label a small number of generated spam campaigns. To this end,small number of campaigns are labeled with the use of a small set of labeled emails C. Thisset contains a small number of manually selected spam emails, equally distributed in the fiveclasses, all structurally different. These spam emails do not come from the D set. The emailsin the C dataset have to be accurately chosen on the base of the email that investigator areinterested in. For example, Italian police investigators interested in following a phishing caseshould put in the C dataset some emails with Italian text and bank names. After extractingthe value of the features from the email in C, they are fed one by one to the CCTree generatedon D. Following the CCTree structure each email ci is eventually inserted in the campaign Cj(Figure 5.8). Thus the campaign Cj is labeled with the class of ci and all its emails are addedto the training set.

If the same spam campaign is reached by two or more emails of different classes, the campaignis discarded and the emails are re-evaluated to be sent to other campaigns. It is worth notingthat such an event is unlikely due to the high homogeneity of the clusters generated throughCCTree. Furthermore, in the event that an email in C does not reach to any campaign, i.e.a specific attribute value of the email is not present in the CCTree, the email is inserted inthe more similar campaign. To this end, the node purity of each campaign is calculated beforeand after the insertion of the email ci. The email is thus assigned to the campaign in whichthe difference between the two purities, weighted by the number of elements, is lesser.

Phase 3 : Labeling Spam Campaigns

74

Feeding the training set to the classifier, we are able to classify all remaining campaignsgenerated by the CCTree (step 5 and 6 in Figure 5.7). To this end, each campaign resultedfrom CCTree is given to the classifier. The classifier labels each email of received campaign onthe base of spammer purpose. Under two conditions DWS considers a spam campaign as nonclassified.

Firstly, it is possible that more emails belonging to the same campaigns receive differentlabels, e.g. phishing and portal redirection. In such a case, calling as “majority class” the labelwith more emails in the cluster, the campaign is considered non classified if the emails of themajority class amount to less than 90% of all the emails in the campaign. classified.

The second condition is instead related to the prediction error reported by the classifier on eachelement of a campaign. The predicted error, computed as 1−P (ei ∈ Ωj), where P (ei ∈ Ωj) isthe probability that the element ei belongs to the class Ωj , i.e. the label assigned to the elementei. DWS framework considers a campaign as non classified, if the average predicted error ismore than 30%. If the non classified campaigns and/or elements are a consistent percentage,it is possible to restart the classification process running the CCTree algorithm with tightercriteria for node purity.

5.4 Results

This section presents the experimental results of the DWS framework. First we discuss theclassifier selection process, exploiting two small datasets of manually labeled spam emails.Afterward, we present the results for a real use case of the DWS framework on a recentdataset of spam emails.

5.4.1 Classifier Selection

In this first set of experiments we compare the performance of three different classifiers. To thisend, two sets of real spam emails are provided to be used as training and test sets. These twodatasets are extracted from emails collected by the untroubled honey pot 10 in February andJanuary 2015. The emails have been manually analyzed and labeled for standard supervisedlearning classification and performance evaluation. The manual analysis and labeling processhas been performed rigorously analyzing text and images, and following the links in each email.Only the emails for which the discovered class was certain have been inserted to the datasets.For a spam email, the label is certain if it matches the label description given in Section 5.3.1and the label is verified through manual analysis. For example, Portal Redirection emails arecertainly labeled if the links really redirect to a portal website. The first dataset, used astraining set, is made of 160 spam emails, the second one, used as test set, is made of 80 emails.


75

Experiments have been run on all the classifiers offered by the WEKA library to classifycategorical data. For the sake of brevity and clarity we only report the classifier with thebetter results for each classifier group. More specifically, the chosen classifiers are the K-Starfrom the Lazy group, the Random Tree Forest from the Tree group and the Bayes Networkfrom the NaiveBayes group. Among these three classifiers, the best one has been used by theDWS framework in its operative phase.

Dataset Dimensioning

The process of manual analysis and labeling is time consuming. However, it is necessary to havea dataset well balanced, without duplicates and representative of the five classes, needed tocorrectly assess classifier performances. Given the complexity of manual analysis procedure, itis not possible to choose training and testing set of extremely large dimension. Thus, standarddimensioning techniques have been used, for both training and testing set. A general ruleto assess the minimum size for a training set is to dimension it as six times the number ofused features [140]. It is worth noting that the training set of 160 elements already matchesthis condition (6× 21 < 160). However, in multi-class problem, the dimension of data shouldprovide well result in terms of sensitivity and specificity, i.e. true positive rate (TPR) and(1 - false positive rate (FPR)) respectively, when K-fold validation is applied [14]. This mustbe done keeping balanced the relative frequencies of data in various classes. As shown in thefollowing, the provided testing set returns for K-fold validation a value of Receiver OperatingCharacteristic’s Area Under Curve (ROC-AUC or AUC for short) higher than 90% for alltested classifiers.

Concerning the test set, it is important the null intersection with the training set and thebalanced relative frequencies of the various classes. In [14], the minimum size for a testing setto provide meaningful results, in a problem of classification with five classes, is estimated tobe 75, which is smaller than the test set of 80 spam emails provided.

Classification Results

We report now the classification results for the three tested classifiers on the two aforemen-tioned dataset. The first set has been used as training set for the classifiers. According to themethodology in [14], a first performance evaluation has been done through the K-fold (K=5)validation method, classifying the data for K times using each time K − 1/K of the datasetas training set and the remaining elements as testing set. The used evaluation indexes are theTrue Positive Rate (TPR), False Positive Rate (FPR) and Receiver Operating CharacteristicArea Under Curve (ROC-AUC or simply AUC). The AUC is defined in the interval [0, 1] andmeasures the performance of a classifier at the variation of a threshold parameter T , properof the classifier itself, according to the following formula :

AUC =

∫ ∞−∞

TPR(T ) · FPR′(T )dT

76

Table 5.3 – Classification results evaluated with K-fold validation on training set.

Algorithm K-star RandomForest BayesNetTrue Positive Rate 0.956 0.937 0.95False Positive Rate 0.01 0.019 0.013Area Under Curve 0.996 0.992 0.996

where FPR′ = 1 − FPR. When the value of AUC is equal to 1, the classifier is considered“perfect” for the classification problem.

Table 5.3 reports TPR, FPR and AUC the three classifiers, i.e. the number of correctly clas-sified elements between the five classes for both the K-fold test on the first dataset (160 spamemails). As shown, all the classifiers return an accuracy higher than 90%.

Table 5.4 – Classification results evaluated on test set.

Algorithm K-star RandomForest BayesNetMeasure TPR FPR AUC TPR FPR AUC TPR FPR AUC

Advertisement 1.000 0.031 0.998 1.000 0.000 1.000 1.000 0.031 0.967Portal 0.786 0.000 0.996 0.786 0.016 0.985 0.929 0.000 0.998Fraud 1.000 0.016 0.992 1.000 0.016 0.951 1.000 0.016 0.928

Malware 0.938 0.016 0.995 0.938 0.016 0.908 0.938 0.016 0.957Phishing 0.947 0.017 0.977 0.947 0.051 0.963 0.842 0.017 0.907Average 0.9342 0.016 0.9916 0.9342 0.019 0.9614 0.9418 0.016 0.9514

Afterward, the whole first dataset has been used to train the three classifiers, whilst the seconddataset has been used as test set. Table 5.4 reports the detailed classification results wherethe classifiers are trained with training set (160 spam emails) and evaluated with test set (80spam emails). The result is reported on the five classes with TPR, FPR and AUC.

For a further insight, we report in Figures 5.9, 5.10, 5.11, 5.12, and 5.13 the comparison of theROC curves of the three classifiers for the five classes, measured on the test set.

It is worth noting that in all cases the area under the ROC curve is close to 1, hence in generalclassifiers shows good performances on the testing set for each class.

As can be observed in Table 5.3, on the average the K-star and Bayes Net classifiers giveslightly better K-fold results. However, the K-star classifier yields the better results in termsof AUC in average, evaluated with test set ( Table 5.4). Therefore, K-star is the desiredclassifier which we implement in DWS framework.

77

Figure 5.9 – ROC curve / Advertisement

Figure 5.10 – ROC curve / Portal Redirection

78

Figure 5.11 – ROC curve / Fraud

Figure 5.12 – ROC curve / Malware

79

Figure 5.13 – ROC curve/ Phishing

5.4.2 DWS Application

The second set of experiments aims at assessing the capability of the framework to cluster andlabel large amounts of spam emails. To this end the DWS framework has been tested on setof 3230 recent spam emails. The spam emails have been extracted from the collection of thehoneypot 11, related to the first week of March 2015. The emails have been manually analyzedand labeled for performance analysis.

Phase 1 : Clustering with CCTree

In the first step CCTree has been used to divide the emails in campaigns. The CCTree pa-rameters have been chosen finding the optimal values for number of generated clusters andhomogeneity, using the knee method described in Chapter 4. Applying CCTree, 135 clustershave been generated of which 73 only contains one element. Generated clusters with a singleelement have not been considered. These emails are, in fact, outliers which do not belong toany spam campaign. The remaining 3149 emails divided in 62 clusters have been used for thefollowing steps.

Phase 2 : Training Set Generation

To generate the training set we used a small dataset made of three representative emails foreach of the five classes. These 15 emails have been manually selected from different datasetsof real spam emails, including personal spam inbox of the authors. To facilitate the manual


80

analysis of the classified spam emails, the 15 emails of the set C are written in English language.Each email has been assigned to one of the 62 spam campaigns, following the CCtree structure,as described in Section 5.3.3. The campaigns associated to each email are used as training set.

Table 5.5 – Training set generated from small knowledge.

Class Number of Emails Number of CampaignsAdvert. 29 2Portal 66 3Fraud 113 3Malware 27 1Phishing 17 1Total 252 10

The generated training set (Table 5.5) is composed of 252 emails, contained in 10 campaigns.It is worth noting that the 15 emails have not been added to the associated cluster after theCCTree classification, to not alter the decision on the following emails.

Phase 3 : Labeling Spam Campaigns

After training the classifier with the generated training set, we label the remaining (52 out of62) unlabeled spam campaigns of CCTree. The classification results are reported in Table 5.6.

Table 5.6 – DWS classification results for the labeled spam campaigns.

Class Campaigns Emails TPR FPR AccuracyCorrect Wrong Correct Wrong

Advert. 5 0 137 0 1 0 1Portal 26 0 1331 0 1 0.03 0.9935Fraud 10 2 1032 43 0.96 0.01 0.9788Malware 3 0 31 0 1 1 1Phishing 7 1 213 18 0.915 0 0.994Total 51 3 2744 61 0.975 0.008 0.9782

The table reports for each class the amount of campaigns and corresponding email classifiedcorrectly or incorrectly. Moreover, we report for the emails the statistics on TPR, FPR andAccuracy (i.e., the ratio of correctly classified elements). The global accuracy, (last row ofthe table) is of 97,82%. However, we point out that, due to the conditions on predicted errorreported in Subsection 5.3.3, 8 campaigns out of 62, containing 344 spam emails are consideredunclassified. For the sake of accuracy, considering these 8 campaigns as misclassified, the totalaccuracy for emails on the dataset is of 87,14%. The accuracy result is in the line with previousworks on classification emails into phishing and ham [34], [50], [17].

Concerning the 8 non classified campaigns, 3 campaigns containing 68 spam emails were cor-rectly labeled as portal. However, they are considered unclassified since the average predicted

81

error is higher than 30% in all the 3 campaigns. 4 campaigns containing 258 spam emailshave been classified as phishing. 2 of them with 116 messages, were correctly identified butdid not match the predicted error condition. The other 2 campaigns have been incorrectlyclassified as fraud. However, they are considered as unclassified due to high predicted error.The last campaign with 18 elements is in the advertisement class, but incorrectly classified asfraud, though the predicted error condition again is not matched. It is worth noting how thecondition on predicted error is useful in increasing the overall accuracy on classified data.

From Table 5.6 it is possible to infer what a large portion of spam messages belongs to portaland fraud classes. Even if these preliminary results are related to a relatively small dataset,they are indicative of the current trend of spam emails distribution, which may provide to thespammer the greatest result with the smallest risk.

5.5 Ranking Spam Campaigns

Due to the fact that the number of spam emails collected daily is astronomic, even afterclustering spam emails into smaller similar groups (spam campaigns), still a methodology isrequired to automatically order spam campaigns according to investigator priorities. To thisend, in this section, we provide several features (including label of campaigns) and the weightof features to attribute a grade to each spam campaign. The set of campaigns are orderedbased on their grades. More features can be added to the provided set depending on the casestudy.

Ranking spam campaigns helps the investigator to decide which set of spam messages assi-gned to a specific spammer, needs to be analyzed and prosecuted first. Furthermore, if theinvestigator pursue a specific goal, for example dangerous spam campaigns toward Canada,our proposed ranking methodology can be applied.

5.5.1 Ranking Features

In this section, we propose a set of ranking features, containing five features, to order spamcampaigns. The ranking features are presented in Table 5.7. Afterwards, we explain in detailwhat each ranking feature means and how it is normalized to the interval [0,1].

— Number of Data belonging to a Spam Campaign (N )— Domain of URLs (U)— Language of spam message (L )— Burst Property, Analysis of Distribution of Data in a period of Time (B )— Class (Label) of Campaigns (C)

Table 5.7 – Set of ranking features

82

Number of Data (N ) : The number of data in a campaign refers to the number of spamemails belonging to that campaign. The number of data in campaigns are normalizedbased on the maximum number of elements in largest campaign. More precisely, supposethe campaign containing the maximum number of elements contains nmax spam emails.The number of data of i’th campaign, containing ni elements, is normalized asNi = ni

nmax.

Hence, Ni ∈ [0, 1].

URL Domain of Campaign (U) : The URL domain of spam message is a boolean feature,which for a spam email equals to 1, if among the URLs in the body of message the desireddomains occurs, and it equals to 0, otherwise.The URL domain of a spam campaign equals to the portion of spam messages in acampaign for them URL domain equals to 1.

For example, consider that the investigator is interested in emails oriented to Canada.In this case, appearing the URL with the domains like “.ca ” in the body of messagesin a campaign makes it more interesting than other campaigns. To this end, a set ofinteresting domains are provided as X = X1, . . . , .Xk, then for each message in thespam campaign respecting one of provided interesting domains, the URL domain ofmessage equals to 1. The URL domain of spam campaign equal to the number of messagefor them the result is equal to 1 divided to whole number of messages in the campaign.From the definition, the feature U is normalized to [0,1].

Language (L) : The language of message is another criteria which helps to the investigatorinterested in spam campaigns oriented to a specific region, e.g. Canada. To this end, aset of desired languages is provided, e.g. for Canada the set of language may containEnglish and French. Then, the language of a message equals to 1, if it has been writtenin one of desired language, and equals to 0, otherwise. The language of a campaign (L) equals to the portion of messages for them the language of message equals to 1. Fromthe definition, the criteria L is normalized to [0,1].

Burst Property (B) : A spam campaign for which the number of spam messages decreasedas the time passes is less dangerous than the one for which the number of produced spamemails are increasing. We call the criteria of increasing the number of elements of a spamcampaign as burst property of campaign, and we calculate it as dividing the duration oftime between first email (first in terms of time) and last email (last in terms of time) ina spam campaign into two parts. If the number of emails in second part is more thanthe spam messages in first part, we say that the spam campaign respect burst property,and we attribute 1 to B, otherwise, it is not respecting burst property and we attributeit 0.

Class (label) of Campaign (C) : Label of campaign (C) is returned from the result of DWSframework. Here, we propose an approach to attribute a score to each label. Worth

83

noticing that the proposed score for each label can be modified according to investigator’spriorities.

Phishing spam messages are the most dangerous kind of spam emails, stealing the impor-tant information of the victim in a very well presented format. After phishing, Malwarespam are the most dangerous ones, in the sense that mostly the computer of end useris affected without his awareness. Fraud emails being dangerous enough, are less dange-rous than phishing and malware. The reason is that fraud spam messages mostly reachto their goal through several times of communication, and during this interaction it ispossible that the victim become aware of the risk of continuing communication or itis possible that a filtering service stop it before that the required money be transfer-red. Portal emails, mostly not well presented, are generally distinguished by the user asspam, hence, not dangerous as previous groups. Finally, advertisement spam email whichmostly propose a real product are the least dangerous spam campaigns. Considering thatthe campaigns with unknown label are not considered as real dangerous campaigns, wescore the phishing, malware, fraud, portal, advertisement and unknown campaigns as 6,5, 4, 3, 2, 1, respectively. The score of campaign label is normalized by dividing eachscore to 6 (Table 5.8).

Table 5.8 – Normalized score of spam campaigns label

label Phishing malware Fraud Portal Advert. Unknownnormalized score 1 0.83 0.66 0.5 0.33 0.16

5.5.2 Spam Campaign Grade

To attribute a grade to each spam campaign after extracting its ranking features, it is requiredto provide a weight for each ranking feature. The weight of a feature is characterized by anexpert, which may vary from one case to another. The weights of features should be normalizedto [0,1], which simply could be achieved by dividing each weight to the sum of weights. Theweighted features shows the importance of each feature in ranking spam campaigns.

We define the grade of campaign C, written grade(C), as following :

grade(C) = ω1 · C + ω2 · N + ω3 · U + ω4 · L+ ω5 · B

where C,N ,U ,L and B are the extracted ranking features of campaign C, ωi ∈ [0, 1] for1 ≤i ≤ 5. From the definition grade(C) ∈ [0, 1].

5.5.3 Ranking Application

In this section, we propose an approach to order set of spam campaigns according to the spamcampaigns grades. To this end, first we provide a simple ranking methodology, named dense

84

Table 5.9 – Three first ranked campaigns

Number of Data URL Domain Language Burst Label gradeCampaign 1 1 0.96 0.98 1 0.5 0.88Campaign 2 0.78 0.86 0.97 1 0.66 0.854Campaign 3 0.15 0.91 1 1 1 0.812

ranking. In dense ranking, objects having the same score receive the same rank. Afterwards,we explain the experiment of ranking spam campaigns resulted from Section 5.4.2.

Definition 5.1 (Dense Ranking (1223 ranking)). In dense ranking, objects having the samescore receive the same ranking number, and the next object(s) receive the immediate followingranking number. Hence, each object ranking number is 1 plus the number of object ranked aboveit that are distinct respecting to the ranking order. For example, if A ranks ahead of B and C,where B, C rank equal, and both rank ahead of D, then A gets ranking number 1, B and C eachgets the rank number 2, and finally D gets ranking number 3, i.e. A = 1, B = 2, C = 2, D = 3.

To apply dense ranking in ordering set of spam campaigns, it is enough that we first find thegrade of each spam campaign. Afterwards, the campaigns get the rank according to their owngrade. The greater the grade, the lesser rank.

To order the 62 spam campaigns labeled in Section 5.4.2, firstly we should extract for eachcampaign the four other ranking features explained in Section 5.5.1.

Concerning the features U and L, we consider the range of interesting domain as .ca, .comand English, French, respectively. By considering the equal weight for each feature, i.e.ωi = 0.2, 1 ≤ i ≤ 5, we calculate the grade of each campaign. The maximum number ofelements among 62 campaigns belongs to a portal campaign, containing 407 spam emails.Hence, the number of elements in campaigns are normalized by dividing the number of elementsin a campaign divided to 407.

In Table 5.9, we report the properties of first three ranked campaigns.

where the grade of each campaign is calculated as following :

grade(campaign1) = 0.2.(1 + 0.96 + 0.98 + 1 + 0.5) = 0.88

grade(campaign2) = 0.2.(0.78 + 0.86 + 0.97 + 1 + 0.66) = 0.854

grade(campaign3) = 0.2.(0.15 + 0.91 + 1 + 1 + 1) = 0.812

The set of first ranked campaigns, reports the set of campaigns required to be analyzed andfollowed by the investigators. The process is performed automatically, hence, in a short periodof time vital information is provided, which is almost impossible to be achieved by consideringa huge amount of spam emails as a whole.

85

5.6 Conclusion

Spam emails constitute a constant threat to both companies and private users. Not only theseemails are unwanted, occupy storage space and need time to be deleted, also they have becomevectors of security threat and used to perform cybercrimes, such as phishing and malwaredistribution. In this chapter, we have presented a framework, named DWS, for analysis oflarge amounts of spam emails collected through honeypots. We argue that DWS can providea helpful tool for police and investigators in forensic analysis of spam emails. In fact DWSautomatically clusters and classifies large amount of spam emails in labeled campaigns, toeventually help investigator to focus on campaigns for a specific cybercrime, filtering out thenon-interesting spam emails. Moreover DWS is self learning, not requiring any preexistentknowledge of the dataset to analyze. Instead a small set of data, named small knowledge isprovided. To update the small knowledge the investigators can add new discovered templatesto previous set of small knowledge.Preliminary tests performed on a first dataset of more than 3200 emails showed a good accuracyof the DWS framework.Furthermore, a ranking methodology is proposed to order set of spam campaigns based onthe investigator priorities. The set of first ranked campaigns are the ones which should beanalyzed first.

86

Chapitre 6

Algebraic Formalization of CCTree

Despite being one of the most common approach in unsupervised data analysis, a very smallliterature exists on the formalization of clustering algorithms. In this chapter we propose asemiring-based methodology, named Feature-Cluster Algebra, which abstracts the represen-tation of a labeled tree structure representing a hierarchical categorical clustering algorithm,named CCTree ([127]). Through several theorems and examples we show that the abstractschema fully abstracts the tree structure. The full abstraction provide this interesting pro-perty that it is possible to apply an algebraic term and a tree structure one instead of theother, when needed. This means that it is possible to use the well established concepts in thealgebraic form of the clustering algorithm to get the equivalent result in the semantic form.We apply the abstract schema of CCTree to formalize CCTree parallelism with the use ofrewriting system. To this end, a set of functions and relations are defined on feature-clusteralgebra. Then, we first propose a rewriting system to automatically identify whether a termrepresents a CCTree term or not. Afterwards, a rewriting system is proposed to automaticallyget a final CCTree term from the addition of two (or more) CCTree terms. The final CCTreeterm is used to homogenize the structure of all CCTrees in parallel devices.

6.1 Introduction

Clustering is a very well-known tool in unsupervised data analysis, which has been the focusof significant researches in different domains of computer security, spanning from intrusiondetection [145], spam campaign detection as explained in previous chapters, to clusteringAndroid malware [121]. The problem of clustering becomes more challenging when data aredescribed in terms of categorical attributes, for which, differently from numerical attributes,it is hard to establish an ordering relationship [6]. The difficulty arises from the fact that thesimilarity of elements cannot be computed with the use of well-known geometric distances,e.g. euclidean distance. In categorical clustering, each attribute contains a domain of discrete,mutually exclusive features, where each feature represents a value of an element. For example,

87

the attribute color may contain features as red and blue.

Clustering algorithms are vastly applied in real world problems, also in security problems,which in present thesis have already been applied toward spam campaign detection. Notwiths-tanding, a very few works exist to express and solve the problems of clustering algorithms interms of formal methods. Formal methods are mathematically based languages, techniques,and tools to specify general rules on a system, where the desired properties of the system canbe verified easily on the base of identified rules [37].In the present work, we argue that using formal methods in CCTree, as a specific form of cate-gorical clustering algorithm, provides an abstract representation of clusters, which facilitatesthe analysis of cluster properties, while getting rid of confronting a large amount of data ineach cluster. The proposed formal scheme is used to formalize a challenging task in categoricalclustering algorithms, named parallel clustering.CCTree (Categorical Clustering Tree) has a decision tree-like structure, which iteratively di-vides the data of a node on the base of an attribute, or domain of features, yielding thegreatest entropy. The division of data is shown with edges coming out from a parent nodeto its children, where the edges are labeled with the associated features. A node respectingthe identified stop conditions, is considered as a leaf. The leaves of the tree, are the desiredclusters. Being notably significant the features in CCtree construction, i.e. the edge labels, aCCTree has a feature-based structure.Feature algebra [74] is a semiring-based formal method proposed to formalize feature-basedproduct lines, e.g. software product.We import the idea of feature algebra to formalize feature-based CCTree structure and callour proposed semiring-based algebraic structure : “Feature-Cluster Algebra”.The notion of feature-cluster algebra is used to abstract CCTree representation as a term.The CCTree term is applied to formalize CCTree parallelism on the base of Rewriting System.Parallel clustering is a methodology proposed to alleviate the problem of time and memoryusage in clustering large dataset [42].

The contributions of this chapter can be summarized as follows :— A semiring-based formal method, named Feature-Cluster Algebra, is proposed to abs-

tract the representation of a categorical clustering algorithm, named CCTree ([127]).The abstraction theory is a delightful mathematical concept, which constructs a briefsketch of the original representation of a problem to deal with it easier. More precisely,abstraction is the process of mapping a representation of a problem, called the ground(semantic), onto a new representation, called abstract (syntax) representation, in a waythat it is possible to deal with the problem in the original space by preserving certaindesirable properties and in a simpler way to handle, since it is constructed from groundrepresentation by removing unwanted detail [59].

— Through several theorems and examples we show that the proposed approach fully

88

abstracts the CCTree representation under some conditions.Full abstraction is an interesting property of abstract mapping, which guarantees thatwe can apply the ground (semantic) representation and abstract (syntax) representationof a problem alternatively.

— A rewriting system is proposed to automatically verify whether a term is a CCTreeterm or not.A rewriting system is a set of directed equations on a set of objects. Mostly the objects inrewriting system are called terms and the directed equations are called rewriting rules.The rewriting rules are applied to compute new terms by repeatedly replacing subtermsof a given term until the simplest form possible is obtained. The rewriting system is aninteresting mathematical concept which automatically creates a new desired final termapplying the correctly specified rewriting rules [43].

— The abstract form of the CCTree is applied to formalize the process of parallelizingCCTree clustering on parallel computers with the use of rewriting system.The proposed rewriting system contains a set of rewriting rules which direct us to geta CCTree term from a non CCTree term, representing a CCTree which all CCTrees inparallel devices can merge to it.

— We prove that the proposed rewriting systems are confluent.The termination and confluence are two interesting properties of a rewriting system.The termination of a rewriting system guarantees that the system does not containa loop of rules, which causes the non terminating process of applying the rewritingrules. On the other hand, the confluent property of a rewriting system guarantees thatapplying the rewriting rules on a given term results in a unique term.

This chapter is organized as follows. In Section 6.2, we present a review of the literatureabout formalization methods applied in feature-based problems. In Section 6.3, the process oftransforming a CCTree to its equivalent algebraic expression is explained in terms of semiring.In Section 6.4, we show the proposed algebraic structure fully abstract tree representation.The relations on feature-cluster algebra are introduced in Section 6.5. In Section 6.6, we applyabstract CCTree representation to formalize CCTree parallel clustering in terms of rewritingsystem. We conclude and point to future directions in Section 6.7.

6.2 Related work

Feature models are information models in a way that a set of products, e.g. software productsor DVD player products, are represented as hierarchically arrangement of features, with dif-ferent relationships among features [15]. Feature models are used in many applications as theresult of being able to model complex systems, being interpretable, and being able to handleboth ordered and unordered features [105]. [15] believe that designing a family of softwaresystems in terms of features, makes it easy to be understood by all stakeholders, rather than

89

other forms of representations. Representing feature models as a tree of features, were firstintroduced by [82], to be used in software product line. Some studies [31], [32], show that treemodels combined with ensemble techniques, lead to an accurate performance on variety ofdomains. In feature model tree, differently from CCTree, the root is the desired product, thenodes are the features, and different representation of edges demonstrates the mandatory oroptional presence of features.[73] [74], were the first who applied idempotent semiring as the basis for the formalizationof tree models of products, and called it feature algebra. The concept of semiring is usedto answer the needs of product family abstract form of expressions, refinements, multi-viewreconciliations, product development, and classification. The elements of semiring in the pro-posed methodology, are sets of products, or product families.To the best of our knowledge, we are the first who applied an algebraic structure, to abstracta categorical clustering algorithm representation and formalize the associated issues.

6.3 Feature-Cluster Algebra

In this section, we introduce our proposed semiring-based formal method, named feature-cluster algebra, to abstract the CCTree representation. To this end, we first explain whatprecisely a semiring implies. Then, the process of transforming a tree structure to its equiva-lent term is presented.

6.3.1 Semiring

In abstract algebra, the term algebraic structure generally refers to the set of elements toge-ther with one or more finitary operations respecting specified properties [68]. In particular, asemiring is an algebraic structure containing two binary operations on a set of elements. Moreprecisely, a semiring is defined as follows.

Definition 6.1 (Semiring). A semiring is a set S, with two binary operations “+” , “·” ,called addition and multiplication, respectively, such that (S,+) is a commutative monoid withidentity element 0, and (S, ·) is a monoid with identity element 1. The multiplication distributesleft and right over addition, and multiplication by 0 annihilates elements of S. A semiring forwhich multiplication is commutative, is called a commutative semiring [68].More precisely, S equipped with two binary operations “+”, “·” , such that 0 , and 1 are identityelements to “+”, and “·”, respectively, is a semiring, if for all a, b, c ∈ S, the following laws

90

are satisfied :

(a+ b) + c = a+ (b+ c)

0 + a = a+ 0 = a

a+ b = b+ a

(a · b) · c = a · (b · c)

1 · a = a · 1 = a

a · (b+ c) = (a · b) + (a · c)

(a+ b) · c = (a · c) + (b · c)

0 · a = a · 0 = 0

Briefly, we write (S,+, ·, 0, 1) is a semiring.A semiring (S,+, ·, 0, 1) is called an idempotent semiring, if for any a ∈ S, we have :

a+ a = a

Semiring of Features

Lets consider a set of disjoint sorts, denoted as A, is given, where the carrier set of each sortAi ∈ A is denoted by VAi . In our context, we call the given set of sorts, the set of attributes,and we call the union of sorts, denoted as V =

⋃Ai∈A VAi , the set of values or features.

Example 6.1. We may consider the set of attributes as A = color, size, where the carrierset of each attribute can be considered as Vcolor = red, blue and Vsize = small, large. Inthis case, we have V = red, blue, small, large.

Definition 6.2 (Sort). We define the sort function which gets a set of features and returns aset of the associated sorts of received feature as follows :

sort : P(V)→ P(V)

sort(f) = VA for f ∈ VA

sort(V1 ∪ V2) = sort(V1) ∪ sort(V2)

Example 6.2. In the following, we present the application of sort function on a set of featuresfrom Example 6.1 :

sort(red) = red, blue

sort(red, small) = sort(red) ∪ sort(small) = red, blue, small, large

Consider F = P(P(V)) be the power set of the power set of V, whilst we denote 1 = ∅ and0 = ∅. We define the operations “+” and “·”, as choice and composition operators on F as

91

following :

· : F× F→ F

F1 · F2 = X ∪ Y : X ∈ F1 , Y ∈ F2

+ : F× F→ F

F1 + F2 = F1 ∪ F2

We say that F belongs to the set power of features F, if it respects one of the following syntaxforms :

F := 0 | f |F · F |F + F | 1 (6.2)

where f ∈ V.

Example 6.3. In the following, some elements of F on V = red, blue, small, large arepresented :

F1 = red, large, blue

F2 = small

F1 · F2 = red, large, small, blue, small

F1 + F2 = red, large, blue, small

In the problem of formalizing the categorical clustering, the set red, large, blue mayrepresent two clusters, where the elements of the cluster red, large have features red andlarge in common, and the elements of the cluster small are all small. This means thatwe use the addition to separate clusters, and we use multiplication to consider more featuresin identifying the clusters. It is clear that any combination of the set of the features does notnecessarily represent a clustering.

Proposition 6.4. It is easy to verify that the two operations “+” and “·” respect the following

92

properties for every F1, F2, F3 ∈ F :

(F1 + F2) + F3 = F1 + (F2 + F3) (6.3)

F1 + F2 = F2 + F1 (6.4)

F1 · F2 = F2 · F1 (6.5)

(F1 · F2) · F3 = F1 · (F2 · F3) (6.6)

F1 · (F2 + F3) = (F1 · F2) + (F1 · F3) (6.7)

(F1 + F2) · F3 = (F1 · F3) + (F2 · F3) (6.8)

1 · F1 = F1 · 1 = F1 (6.9)

0 · F1 = F1 · 0 = 0 (6.10)

0 + F1 = F1 + 0 = F1 (6.11)

F1 + F1 = F1 (6.12)

Theorem 6.5. The quintuple (F,+, ·, 0, 1) constitutes an idempotent commutative semiring.

Démonstration. The proof is straightforward from the Proposition 6.4.

Definition 6.3. Lets consider |.| returns the number of elements in a set. Then, we say F ∈ Fbelongs to the set Fn, if |F | = n. Under this definition, F1, i.e. the subset of F, where eachelement contains just one dataset of features, is the desired one according to our problem. Inthis case, for F ∈ F1, we remove the brackets and separate the features belonging to the sameset by multiplication. Hence, we consider F ∈ F1 if it can be written as one of the syntax formsas : 0 | f |F1 · F2 | 1, for f ∈ V.

It is noticeable when two elements of F1 are added or multiplied, they follow the same pro-perties following the main semiring defined on F. In the following example, we show how thissimpler representation is used in the rest of the chapter.

Example 6.6. We simplify the elements of Example 6.3 according to Definition 6.3, as thefollowing :

F1 = red, large, blue = red, large+ blue = red · large+ blue

F2 = small = small

F1 · F2 = red, large, small, blue, small = red, large, small+ blue, small

= red · large · small + blue · small

F1 + F2 = red, large, blue, small = red, large+ blue+ small

= red · large+ blue+ small

The semiring of features can be used to represent different feature-based clustering algorithms.In our context, planing to address the parallel clustering, we also require to discuss on different

93

datasets that the clusters are originated from. To this end, in upcoming subsection we presentthe semiring of elements.

Semiring of Elements

Let us consider that the set of the sorts, or the set of attributes A with an order amongattributes, is given. Suppose |A| = k, and without loss of generality A1, A2, . . . , Ak are theordered sorts which range over A. We say s belongs to the set of elements S, if s ∈ VA1 ×VA2 × . . . × VAk × N, where the carrier of attributes are arbitrarily ordered (then fixed) foreach problem, and N is the set of natural numbers. Hence, S ⊆ VA1 × VA2 × . . . × VAk × N,i.e. s ∈ S can be written as s = (x1, x2, · · · , xk, n), where xi ∈ VAi for 1 ≤ i ≤ k, and n ∈ Nis a natural number representing the ID of an element. For the sake of simplicity, we may usethe alternative representation xi ∈ Ai instead of xi ∈ VAi .In our problem, S is the set of all elements desired to be clustered. As the result of havingdifferent sets of elements to be clustered in the problem of parallel clustering, we define asemiring of the power set of all elements. In this case, if we have for example two datasets ofelements, say S1 and S2, then S = S1 ∪ S2.

Example 6.7. Consider that in Example 6.3, we have the Cartesian product of carrier of attri-butes as “color×size”, then the tuples S = (red, small, 1) , (blue, small, 2), and (red, large, 3)is a set of elements on V to be clustered, in a specific problem.

We formally define two operations “+” and “·” as union and intersection of elements of P(S)

(the power set of S), as the following :

· : P(S)× P(S)→ P(S)

S1 · S2 = S1 ∩ S2

+ : P(S)× P(S)→ P(S)

S1 + S2 = S1 ∪ S2

Formally, we say S belongs to the set of elements S ∈ P(S), if it respects one of the followingforms :

S := ∅ |S′ |S + S |S · S | S (6.13)

where S′ ⊆ S.

94

Proposition 6.8. It is easy to verify that operations “+” and “·” on every S1, S2, S3 ∈ P(S)

respect the following properties :

(S1 + S2) + S3 = S1 + (S2 + S3) (6.14)

∅+ S1 = S1 + ∅ = S1 (6.15)

S1 + S2 = S2 + S1 (6.16)

(S1 · S2) · S3 = S1 · (S2 · S3) (6.17)

S · S1 = S1 · S = S1 (6.18)

S1 · (S2 + S3) = (S1 · S2) + (S1 · S3) (6.19)

(S1 + S2) · S3 = (S1 · S3) + (S2 · S3) (6.20)

∅ · S1 = S1 · ∅ = ∅ (6.21)

S1 + S1 = S1 (6.22)

S1 · S1 = S1 (6.23)

S1 · S2 = S1 if S1 ⊆ S2 (6.24)

Theorem 6.9. The quintuple (S,+, ·, ∅, S) is an idempotent commutative semiring.

Démonstration. The proof is straightforward from Proposition 6.8.

Note : It should be noted that the operations “+” and “·” are overloaded to the kind ofelements that they are applied on. This means that if the operation “+” is used between twosets of elements, it refers to the addition operation in semiring of elements, and when theoperation “+” is applied between two sets of features, it refers to the addition operation insemiring of features. The same property satisfies for multiplication operation “·”.

Semiring of Terms

In sections 6.3.1 and 6.3.1 we introduced two semirings on the set of features and the set ofelements, respectively. The reason underlying this choice is that in our context 1) categoricalclusters are generally specified with the set of features, 2) in formalizing the parallel clusteringwe have several datasets and it is required to clearly specify which dataset of elements we referto. In what follows, we construct the semiring of terms with the use of previous semrirings,which will be used to abstract the tree structure and to formalize parallel clustering. In therest of the chapter, we use the same notions and symbols introduced in 6.3.1 and 6.3.1.

Recalling that a cluster in CCTree can uniquely be identified by a set of elements respecting aset of features. We define the satisfaction relation to formally express the concept of cluster.

Definition 6.4 (Satisfaction Relation ). Recalling that when the elements of F contain justone dataset of features we remove the brackets (Definition 6.3). Hence, we define satisfactionrelation, denoted with , as the following :

95

: F× P(S)→ P(S)

(f, (x1, x2, · · · , xk, n)) = (x1, x2, · · · , xk, n) if ∃i, 1 ≤ i ≤ k, s.t xi = f

(f, (x1, x2, · · · , xk, n)) = ∅ if @i, 1 ≤ i ≤ k, s.t xi = f

(f, S1 ∪ S2) = (f, S1) ∪ (f, S2)

(F1 · F2, S) = (F1, S) ∩ (F2, S)

and when (F, S) 6= ∅, we say that S satisfies F . For the sake of simplicity, we apply thealternative representation F S instead of (F, S) when (F, S) 6= ∅.

We consider that the multiplication “·” and “+” over respect the following properties :

(F1 S1) · (F2 S2) = (F1 · F2) S2 if S1 · S2 = S2 (6.25)

(F1 S1) + (F2 S2) = (F1 + F2) S2 if S1 + S2 = S2 (6.26)

where S1 ·S2 = S2 means S2 ⊆ S1, and S1 +S2 = S2 means S1 ⊆ S2. In the case neither set isa subset of the other, the multiplication and addition return the received elements unchanged.It should be noted that “·” and “+” are overloaded to their own definition for the semiring offeatures and the semiring of elements when they are applied between two sets of features andtwo sets of elements, respectively.

Roughly speaking, these properties can be interpreted as follows. The multiplication is used tofind the resulted tuples from the intersection of two clusters resulted two sets that one is thesubset of the other one ; and the addition refers to the union of two clusters, where one is thesubset of the other one. In our context, the property 6.25 is applied to address the concept ofdivision of a cluster to new smaller clusters. In this case, each small new cluster satisfies thefeatures of the main cluster, plus more restricted features. Moreover, the property 6.26 is usedto get the simpler form of clusters according to Definition 6.3.

Example 6.10. Lets consider the set of elements S = (red, small, 1), (blue, small, 2), (red, large, 3)on the set of features V = red, blue, small, large, are given. The following examples representdifferent clusters on this datasets in terms of satisfaction relation :

(red, (red, small, 1)) = (red, small, 1)

(red, (blue, small, 2)) = ∅

(red, (red, small, 1), (blue, small, 2)) = (red, (red, small, 1))

∪ (red, (blue, small, 2)) = (red, small, 1) ∪ ∅ = (red, small, 1)

(red · small, (red, small, 1)) = ( red, (red, small, 1))

∩ ( small, (red, small, 1)) = (red, small, 1) ∩ (red, small, 1) = (red, small, 1)

96

Proposition 6.11. For F1, F2 ∈ F and S ∈ P(S), the symbol “” satisfies the followingproperties with respect to “+” and “ ·” :

(F1 · F2) S = (F1 S) · (F2 S) (6.27)

(F1 + F2) S = (F1 S) + (F2 S) (6.28)

Démonstration. The proof is straightforward from the properties 6.25 and 6.26, since we haveS · S = S and S + S = S.

Actually, the equations 6.27 and 6.28 express how we can transform the different forms ofF ∈ F to the form of F ∈ F1.

Example 6.12. The following equation shows the transformation of 6.27 and 6.28 to a set offeatures as F ∈ F1 defined in Definition 6.3.

f1, f2, f3 S = f1, f2 S + f3 S = f1 · f2 S + f3 S

The form of F ∈ F1, is a particular desired representation of the set of features which will beused in our context. Hence, we attribute a specific name to it as what follows.

Definition 6.5 (Feature-Cluster (Family) Term). The set of feature-cluster family terms onV and S denoted as FCV,S (or simply FC if it is clear from the context) is the smallest setcontaining elements satisfying the following conditions :

if S ⊆ S then S ∈ FC

if F ∈ F1, S ⊆ S then F S ∈ FC

if τ1 ∈ FC, τ2 ∈ FC then τ1 + τ2 ∈ FC

In this case, we call S and F S a feature-cluster term and the addition of one or more feature-cluster terms is called feature-cluster family term. We may simply use FC-term to refer to afeature-cluster family term.We define the block function, which receives a feature-cluster family term and returns the setof its blocks. Formally, we have :

block : FC→ P(FC)

block(S) = S

block(F S) = F S

block(τ1 + τ2) = block(τ1) ∪ block(τ2)

In the case that no feature specifies S directly, it is called an atomic term. The set of all atomicterms is denoted as A .

97

Example 6.13. In the following, some examples of FC-terms are presented :

S ∈ FC

red · small S ∈ FC

red · small +blue S ∈ FC

Example 6.14. Suppose that the term τ = red S + blue S is given. Applying the blockfunction on τ results in :

block(red S + blue S) = red S, blue S

Definition 6.6 (FC-Term Comparison). We say two FC-terms τ1 and τ2 are equal, denotedby τ1 ≡ τ2, if for different representations of FC-terms, it satisfies the following relations :

S1 ≡ S2 ⇔ S1 = S2

F1 S1 ≡ F2 S2 ⇔ S1 = S2 , F1 = F2

τ ≡ τ ′ ⇔ block(τ) = block(τ ′)

Example 6.15. In the following examples two simple equivalence of FC-terms have beenshown :

red · small S ≡ small · red S

red · small S + blue S ≡ blue S + small · red S

Definition 6.7 (Term). We call τ a term, if it has one of the following form :

τ := S | F S | τ + τ | τ · τ (6.29)

where

S := ∅ |S′ |S + S |S · S | S (6.30)

F := 0 | f |F + F |F · F | 1 (6.31)

in which 6.30 and 6.31 satisfy the properties specified in 6.3.1 and 6.3.1, respectively.

The set of terms on S and F is denoted as CS,F, or abbreviated as C, where it is knownbeforehand on which datasets it has been constructed.As previously discussed, when an element of F contains just one dataset of features, we removethe brackets, and with the use of “·” we separate the features belonging to the associated dataset.

Example 6.16. In the following some examples of terms on V = red, blue, small, largeand dataset S, are presented :

red · small S

red · small S + blue S′

(red · small S) · (blue S′)

red, large, blue S

98

Theorem 6.17. Two identity elements of C with respect to “+” and “·” are 0 ∅ and 1 S,respectively.

Démonstration. From properties 6.25 and 6.26, and the term definition 6.7, which considersthe commutativity of multiplication and addition among terms, we have :

(1 S) · (F S) = (1 · F ) S = F S (6.32)

(0 ∅) · (F S) = (0 · F ) ∅ = 0 ∅ (6.33)

(0 ∅) + (F S) = (0 + F ) S = F S (6.34)

For the other elements of C, the proof is straightforward from the above equations, andproperties 6.25, 6.26.

Theorem 6.18. The quantiple (C, “+”, “·”, 0∅, 1S) is an idempotent commutative semiring.

Démonstration. The proof is straightforward from the semrirng definition (Definition 6.1),Sections 6.3.1, 6.3.1, and the properties mentioned in 6.3.1.

Definition 6.8 (Feature-Cluster Algebra). The semiring (C, “ + ”, “ · ”, 0 ∅, 1 S) is calleda feature-cluster algebra.

It is noticeable that in present work our terms in following sections, mostly, belong to the setof feature-cluster family terms FC ⊆ C. This means that they as the elements of the semiring(C, “ + ”, “ · ”, 0 ∅, 1 S), follow the same operation and properties among the elements of theproposed feature-cluster algebra.

6.4 Feature-Cluster (Family) Term Abstraction

In this section, we plan to relate the concept of feature-cluster algebra to tree structure. Tothis end, firstly, some preliminary notions related to graph, abstraction and rewriting system,are presented. Graph theory notions is used to formally represent tree structure. On the otherhand, the abstraction theory is used to prove that the syntax form of trees (under someconditions) is equivalent to the semantic form of tree structure.This property is desirable in the sense that we are able to apply several interesting algebraiccalculation on syntax forms, whilst whenever it is required it is possible to transform it to itsequivalent semantic structure, preserving the same properties of applying the calculations onsemantic forms.Moreover, the rewriting system is applied to automatically verify whether a term representsa CCTree or not. Moreover, we can automatically get a homogenized CCTree term resultedfrom the addition of several CCTree terms.

99

6.4.1 Preliminary Notions

Graph Theory Preliminaries In graph theory [62], a tree is an undirected graph in whichany two vertices are connected by exactly one path. A forest is a disjoint union of trees. A treeis called a rooted tree if one vertex has been designated the root, which means that the edgeshave a natural orientation, towards or away from the root [62]. A node directly connected toanother node when moving away from the root is the child node. In a rooted tree, every nodeexcept the root has one parent node, called predecessor. Moreover, a child node in a rootedtree is called a successor. A node without successors in a rooted tree is called a leaf. A tree is alabeled tree if the edges of the tree are labeled. A branch of a tree, refers to the path betweenthe root and a leaf in a rooted tree [62]. A descendant tree of an edge f in a rooted tree T , isthe subtree of T following edge f .

Definition 6.9 (Graph Homomorphism, Graph Isomorphism). Graph homomorphism sort

from a graph G = (V,E) to a graph G′ = (V ′, E′), written as ζ : G → G′, is a mappingζ : V → V ′ from the vertex set of G to the vertex set of G′ such that u, v ∈ E impliesζ(u), ζ(v) ∈ E′ [70]. If the homomorphism ζ : G → G′ is bijection whose inverse functionis also a graph homomorphism, then ζ is a graph isomorphism. In our context it is importantthat both u, v ∈ E and ζ(u), ζ(v) ∈ E′ have the same edge label. Under this condition,we say that two graphs G = (V,E) and G′ = (V ′, E′) are equivalent, denoted as G ≈ G′, ifV = V ′, E = E′, for u, v ∈ E and ζ(u), ζ(v) ∈ E′, we have u, v = ζ(u), ζ(v), andfinally G and G′ are isomorphic.

Definition 6.10 (Tree Structure). In our context, a graph structure is a triple (F,Q, ω)

where : F represents the set of edge labels ; Q is the set of states or nodes ; and ω is the set oftransition function, denoted as ω : Q× F → Q. A graph structure is a tree structure if thereis no cycle in transitions. In this case, the transitions are written such that each parent nodeis connected to its children moving from root.We note a transition ω(s1, f) = s2 as a triple (s1, f, s2). Hence, the set of transitions in ourcontext is a set of triples, where the first component is a parent node (predecessor) and the lastcomponent is a child (successor) of first component, whilst the middle component is the edgelabel (feature) transiting from first this parent node to its child.

Note : It is worth noticing that a CCTree is a tree structure, which in our context can beformally presented as a triple where the first component (F ) contains the set of edge labels,the second component (Q) contains the nodes of CCTree, and the last component is the set oftransitions between a parent node through edge labels to its children. We label the root nodeas the main dataset desired to be clustered.

Abstraction Theory Preliminaries What the abstraction means in general ? Some of thesynonyms of the word “abstract” are “brief”, “synopsis” and “sketch”, some of the synonyms

100

of the verb “to abstract” are “to detach” and “to separate”. The intuition which comes out ofthis list of synonyms is that the process of abstraction is related to the process of separating,extracting from a representation of an object or subject an “abstract” representation thatconsists of a brief sketch of the original representation [59].

More precisely, the abstraction is the process of mapping a representation of a problem, calledthe “ground” representation, onto a new representation, called the “abstract” representation,such that it helps to deal with the problem in the original search space by preserving certaindesirable properties and is simpler to handle as it is constructed from the ground representationby “ not considering the details” [59]. The most common use of abstraction is in theoremproving, which abstracts the goal, to prove its abstracted version, and then to use the structureof the resulting proof to help construct the proof of the original goal. This is based on theassumption that the structure of the abstract proof is equivalent to the structure of the proofof the goal. The other main use of abstraction theory has been to study the formal propertiesof abstractions and the operations like composition and ordering which can be defined uponthem [59].

An abstraction can formally be written as a function [[.]] : X → Y from the ground represen-tation (semantic form) of an object to its abstract form (syntax form). We say [[.]] adequatelyabstracts X if from the equivalence of two elements of semantic forms, we get the equivalenceof their equivalent syntax forms. Formally, if the equivalence of elements in X is denoted by' and the equivalence of elements in Y is represented with ∼=, then :

[[X1]] ∼= [[X2]] ⇒ X1 ' X2 (6.35)

we say [[.]] abstracts X if we have :

X1 ' X2 ⇒ [[X1]] ∼= [[X2]] (6.36)

when 6.35 and 6.36 are both satisfied, we say [[.]] fully abstracts X , i.e. we have :

[[X1]] ∼= [[X2]] ⇔ X1 ' X2

Rewriting System Terminology A rewriting system is shown with a set of directed equa-tions on a set of objects. Mostly the objects in rewriting system are called terms and thedirected equations are called rewriting rules. The rewriting rules are applied to compute newterms by repeatedly replacing subterms of a given term until the simplest form possible isobtained [43].More precisely, a rewriting rule is an ordered pair, written as x→ y of terms x and y. Similarto equations, rules are applied to replace instances of x by corresponding instances of y. Unlikethe equations, rules are not applied to replace instances of the right-hand side y [43]. A termover symbols G, constants K, and variables X is either a variable x ∈ X , a constant k ∈ K, or

101

an expression of the form g(t1, t2, . . . , tn), where g ∈ G is a function symbol of n arguments,and ti are terms [43]. A derivation for a rule →, is a sequence of the form t0 → t1 → . . .. Theterm t is reducible with respect to rule →, if there is a term u such that t → u ; otherwise itis considered as irreducible. A rewrite system R is a set of rewrite rules, x → y, where x andy are terms. The term u is a → normal form of t, if t→∗ u and u is irreducible via →, where→∗ means that the rule → is applied n times (n ∈ N). A relation → is terminating, if there isno infinite derivations t0 → t1 → . . ., which means that an infinite derivations does not reachto a normal term. A relation → is confluent, if there is an element v such that s →∗ v andt→∗ v whenever u→∗ s and u→∗ t for some elements s, t and u. A relation → is convergent,if it is terminating and confluent. Convergent rewriting systems are interesting, because allderivations lead to a unique normal form [43].A conditional rule is an equational implication in which the term in the conclusion is reachedjust if the conditions are satisfied. We use the form x1 = u1 ∧ . . . ∧ xn = un|x → y to showthat under the conditions x1 = u1 ∧ . . . ∧ xn = un we have x→ y.

6.4.2 Graph Structure and Feature-Cluster Family Terms

In this subsection, we explain how graph structure and feature-cluster family term can betransformed to each other. To this end, we first present the “meaning” relation to transform afeature-cluster family term to a labeled graph structure. Afterwards, we present a function toget a feature-cluster family term from a labeled tree structure. Then, we prove in a theoremthat if two labeled trees are equivalent, they return equal terms. However, we show that the twoequal feature-cluster family terms do not necessarily return two equivalent graph structures.We prove that under the condition of considering a fixed order among the features, the latterrequirement will also be respected.

In the provided examples, attributes Color = r(ed), b(lue), Size = s(mall), l(arge), andShape = c(ircle), t(riangle) are used to describe the terms.

To avoid the confusion of different representations of an FC-term, in what follows we presentthe definitions of factorized and non factorized terms.

Definition 6.11 (Factorized Term). We define the factorization rewriting rule through anattribute A ∈ A, denoted as A−→, from an FC-term to its factorized form as the following :

f · τ1 + f · τ2A−→ f · (τ1 + τ2) for f ∈ A

we denote the normal form of applying the factorization rewriting rule on term τ applyingfactorized rewriting rule, through attribute A as τ ↓A, and the set of factorized forms of theterms of FC is denoted by FC ↓. A term after factorization is called a factorized term.

Definition 6.12 (Defactorization). We define the defactorized rewriting rule on an FC-term

102

as what follows :

f · (τ1 + τ2)→d f · τ1 + f · τ2

A normal term resulted from defactorized rewriting rule is called a non factorized term. A nonfactorized form of the term τ is denoted as τ ↑. The set of non factorized terms of FC aredenoted by FC ↑.

Example 6.19. In what follows, we show how factorization and defactorization perform. Forfactorization we have :

(r · s S + r · c S + b · s S) ↓color= r · (s S + c S) + b · s S

and for defactorization :

r · (s S + c S) + b · s S →d r · s S + r · c S + b · s S

From Feature-Cluster Family Term to Tree Structure Applying the same notionspresented in previous sections, in what follows we define three functions, which return the setof edge labels, the set of nodes, and the set of transitions from a received FC-term, respectively.These three functions are used in our context to get a forest structure from an FC-term.We define the function of feature, denoted by Θ, which gets a non factorized FC-term andreturns a set of features as follows :

Θ : FC ↑→ P(V)

Θ(S) = ∅

Θ(f S) = f

Θ(f · F S) = f ∪ΘF S

Θ(τ1 + τ2) = Θ(τ1) ∪Θ(τ2)

we define the function of states, noted as Φ, which gets a non factorized FC-term and returnsa set of FC-terms, as follows :

Φ : FC ↑→ P(FC ↑)

Φ(S) = S

Φ(f S) = f S, S

Φ(f · F S) = f · F S ∪ Φ(F S)

Φ(τ1 + τ2) = Φ(τ1) ∪ Φ(τ2)

103

Moreover, we define the transition function, denoted as Ω, which gets a non factorized FC-termand returns a triple which returns the transitions from the associated node as follows :

Ω : FC ↑→ P(FC ↑ ×V× FC ↑)

Ω(S) = ∅

Ω(f S) = (S, f, f S)

Ω(f · F S) = (F S, f, f · F S) ∪ Ω(F S)

Ω(τ1 + τ2) = Ω(τ1) ∪ Ω(τ2)

Now we are ready to introduce the meaning relation which gets a non factorized FC-term andreturns a forest structure.

Definition 6.13. The meaning relation, denoted as [[.]], gets a non factorized FC-term andreturns a triple, representing a forest (or tree) structure, as following :

[[.]] : FC ↑→ GV,FC[[τ ]] = (Θ(τ), Φ(τ), Ω(τ))

where GV,FC is the set of all possible forest structures on the set of edges’ labels V and the setof nodes’ labels FC.

Example 6.20. In what follows, we show how a feature-cluster family term is transformed toits equivalent graph structure according to the above rules :

[[r S + b · l S + b · s S]] =

(b, r, l, s,

S, r S, b S, b.l S, b.s S,

(S, r, r S), (b S, l, b · l S), (S, b, b S), (b S, s, b · s S))

From Tree Structure to a Feature-Cluster Family Term We define the function root,denoted as r, which gets a tree and returns the root of the tree. Formally :

r : TV,FC → Q

r(T ) = s | ∪si∈Q (si, f, s) = ∅

where TV,FC is the set of rooted trees on V and FC.We define the set of edge labels of the children of r(T ) as follows :

δ(T ) = f | ∃ s′ ∈ Q s.t. (r(T ), f, s′) ∈ ω

Moreover, in a tree T , the descendant tree directly after edge f , as the derivative tree of Tfollowing edge f , is denoted by ∂f (T ). We define the Ψ function which gets a tree structure

104

T , and returns the features as follows :

Ψ(T ) =∑

f∈δ(T )

f ·Ψ(∂f (T )) (6.37)

where Ψ(T ) = 1 when δ(T ) = 1. We represent f · 1 as f .We define the transform function, denoted by ψ, which gets a set of k labeled trees (forest)and returns an FC-term as follows :

ψ : GV,FC → FC

ψ(∅) = 0

ψ(T1 ∪ T2) = Ψ(T1) r(T1) + Ψ(T2) r(T2)

Example 6.21. Suppose the following tree is given :

M = (f1, f2, s, s1, s2, (s, f1, s1), (s, f2, s2))

then, the only state to which there is no transition is node s. Hence, we have :

Ψ(M) = f1 ·Ψ(∅, s1, ∅) + f2 ·Ψ(∅, s2, ∅) = f1 · 1 + f2 · 1 = f1 + f2

which the resulting term is equal to :

ψ(M) = Ψ(M) s

Definition 6.14. A term resulting from a CCTree structure, or equivalently transformable toa tree structure representing a CCTree, is called a CCTree term.

Example 6.22. Suppose the CCTree of Figure 6.1 is given. The tree structure of this CCTreecan be written as the following :

(red, blue, small, large, S, Sr, Sb, Sb·s, Sb·l,

(S, red, Sr), (S, blue, Sb), (Sb, small, Sb.s), (Sb, large, Sb.l))

The CCTree term resulting from this CCTree equals to :

red S + blue · small S + blue · large S

Proposition 6.23. For each non factorized FC-term τ , there exists at least one forest structurein GV,FC that represents τ . Moreover, for each labeled forest structure T in GV,FC, there existsa unique term that represents T .

Démonstration. The proof is straightforward from the proposed methodology of transforminga forest structure to a term and vise versa.

105

S

Sr

red

Sb

Sb.s

small

Sb.l

large

blue

Figure 6.1 – A Small CCTree

Theorem 6.24. The meaning relation [[.]] adequately abstracts a graph structure resultingfrom a feature-cluster (family) term on V and the same fixed dataset of elements S ⊆ S. Thismeans that for two non factorized FC-terms τ and τ ′ we have :

[[τ ]] ≈ [[τ ′]] ⇒ τ ≡ τ ′ (6.38)

Intuitively, the relation 6.38 expresses that if two forest structures resulting from two FC-terms are equal, by certain we can conclude that the original terms were equal as well. Inother words, if τ 6≡ τ ′ then we can conclude that [[τ ]] 6≈ [[τ ′]].

Démonstration. From Proposition 6.23, for each non factorized FC-term there exists a uniqueterm representing it. This means that [[τ ]] and [[τ ′]] certainly return a term. Now, supposethat the left hand side of 6.38 is satisfied. Hence, we have :

[[τ ]] ≈ [[τ ′]]⇒ Θ(τ) = Θ(τ ′) , Φ(τ) = Φ(τ ′) , ∆(τ) = ∆(τ ′) (6.39)

⇒ block(τ) = block(τ ′)⇒ τ ≡ τ ′ (6.40)

where 6.39 is resulted from the equivalent graph structures of τ and τ ′, and 6.40 is satisfied fromΦ(τ) = Φ(τ ′) and the fact that two main terms were originated from the same dataset.

The following example contradicts the satisfiability of relation 6.38 from right to left.

Example 6.25. The two following feature-cluster family terms are equivalent in terms of termcomparison 6.6, i.e. we have :

f1 · f2 S + f1 · f3 S ≡ f2 · f1 S + f3 · f1 S

106

but their equivalent tree representation are not equivalent, since we have :

[[f1 · f2 S + f1 · f3 S]]

= (f1, f2, f3, S, f1 S, f1 · f2 S, f1 · f3 S,

(f1 S, f2, f1.f2 S), (f1 S, f3, f1.f3 S), (S, f1, f1 S))

[[f2 · f1 S + f3 · f1 S]]

= (f1, f2, f3, S, f2 S, f3 S, f2.f1 S, f3.f2 S,

(f2 S, f1, f2.f1 S), (S, f2, f2 S), (f1 S, f3, f1.f3 S), (S, f1, f1 S))

where the first one contains five nodes, whilst the second one contains four nodes. This meansthat they are not isomorphic graphs.

This example shows that commutativity of “·” is not an appropriate property for full abstrac-tion. In what follows, we will show that the reverse of 6.38 is satisfied if an order of features isidentified on the set of features, which solves the problem of multiplication (“·”) commutativity.

Definition 6.15 (Ordered Features). We say that the set of features V is an ordered set offeatures if there is an order relation “<” on V, such that (V, <) is a total order set. Thismeans that for any f1, f2 ∈ V we either have f1 < f2 or f2 < f1. We say F1 is exactly equalto F2, denoted by F1

∼= F2, if considering the order of features they are equal.

Definition 6.16 (Order Rewriting Rule). Let an ordered set of features (V, <) be given. Wesay an FC-term is an ordered FC-term on (V, <), if it is the normal form of applying thefollowing rewriting rule :

f1 · f2 S →O f2 · f1 S if f1 < f2 ∀ f1, f2 ∈ V

Moreover, we define a rewriting rule which orders the features of an FC-term based on anattribute A ∈ A as follows :

f2 · f1 SA−→O f1 · f2 S if f1 ∈ A

we represent the normal for of a term τ applying above rewriting rule, based on attribute A,as τ ⇓A.

Example 6.26. Suppose that the set of features V1 = red, blue, small, large is given. Wi-thout loss of generality, fixing a strict order “<” among them as “red < blue < small < large”results in having (V1, <) as a total ordered set. The following examples show how ordered FC-terms on V1 are obtained applying order rewriting rule :

red · small S → small · red S

red · small S + blue · large S → small · red S + large · blue S

Moreover : red · small small · red, whilst red · small ∼= red · small.

107

Definition 6.17 (Ordered FC-term Comparison). We say two ordered FC-terms on (V, <)

are exactly equal, denoted by ∼, as the smallest relation for which the terms respect one ofthe following relations :

1. if S1 = S2 then S1 ∼ S22. if S1 = S2 ∧ F1

∼= F2 then F1 S1 ∼ F2 S23. if ∀τi ∈ block(τ) ∃τj ∈ block(τ ′) s.t τi ∼ τj

and ∀τj ∈ block(τ ′) ∃τi ∈ block(τ) s.t τj ∼ τi then τ ∼ τ ′

Example 6.27. Lets consider the ordered set of features of Example 6.26 is given. The follo-wing examples show how two ordered FC-terms are compared :

red · small S small · red S

red · small S ∼ red · small S

red · small S + blue S ∼ blue S + red · small S

Theorem 6.28. Let (V, <) be a total ordered set of features and S ⊆ S. The meaning relation[[.]] abstracts the forest (tree) structure resulted from the ordered non factorized FC-terms onV and S. This means considering τ and τ ′ be two arbitrary ordered non factorized FC-termson (V, <) and S ⊆ S, we have :

τ ∼ τ ′ ⇒ [[τ ]] ≈ [[τ ′]] (6.41)

Démonstration. Suppose the left side of 6.41 satisfies. This means that for each feature-clusterterm τi ∈ τ there exists a feature-cluster term τj ∈ τ ′ such that τi and τj are exactly equal.This property causes that the set of transitions of [[τi]] to be equal to the set of transitions of[[τj ]]. Consequently, 6.41. More precisely, we have :

τ ∼ τ ′ ⇒ ∀τi ∈ block(τ)∃τj ∈ block(τ ′) s.t τi ∼ τj(⇒ [[τi]] ≈ [[τj ]]), (6.42)

∀τj ∈ block(τ ′)∃τi ∈ block(τ) s.t τj ∼ τi(⇒ [[τi]] ≈ [[τj ]])

⇒[[τ ]] ≈ [[τ ′]] (6.43)

Now we are ready to present the main theorem of this section, which provides the conditionsof full abstraction.

Theorem 6.29 (Main Theorem). Let the ordered set of features (V, <), the set of elementsS ⊆ S are given. The meaning function [[.]] fully abstracts the ordered feature-cluster familyterms on (V, <) and S. This means that for two arbitrary ordered feature-cluster family termsτ and τ ′ on V and S, we have :

[[τ ]] ≈ [[τ ′]]⇔ τ ∼ τ ′ (6.44)

Démonstration. The proof is straightforward from the proofs of theorems 6.24 and 6.28.

108

6.5 Relations on Feature-Cluster Algebra

In this section, we define several relations on feature-cluster algebra and discuss the propertiesof the proposed relations. Here, we will use the same notions and symbols introduced in 6.3.1,6.3.1, and 6.3.1.

Definition 6.18 (Attribute Division). Attribute division (DA) is a function from A × FCto True, False, which gets an attribute and a non factorized FC-term as input ; it returnsTrue or False as follows :

DA : A× FC ↑→ True, False

...................................................................

DA(A,S) = False

DA(A, f S) = True if f ∈ A

DA(A, f S) = False if f /∈ A

DA(A, f · F S) = DA(A, f S) ∨DA(A,F S)

DA(A, τ1 + τ2) = DA(A, τ1) ∧DA(A, τ2)

The concept of attribute division is used order the attributes presented in a term, which willbe discussed later.

Example 6.30. In the following we show how attribute division performs :

DA(color, r · s S + r · c S + b · s S)

= DA(color, r · s S) ∧DA(color, r · c S) ∧DA(color, b · s S) = True

Definition 6.19 (Initial). We define the initial (δ) function from P(FC ↑) to P(F), whichgets a set of ordered non factorized terms on (V, <), and returns a set of the first features ofeach term as follows :

δ : P(FC ↑)→ P(F)

δ(∅) = 0

δ(S) = 1

δ(f · F S) = f

δ(τ1 + τ2) = δ(τ1) ∪ δ(τ2)

δ(τ1, τ2) = δ(τ1) ∪ δ(τ2)

with the following property :

δ(X,Y ) = δ(X) ∪ δ(Y )

109

where X,Y ∈ P(FC ↑).In the case that the input set contains just one term, we remove the brackets, i.e. δ(τ) = δ(τ),when |τ| = 1. Moreover, when the output set also contains just one element, for the sake ofsimplicity we remove the brackets, i.e. δ(X) = f = f for X ∈ P(FC ↑).

Example 6.31. In the following we show the result of initial function on pair of terms :

δ(S , r · s S)) = 1, r

Definition 6.20 (Derivative). The Brzozowski derivative [23], denoted as u−1S, of a set Sof strings and a string u is defined as the set of all the rest strings obtainable from a stringin S by cutting off its prefix u. In our context, importing the idea of Brzozowski, we definethe derivative, denoted by ∂, as a function which gets an ordered non factorized FC-term on(V, <) and returns the term (set of terms) by cutting off the first features as follows :

∂ : FC ↑→ P(FC)

∂(S) = ∅

∂(f S) = S

∂(f · F S) = F S

∂(τ1 + τ2) = ∂(τ1) ∪ ∂(τ2)

Note : Note that the functions initial (δ) and derivative (∂) are overloaded to the input,depending to the input that if it is a tree or a term.

Definition 6.21 (Order of Attributes). We say attribute B is smaller or equal to attribute Aon the non factorized term τ ∈ FC ↑, denoted as B τ A, if the number of blocks of τ that Bdivides, is less than (equal to) the number of blocks that A divides. Formally, B τ A impliesthat :

|τi ∈ block(τ) |DA(B, τi) = True| ≤ |τi ∈ block(τ) |DA(A, τi) = True|

Given a set of attributes A and a term τ , the set (A,τ ) is a lattice. We denote the upperbound of this set as uA,τ . This means that we have : ∀A ∈ A ⇒ A τ uA,τ .

Example 6.32. In the following we show how the order of attributes of a term is identified.Suppose the term τ = r · s S + r · c S + b · s S is given. We have :

block(τ) = r · s S, r · c S, b · s S

consequently,

|τi ∈ block(τ) |DA(shape, τi) = True| = 1

≤ |τi ∈ block(τ) |DA(size, τi) = True| = 2

≤ |τi ∈ block(τ) |DA(color, τi) = True| = 3

110

which means that we have :

shape τ size τ color

Recalling that not having the predefined order among features creates a problem in full abs-traction of terms. To this end, here we propose a way to order the set of features which isappropriate to our problem.First of all, given a feature-cluster family term τ , we find the order of attributes according todefinition 6.21, whilst if for two arbitrary attributes A and A′, we have A = A′, without lossof generality, we choose a strict order among them, say A ≺ A′. Then in each attribute wearbitrarily order the features. It is important that the features of smaller attribute be alwayssmaller than the features of greater attribute. For example, if size ≺ color, we consider theorder of features as small < large < blue < red, whilst all the features of color are greaterthan all the features of size.

Definition 6.22 (Ordered Unification). Ordered unification (F) is a partial function fromP(A)× FC ↑ to FC ↓, which gets a set of attributes and a non factorized term ; it returns thenormal form of applying rewriting rule A−→O introduced in Definition 6.16, iteratively, based onthe order of attributes on received term, as follows :

F : P(A)× FC ↑→ FC

F(∅, τ ↑) = τ

F(A, τ ↑) = τ ⇓AF(A, τ) = F(uA,τ ,F(A− uA,τ, τ ↑))

The normal form of ordered unification is called a unified term. By F∗(τ) we mean that F isperformed iteratively on the set of ordered attributes on τ to get the unified term.

Example 6.33. To find the unified form of τ1 = r · s S + r · c S + b · s S , we have :

F∗(τ1) = F(shape, color, size, τ1 ↑)

= F(color,F(size,F(shape, τ1))) = r · s S + r · c +b · s S

Definition 6.23 (Component relation). Given two ordered non factorized FC-terms τ1 andτ2 on (V, <), we define the component relation, denoted by ∼1, as the first level comparisonof terms as the following :

τ1 ∼1 τ2 ⇔ δ(τ1) = δ(τ2)

Proposition 6.34. The component relation is an equivalence relation on the set of orderednon factorized FC-terms.

111

Démonstration. For ordered non factorized FC-terms τ1, τ2 and τ3, we have :

if τ1 ∼1 τ1 iff δ(τ1) = δ(τ1)

if τ1 ∼1 τ2 ∧ τ2 ∼1 τ1 iff δ(τ1) = δ(τ2) ∧ δ(τ2) = δ(τ1)

if τ1 ∼1 τ2 , τ2 ∼1 τ3 then τ1 ∼1 τ3 iff δ(τ1) = δ(τ2), δ(τ2) = δ(τ3) then δ(τ1) = δ(τ3)

Definition 6.24 (Component). Let consider that the ordered term τ ∈ FC ↑ on (V, <) isgiven. The equivalence class of τ ′ ∈ block(τ) is called a component of τ , and it is formallydefined as :

[τ ′]τ = τi ∈ block(τ) | τ ′ ∼1 τi

The set of all components of the term τ through the equivalence relation ∼1, is denoted byblock(τ)/ ∼1 or simply τ/ ∼1, i.e. we have :

τ/ ∼1= [τi]τ | τi ∈ block(τ)

Definition 6.25 (Component Order). Let X and Y be two sets of ordered non factorizedFC-terms on (V, <). We say X is smaller than Y , denoted as X < Y , if :

X < Y ⇔ ∀f ′ ∈ δ(X),∀f ′′ ∈ δ(Y ) f ′ < f ′′

Specifically, let τ be an ordered non factorized FC-term on (V, <). We order the componentsof τ according to the order of features in V as what follows :

[τ ′]τ < [τ ′′]τ ⇔ ∀f ′ ∈ δ([τ ′]),∀f ′′ ∈ δ([τ ′′]) f ′ < f ′′

It is noticeable that |δ([τ ′])| = |δ([τ ′′])| = 1, for all τ ′, τ ′′ ∈ block(τ), since the first features ofall elements in a component are equal.We denote the i’th component of τ/ ∼1 as [τ ]i. Due to the fact that the features are orderedstrictly, the term components are also ordered strictly.

Definition 6.26 (Well formed term). Well formed function, denoted as W , is a binary func-tion from FC ↑ to True, False, which gets a unified non factorized FC-term ; it returnsTrue if the set of first features of its components is equal to a sort of A to which these featuresbelong ; it returns False otherwise. Formally :

W (τ) =

True if δ(τ/ ∼1) = sort(δ([τi]τ )) ∀τi ∈ block(τ)

False otherwise

where δ(τ/ ∼1) = sort(δ([τ1]τ )) means that the the set of the first features of the componentsof the term τ is equal to the attribute that the first feature belongs to. A unified term τ is calleda well formed term, if W (τ) = True. An atomic term is a well formed term.

112

Example 6.35. The unified term of Example 6.33, τ = r · s S + r · c S + b · s S is a wellformed term, since we have :

δ(τ/ ∼1) = δ(r · s S , r · c S, b · s S) = r, b

sort(δ([r · s S])) = sort(r · s S , r · c S) = r, b

consequently W (τ) = True.

It is noticeable that in an ordered CCTree term all first features belong to the same attribute.Hence, in what follows we exploit the concept of well formed term to identify whether a termrepresents a CCTree term or not.

6.5.1 CCTree Term Schema

We know that each CCTree term is a feature-cluster family term. However, in converse, afeature-cluster family term is not necessarily representing a CCTree term. It would be inter-esting to know which feature-cluster family term represents a CCTree term. This knowledgeprovides us with the opportunity to iteratively use the rules on CCTree terms.

Theorem 6.36. A unified term represents a CCTree term, or it is transformable to a CCTreestructure, if and only if, it can be written in the following form :

F∗(τ) =∑i

fi · τi (6.45)

such that “ W (F∗(τ)) = True”, i.e. the unified form of the received term is a well formedterm ; and the unified form of each τi is a well formed term as well (W (τi) = True ) whichrespects the above formula.

Démonstration. First we show that a unified term obtained from a CCTree structure satisfiesthe equation 6.45. In a CCTree, the attribute used for division in the root, has the greatestnumber of occurrence in non factorized CCTree term (all blocks of CCTree term contain oneof the features of this attribute). According to 6.37, for transforming the tree to a term, thefirst features of components are specified from δ(T ) = f | ∃ s′ ∈ Q s.t. (sT , f, s

′) ∈ ω, wherein CCTree all belong to the same sort, i.e. we have :

δ(T ) = f | ∃ s′ ∈ Q s.t. (sT , f, s′) ∈ ω = sort(f) ⇒ W (ψ(T )) = True

we call the tree following a child of the root as a new tree. It is noticeable each new tree isa CCTree by itself ; hence, it respects 6.45. By considering the tree following the new tree asnew trees themselves, the aforementioned process is iteratively repeated for all new trees, dueto the iterative structure of CCTree, i.e. from 6.37, we have :

W (ψ(∂f (T ))) = True ∀ f ∈ δ(T )

113

this means that if the input tree structure is a CCTree, then the obtained term respects theabove formula.

On the other hand, a unified term that respects equation 6.45 can be converted to a CCTreestructure. To this end, τi’s are the components of τ separating their first features (fi’s). The setof the first features of components of the term, constitute the transitions of the first divisionfrom the root of CCTree, i.e. :

Ω(∑

[τi]∈τ/∼1

δ([τi]) ·∑τk∈[τi]

∂(τk)) =⋃

[τi]∈τ/∼1

(S, δ([τi]),∑τk∈[τi]

∂(τk))

where S is the main dataset the term is originated from.Since the term is well formed, it guarantees that the label of children belong to the samesort, as required by CCTree. Due to the iterative rule for successive components, iterativelythe structure of CCTree is constructed. Note that the condition of equivalence of the firstfeatures of components to a sort, guarantees that in the process of transforming the term toits equivalent tree structure, all the features of a selected attribute exist.

With the use of above theorem, we propose a rewriting system which is applied to automaticallycheck if a term represents a CCTree term or not.

CCTree Rewriting System

To verify automatically if a term is a CCtree term, a set of conditional rewriting rules areprovided in Table 6.1. The term ∅ in this table, refers to a null term. In this regard, theCCTree rewriting system is applied on a received term ; the term is a CCTree term if the onlyirreducible term is ∅.

In this rewriting system, Jf(τ)K means that the semnatics of f(τ) is replaced, whilst the resultis considered as one unique term, not several terms. Furthermore, τ1 : τ2 contains two terms τ1and τ2, whilst each one is considered as a new term. Moreover, [τ ]i refers to the i’th componentof τ/ ∼1.

(1) (τ ∈ A ) | τ → ∅(2) (τ 6= F∗(τ)) | τ → JF∗(τ)K(3) (τ = F∗(τ)) ∧ (W (τ))) ∧ (τ /∈ A ) | τ → JΣτk∈[τ ]1∂(τk)K : . . . : JΣτk∈[τ ]|τ/∼1|

∂(τk)K

Table 6.1 – CCTree Rewriting System

The first rule of Table 6.1 specifies that if a term is an atomic term it is directed to ∅. Thesecond rule expresses that if a term is not in unified form, it is required to transfer it to itsunified representation. The third rule specifies that if a non atomic unified term is well formed,

114

it is divided to the derivative of its components. The last rule is used to verify whether theCCTree conditions satisfy for the following components or not. These rules are following thestructure of Theorem 6.36 in identifying whether a term is CCTree term or not.

Example 6.37. Suppose that the term τ1 = a1 S + b1 S, with the set of attributes A =

a1, a2, B = b1, b2, are given. We apply the CCTree rewriting rules to automatically verifyif τ1 is a CCTree term or not.The term τ1 is not atomic. Moreover, we have τ1 = F∗(τ1) and (W (τ1) = False). There isno CCTree rewriting rule which can be applied, whilst this term is not ∅. This means that thereceived term τ1 is not a CCTree term.

Example 6.38. With the use of CCTree rewriting system, we show that the term τ2 = a1 S + a2 S with the set of attributes A = a1, a2, B = b1, b2, is a CCTree term.

(τ2 = F∗(τ2)) ∧ (W (τ2)) | a1 S + a2 S(3)−−→ S : S

(1)−−→ ∅ : ∅

There is no irreducible term except ∅, hence, τ2 is a CCTree term.

6.5.2 Termination and Confluent Rewriting System

In the present section, we first present what the termination and confluent of a rewriting sys-tem mean. Furthermore, through several theorems, we prove our proposed rewriting system isterminating and confluent.Termination and confluence are the interesting properties of a rewriting system, which gua-rantee that firstly, applying the rewriting rules of the proposed system will not involve in aninfinite loop of application, and furthermore, applying the rewriting rules we always get aunique result.

Termination and Confluence of Rewriting System A rewriting system is terminating,if there is no infinite derivations a1 → a2 → a3 → . . . in R. This implies that every derivationeventually ends to a normal form [43]. Lankford theorem claims that a rewriting system R

is terminating, if for some reduction ordering >, x > y for all rules x → y ∈ R. An order isa reduction ordering, if it is monotonic and fully invariant [43]. A relation is monotonic if itpreserves the order through adding or reduction a term in both sides, and it is fully invariant,if it preserves the order when a term is substitute in both sides of the relation [43].An element a in the rewriting system R is locally confluent if for all b, c ∈ R such that a→ b

and a→ c, there exists d ∈ R such that b→∗ d and c→∗ d. If every a ∈ R is locally confluent,then → is called locally confluent. Newman’s lemma expresses that a terminating rewritingsystem is confluent if and only if it is locally confluent [43].

Theorem 6.39. The CCTree rewriting system is terminating.

115

Démonstration. To prove this theorem we first define a reduction order on the rules of CCTreerewriting system. To this end, we define the size function which gets an FC-term and returnsthe number of features appeared in the term as follows :

size : FC→ N

size(S) = 1

size(f S) = 1

size(F · τ) = |F |+ size(τ)

size(τ1 + τ2) = size(τ1) + size(τ2)

where we consider size(∅) = 0 and size(τ1 : τ2) = size(τ1) + size(τ2).We say FC-term τ1 is less than FC-term τ2, denoted by τ1 ≤ τ2, if the number of features in τ1is less than the number of features in τ2, or equally size(τ1) ≤ size(τ2). This partial orderingis well-founded, since there is no infinite descending chain (number of features are limited).It is monotonic, because the property of number of features in two terms is preserved when aterm is added or reduced in both sides. Furthermore, the substitution in left and right sides,preserves the order of number of features, i.e. it is fully invariant. Therefore, the proposedordering is a reduction ordering.

Considering that ∅ is a null term containing no feature, in the first rule we have atomic term >

∅. In the second one, the conditional rule is just applied when the term is not equal to itsunified form ; whilst the ordered unification function, if applied, does not change the numberof features, i.e.

τ ≥ F∗(τ) for τ 6= F∗(τ)

since size(τ) = size(F∗(τ)). Worth noticing that this rule is a one step rule, such that whenthe term is unified, the other rules are exploited.In the third rule, the first features of all components of the left term are removed, i.e. the size(number of features) of the left-hand term is greater than the size (number of features) in theright-hand one. Hence, the proposed reduction ordering ≤ on CCTree rewriting system, showsthat the system is terminating.

Theorem 6.40. The CCTree rewriting system is locally confluent.

Démonstration. In CCTree rewriting system, all rules are conditional and there is no term forwhich two (or more) conditions are satisfied at the same time. This means that the possibilityof having τ → τ1 and τ → τ2 where τ1 6= τ2, does not happen. Hence, the rewriting system islocally confluent.

Theorem 6.41. The CCTree rewriting system is confluent.

116

Démonstration. According to Newman’s lemma, the CCTree rewriting system being termina-ting ( Theorem6.39) and locally confluent (Theorem 6.39), it is confluent.

6.6 CCTrees Parallelism

It is not uncommon that a data mining process requires several days or weeks to be completed.Parallel computing systems bring significant benefit, say high performance, in implementationof massive database [33]. Parallel clustering is a methodology proposed to alleviate the pro-blem of time and memory usage in clustering large amount of data [94], [18].SPMD (Single Program Multiple Data) parallelism is the most common approach in parallelcomputation [135]. In SPMD parallel algorithm, multiple computers implement the same al-gorithm on different subsets and exchange the partial results to merge to a final result.In the present work, we propose SPMD parallelism of CCTrees in terms of a rewriting system.To this end, a large amount of data desired to be clustered is divided into two (or more)parallel computers, where each computer clusters the received dataset with the use of CC-Tree algorithm. The result of each CCTree is transformed to its equivalent CCTree term. Theresulted CCTree terms are reported to master computer for composition. The CCTree termsare composed automatically based on our proposed composition rewriting rules (6.2). The

Figure 6.2 – Parallel Clustering Workflow.

composition result is reported to each computer to homogenize the all CCTree terms, andconsequently the structure of all CCTrees (Figure 6.2).Getting a CCTree term from the composition of received terms, provides us with two advan-tages : First, the process of parallelism can be continued iteratively. Furthermore, it explainshow the set of clusters resulted from two (or more) CCTrees can be merged.To address the composition process, a set of composition rewriting rules (Table 6.2) are pro-posed to get automatically a CCTree term when a term is not a CCTree term.

117

The split relation, the 4’th rule of Table 6.2, is added to the rules of Table 6.1 to get CCTreeterm from non CCTree term.

Definition 6.27 (Split). Let a unified term τ ∈ FC ↑ on (V, <) and the set of attributes A,is given. Considering uA,τ as the upper bound attribute of τ , we define the split relation aswhat follows :

split(τ) =

τ if W (τ) = True∑

τi∈block(τ) ζ(τi) if W (τ) = False

where :

ζ(τi) =

τi if DA(A, τi) = True

(∑

ai∈uA,τ ai) · τi if DA(A, τi) = False

This means that all blocksof τ which do not contain any feature of uA,τ are multiplied to theaddition of the features of uA,τ .

In the following examples we show how split relation is applied.

Example 6.42. Lets consider τ1 = r · s S + r · c S + b · s S, is given. We have W (r · s S + r · c S + b · s S) = True, i.e. τ1 is a well formed term, which results in :

split(r · s S + r · c S + b · s S) = r · s S + r · c S + b · s S

Example 6.43. Suppose the term τ2 = r · s S + c S + b S is given. We have :

W (r · s S + r · c S + b S) = False

hence, τ2 is not a well formed term. Considering uA,τ2 = color we have :

DA(colro, r · s S) = True

DA(colro, r · c S) = True

DA(colro, b S) = False

which results in :

split(r · s S + c S + b S) = r · s S + (r + b) · c S + b S

= r · s S + r · c S + b · c S + b S

It is worth noticing that when a term is not a CCTree term, it is possible to infer it from itsunified form when the first features of its components do not belong to the same attribute.Therefore, the split rule is proposed to create a well formed term from a non CCTree term.

In what follows, we add the split rule to the previous rewriting system, which is used when aterm is not a CCTree term to obtain a CCTree term.

118

6.6.1 Composition Rules

The composition rewriting rules to get a CCTree term from a non CCTree term is presentedin Table 6.2. In the proposed rewriting system, Jf(τ)K means that the semnatic of f(τ) isreplaced, whilst the result is considered as one unique term, not several terms. Furthermore,τ1 : τ2 contains two terms τ1 and τ2, whilst each one is considered as a new term. Moreover,[τ ]i refers to the i’th component of τ/ ∼1.

(1) (τ ∈ A ) | τ → ∅(2) (τ 6= F∗(τ)) | τ → JF∗(τ)K(3) (τ = F∗(τ)) ∧ (W (τ)) ∧ (τ /∈ A ) | τ → JΣτk∈[τ ]1∂(τk)K : . . . : JΣτk∈[τ ]|τ/∼1|

∂(τk)K(4) (τ = F∗(τ)) ∧ (∼W (τ)) | τ → Jsplit(τ)K

Table 6.2 – Composition Rewriting System

Comparing to Table 6.1, just the forth rule (split rule) is added. This rule guarantee that if aterm is not a CCTree term, how by splitting the term based on the upper bound attribute wemay get a CCtree term.

6.6.2 CCTree Term From Composition Rewriting Rules

Here we briefly explain how to find a CCTree term from non CCtree term with the use ofcomposition rewriting system. To this end, first of all, the set of attributes A describing thereceived term τ is provided. Note that in categorical clustering algorithm, the set of attributesare known beforehand. The set of attributes and non CCTree term are given to the compositionrewriting system. When the conditions of the rule (τ = F∗(τ))∧(W (τ)) | τ → JΣτk∈[τ ]1∂(τk)K :

. . . : JΣτk∈[τ ]|τ/∼1|∂(τk)K respects for a term τ , we save τ . Then all JΣτk∈[τ ]iK of τ are replaced

by their own successive terms respecting this rule. This process is repeated iteratively tillreaching to atomic term in all components of term. The result of this term is the desiredCCTree term.

Example 6.44. Suppose that the addition of two CCTree terms is given as τ = a1 S + a2 S + b1 S′ + b2 S′, with the set of attributes A = a1, a2, B = b1, b2.It is easy to verify that τ is not a CCTree term from the rules of Table 6.1.We are interested to find a CCTree term from received non CCTree term τ , with the use of

119

composition rewriting system. To this end we have :

(i) (τ = F∗(τ)) ∧ (∼W (τ)) | τ (4)−−→ Jsplit(τ)K

(ii) Jsplit(τ)K = τ ′ = a1 S + a2 S + (a1 + a2) · b1 S′ + (a1 + a2) · b2 S′

(iii) (τ ′ 6= F∗(τ ′)) | τ ′ (2)−−→ JF∗(τ ′)K = (a1 · (S + b1 S′) + a2 · (S + b1 S′)) = τ ′′

(iv) (τ ′′ = F∗(τ ′′)) ∧ (W (τ ′′)) | τ ′′ ∗(3)∗−−−→ S + b1 S′(I) : S + b1 S′(II)

(I) S + b1 S′(4)−−→ (b1 + b2) · S + b1 S′

(2)−−→ b1 · (S + S′) + b2 S∗(3)∗−−−→ S + S′ : S

(1)−−→ ∅ : ∅

(II) S + b1 S′(4)−−→ (b1 + b2) · S + b1 S′

(2)−−→ b1 · (S + S′) + b2 · S∗(3)∗−−−→ S + S′ : S

(1)−−→ ∅ : ∅

To find the resulted CCTree term, we consider the terms respecting the rule (3), shown with∗(3)∗. Hence, we have them as follows :

(∗) a1 · (S + b1 S′) + a2 · (S + b1 S′)

(∗∗) b1 · (S + S′) + b2 S

(∗ ∗ ∗) b1 · (S + S′) + b2 · S

Then since (∗∗) results from this term S + b1 S′ inside (∗), and (∗ ∗ ∗) from term S + b1 S′

inside (∗), we replace them to their previous form :

a1 · (b1 · (S + S′) + b2 S) + a2 · (b1 · (S + S′) + b2 · S)

Since there is no more term respecting rule (3), the above term is the desired CCTree term.It is easy to automatically verify that the resulted term is a CCTree term according to Table6.1.

6.6.3 CCTree Homogenization

After that the final CCTree term, resulting from the composition of two (or more) CCTreeterms, is returned to parallel devices, the CCTree term of each computer has to be extendto the final CCTree term. The extension of each CCTree term to a final CCTree term willhomogenize the structure of all CCTrees. To this end, it is enough to add a CCTree termwith the final CCTree term. Then, all split rules applied on CCTree term in the process of

120

its composition with final CCTree term, shows the required split in the associated CCTreestructure, following the procedure of transforming a term to tree provided in 6.4.2.

Note It is worth noticing that after homogenizing all the CCTrees to the final CCTree, thedata respecting the same set of features go to the same cluster of final CCTree. However,merging a lot of data points from different clusters of different CCTrees to one cluster, maycause that the final nodes not respect required purity. To solve this issue, after merging thedata, the purity of each final node should be computed, and if not pure enough, it requires tobe split based on the CCTree rules of construction.

Theorem 6.45. The composition rewriting system is terminating.

Démonstration. The only rule added to composition rewriting system comparing to CCTreerewriting system, is the rule split. We show that split rule is not contradicting the terminationand confluence of rewriting system. First of all, the split rule is one step rule, i.e. the resultof split rule, after one step application, is considered as the premise of other rules (whichdecreases the term). On the other hand, on each term, the split rule is applied at most equalto the number of attributes (finite). Hence, since the split by itself is one step rule, and foreach term it is called finite times, the composition rewriting system is terminating.

Theorem 6.46. The composition rewriting system is locally confluent.

Démonstration. There is no term respecting at the same time two (or more) conditions ofcomposition rewriting system, i.e. there is no term τ for which τ → τ1 and τ → τ2, whereτ1 6= τ2. This means that composition rewriting system is locally confluent.

Theorem 6.47. The composition rewriting system is confluent.

Démonstration. From Theorems 6.45 and 6.46, the composition rewriting system is termi-nating and locally confluent, respectively. Hence, from Newman’s lemma, the compositionrewriting system is confluent.

6.6.4 Time Complexity

Here we present a theorem which calculates the time complexity of constructing several CCTreein parallel devices.

Theorem 6.48. Let us consider n to be the total number of elements desired to be clustered,r be the number of attributes, vmax be the maximum number of values in an attribute, and Kbe the maximum number of non leaf nodes. The time complexity of constructing CCTrees in tparallel devices equals to :

1

t· O(K × (n×m + n× vmax))

121

Démonstration. In Section 3.5, we explained about calculating the time complexity of construc-ting the CCTree. Recalling again, consider n as the number of elements in whole dataset, nibe the number of elements in node i, m be the total number of features, vl the number offeatures of attribute Al, r the number of attributes, and vmax = maxvl, (1 ≤ l ≤ r).For constructing a CCTree, if K = m+ 1 be the maximum number of non leaf nodes, whicharise in a complete tree, then the maximum time required for constructing a CCTree with nelements equals to O(K × (n×m + n× vmax)).Now if we equally divide the dataset containing n points to t devices, it takes O(K × ((n/t)×m + (n/t) × vmax)) = 1

t · O(K × (n ×m + n × vmax)) to create t CCTrees, i.e the wholerequired time will be divided to the number of devices. The other part of algebraic calculationsrequires constant time.

6.7 Conclusion

In this chapter, a semiring-based formal method, named Feature-Cluster Algebra, is proposedto abstract the representation of a categorical clustering algorithm, named CCTree.The abstraction theory is a delightful mathematical concept, which constructs a brief sketchof the original representation of a problem to deal with it easier. More precisely, abstractionis the process of mapping a representation of a problem, called the ground (semantic), onto anew representation, called abstract (syntax) representation, in a way that it is possible to dealwith the problem in the original space by preserving certain desirable properties and in a sim-pler way to handle, since it is constructed from ground representation by removing unwanteddetail. The abstraction process is performed with the use of a powerful algebraic structure, na-med semiring. Through several theorems and examples, we show that the proposed approach,under some conditions, fully abstracts the CCTree structure. The full abstraction propertyguarantees that the semantic and syntax forms of a problem can be used alternatively, whilstpreserving the required properties.Furthermore, we presented a set of functions and relations on feature-cluster algebra, whichis used to present the CCTree schema in general. We provided a rewriting system which au-tomatically identifies whether a term represents a CCTree or not.The CCTree abstract representation is used in CCTree parallel clustering. Generally, the pro-cess of clustering requires time and space, specially when a large amount of data are desiredto be analyzed. The problem of time and precision in clustering becomes more challenging insecurity issues, where the fast and precise analysis is required to find the strategies againstintruder.We proposed a rewriting system which automatically returns a CCTree term, in a way thatall CCTrees in parallel devices can be generalized to.The termination and confluence of the proposed rewriting system have been proved, whichguarantees first of all we have no loop in applying the proposed rewriting systems, and mo-

122

reover, the resulted final term is unique.

To the b est of our knowledge, the proposed technique in this chapter is a novel methodologyin applying algebraic structure in formalizing a clustering algorithm representation and ad-dressing the associated issues. The proposed approach can be extended to other feature-basedclustering and classification algorithms.

123

Chapitre 7

Conclusions and Future Work

In present section, we first summarize what we presented in this work, and afterwards, wepresent the future directions for continuing the present study.

7.1 Thesis Summary

The current strategies to minimize the impact of spam messages mostly focus on stoppingspam messages to be delivered to end user inbox. This kind of analysis, although being quiteeffective in decreasing the cost of spam emails, does not stop spammers, who still imposenon negligible cost to users and companies. The reason could be that the spammer, the rootof the problem, finds the minimum risk to be followed, whilst he has the possibility to sendmillions of messages in a short period of time with minimum expenses. To this end, analyzinga spammer behavior to find the strategies against and may be persecuting him, becomes animportant issue in spam forensics. However, such an effort requires a first analysis of hugeamount of spam messages, collected in a short period of time in honey-pots, whilst the size ismagnified after some minutes.

To address this issue, in this thesis, we first proposed a categorical clustering algorithm,named CCTree, to group large amount of spam messages into smaller groups, based on thestructural similarity. CCTree has a tree-like structure, where the root node of the tree containsall spam messages. The CCTree divides spam messages, step-by-step, grouping together thesimilar data and obtaining homogeneous subsets of data points. The measure of similarity ofclustered data points at each step of the algorithm is given by an index called node purity.If the level of purity is not sufficient, it means spam messages belonging to this node are notsufficiently homogeneous and they should be divided into different subsets (nodes) based onthe characteristic (attribute) that yields the highest value of entropy. The rationale under thischoice is that dividing data on the base of the attribute which yields the greatest entropyhelps in creating more homogeneous subsets where the overall value of entropy is consistentlyreduced. This approach aims at reducing the time needed to obtain homogeneous subsets. The

124

division process of non homogeneous sets of data points is repeated iteratively till all sets aresufficiently pure or the number of elements belonging to a node is less than a specific thresholdidentified by the user. These pure sets are the leaves of the tree and will represent the desiredspam campaigns.

To apply CCTree in clustering large amount of spam emails into spam campaigns, we provideda set of 21 categorical features representative of email structure. Then, through analysis on200k spam emails, we proposed and validate a methodology to choose the optimal CCTreeparameters based on detection of max curvature point (knee) on a homogeneity-number ofclusters graph. We proved the effectiveness of CCTree in spam campaign detection throughinternal evaluation, to estimate the ability in obtaining homogeneous clusters and externalevaluation, for the ability to effectively classify similar elements (emails), when classes areknown beforehand. The efficiency of CCTree has been shown through the comparison to oneof the fast well-known categorical clustering algorithm.

We proposed a framework, named Digital Waste Sorter (DWS), which exploits a self learninggoal of the spammer -based approach for spam email classification. The proposed approachaims at automatically classifying large amount of raw unclassified spam emails dividing theminto campaigns and labeling each campaign with its spammer goals. To this end, we proposedfive class labels to group spammer goals in five macro-groups, namely Advertisement, PortalRedirection, Advanced Fee Fraud, Malware Distribution and Phishing. Moreover, a set of 21categorical features representative of email structure is proposed to perform a multi-featureanalysis aimed at identifying emails related to a large range of cybercrimes. DWS is basedon the cooperation of unsupervised and supervised learning algorithms. Given a set of classesdescribing different spammer goals and a dataset of non classified spam emails. First, the pro-posed approach automatically creates a valid training set for a classifier exploiting CCTree.DWS is built on the result of CCTree , which is effective in dividing spam emails in homoge-neous clusters. Afterward, significant spam campaigns useful in the generation of the trainingset are selected through similarity with a small set of known emails, representative of eachspam class. Hence, a classifier is trained using the selected campaigns as training set, and willbe used to classify the remaining unclassified emails of the dataset. Furthermore, we proposesix features, including the label of campaigns discovered with DWS, to automatically rank aset of spam campaigns according to investigator priorities.

Finally, to abstract CCTree representation, we proposed a semiring-based approach, namedfeature-cluster algebra. Several interesting relations and functions are defined on the abstractschema of CCTree, named CCTree term. The concept of CCTree term is applied in the for-malization of CCTree parallelism, which is expressed in terms of rewriting system. Clusteringparallelism can be used to speed up the process of grouping large amount of data in paralleldevices.

125

To summarize, we have to say that what we proposed in this thesis can be used as a tool forcyber crime investigators to organize automatically a huge amount of spam messages in a shortperiod of time. This tool provides the investigator with the priority of the most dangerousspammers, trough best ranked spam campaigns, required to be followed.

7.2 Future work

This thesis can be extended in several directions. In what follows we present what we plan toextend.

The technique that we proposed in this thesis can be applied as a useful tool in automatic fastdetection of the most dangerous spam campaigns. To show the efficiency and effectiveness ofour proposed approach, we plan to apply it on a huge amount of spam messages, containingone of the most dangerous current spam campaigns, e.g. cryptowall 3.0 malware. We plan toshow that our approach detects it automatically among other campaigns.

To speed up the process of clustering spam emails into campaign, we expect to apply severalsampling algorithms. In statistics, sampling approach is concerned with the selection of asubset of elements for which the statistical properties of dataset is preserved, and it is appliedto estimate characteristics of the whole population.In the concept of spam messages, since we always encounter a large amount of data, finding thebest strategy in sampling data from whole dataset, which preserves the main characteristicsof the whole dataset may help to speed up the analysis.

Furthermore, we plan to apply the proposed methodology in detecting, labeling, and rankingsocial spam campaigns, e.g. Facebook or Twitter. To this end, first of all, the representativefeatures of social spam campaigns should be identified. Afterwards, the most popular cyber-crimes in social networks should be characterized as the label of discovered spam campaigns totrain a classifier. Finally, the ranking features needs to be identified to order the set of socialspam campaigns.

Another area of research which we are interested to apply our proposed methodology refersto botnet detection and finding the botmaster, the root of the problem. To this end, althoughmany efforts have been done in prosecuting the botmaster through botnet, we expect ourproposed approach works well in botnet detection through precise spam campaign detectionand consequently catch the spammer. The reason is that we believe the proposed mechanismis able to precisely identify the zombies (bots) controlled by the same spammer (botmaster).

In the side of formalization, there are a lot of directions to extend our proposed approach, sinceit is among the very first efforts in applying formal methods in clustering algorithms. First,we plan to extend the idea of semiring in abstracting the representation of other well-knowncategorical clustering algorithms. Then, we apply the abstract schema in concepts related

126

to feature analysis, parallel clustering, etc. Furthermore, we plan to apply more interestingproperties of semiring, to address more issues in categorical clustering algorithms. For example,semiring homomorphism can be applied in automatically identifying whether two categoricalclustering are identical or not.

127

Publications

• Sheikhalishahi, M., Mejri, M., and Tawbi, N. (2015). Clustering spam emails into campaigns.In Library, S. D., editor, 1st International Conference on Information Systems Security andPrivacy [126].

• Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., and Martinelli, F. (2015c). Fastand effective clustering of spam emails based on structural similarity. In 8th InternationalSymposium on Foundations and Practice of Security [129].

• Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., and Martinelli, F. (2015b). Digital-waste sorting : A goal-based, self-learning approach to label spam email campaigns. In 11thInternational Workshop on Security and Trust Management [128].

• Sheikhalishahi, M., Mejri, M., and Tawbi, N. (2016). On the abstraction of a categoricalclustering algorithm. In Machine Learning and Data Mining in Pattern Recognition - 12thInternational Conference [127].

128

Table 7.1 – Table of Notations

ε , Node purityµ , Minimum number of elements in a nodeA , The set of sorts (attributes)

VA , The carrier set of sort AV , The union set of carrier sets of A

sort , A function which returns a set of carrier sets of receivedfeatures

F , The power set of the power set of VF1 , A subset of F in which each set contains just one elementS , The set of records (elements) , Satisfaction relation

F S , The set of elements of S that satisfy the set of features FFC , Set of feature-cluster termsA , Atomic terms

block , A function which returns a set of feature-cluster terms≡ , FC-term comparisonC , The set of termsA−→ , Factorization rewriting rule−→d , Defactorization rewriting ruleFC ↓ , The set of factorized FC-termsFC ↑ , The set of non factorized FC-terms

(Σ, Q, δ) , Graph structure[[.]] , A function which returns a tree from received feature-cluster

family termΨ , A function which returns a feature-cluster family term from

a received forest (tree)GV,FC , The set of all possible forest on the set of edge labels V and

node labels FC≈ , Ordered FC-terms comparison

DA , Attribute division functionδ , Initial function

B ≺τ A , Attribute B is smaller than attribute AFk , Ordered unification function∂ , Derivative function

[τ ]i , The i’th component of τW (τ) , Well formed term

F∗(A, τ) , Unified termsplit(τ) , Split function

129

Annexe A

Appendix

A.1 Source Codes of Proposed Approach

In what follows some important source codes used in CCTree construction, labeling, and etc.are provided.

Shannon entropy function

entropy = shannon_entropy ( a t t r i bu t e_va l s )

%INPUT:%at t r ibu t e_va l s : [ 1∗N] INTEGER%I s the vec to r with the va lue s f o r each a t t r i b u t e i n s i d e a c l u s t e r%OUTPUT:%entropy : [ 1 ∗ 1 ] DOUBLE% The entropy f o r the s p e c i f i c a t t r i b u t e .

f unc t i on entropy = shannon_entropy ( a t t r i bu t e_va l s )ordered_vect = so r t ( a t t r i bu t e_va l s ) ;

%Order the array to d iv id e the d i f f e r e n t va lues o f the a t t r i b u t evec to r_s i z e = s i z e ( a t t r i bu t e_va l s ) ;

i =0;whi l e isempty ( ordered_vect)== 0

%Find the number o f e lements f o r each a t t r i b u t e value in the vec to r

i=i +1;

130

index = f i nd ( ordered_vect == ordered_vect ( 1 ) ) ;temp = s i z e ( index ) ;dim( i ) = temp ( 2 ) ;ordered_vect ( index )= [ ] ;

end

entropy=0;counter = s i z e (dim ) ;

f o r j = 1 : counter (2 ) %compute the entropyentropy=entropy − ( ( dim( j )/ vec to r_s i z e ( 2 ) )∗ l og2 (dim( j )/ vec to r_s i z e ( 2 ) ) ) ;

end

The Shannon Entropy of a Cluster :

f unc t i on e = c lus te r ing_entropy (A, c i )num_clusters = s i z e (A) ;num_clusters = num_clusters ( 2 ) ;r e s u l t = 0 ;f o r i =1: num_clusters

i f not ( isempty (A i ) )num_cols_ai = s i z e (A i ) ;num_cols_ai = num_cols_ai ( 2 ) ;vect_ai = A i ( : , num_cols_ai ) ’ ;num_cols_ci = s i z e ( c i ) ;num_els_ci = num_cols_ci ( 1 ) ;num_cols_ci=num_cols_ci ( 2 ) ;vect_ci = c i ( : , num_cols_ci−1) ’ ;i n t e r s e c t i o n = i n t e r s e c t ( vect_ai , vect_ci ) ;dim = s i z e ( i n t e r s e c t i o n ) ;dim = dim ( 2 ) ;i f ( dim~=0)

r e s u l t = r e s u l t + (dim/num_els_ci )∗ l og (dim/num_els_ci ) ;endend

ende = −r e s u l t ;

131

end

Node purity

f unc t i on [ np , max_entropy_attribute ] = node_purity ( data , weight )n_attr = s i z e ( data ) ;n_attr = n_attr (2)−3;i f narg in < 2

weight = ones (1 , n_attr )∗1/ n_attr ;endnp=0;max_entropy = 0 ;max_entropy_attribute=1;f o r i =1:n_attr−1

temp_entropy = shannon_entropy ( data ( : , i ) ’ ) ;i f temp_entropy > max_entropy

max_entropy = temp_entropy ;max_entropy_attribute=i ;

endnp=np+weight ( i )∗ temp_entropy ;

end

CCTree function :

f unc t i on [ c l u s t e r s , l a b e l s ] =CCTree ( data , node_purity_threshold , max_num_elem)t i cnum_elem = s i z e ( data ) ;num_elem = num_elem ( 1 ) ;a s soc i a t e_vec to r = 1 : num_elem ;a s soc i a t e_vec to r = assoc ia te_vector ’ ; %count the emai l l i n e sdata = [ data , a s so c i a t e_vec to r ] ;l e v e l = 0 ; %i n i t i a l i z e data s t r u c t u r e snodes_per_level = ;nodes_next_level = ;a l l_nodes =;l e av e s = ;

[ current_node_purity , cu r r ent_at t r ibute ] =node_purity ( data ) ;%compute node pur i ty o f the whole datase t

num_elem_curr_node = s i z e ( data ) ;

132

num_elem_curr_node = num_elem_curr_node ( 1 ) ;%check number o f e lements

i fcurrent_node_purity > node_purity_threshold

&& num_elem_curr_node > max_num_elem%s p l i t i f s e t i s NOT pure AND too many elements

[ nodes_per_level , l a b e l s ] = CCTreeSplit ( data , cu r r ent_at t r ibute ) ;%nodes_per_level conta in s the var i ous c l u s t e r s

l e v e l = 1 ;e l s e

c l u s t e r s = data ;l a b e l s = [ ] ;r e turn ;

end

whi l e 1num_nodes_curr_level = s i z e ( nodes_per_level ) ;num_nodes_curr_level = num_nodes_curr_level ( 2 ) ;new_level=0;

%boolean to check i f the re i s a new l e v e lpd3=nodes_per_level ;

f o r i =1:num_nodes_curr_level%f o r a l l nodes in t h i s l e v e l

temp_node = nodes_per_level i ; %ex t r a c t a c l u s t e rnum_elem_curr_node=s i z e ( temp_node ) ;num_elem_curr_node=num_elem_curr_node ( 1 ) ;[ current_node_purity , cu r r ent_at t r ibute ] = node_purity ( temp_node ) ;

%compute pur i ty


&& num_elem_curr_node > max_num_elem%i f s e t i s pure OR there are too few elements

[ temp_cell_array , temp_label ]=CCTreeSplit ( temp_node , cur r ent_at t r ibute ) ;%s p l i t and a s s i gn to temp va r i ab l e the new c l u s t e r

nodes_next_level=[nodes_next_level , temp_cell_array ] ;%add node to a deeper l e v e l

new_level=1;

133

num_nodes_curr_level = s i z e ( nodes_per_level ) ;num_nodes_curr_level = num_nodes_curr_level ( 2 ) ;new_level=0;

%boolean to check i f the re i s a new l e v e lpd3=nodes_per_level ;

f o r i =1:num_nodes_curr_level%f o r a l l nodes in t h i s l e v e l

temp_node = nodes_per_level i ;%ex t r a c t a c l u s t e r

num_elem_curr_node=s i z e ( temp_node ) ;num_elem_curr_node=num_elem_curr_node ( 1 ) ;[ current_node_purity , cu r r ent_at t r ibute ] =node_purity ( temp_node ) ;

%compute pur i ty


&& num_elem_curr_node > max_num_elem%i f s e t i s pure OR there are too few elements

[ temp_cell_array , temp_label ]=CCTreeSplit ( temp_node , cur r ent_at t r ibute ) ;%s p l i t and a s s i gn to temp va r i ab l e the new c l u s t e r

nodes_next_level=[nodes_next_level , temp_cell_array ] ;%add node to a deeper l e v e l

new_level=1;%disp ( ’ l e a f e found ’ ) ; %c r ea t e a l e a f

e l s e%disp ( ’ l e a f e not found ’ ) ;

l e av e s= [ l e av e s ; temp_node ] ;%add i t to the l e a f c o l l e c t i o n

endend

c l u s t e r s = l e av e s ;%as s i gn the l e av e s to the r e s u l t s

a l l_nodes = [ al l_nodes nodes_per_level ] ;nodes_per_level=nodes_next_level ;

%next l e v e l becomes cur r ent l e v e lnodes_next_level = ;

134

i f new_level==0%stop i f a l l nodes are l e av e s

break ;end

l e v e l = l e v e l + 1 ;endtoc

CCTree Labeling Function :

f unc t i on M = CreateCCTreeLabelledMatrix ( c )i t e r = s i z e ( c ) ;i t e r = i t e r ( 1 ) ;M= [ ] ;f o r i =1: i t e r

numofelements = s i z e ( c i ) ;numofelements = numofelements ( 1 ) ;vect = i ∗ ones ( numofelements , 1 ) ;tempmat = c i ;tempmat = [ tempmat , vect ] ;i f ( i==1)

M=tempmat ;e l s e

M = [M; tempmat ] ;end

endend

Precise cluster

f unc t i on p = p r e c i s i o n_c l u s t e r (Ai , Cj )num_cols_ai = s i z e (Ai ) ;%num_el_ai = num_cols_ai ( 1 ) ;num_cols_ai = num_cols_ai ( 2 ) ;num_cols_cj = s i z e (Cj ) ;num_el_cj = num_cols_cj ( 1 ) ;num_cols_cj = num_cols_cj ( 2 ) ;vect_ai = Ai ( : , num_cols_ai ) ’ ;vect_cj = Cj ( : , num_cols_cj−1) ’ ;i n t e r s e c t i o n = i n t e r s e c t ( vect_ai , vect_cj ) ;

135

r e s u l t=s i z e ( i n t e r s e c t i o n ) ;p=r e s u l t (2)/ num_el_cj ;endend

Recall cluster

f unc t i on r = r e c a l l_ c l u s t e r (Ai , Cj )num_cols_ai = s i z e (Ai ) ;num_el_ai = num_cols_ai ( 1 ) ;num_cols_ai = num_cols_ai ( 2 ) ;num_cols_cj = s i z e (Cj ) ;num_cols_cj = num_cols_cj ( 2 ) ;vect_ai = Ai ( : , num_cols_ai ) ’ ;vect_cj = Cj ( : , num_cols_cj−1) ’ ;i n t e r s e c t i o n = i n t e r s e c t ( vect_ai , vect_cj ) ;r e s u l t=s i z e ( i n t e r s e c t i o n ) ;r=r e s u l t (2)/ num_el_ai ;end

Find Clusters by Purity :

f unc t i on [ index , pur i ty ]= FindClusterByPurity ( data , l e av e s )num_of_leaves = s i z e ( l e av e s ) ;num_of_leaves = num_of_leaves ( 1 ) ;tot_el = s i z e ( ce l l 2mat ( l e av e s ) ) ;tot_el = tot_el ( 1 ) ;min_purity = In f ;index = −1;na t t r_ l ea f = s i z e ( l e av e s 1 ) ;na t t r_ l ea f = nat t r_ l ea f ( 2 ) ;nattr_data = s i z e ( data ) ;nattr_data = nattr_data ( 2 ) ;s i z e_d i f f = nat t r_ l ea f − nattr_data ;data = [ data , z e r o s (1 , s i z e_d i f f ) ] ;

%add two empty va lue s to match the s i z e o f l e a ff o r i =1:num_of_leaves

num_of_elements = s i z e ( l e av e s i ) ;num_of_elements = num_of_elements ( 1 ) ;i f ( num_of_elements > 1)

136

%do not con s id e r nodes with a s i n g l e elementpurity_old = node_purity_mod ( l e av e s i ) ;purity_new = node_purity_mod ( [ l e av e s i ; data ] ) ;

%add data and compute new pur i tyd i f f e r e n c e = ( purity_new − purity_old ) ;d i f f e r e n c e = d i f f e r e n c e ∗ ( num_of_elements ) ;%do not con s id e r node whose pur i ty i s i n c r ea s ed

i f d i f f e r e n c e < min_puritymin_purity = d i f f e r e n c e ;index = i ;

endend

endpur i ty = min_purity ;end

F-Measure

f unc t i on f = FMeasure_Clusters (Ai , c )r e s u l t s = 0 ;num_of_clusters = s i z e ( c ) ;num_of_clusters = num_of_clusters ( 1 ) ;f o r i =1: num_of_clusters

op = 2∗ p r e c i s i o n_c l u s t e r (Ai , c i )∗ r e c a l l_ c l u s t e r (Ai , c i )/( p r e c i s i o n_c l u s t e r (Ai , c i )+ r e c a l l_ c l u s t e r (Ai , c i ) ) ;r e s u l t s = max( r e s u l t s , op ) ;

endf = r e s u l t s ;end

137

A.2 Tables of Attributes

In what follows the set of features of each attribute, and the range of each feature, whichare applied in CCTree algorithm are presented in tables. Each table represents one attribute,whilst the first column of each table constitute the set of features of that attribute, and thesecond column shows the number we assigned to each feature in the same raw.The two binary attributes Linkwithat(@) and LinkswithnonASCIIcharacter are not pre-sented in tables. For these two attributes, if presented in the body of spam message there isno link with (@) or no link non ASCII character, we attribute the number 0 to this message,otherwise the attributed number equals to 1.

Table A.1 – Language of spam message and subject

Language Attributed NumberUnknown language 0English language 1Italian language 2French language 3German language 4Spanish language 5Chinese language 6Arabic language 7Persian language 8Japanese language 9Russian language 10Croatian language 11Portuguese language 12

Indian language 13

Table A.2 – Type of Attachment

Attachment Type Attributed NumberNone 0PDF 1EXEC 2DOC 3PIC 4TXT 5ZIP 6Other 7

138

Table A.3 – Attachment Size

Attachment Size Attributed NumberAttachment Size 0 kb 0

Attachment Size 1-100 kb 1Attachment Size 100-500 kb 2Attachment Size 500-1000 kb 3Attachment Size 1000-more kb 4

Table A.4 – Number of attachment

Attachment Number Attributed numberNo attachment 01 attachment 12 attachments 23 attachments 3

4 attachments and more 4

Table A.5 – Average size of attachments

Average Attachment Size Attributed Numberaverage size of attachment 0 0

average size of attachment 1-100 1average size of attachment 100-500 2average size of attachment 500-1000 3

average size of attachment 1000 and more 4

Table A.6 – Type of Message

Message Type Attributed NumberPlain Text 1

HTML based 2Image based 3Links Only 4Others 5

139

Table A.7 – Length of Message

Message Size Attributed NumberLength Class 0-100 kb 0Length Class 100-200 kb 1Length Class 200-300 kb 2Length Class 300-400 kb 3Length Class 400-500 kb 4Length Class 500-600 kb 5Length Class 600-700 kb 6Length Class 700-800 kb 7Length Class 800-900 kb 8Length Class 900-1000 kb 9Length Class 1000-5000 kb 10Length Class 5000-10000 kb 11Length Class 10000-20000 kb 12Length Class 20000-30000 kb 13Length Class 30000-40000 kb 14Length Class 40000-50000 kb 15Length Class 50000-60000 kb 16Length Class 60000-70000 kb 17Length Class 70000-80000 kb 18Length Class 80000-90000 kb 19Length Class 90000-100000 kb 20Length Class 100000-more kb 21

Table A.8 – IP-based links verification

IP based Verification Attributed NumberNo IP based links 0

Contain IP based links 1

Table A.9 – Mismatch links

Mismatch Links Attributed NumberNo Mismatch link 01 Mismatch Link 12 Mismatch Links 2

3 Mismatch links and more 3

140

Table A.10 – Number of links

Number of Links Attributed NumberNo link 01 link 12 links 23 links 34 links 45 links 56 links 67 links 78 links 89 links 9

10-100 links 10more than 100 links 11

Table A.11 – Number of Domains

Number of Domains Attributes NumberNo domain 0

1 domain in links 12 domains in links 23 domains in links 34 domains in links 45 domains in links 5

6-10 domains in links 6more than 10 domains in links 7

Table A.12 – Average number of dots in links

Average Number of Dots in Links Attributed Number0 dot per link 01 dot per link 12 dots per link 23 dots per link 3

more than 3 dots per link 4

Table A.13 – Hex character in links

Number of links with Hex Attributed NumberNo link with Hex character 01 link with Hex character 12 links with Hex character 23 links with Hex character 34 links with Hex character 45 links with Hex character 5

6-10 links with Hex character 6more than 10 links with Hex character 7

141

Table A.14 – Words in Subject

Number of Words in Subject Attributed NumberNo word in subject 01-5 words in subject 16-10 words in subject 2

more than 10 words in subject 3

Table A.15 – Characters in subject

Number of Characters in Subject Attributed NumberNo character in subject 01-10 characters in subject 110-20 characters in subject 2

more than 20 character in subject 3

Table A.16 – Non ASCII characters in subject

Number of Non ASCII characters in Subject Attributed NumberNo non ASCII character in subject 01 non ASCII character in subject 1

2-5 non ASCII characters in subject 26-10 non ASCII characters in subject 3

more than 10 non ASCII characters in subject 4

Table A.17 – Recipients of spam email

Number of Recipients Attributed NumberNo recipient 01 recipient 1

2 recipients and more 2

142

Table A.18 – Images in spam messages

Number of Images Attributes NumberNo image 01 image 12 images 23 images 34 images 45 images 56 images 67 images 78 images 89 images 9

10-20 images 1021-30 images 1131-40 images 1241- 50 images 1351-100 images 14101- 500 images 15501-1000 images 16

more than 1000 images 17

143

Bibliographie

[1] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank – fully automatic linkspam detection. In Adversarial Information Retrieval on the Web, 2005.

[2] M. K. Albertini and R. F. Mde Mello. Formalization of data stream clustering propertiesand analysis of algorithms. In The International Conference on Artificial Intelligence(ICAI). The Steering Committee of The World Congress in Computer Science, ComputerEngineering and Applied Computing, 2011.

[3] A. Almomani, B. B. Gupta, S. Atawneh, A. Meulenberg, and E. Almomani. A surveyof phishing email filtering techniques. IEEE Communications Surveys and Tutorials,15(4) :2070–2090, 2013.

[4] D.S. Anderson, C. Fleizach, S. Savage, and G.M. Voelker. Spamscatter : Characterizinginternet scam hosting infrastructure. In Proceedings of 16th USENIX Security Sympo-sium on USENIX Security Symposium, 2007.

[5] R. Anderson, C. Barton, R. Böhme, R. Clayton, M. J.G. van Eeten, M. Levi, T. Moore,and S. Savage. Measuring the cost of cybercrime. In Rainer Böhme, editor, The Econo-mics of Information Security and Privacy, pages 265–300. 2013.

[6] P. Andritsos, P. Tsaparas, R. Miller, and K.C. Sevcik. In Advances in Database Techno-logy - EDBT 2004, volume 2992 of Lecture Notes in Computer Science, pages 123–146.2004.

[7] P. Andritsos, P. Tsaparas, R. Miller, and K.C. Sevcik. Limbo : Scalable clustering ofcategorical data. In Advances in Database Technology - EDBT 2004, volume 2992 ofLecture Notes in Computer Science, pages 123–146. Springer Berlin Heidelberg, 2004.

[8] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An ex-perimental comparison of naive bayesian and keyword-based anti-spam filtering withpersonal e-mail messages. In Proceedings of the 23rd Annual International ACM SI-GIR Conference on Research and Development in Information Retrieval, SIGIR, pages160–167, New York, NY, USA, 2000.

144

[9] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos.An evaluation of naive bayesian anti-spam filtering. In Proceedings of the Workshop onMachine Learning in the New Information Age, 11th European Conference on MachineLearning, ECML, pages 9–17, 2000.

[10] H.B. Aradhye, G.K. Myers, and J.A. Herson. Image analysis for efficient categorizationof image-based spam e-mail. In Document Analysis and Recognition, 2005. Proceedings.Eighth International Conference on, pages 914–918 Vol. 2, Aug 2005.

[11] G. Atkinson and A.M. Nevill. Statistical methods for assessing measurement error (re-liability) in variables relevant to sports medicine. Sports Medicine, 26(4) :217–238, 1998.

[12] S. Baase and A. V. Gelder. Computer Algorithms : Introduction to Design and Analysis.Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 3rd edition, 1999.

[13] M. Bailey, E. Cooke, F. Jahanian, X. Yunjing, and M. Karir. A survey of botnet tech-nology and defenses. In Proceedings Conference For Homeland Security, 2009. CATCH’09. Cybersecurity Applications Technology, pages 299–304, 2009.

[14] C. Beleites, U. Neugebauer, T. Bocklitz, C. Krafft, and J. Popp. Sample size planningfor classification models. Analytica chimica acta, 760 :25–33, 2013.

[15] D. Benavides, S. Segura, and A. Ruiz-Cortés. Automated analysis of feature models 20years later : A literature review. Inf. Syst., 35(6) :615–636, September 2010.

[16] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank–fully automatic linkspam detection work in progress. Proceedings of the first international workshop onadversarial information retrieval on the web, 2005.

[17] A. Bergholz, G. PaaB, F. Reichartz, S. Strobel, and S. Birlinghoven. Improved phishingdetection using model-based features. In Fifth Conference on Email and Anti-Spam,CEAS, 2008.

[18] P. Berkhin. A survey of clustering data mining techniques. In Jacob Kogan, CharlesNicholas, and Marc Teboulle, editors, Grouping Multidimensional Data, pages 25–71.Springer Berlin Heidelberg, 2006.

[19] J.C. Bezdek and N.R. Pal. Cluster validation with generalized dunn’s indices. In Pro-ceedings of the Second New Zealand International Two-Stream Conference on ArtificialNeural Networks and Expert Systems, pages 190–193, 1995.

[20] B. Biggio, G. Fumera, I. Pillai, and F. Roli. Image spam filtering using visual informa-tion,iciap. In Image Analysis and Processing, 14th International Conference on, pages105–110, Sept 2007.

145

[21] E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering.Artificial Intelligence Review, 29(1) :63–92, 2008.

[22] E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering.Artif. Intell. Rev., 29(1) :63–92, March 2008.

[23] Janusz A. Brzozowski. Derivatives of regular expressions. Journal of the ACM,11(4) :481–494, October 1964.

[24] S. Buhne, K. Lauenroth, and K. Pohl. Modelling requirements variability across productlines. In Proceedings of the 13th IEEE International Conference on Requirements Engi-neering, RE ’05, pages 41–52, Washington, DC, USA, 2005. IEEE Computer Society.

[25] J. Caballero, P. Poosankam, D. Song, and C. Kreibich. Dispatcher : Enabling activebotnet infiltration using automatic protocol reverse-engineering. In In CCS09 : of the16th ACM conference on Computer and communications security, pages 621–634. ACM,2009.

[26] P.H. Calais, E. V. P. Douglas, O. G. Dorgival, M. Wagner, H. Cristine, and S.J. Klaus.A campaign-based characterization of spamming strategies. In the proceedings of 5thConference on e-mail and anti-spam (CEAS), 2008.

[27] P.H. Calais, D.E.V Pires, D.O. Guedes, W. Meira, C. Hoepers, and K. Steding-Jessen.A campaign-based characterization of spamming strategies. In CEAS, 2008.

[28] P.H. Calais Guerra, D.E.V. Pires, M.T. C. Ribeiro, D. Guedes, W. Meira, C. Hoepers,M. H.P.C Chaves, and K. Steding-Jessen. Spam miner : A platform for detecting andcharacterizing spam campaigns. Information Systems Applications, 2009.

[29] J. Carpinter and R. Hunt. Tightening the net : A review of current and next generationspam filtering tools. Computers and Security, 25(8) :566 – 578, 2006.

[30] X. Carreras, L. Marquez, and J.H. Salgado. Boosting trees for anti-spam email filte-ring. In Proceedings of RANLP-01, 4th International Conference on Recent Advances inNatural Language Processing, Tzigov Chark, BG, pages 58–64, 2001.

[31] R. Caruana, N. Karampatziakis, and A. Yessenalina. An empirical evaluation of super-vised learning in high dimensions. In Proceedings of the 25th International Conferenceon Machine Learning, ICML ’08, pages 96–103, New York, NY, USA, 2008.

[32] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learningalgorithms. In Proceedings of the 23rd International Conference on Machine Learning,ICML ’06, pages 161–168, NY, USA, 2006. ACM.

146

[33] C.L.P. Chen and C.Y. Zhang. Data-intensive applications, challenges, techniques andtechnologies : A survey on big data. Information Sciences, 275 :314 – 347, 2014.

[34] T.C. Chen, T. Stepan, S. Dick, and J. Miller. An anti-phishing system employing diffusedinformation. ACM Trans. Inf. Syst. Secur., 16(4) :16 :1–16 :31, April 2014.

[35] C. Cho, J. Caballero, C. Grier, V. Paxson, and D. Song. Insights from the inside : a viewof botnet management from infiltration. In Proceedings of the 3rd USENIX conference onLarge-scale exploits and emergent threats : botnets, spyware, worms, and more, LEET’10,pages 2–2, Berkeley, CA, USA, 2010. USENIX Association.

[36] C. Cisco. Cisco 2015 annual security report. In www.cisco.com, 2015.

[37] E. M. Clarke and J. M. Wing. Formal methods : State of the art and future directions.ACM Comput. Surv., 28(4) :626–643, December 1996.

[38] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience,New York, NY, USA, 1991.

[39] K. Czarnecki and U. W. Eisenecker. Generative Programming : Methods, Tools, andApplications. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 2000.

[40] L.F. Da Cruz Nassif and E.R. Hruschka. Document clustering for forensic analysis : Anapproach for improving computer inspection. Information Forensics and Security, IEEETransactions on, 8(1) :46–54, Jan 2013.

[41] J. Dan, Q. Jianlin, C. Yanyun, and C. Li. Clustering method and its formalization. InInformation Technology and Artificial Intelligence Conference (ITAIC), 6th IEEE JointInternational, volume 1, pages 57–61, Aug 2011.

[42] J. Dean and S. Ghemawat. Mapreduce : Simplified data processing on large clusters.Commun. ACM, 51(1) :107–113, January 2008.

[43] N. Dershowitz and J.P. Jouannaud. Handbook of theoretical computer science (vol. b).chapter Rewrite Systems, pages 243–320. MIT Press, Cambridge, MA, USA, 1990.

[44] S. Dinh, T. Azeb, F. Fortin, D. Mouheb, and M. Debbabi. Spam campaign , analysis,and investigation. Digital Investigation, 12, Supplement 1(0) :S12 – S21, 2015.

[45] D.L. Donoho, A. Flesia, U. Shankar, V. Paxson, J. Coit, and S. Staniford. Multiscalestepping-stone detection : Detecting pairs of jittered interactive streams by exploitingmaximum tolerable delay. In Recent Advances in Intrusion Detection, volume 2516 ofLecture Notes in Computer Science, pages 17–35. 2002.

[46] H. Drucker, D. Wu, and V.N. Vapnik. Support vector machines for spam categorization.IEEE Transactions on Neural Networks, 10(5) :1048 –1054, 1999.

147

[47] Z. Duan, Peng Chen, F. Sanchez, Yingfei Dong, M. Stephenson, and J.M. Barker. Detec-ting spam zombies by monitoring outgoing messages. IEEE Transactions on Dependableand Secure Computing, 9(2) :198–210, March 2012.

[48] F. Fdez-Riverola, E. L. Iglesias, F. Díaz, J. R. Méndez, and J. M. Corchado. Applyinglazy learning algorithms to tackle concept drift in spam filtering. Expert Syst. Appl.,33(1) :36–48, July 2007.

[49] Report. Federal Trade Commission. www.consumer.ftc.gov. In Federal Trade Commis-sion Reprot, 2009.

[50] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedingsof the 16th ACM International Conference on World Wide Web, pages 649–656, 2007.

[51] D.H. Fisher. Knowledge acquisition via incremental conceptual clustering. Mach. Learn.,2(2) :139–172, 1987.

[52] J. François, S. Wang, R. State, and T. Engel. Bottrack : Tracking botnets using netflowand pagerank. In Jordi Domingo-Pascual, Pietro Manzoni, Sergio Palazzo, Ana Pont,and Caterina Scoglio, editors, NETWORKING 2011, volume 6640 of Lecture Notes inComputer Science, pages 1–14. Springer Berlin Heidelberg, 2011.

[53] W. N. Gansterer and D. Pölz. E-mail classification for phishing defense. In Proceedingsof the 31th European Conference on IR Research on Advances in Information Retrieval,ECIR ’09, pages 449–460, Berlin, Heidelberg, 2009. Springer-Verlag.

[54] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B.Y. Zhao. Detecting and characterizingsocial spam campaigns. In Proceedings of the 10th ACM annual conference on Internetmeasurement, pages 35–47, 2010.

[55] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B.Y. Zhao. Detecting and characterizingsocial spam campaigns. In Proceedings of the 10th ACM SIGCOMM Conference onInternet Measurement, IMC ’10, pages 35–47, New York, NY, USA, 2010. ACM.

[56] S. Garcia, J. Luengo, J. A. Saez, V. Lopez, and F. Herrera. A survey of discretizationtechniques : Taxonomy and empirical analysis in supervised learning. IEEE Trans. onKnowl. and Data Eng., 25(4) :734–750, April 2013.

[57] Z. Ghahramani. Unsupervised learning. In Advanced Lectures on Machine Learning,volume 3176 of Lecture Notes in Computer Science, pages 72–112. Springer Berlin Hei-delberg, 2004.

[58] S. Gilpin, S. Nijssen, and I. Davidson. Formalizing hierarchical clustering as integerlinear programming. In Proceedings of the twenty-seventh AAAI conference on artificialintelligence, 2013.

148

[59] F. Giunchiglia and T. Walsh. A theory of abstraction. Artif. Intell., 57(2-3) :323–389,October 1992.

[60] M. A. Gluck and J. E. Corter. Information Uncertainty and the Utility of Categories.In Proceedings of the Seventh Annual Conference of Cognitive Science Society, pages283–287, 1985.

[61] R. Grinker, S. Lubkemann, and C.B. Steiner. In Perspectives on Africa : A readerinCulture, History and Representation, pages 618–621, 2012.

[62] J. L. Gross and J. Yellen. Graph Theory and Its Applications, Second Edition (DiscreteMathematics and Its Applications). Chapman & Hall/CRC, 2005.

[63] R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, and D. Benredjem. Towards anintegrated e-mail forensic analysis framework. Digital Investigation, 5(3–4) :124 – 137,2009.

[64] M. Halkidi and M. Vazirgiannis. Clustering validity assessment : finding the optimalpartitioning of a data set. In Data Mining, 2001. ICDM 2001, Proceedings IEEE Inter-national Conference on, pages 187–194, 2001.

[65] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. Theweka data mining software : An update. SIGKDD Explor. Newsl., 11(1) :10–18, 2009.

[66] J. Han, M. Kamber, and J. Pei. Data Mining : Concepts and Techniques. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011.

[67] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate ge-neration : A frequent-pattern tree approach. Data mining and knowledge discovery,8(1) :53–87, 2004.

[68] U. Hebisch and H.J. Weinert. Semiring- Algebraic Theory and Application in ComputerScience. World Scientific, 1998.

[69] J. Hedley. Jsoup cookbook. http ://jsoup.org/cookbook, 2009.

[70] P. Hell and J. Nesetril. Graphs and homomorphisms. Oxford lecture series in mathema-tics and its applications. Oxford University Press, Oxford, New York, 2004.

[71] L. Henderson. Crimes of Persuasion : Schemes, Scams, Frauds : how Con Artists WillSteal Your Savings and Inheritance Through Telemarketing Fraud, Investment Schemesand Consumer Scams. Coyoto Ridge Press, 2003.

[72] P. Höfner, R. Khedri, and B. Möller. Feature algebra. In Proceedings of the 14th in-ternational conference on Formal Methods, FM’06, pages 300–315, Berlin, Heidelberg,2006. Springer-Verlag.

149

[73] P. Höfner, R. Khédri, and B. Möller. An algebra of product families. Software andSystem Modeling, 10(2) :161–182, 2011.

[74] P. Höfner, R. Khedri, and B. Möller. Feature algebra. In Jayadev Misra, Tobias Nipkow,and Emil Sekerinski, editors, FM 2006 : Formal Methods, volume 4085 of Lecture Notesin Computer Science, pages 300–315. 2006.

[75] I. Idris, A. Selamat, N. Thanh Nguyen, S. Omatu, O. Krejcar, K. Kuca, and M. Penhaker.A combined negative selection algorithm–particle swarm optimization for an email spamdetection system. Engineering Applications of Artificial Intelligence, 39 :33 – 44, 2015.

[76] J. Iedemska, G. Stringhini, R.A. Kemmerer, C. Kruegel, and G. Vigna. The tricks ofthe trade : What makes spam campaigns successful ? In 35. IEEE Security and PrivacyWorkshops, SPW 2014, San Jose, CA, USA, May 17-18, 2014, pages 77–83, 2014.

[77] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering : A review. ACM Comput.Surv., 31(3) :264–323, 1999.

[78] J.P. John, A. Moshchuk, S.D. Gribble, and A. Krishnamurthy. Studying spammingbotnets using botlab. In Proceedings of the 6th USENIX symposium on Networkedsystems design and implementation, NSDI’09, pages 291–306, Berkeley, CA, USA, 2009.USENIX Association.

[79] J.P. John, A. Moshchuk, S.D. Gribble, and A. Krishnamurthy. Studying spammingbotnets using botlab. In Proceedings of the 6th USENIX symposium on Networkedsystems design and implementation, NSDI09, pages 291–306, Berkeley, CA, USA, 2009.USENIX Association.

[80] I. Kanaris, K. Kanaris, H. Houvardas, and E. Stamatatos. Words versus Character n-Grams for Anti-Spam Filtering. International Journal on Artificial Intelligence Tools,16 :1047–1067, 2007.

[81] K. Kang, S. Cohen, J. Hess, W. Novak, and A. Peterson. Feature-oriented domainanalysis (foda) feasibility study, technical report, 1990.

[82] K. C. Kang, S. Kim, J. Lee, K. Kim, E. Shin, and M. Huh. Form : A feature-orientedreuse method with domain-specific reference architectures. Ann. Softw. Eng., 5 :143–168,January 1998.

[83] C. Kanich, C. Kreibich, K. Levchenko, B. Enright, G. M. Voelker, V. Paxson, and S. Sa-vage. Spamalytics : An empirical analysis of spam marketing conversion. In Proceedingsof the 15th ACM Conference on Computer and Communications Security, CCS ’08,pages 3–14, New York, NY, USA, 2008. ACM.

150

[84] C. Kanich, N. Weavery, D. McCoy, T. Halvorson, C. Kreibichy, K. Levchenko, V. Paxson,G.M. Voelker, and S. Savage. Show me the money : Characterizing spam-advertisedrevenue. In Proceedings of the 20th USENIX Conference on Security, SEC’11, Berkeley,CA, USA, 2011. USENIX Association.

[85] R. Kerber. Chimerge : Discretization of numeric attributes. In Proceedings of the TenthNational Conference on Artificial Intelligence, pages 123–128, 1992.

[86] J. Kleinberg. An impossibility theorem for clustering. In Neural Information ProcessingSystems Foundation, Inc., pages 446–453. MIT Press, 2002.

[87] C. Kreibich, C. Kanich, K. Levchenko, B. Enright, G.M. Voelker, V. Paxson, and S. Sa-vage. Spamcraft : an inside look at spam campaign orchestration. In Proceedings of the2nd USENIX conference on Large-scale exploits and emergent threats : botnets, spyware,worms, and more, LEET09, 2009.

[88] C. Kreibich, C. Kanich, K. Levchenko, B. Enright, G.M. Voelker, V. Paxson, and S. Sa-vage. Spamcraft : An inside look at spam campaign orchestration. In Proceedings ofthe 2Nd USENIX Conference on Large-scale Exploits and Emergent Threats : Botnets,Spyware, Worms, and More, LEET’09, Berkeley, CA, USA, 2009. USENIX Association.

[89] C.C Lai and M.C Tsai. An empirical performance comparison of machine learningmethods for spam e-mail categorization. In Hybrid Intelligent Systems, 2004. HIS ’04.Fourth International Conference on, pages 44–48, 2004.

[90] C. Laorden, X. Ugarte-Pedrero, I. Santos, B. Sanz, J. Nieves, and P.G. Bringas. Study onthe effectiveness of anomaly detection for spam filtering. Information Sciences, 277 :421– 444, 2014.

[91] N. Leontiadis. Measuring and analyzing search-redirection attacks in the illicit onlineprescription drug trade. In Proceedings of USENIX Security 2011, 2011.

[92] F. Li and M.H. Hsieh. An empirical study of clustering behavior of spammers and group-based anti-spam strategies. In CEAS 2006 Third Conference on Email and AntiSpam,pages 27–28, 2006.

[93] H. Li. Minimum entropy clustering and applications to gene expression analysis. In InProceedings of IEEE Computational Systems Bioinformatics Conference, pages 142–151,2004.

[94] X. Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11(3) :275 – 290,1989.

151

[95] J. Liu, D. Batory, and C. Lengauer. Feature oriented refactoring of legacy applications.In Proceedings of the 28th International Conference on Software Engineering, ICSE ’06,pages 112–121, New York, NY, USA, 2006. ACM.

[96] J. Liu, Y. Xiao, K. Ghaboosi, H. Deng, and J. Zhang. Botnet : Classification, at-tacks, detection, tracing, and preventive measures. EURASIP J. Wirel. Commun. Netw.,2009 :9 :1–9 :11, February 2009.

[97] R. Lopez-Herrejon, D. Batory, and C. Lengauer. A disciplined approach to aspect com-position. In Proceedings of the 2006 ACM SIGPLAN Symposium on Partial Evaluationand Semantics-based Program Manipulation, PEPM ’06, pages 68–77, New York, NY,USA, 2006. ACM.

[98] C. D. Manning, R. Prabhakar, and H. Schütze. Introduction to Information Retrieval.Cambridge University Press, New York, NY, USA, 2008.

[99] S. Martin, B. Nelson, A. Sewani, K. Chen, and A. D. Joseph. Analyzing behavioralfeatures for email classification. In CEAS, 2005.

[100] L. McAfee. Mcafee threats report : 2015. In www.mcafee.com, 2015.

[101] M. McAfee Avert Labs. Mcafee threats report : Third quarter 2013. 2013.

[102] M. Meila. Comparing clusterings : An axiomatic view. In Proceedings of the 22NdInternational Conference on Machine Learning, ICML ’05, pages 577–584, New York,NY, USA, 2005. ACM.

[103] P. Meyer and A.L. Olteanu. Formalizing and solving the problem of clustering inMCDA. European Journal of Operational Research, 227(3) :494 – 502, 2013.

[104] M. E. J. Newman, S. Forrest, and J. Balthrop. Email networks and the spread ofcomputer viruses. Phys. Rev. E, 66 :035101, 2002.

[105] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet : Massively parallel learningof tree ensembles with mapreduce. Proc. VLDB Endow., 2(2) :1426–1437, August 2009.

[106] A. Pathak, F. Qian, Y. C. Hu, Z. M. Mao, and S. Ranjan. Botnet spam campaigns canbe long lasting : Evidence, implications, and analysis. SIGMETRICS Perform. Eval.Rev., 37(1) :13–24, June 2009.

[107] A. Pitsillidis, K. Levchenko, C. Kreibich, C. Kanich, G.M. Voelker, V. Paxson, N. Wea-ver, and S. Savage. Botnet judo : Fighting spam with itself. 2010.

[108] C. Pu and S. Webb. Observed trends in spam construction techniques : A case study ofspam evolution. In CEAS, pages 104–112, 2006.

152

[109] J. R. Quinlan. Induction of decision trees. Mach. Learn, pages 81–106, 1986.

[110] S. Radicati. Email statistics report 2013-2017. In www.radiocati.com, 2013.

[111] A. Ramachandran and N. Feamster. Understanding the network-level behavior of spam-mers. ACM SIGCOMM Computer Communication Review, 36(4) :291–302, 2006.

[112] J. M. Rao and D. H. Reiley. The economics of spam. The Journal of Economic Pers-pectives, 26(3) :pp. 87–110, 2012.

[113] J.M. Rao and D.H. Reiley. On the spam campaign trail. In The Economics of Spam,pages 87–110. Journal of Economic Perspectives, Volume 26, Number 3, 2012.

[114] Technical Report. Commtouch technical report. In www.commtouch.com, 2015.

[115] S. Robak and A. Pieczyński. Employment of fuzzy logic in feature diagrams to modelvariability in software families. J. Integr. Des. Process Sci., 7(3) :79–94, August 2003.

[116] R. A. Rodríguez-Gómez, G. Maciá-Fernández, and P. García-Teodoro. Survey and taxo-nomy of botnet research through life-cycle. ACM Comput. Surv., 45(4) :45 :1–45 :33,August 2013.

[117] L. Rokach. A survey of clustering algorithms. In O. Maimon and L. Rokach, editors,Data Mining and Knowledge Discovery Handbook, pages 269–298. 2010.

[118] Peter J. Rousseeuw. Silhouettes : A graphical aid to the interpretation and validationof cluster analysis. Journal of Computational and Applied Mathematics, 20(0) :53 – 65,1987.

[119] T. S. Guzella and W. M. Caminhas. A review of machine learning approaches to spamfiltering. Expert Systems with Applications, 36(7) :10206 – 10222, 2009.

[120] S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchi-cal clustering/segmentation algorithms. In Proceedings of the 16th IEEE InternationalConference on Tools with Artificial Intelligence, ICTAI ’04, pages 576–584, Washington,DC, USA, 2004. IEEE Computer Society.

[121] A. A. Abu Samra and O. A. Ghanem. Analysis of clustering technique in androidmalware detection. In Proceedings of the 2013 Seventh International Conference onInnovative Mobile and Internet Services in Ubiquitous Computing, IMIS ’13, pages 729–733, Washington, DC, USA, 2013. IEEE Computer Society.

[122] S. S.C. Silva, R. M.P. Silva, R. C.G. Pinto, and R. M. Salles. Botnets : A survey.Computer Networks, 57(2) :378 – 403, 2013. Botnet Activity : Analysis, Detection andShutdown.

153

[123] A. K. Seewald. An evaluation of naive bayes variants in content-based learning for spamfiltering. Intell. Data Anal., 11(5) :497–524, October 2007.

[124] S. Shalev-Shwartz and B.D. Shai. Understanding machine learning : From theory toalgorithms. In Cambridge University Press, 2014.

[125] C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput.Commun. Rev., 5(1) :3–55, January 2001.

[126] M. Sheikhalishahi, M Mejri, and N. Tawbi. Clustering spam emails into campaigns. InOlivier Camp, Edgar R. Weippl, Christophe Bidan, and Esma Aïmeur, editors, ICISSP2015 - Proceedings of the 1st International Conference on Information Systems Securityand Privacy, ESEO, Angers, Loire Valley, France, 9-11 February, 2015., pages 90–97,February 2015.

[127] M. Sheikhalishahi, M Mejri, and N. Tawbi. On the abstraction of a categorical clusteringalgorithm. In Machine Learning and Data Mining in Pattern Recognition - 12th Inter-national Conference, MLDM 2016, New York, NY, USA, July 16-21, 2016, Proceedings,pages 659–675, 2016.

[128] M. Sheikhalishahi, A. Saracino, M Mejri, N. Tawbi, and F. Martinelli. Digitalwastesorting : A goal-based, self-learning approach to label spam email campaigns. In SaraForesti, editor, Security and Trust Management - 11th International Workshop, STM2015, Vienna, Austria, September 21-22, 2015, Proceedings, volume 9331 of LectureNotes in Computer Science, pages 3–19. Springer, 2015.

[129] M. Sheikhalishahi, A. Saracino, M Mejri, N. Tawbi, and F. Martinelli. Fast and effectiveclustering of spam emails based on structural similarity. In Foundations and Practice ofSecurity - 8th International Symposium, FPS 2015, Clermont-Ferrand, France, October26-28, 2015, Revised Selected Papers, pages 195–211, 2015.

[130] J. Song, D. Inque, M. Eto, H.C. Kim, and K. Nakao. An empirical study of spam :Analyzing spam sending systems and malicious web servers. In Proceedings of the 201010th IEEE/IPSJ International Symposium on Applications and the Internet, SAINT ’10,pages 257–260, Washington, DC, USA, 2010. IEEE Computer Society.

[131] J. Song, D. Inque, M. Eto, H.C. Kim, and K. Nakao. A heuristic-based feature selectionmethod for clustering spam emails. In Proceedings of the 17th international conferenceon Neural information processing : theory and algorithms - Volume Part I, ICONIP’10,pages 290–297, Berlin, Heidelberg, 2010. Springer-Verlag.

[132] J. Song, D. Inque, M. Eto, H.C. Kim, and K. Nakao. O-means : An optimized clusteringmethod for analyzing spam based attacks. In IEICE Transactions on Fundamentals ofElectronics Communications and Computer Sciences, volume 94, pages 245–254, 2011.

154

[133] B. Stone-Gross, T. Holz, G. Stringhini, and G. Vigna. The underground economy ofspam : A botmaster’s perspective of coordinating large-scale spam campaigns. In Pro-ceedings of the 4th USENIX Conference on Large-scale Exploits and Emergent Threats,LEET’11, Berkeley, CA, USA, 2011. USENIX Association.

[134] G. Stringhini, O. Hohlfeld, C. Kruegel, and G. Vigna. The harvester, the botmaster, andthe spammer : On the relations between the different actors in the spam landscape. InProceedings of the 9th ACM Symposium on Information, Computer and CommunicationsSecurity, ASIA CCS ’14, pages 353–364, New York, NY, USA, 2014. ACM.

[135] D. Talia. Parallelism in knowledge discovery techniques. In Proceedings of the 6thInternational Conference on Applied Parallel Computing Advanced Scientific Computing,PARA ’02, pages 127–138, London, UK, 2002.

[136] K. Tillman. How many internet connections are in the world ? right now. Inwww.blogs.cisco.com, 2013.

[137] A. Topchy, A.K. Jain, and W. Punch. Combining multiple weak clusterings. In ICDMthird Proceedings of the Third IEEE International Conference on Data Mining, pages331–338, Nov 2003.

[138] K. Tretyakov. Machine learning techniques in spam filtering. 2004.

[139] K. Tretyakov. Machine learning techniques in spam filtering. In Data Mining Problem-oriented Seminar, MTAT, volume 3, pages 60–79. Citeseer, 2004.

[140] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.In Proceedings of the CVPR IEEE Computer Society Conference on Computer Visionand Pattern Recognition, volume 1, pages I–511–I–518 vol.1, 2001.

[141] D. Wang, D. Irani, and C. Pu. A study on evolution of email spam over fifteen years. InProceedings of the 9th International Conference Conference onCollaborative Computing :Networking, Applications and Worksharing (Collaboratecom), pages 1–10, Oct 2013.

[142] X. Wang, S. Chen, and S. Jajodia. Network flow watermarking attack on low-latencyanonymous communication systems. In Security and Privacy, 2007. SP ’07. IEEE Sym-posium on, pages 116–130, 2007.

[143] X.L. Wang and I. Cloete. Learning to classify email : a survey. In Proceedings of 2005International Conference on Machine Learning and Cybernetics, volume 9, pages 5716–5719 Vol. 9, Aug 2005.

[144] C. Wei, A. Sprague, G. Warner, and A. Skjellum. Mining spam email to identify commonorigins for forensic application. In Proceedings of the 2008 ACM symposium on Appliedcomputing, SAC ’08, pages 1433–1437, New York, NY, USA, 2008. ACM.

155

[145] F. Weng, Q. Jiang, L. Shi, and N. Wu. An intrusion detection system based on the cluste-ring ensemble. In Anti-counterfeiting, Security, Identification, 2007 IEEE InternationalWorkshop on, pages 121–124, April 2007.

[146] Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I Osipkov. Spamming botnets :Signatures and characteristics. 38(4) :171–182, 2008.

[147] R. Xu and D. Wunsch. Survey of clustering algorithms. Proceedings of the IEEE Tran-sactions on Neural Networks, 16(3) :645–678, May 2005.

[148] Y. Yang, X. Guan, and J. You. Clope : A fast and effective clustering algorithm fortransactional data. In Proceedings of the Eighth ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, KDD ’02, pages 682–687, New York, NY,USA, 2002. ACM.

[149] K. Yoda and H. Etoh. Finding a connection chain for tracing intruders. In Proceedingsof the 6th European Symposium on Research in Computer Security, ESORICS ’00, pages191–205, London, UK, 2000. Springer-Verlag.

[150] C. Zhang, W.B. Chen, X. Chen, and G. Warner. Revealing common sources of imagespam by unsupervised clustering with visual features. In Proceedings of the 2009 ACMsymposium on Applied Computing, SAC ’09, pages 891–892, New York, NY, USA, 2009.ACM.

[151] L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques.ACM Transactions on Asian Language Information Processing (TALIP, 3(4) :243–269,December 2004.

[152] Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum.Botgraph : Large scale spamming botnet detection. In Proceedings of the 6th USENIXSymposium on Networked Systems Design and Implementation, NSDI’09, pages 321–334,Berkeley, CA, USA, 2009. USENIX Association.

[153] L. Zhuang, J. Dunagan, D.R. Simon, H. Wang, and J.D. Tygar. Characterizing botnetsfrom email spam records. In Proceedings of the 1st Usenix Workshop on Large-ScaleExploits and Emergent Threats, LEET’08, pages 2 :1–2 :9, Berkeley, CA, USA, 2008.USENIX Association.

156

Spam Campaign Detection, Analysis, and Formalization

Documents