IDENTIFYING APPLICATION PROTOCOLS IN COMPUTER …...eij is an edge from vertex i to vertex j deg(v) is the degree of vertex v id(v) is the indegree of vertex v od(v) is the outdgree

IDENTIFYING APPLICATION PROTOCOLS IN COMPUTERNETWORKS USING VERTEX PROFILES

By

Edward G. Allan, Jr.

A Thesis Submitted to the Graduate Faculty of

WAKE FOREST UNIVERSITY

in Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science

December 2008

Winston-Salem, North Carolina

Approved By:

Errin W. Fulp, Ph.D., Advisor

Examining Committee:

David J. John, Ph.D., Chairperson

William H. Turkett, Jr., Ph.D.

Acknowledgements

This thesis is the product of many people’s labors, not just my own. The ideascontained in the pages that follow have been formulated and refined for over a year,with the guidance and support of several people, whose assistance I would be remissnot to mention. I would like to thank Wake Forest University and GreatWall Systems,Inc. for their support. This research was funded by GreatWall Systems, Inc. via theUnited States Department of Energy STTR grant DE-FG02-06ER86274. 1

I would also like to thank my parents for their support throughout my years atWake Forest, both as an undergraduate and as a graduate student. Without theirencouragement and financial assistance, none of this would have been possible. I alsowould not be where I am today without the help of my friends, who have made thesepast several years some of the most enjoyable and most memorable yet.

My thesis committee members, Dr. David John and Dr. William Turkett, Jr.,were instrumental in providing me with feedback throughout the research and writingprocess. Their comments and criticism have undoubtedly enabled the success ofthis endeavor. I would especially like to thank Dr. Turkett for selflessly spendinghours assisting me and stepping in as my “adopted advisor” during Dr. Errin Fulp’ssabbatical.

Last, but certainly not least, I must thank my advisor, Dr. Errin Fulp. I havebeen fortunate to work with him in a variety of contexts for more than five yearsnow, and he has been a tremendous influence on both my personal and academicdevelopment. His relaxed personality and great sense of humor kept me off-task justenough to save my sanity, while his insight and guidance allowed me to complete mystudies and be ready to move on to the next chapter in my life. Many thanks againto all who have helped me along the way — you are much appreciated.

1The views and conclusions contained herein are those of the author and should not be interpretedas necessarily representing the official policies or endorsements, either expressed or implied, of theDOE or the U.S. Government.

ii

Table of Contents

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Issues in Network Management and Security . . . . . . . . . . . . . . 2

1.2 Current Methods of Network Analysis . . . . . . . . . . . . . . . . . . 2

1.2.1 Applications and Port Numbers . . . . . . . . . . . . . . . . . 3

1.2.2 Packet Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Interdisciplinary Study of Network Communications . . . . . . . . . . 4

1.3.1 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Biological Networks and Motifs . . . . . . . . . . . . . . . . . 6

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2 Computer Networks and Communications. . . . . . . . . . . . . . . 8

2.1 Network Topologies and Architectures . . . . . . . . . . . . . . . . . 8

2.2 Computer Network Reference Models . . . . . . . . . . . . . . . . . . 10

2.2.1 The OSI Model . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 The TCP/IP Model . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Layer 3: The Network Layer . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Layer 4: The Transport Layer . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Layer 7: The Application Layer . . . . . . . . . . . . . . . . . . . . . 14

Chapter 3 Graph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Graph Terminology and Basic Properties . . . . . . . . . . . . . . . . 16

3.2 Types of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Traditional Graph Measures . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Distances and Path Lengths . . . . . . . . . . . . . . . . . . . 18

3.3.2 Centrality Measures . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.3 Clustering Coefficient . . . . . . . . . . . . . . . . . . . . . . . 21

iii

iv

3.3.4 Application of Traditional Graph Measures in Computer Net-works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Network Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4.1 Definition of a Motif . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Function of Motifs . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Analysis of Application Graphs . . . . . . . . . . . . . . . . . . . . . 25

Chapter 4 Data Selection and Considerations . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Network Trace Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Challenges Associated with Network Data Collection . . . . . . . . . 26

4.2.1 Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.2 Privacy and Sanitization of Data . . . . . . . . . . . . . . . . 28

4.2.3 Network and Data View . . . . . . . . . . . . . . . . . . . . . 29

4.3 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.1 Dartmouth College Wireless Traces . . . . . . . . . . . . . . . 31

4.3.2 LBNL/ICSI Enterprise Tracing Program . . . . . . . . . . . . 31

4.3.3 OSDI Conference Network Traces . . . . . . . . . . . . . . . . 31

4.4 Protocol Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Chapter 5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Hardware and Linux System . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Packet Capture and Storage . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Creation of Application Graphs . . . . . . . . . . . . . . . . . . . . . 37

5.4 Traditional Graph Measures . . . . . . . . . . . . . . . . . . . . . . . 39

5.5 Motif Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.6 Vertex Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.7 K-Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . 44

5.7.1 Measuring Profile Separation . . . . . . . . . . . . . . . . . . 45

5.7.2 Cross Validation of Classification Results . . . . . . . . . . . . 46

5.8 Genetic Algorithm Feature Weighting . . . . . . . . . . . . . . . . . . 46

5.8.1 Overview of Genetic Algorithms . . . . . . . . . . . . . . . . . 47

5.8.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Preliminary Investigations . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2 Initial Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2.1 Traditional Graph Measure Profiles . . . . . . . . . . . . . . . 51

6.2.2 Motif-based Profiles . . . . . . . . . . . . . . . . . . . . . . . . 54

6.3 Weighted Profiles and Key Attributes . . . . . . . . . . . . . . . . . . 57

v

6.3.1 Attribute Weights of Traditional Graph Measures . . . . . . . 58

6.3.2 Attribute Weights of Motif-based Measures . . . . . . . . . . . 59

6.4 Comparison of Profile Types . . . . . . . . . . . . . . . . . . . . . . . 61

6.5 Considerations for Optimizing Classifier Performance . . . . . . . . . 63

6.6 Limitations of Current Approach . . . . . . . . . . . . . . . . . . . . 66

Chapter 7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Appendix A Examples of Application Graphs . . . . . . . . . . . . . . . . . . . . . . 76

Appendix B Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Appendix C Test Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Appendix D Additional Classification Results . . . . . . . . . . . . . . . . . . . . . . 87

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Illustrations

List of Tables

4.1 Summary statistics of three trace files examined . . . . . . . . . . . . 31

5.1 Graph orders for each application protocol . . . . . . . . . . . . . . . 38

6.1 Classification accuracy of 65 application graphs . . . . . . . . . . . . 50

6.2 An example confusion matrix with three classes . . . . . . . . . . . . 50

6.3 Confusion matrix of unweighted traditional graph measures . . . . . . 52

6.4 Number of single and multi-class ties for traditional graph measures 53

6.5 Confusion matrix of unweighted motif-based profiles . . . . . . . . . 55

6.6 Number of single and multi-class ties for motif-based profiles . . . . 55

6.7 Percentage of original data used in motif-based profiles . . . . . . . . 57

6.8 Attribute weights for traditional graph measures . . . . . . . . . . . 58

C.1 FANMOD test parameters . . . . . . . . . . . . . . . . . . . . . . . . 85

D.1 Confusion matrix of 65 application graphs using motif frequencies . . 87

D.2 Confusion matrix of weighted traditional graph measures . . . . . . . 87

D.3 Confusion matrix of weighted motif profiles . . . . . . . . . . . . . . 87

List of Figures

1.1 Example output from NetStat . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Graphical depiction of a social network with two distinctly visible clus-ters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Four network topologies: bus, ring, star and mesh [1] . . . . . . . . . 9

2.2 The OSI and TCP/IP reference models [2] . . . . . . . . . . . . . . . 11

2.3 An IP datagram header [2] . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 UDP and TCP datagram headers [2] . . . . . . . . . . . . . . . . . . 14

2.5 Example communication between a client and a web server . . . . . 15

3.1 A graph with five nodes and five edges . . . . . . . . . . . . . . . . . 17

3.2 Schematic view of motif detection [3] . . . . . . . . . . . . . . . . . . 23

3.3 All 13 configurations of order 3 connected subgraphs [3] . . . . . . . 24

vi

vii

3.4 A feed-forward loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Tcpdump output containing timestamp, protocol, source IP, sourceport, destination IP, destination port, packet length and packet flags 27

5.1 Overview of the proposed methodology and tools used . . . . . . . . 36

5.2 Storing packets from a pcap file into a MySQL database . . . . . . . 37

5.3 A motif with colored vertices . . . . . . . . . . . . . . . . . . . . . . 41

5.4 FANMOD edge-switching process for generating random networks [4] 42

5.5 Arrays representing vertex profiles . . . . . . . . . . . . . . . . . . . . 43

5.6 Single-point crossover of two binary strings . . . . . . . . . . . . . . 48

6.1 Profile collisions for traditional graph measures . . . . . . . . . . . . 54

6.2 Profile collisions for motif-based profiles . . . . . . . . . . . . . . . . 56

6.3 Depiction of three application graphs: HTTP, AIM and SSH . . . . . 57

6.4 Accuracy of unweighted vs. weighted traditional graph measure profiles 59

6.5 The ten highest-weighted motifs and their corresponding weights . . 60

6.6 Accuracy of unweighted vs. weighted motif-based profiles . . . . . . 61

6.7 Accuracy comparison of unweighted profile types . . . . . . . . . . . 62

6.8 Accuracy of single attribute classification . . . . . . . . . . . . . . . 64

6.9 Comparison of profile types as the size of the training set increases . 65

A.1 Application graphs depicting AIM communications . . . . . . . . . . 76

A.2 Application graphs depicting DNS communications . . . . . . . . . . 76

A.3 Application graphs depicting HTTP communications . . . . . . . . . 76

A.4 Application graphs depicting Kazaa communications . . . . . . . . . 77

A.5 Application graphs depicting MSDS communications . . . . . . . . . 77

A.6 Application graphs depicting Netbios communications . . . . . . . . 77

A.7 Application graphs depicting SSH communications . . . . . . . . . . 77

Abbreviations

Acronyms

AIM - AOL Instant MessengerTM

API - Application Programming Interface

AUP - Acceptable Use Policy

DNS - Domain Name Service

FFL - Feed-forward loop

HTTP - HyperText Transfer Protocol

IANA - Internet Assigned Numbers Authority

IDS - Intrusion Detection System

IP - Internet Protocol

MSDS - Microsoft Directory Share

OSI - Open Systems Interconnection

P2P - Peer-to-peer

SANSTM - SysAdmin, Audit, Networking, and Security

SMTP - Simple Mail Transfer Protocol

SSH - Secure Shell

TCP - Transmission Control Protocol

UDP - User Datagram Protocol

VoIP - Voice over IP

viii

ix

Symbols

| V | is the number of vertices in a graph

eij is an edge from vertex i to vertex j

deg(v) is the degree of vertex v

id(v) is the indegree of vertex v

od(v) is the outdgree of vertex v

N(v) is the set of nodes in the neighborhood of vertex v

e(v) is the eccentricity of vertex v

rad(G) is the radius of graph G

diam(G) is the diameter of graph G

d(u, v) is the distance between vertex u and vertex v

CD(v) is the degree centrality of vertex v

CB(v) is the betweenness centrality of vertex v

CC(v) is the closeness centrality of vertex v

xi is the eigenvector centrality of vertex i

C(v) is the clustering coefficient of vertex v

φ is a port number associated with an application (e.g., 80 for HTTP)

Abstract


Identifying Application Protocols in Computer

Networks Using Vertex Profiles

Thesis under the direction of Errin W. Fulp, Ph.D., Associate Professor ofComputer Science

Security and management of computer network resources exemplify two criticalactivities that challenge system administrators. They face potential threats from out-side intruders as well as internal users whom already have access to the organization’sassets. It is imperative that administrators are aware of what applications are beingexecuted, but the use of data encryption techniques and non-standard port numberspresents difficulties that must be overcome.

To that end, this thesis introduces a novel method to identify application protocolsbased on the analysis of application graphs, which model application-level communica-tions between computers. The performance of two types of node descriptions, calledvertex profiles, are compared. “Traditional” vertex profiles characterize each nodeusing several well-studied graph measures. Furthermore, this work uniquely appliesmotif-based analysis, which has previously been used primarily in systems biology, tothe study of application graphs by creating a second type of vertex profile based ona node’s participation in statistically significant motifs. Machine learning techniquesare employed to evaluate the importance of specific profile features. The experimen-tal results, using a nearest-neighbor classifier, show that this type of analysis cancorrectly classify the applications observed with greater than 80% accuracy.

x

Chapter 1: Introduction

Managing and securing today’s critical data networks is a daunting and expensive

task. According to INPUT [5], demand for vendor-furnished information systems

and services by the U.S. government will increase from $71.9 billion in 2008 to $87.8

billion in 2013. This money funds such tasks as system modernization, information

sharing, IT management and information security. As computer networks increase in

size, speed and complexity, and malicious hackers develop more sophisticated attacks,

traditional methods of managing and securing these networks begin to break down.

This thesis proposes a novel approach to identifying the actions of hosts within a

network by examining the properties of application graphs, which model the social and

functional interactions of hosts with one another at the software application level (e.g.

HTTP, FTP, etc.). With the aid of machine learning techniques and algorithms, this

method exploits graph characteristics of each host in the application graph, such as its

connectedness, its position in the graph and the shapes of the subgraphs in which it is

found. One distinct advantage to this approach is that classification can be performed

“in the dark”, meaning that the packet payloads are either unavailable or have been

encrypted, rendering deep packet inspection futile. Knowing what activities users on

the network are participating in is crucial to network administrators who must manage

bandwidth allocations, network configurations, performance and security and access

policies. The following sections of this chapter provide background information and

motivation for the study.

1

2

1.1 Issues in Network Management and Security

To protect itself from litigation and to help ensure the integrity of its network, an

organization (such as a school, business, or government) will often develop an Accept-

able Use Policy, or AUP. An AUP defines what behaviors are acceptable for internet

browsing, what applications can be run by users and other relevant guidelines for

usage. The SANS Security Policy Project [6] provides several resources and tem-

plates for such policies. Take, for example, a policy that does not allow users to run

a personal web server using an organization’s computing resources. Identifying such

behavior can help to preserve network bandwidth that is otherwise used for legitimate

business activities.

Not only can failure to comply with an organization’s AUP waste computing

resources, it can also have serious security implications as well. Continuing with

the example above, running an improperly configured web server or hosting insecure

web application files gives an attacker an easy point of entry into the network. A

study performed by MITRE from 2001-2006 notes a sharp increase in the number of

public reports for vulnerabilities that are specific to web applications [7]. For several

years buffer overflow attacks had been the most common, but were overtaken in 2005

by web application vulnerabilities such as SQL injection, cross-site scripting (XSS)

and remote file inclusion. It is, therefore, in a network administrator’s best interest

to ensure that the network is properly utilized in accordance with the policies and

guidelines adopted by the organization.

1.2 Current Methods of Network Analysis

Several tools allow system administrators to determine which applications are being

used on a network. This information assists them in the maintenance and protection

3

of networked systems. Sophisticated users, however, are able to hide their activities,

which could potentially include actions that are against the organization’s AUP, or

worse yet, are illegal. This section examines a few of the tools used by administrators

and identifies some of their weaknesses.

1.2.1 Applications and Port Numbers

When data is sent to a computer over a network, the destination port number identifies

which application on the host computer should receive and process the data. Many

applications use port numbers specified by the Internet Assigned Numbers Authority

[8]. For example, FTP servers use ports 20 and 21, while web servers use port 80

by default. NetStat is a command line tool that shows information about network

connections, both incoming and outgoing [9]. Figure 1.1 demonstrates the output of

the NetStat command.

$ netstat -taActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address Statetcp 0 0 localhost:2208 *:* LISTENtcp 0 0 *:sunrpc *:* LISTENtcp 0 0 *:auth *:* LISTENtcp 0 0 *:35763 *:* LISTENtcp 0 0 localhost:ipp *:* LISTENtcp 0 0 localhost:smtp *:* LISTENtcp 0 0 localhost:36699 *:* LISTENtcp6 0 0 *:ssh *:* LISTEN

Figure 1.1: Example output from NetStat

Network administrators could look and see that a host on the network is listening

on port 80, indicating the presence of a web server. The administrator could then

shut down that service and take appropriate disciplinary action toward the user.

The problem with this method of detecting network applications is that while many

do run on a known port number, they do not necessarily have to. If a web server

were reconfigured to listen for connections on port 6000, clients could still connect

to it through their web browser by typing http://www.example.com:6000. A user

4

wishing to hide their activities might attempt to disguise an application by using such

a non-standard port number. Chapter 2 describes port numbers and other networking

concepts in more detail.

1.2.2 Packet Inspection

Another method of detecting network applications is to scrutinize the data contained

in each packet as it traverses the network. Packets contain information such as HTTP

requests, email headers and MP3 filename searches, as well as protocol-specific session

initiations and version numbers that can be used to identify a particular application.

Wireshark is a popular network protocol analyzer that has several useful features for

viewing packet contents, reassembling sessions and gathering statistics about network

data [10]. Packet inspection is commonly used in intrusion detection systems (IDS)

such as Snort [11]. A rule-based engine searches packet data, compares it against

a list of known attacks and generates a predefined response (such as notifying an

administrator). The problem with packet inspection is that traffic is increasingly

encrypted. Data payloads that have been transformed into cyphertext are not human-

readable until they are decrypted with the appropriate key, nor do the payloads match

the known attack strings in the case of IDS.

1.3 Interdisciplinary Study of Network Communications

It is therefore the goal of this study to look beyond current methods for identifying

network behavior and propose a novel approach that relies upon high level commu-

nication patterns observed among hosts. To accomplish this goal, this study borrows

ideas and algorithms from several disciplines. Networks are not unique to computer

science; they exist in mathematics, sociology, biology, communications and other ar-

eas of study as well. Graphs, a collection of objects (sometimes called nodes) linked

5

by edges, are the abstract model which allows for the analysis of any type of network.

They can represent relationships among friends, the interaction of biological entities

in a transcriptional regulation network, the collaboration between authors of research

papers [12], as well as a myriad of other problem spaces. Chapter 3 illustrates the

properties of graphs in more depth.

1.3.1 Social Networks

One key area of study that this thesis borrows from is social network analysis, which

focuses on relationships among social entities (also known as actors), and on the

patterns and implications of these relationships [13]. The properties of social graphs

reveal interesting information such as the spread of disease or material goods through

the network, as well as what actors are “influential” (politically, socially, etc.). Social

network analysis also has military and intelligence applications. Yang and Ng provide

visualizations and analysis of weblog social networks related to terrorism and other

crime-related matters [14].

To provide a simple working example of social network analysis, Figure 1.2 depicts

the author’s social network of friendships taken from the popular social networking

web site FacebookTM. There are two clearly visible “clusters” of friends visible in

the graph, created by nodes in each cluster sharing many common links with other

nodes in the cluster. In the context of this social network, it means that many of

the author’s friends in each group are also friends with each other. The group on the

left is primarily comprised of relationships formed during the author’s tenure at Wake

Forest University, while the cluster on the right is primarily comprised of relationships

formed prior to and during high school.

Several concepts pertaining to social networks can be extended to the study of

application graphs performed in this work. Application graphs model the social rela-

6

Figure 1.2: Graphical depiction of a social network with two distinctly visible clusters

tionships between clients and servers in a computer network by showing with which

web servers users choose to interact, with whom they communicate via instant mes-

saging clients and with whom they choose to share files. For example, the application

graph for AOL Instant MessengerTM might show several chat clients communicating

with a central chat server, which then passes messages along to the intended recipi-

ents. Characteristics of these high-level interactions are used to identify the software

application through which the communication occurs. Section 3.3 elaborates upon

the graph measures frequently used to quantify aspects of social networks.

1.3.2 Biological Networks and Motifs

The study of biological networks is another key field from which ideas for this thesis

are borrowed. Cellular processes are regulated by the interactions of several molecules

such as proteins and DNA [15]. These complex interactions can be modeled as graphs.

One particular method used to analyze these graphs is to search within them for mo-

tifs: recurring, significant patterns of interconnections. Milo et al. find motifs in

several types of networks including biochemistry, neurobiology, ecology and engineer-

ing. They suggest that motifs are the basic structural elements capable of defining

broad classes of networks [3].

7

Motif analysis is often used in biology [3, 16, 17, 18], but has not yet been applied

to application graphs. One goal of this study is to determine if a motif or groups of

motifs can help identify what application a computer is using. It finds that several

protocols use similar motifs, partly due to the fact that many applications have a

client-server architecture (described in Section 2.1). However, there is still enough

distinction in how the applications are used at a social level to determine what they

are based on the models developed in this work. Chapter 6 discusses some of the

motifs found in application graphs.

1.4 Outline

The following is an outline of the remaining parts of this thesis. Chapter 2 covers

information regarding computer networks, the different reference models and details

the network layers used to create application graphs. Chapter 3 introduces several

concepts relating to graph theory, “traditional” measurement techniques of graphs

and provides more information about motifs. Data sources and application protocol

selection is covered in Chapter 4. Chapter 5 specifies the tools used in this thesis

and introduces machine learning techniques used for the modeling and classification

of application types. A discussion of the results obtained and an analysis of key

motifs and graph metrics is handled in Chapter 6, as well as a comparison between

traditional graph measures and a motif-based approach. Finally Chapter 7 concludes

this study and explores possible topics for future research.

Chapter 2: Computer Networks and

Communications

Undoubtedly the interconnection of computers and networks to the world wide web

has increased mankind’s ability to share information, perform research and become

more efficient at everyday tasks. However, not all users have benign intentions. Illegal

hacking, cyber terrorism and fraud wreak havoc on governments, corporations and

individuals alike. Data encryption is often used to disguise malicious activity as well

as legitimate activity from observation. By exploring the communication patterns

found within networks, this study shows that it is still possible to gain some insight

into what applications are being utilized. The following sections introduce several

basic concepts related to network architectures, protocols and applications.

2.1 Network Topologies and Architectures

Network topologies describe the arrangement and mapping of networked elements,

such as computers, printers, wires and routers. Mappings can be physical or logical.

Physical topology describes where the elements are actually located and how they are

interconnected with wires. Logical topology on the other hand, referrs to the path

data appears to take when traveling from one network host to another [1]. A network’s

logical topology might be very different from the underlying physical topology, but it

is bound by the network protocols that direct how the data moves across the network.

Application graphs are a generalization of logical topologies in that they provide a

picture of how data moves between hosts, but from a very high-level view.

There are several shapes used to describe network topologies including bus, tree,

star, mesh and ring. In the case of a physical network, these shapes have an impact on8

9

Figure 2.1: Four network topologies: bus, ring, star and mesh [1]

network performance, reliability and ease of management. For example, a bus network

is cost-effective and easy to implement, but the architecture can only support a limited

number of hosts and a bad cable will bring down the entire network. A star network

allows for the isolation of the periphery nodes, but the central hub might be a single

point of failure for the network. Logical topologies show the exchange of information

between entities that are not physically connected by the network infrastructure. For

example, IBM’s Token Ring network technology is a logical ring but is physically

wired in a star topology.

In terms of software application models, two prevalent architectures are found

in computer networks: the client-server model and peer-to-peer (P2P) architectures.

In the client-server model, a client machine is responsible for initiating a request to

some application running on another computer. The server waits for an incoming

request from a client and then sends a response. Client-server architecture allows for

computing responsibilities to be divided up among servers in the network, where one

computer might act as a web server, another as an email server and so on. While the

data sent between the client and server might go through several network devices, the

logical data flow is a single link between the two nodes. A star network could then be

induced by several clients connecting to a common server (see Figure 2.1). In a P2P

10

network, nodes both initiate and respond to requests from other computers on the

network known as peers. Consequently, the logical topology of such interactions could

form a mesh network. This study examines the characteristics of logical topologies

extended to the application layer, modeled as application graphs.

2.2 Computer Network Reference Models

Application graphs are created using information from several layers of the network

communication process. Data goes through a series of transformations before being

sent to its destination, including breaking the data into manageable fragment sizes,

adding quality of service information, specifying how the data should be transmitted

and converting it into the electrical pulses that traverse the wire. Three layers in

particular are of interest: the network, transport and application layers, described in

Sections 2.3–2.5.

There are two fundamental models referenced when describing network layers: the

OSI model and the TCP/IP model. The protocols (rules that govern the syntax and

meaning of data sent between entities) associated with the OSI model are rarely used,

but the features described at each layer are still important. In contrast, the TCP/IP

model is not as rigidly defined as the OSI model, but the protocols associated with

it are widely used [2]. This section provides an overview of these models, depicted in

Figure 2.2.

2.2.1 The OSI Model

The Open Systems Interconnection Basic Reference Model (OSI Model) was designed

to promote international standardization of the protocols used in communication

networks. There are seven layers in this model: the physical layer, data link layer,

network layer, transport layer, session layer, presentation layer and application layer

11

[19]. The physical layer deals with representing and transmitting raw bits over a

communication channel. Well known examples include Ethernet over twisted pair

(10BASE-T, 100BASE-TX) and 802.11/a/b/g wireless standards. The task of the

data link layer is to correct transmission errors from the physical layer and provide

the means to enable point-to-point communication between hosts within a local area

network. This layer arranges data into frames and also provides medium access control

to share communication channels between multiple users.

The network layer determines how packets are routed from the source to the desti-

nation, allows the interconnection of heterogeneous networks and provides congestion

control. The next layer in the model, the transport layer, provides logical commu-

nication between processes on the hosts and is the first true end-to-end layer in the

model. The session and presentation layers are not generally used; their intent is

to provide session management between hosts, synchronization, interruption recovery

and “on the wire” management of abstract data structures. The final layer in the

OSI model is the application layer. This is the layer at which a user directly interacts

with the program (a web browser, for example) that sends network data.

Figure 2.2: The OSI and TCP/IP reference models [2]

12

2.2.2 The TCP/IP Model

First proposed in 1974, the TCP/IP model [20] presents a slightly different view of

network communications with four layers that are not as strictly defined as those

in the OSI model. Whereas the OSI model was developed before the associated

protocols, the TCP/IP model was developed based on protocols that already existed,

taking its name from its two key protocols. The host-to-network layer is somewhat

ill-defined and does not specify the protocols necessary for a host to send packets to

the internet layer. It combines elements of the OSI model’s physical and data link

layers. The internet layer is analogous to layer 3 of the OSI model. Familiar protocols

like IP (Internet Protocol) and ICMP (Internet Control Message Protocol) are a part

of this layer.

The third layer of the TCP/IP model is the transport layer, which maps directly

to the transport layer of the OSI model. It allows for end-to-end communication of

hosts on a network, using the TCP (Transmission Control) and UDP (User Datagram)

protocols. A need for the session and presentation layers was not perceived, so the

TCP/IP reference model does not contain them explicitly. The fourth layer, the

application layer, will contain them if necessary. This layer contains all of the high

level protocols such as HTTP, SMTP and DNS.

Although there are certainly similarities between several layers of the two reference

models, this paper will use OSI model terminology. This allows for a finer distinction

between network services offered at each layer to be made. The important lower-level

protocols for application graphs, however, are those that were originally associated

with the TCP/IP model, namely TCP and UDP.

13

2.3 Layer 3: The Network Layer

The network layer is concerned primarily with delivering packets from one host to

another through a series of routers. It attempts to maintain some quality of service

for variables such as delay, transit time and jitter while forwarding packets along until

the destination is reached.

Figure 2.3: An IP datagram header [2]

Figure 2.3 shows all of the fields contained in the header of an IP data packet.

For modeling network communications, however, only two fields are of interest: the

source address and the destination address. Each IP address identifies a unique node

in an application graph. The protocol field tells the network layer which transport

process to give the data to. Two common options are TCP and UDP, described next.

2.4 Layer 4: The Transport Layer

The transport layer is responsible for getting data to and from applications running

on the host machine, providing logical end-to-end communication between the appli-

cations. There are two types of service available to the upper layers, connectionless or

connection-oriented. The simpler of the two is connectionless, implemented by UDP.

The delivery and ordering of UDP packets is unreliable, but there is less connection

overhead associated with the transfer. Connection-oriented service, provided by TCP,

establishes several properties of the transmission ahead of time, such as data window

14

sizes and congestion control mechanisms. TCP packets are given sequence numbers

that are kept in order. Although IP networks are still only “best-effort” as no re-

sources are reserved ahead of time, TCP provides reliable communication between

hosts.

(a) UDP header

(b) TCP header

Figure 2.4: UDP and TCP datagram headers [2]

TCP and UDP headers (Figure 2.4) contain fields for the source and destination

port numbers. Port numbers serve as numerical identifiers for processes. They are 16

bits in length, resulting in 216 possible ports, numbered 0 through 65535. The Internet

Assigned Numbers Authority (IANA) is responsible for maintaining assignments of

port numbers for specific uses [8].

2.5 Layer 7: The Application Layer

The primary objective of this thesis is to identify application usage via communication

patterns at the application layer. Although not 100% accurate, port numbers are used

15

as the application labeling scheme for training the application classifier, described in

Chapter 5. Some applications communicate on certain port numbers with a high

degree of reliability. For example, when a user opens a web browser and requests

a web page, a connection is established from the user’s computer from a randomly

assigned upper port number to port 80 of the web server hosting the page. In this

case, Hypertext Transfer Protocol (HTTP) is the layer 7 application protocol used,

with the web server listening for connections on port 80, the IANA official port for

the HTTP protocol. This process is depicted in Figure 2.5.

Source Destination192.168.1.100:29985 → 208.122.19.56:80 User requests a web document208.122.19.56:80 → 192.168.1.100:29985 Server responds to request

Figure 2.5: Example communication between a client and a web server

There is no shortage of application layer protocols. Common examples include

SMTP or POP3 for email services, DNS for domain name resolution, peer-to-peer

protocols like BitTorrent and many, others. This study focuses on seven applications

that reflect a variety of application types and also have official port assignments from

the IANA. Protocol selection is detailed in Chapter 4, while the steps taken to create

application graphs based on the layer 3, 4 and 7 information are detailed in Section

5.3.

Chapter 3: Graph Analysis

Graphs are a well-studied concept in mathematics, dating back to Leonhard Eu-

ler’s 1736 analysis of the Seven Bridges of Konigsberg which laid many of the foun-

dations of graph theory [21]. Simply put, graphs are a collection of objects with

connections between them. These abstract structures model problems in a variety

of areas, including logistics, communication systems, biological and chemical com-

pounds and social-group structures [22]. The first part of this chapter reviews the

basic concepts and terminology required by the study of application graphs and then

introduces several “traditional” measures used to describe graphs. In the latter half

of this chapter, network motifs are defined in terms of their graph characteristics and

are related to application graphs.

3.1 Graph Terminology and Basic Properties

Unfortunately, some of the mathematical notation used in graph theory tends to differ

from text to text. Many of the basic properties and definitions are standard, but for

those that are not, this thesis borrows notation primarily from two sources: Chartrand

and Zhang [23], and Busacker and Saaty [22]. Abbreviations and function-like syntax

replace many Greek letters in this style of notation to avoid confusion. For example,

x(G) indicates that x is a property of the entire graph, whereas y(v) indicates y is a

property local to a particular vertex.

Vertices (or nodes, as they are often called in computer science) are the funda-

mental units in a graph. They can represent any object, such as a person, process,

city, or a computer. Vertices are linked together by edges, which show a relationship

16

17

between the vertices they connect. Some examples include roads connecting cities,

social interactions between people, or physical links between computers in a network.

A graph is a collection of vertices and edges taken together. Formally, a graph G con-

sists of a finite, non-empty set of vertices V , connected by a set of edges E, written

as G = (V, E). This definition implies that a graph must have at least one vertex in

it, but it does not necessarily have to contain any edges.

Figure 3.1: A graph with five nodes and five edges

The set of vertices V is written V = {v0, v1, . . . , vk}. The cardinality of this set,

| V |, is the order, or number of nodes in the graph. A graph’s edge set is defined as

E ⊆ {{u, v} | u, v ∈ V }. For brevity, an edge can be written eij to mean an edge

linking node i to node j. |E | is the number of edges in the graph, known as its size.

The degree of a node, deg(v), is the number of nodes that v is adjacent to in the graph

(those that can be reached by traversing one edge). This set of nodes is known as

N(v), the neighborhood of v. In Figure 3.1, nodes 2 and 3 are adjacent to node 1, and

N(1) = {2, 3}.

3.2 Types of Graphs

Modeling complex systems often requires more detail than just nodes and vertices as

described in the previous section. One possible approach is to orient the graph to

show asymmetric relationships between objects. In an undirected graph, the edges

are pairs of unordered vertices, that is, eij = eji. The edges in a directed graph,

however, are ordered pairs, and eij 6= eji. The degree measure can be extended to

include the indegree, id(v), and outdegree, od(v), of a vertex to describe the number

18

of vertices of G from which v is adjacent, and the number of vertices in G to which

v is adjacent, respectively. The associated undirected graph of a directed graph is

obtained by disregarding the ordering of the end points of each edge.

The assembly line process for building an automobile can be modeled as a directed

graph, where each stage of the process is represented by a node in the graph. The

directionality of the edges indicate that each step follows in a specified order and that

the process cannot happen in the reverse order. Edges of a graph can be weighted,

usually with an integer or real number, to imply a “cost” associated with traversing

an edge, or to further describe how the edge is used within the overall system. In

the auto assembly line graph, an edge weight could represent the amount of time a

particular step in the process takes.

3.3 Traditional Graph Measures

Several graph measures exist to describe the structure of a network, such as how

connected a vertex is, its distance from other vertices, and how it is positioned in

the graph. These measures have been used to characterize many different types of

networks and describe their growth patterns [24]. The following sections define the

measures selected for this study and provide examples of several of the concepts.

3.3.1 Distances and Path Lengths

The distance between two nodes u and v, written d(u, v), is the length of the shortest

path between them. In an unweighted graph, this is equal to the number of edges in

the path. In a weighted graph, the length of path P is∑

w(e) for e ∈ P . Dijkstra’s

algorithm [25] is one common method for determining this path through a network.

For a vertex v in a connected graph, the eccentricity of v, e(v), is the distance

between v and a vertex farthest from v in G. The radius of a graph rad(G) =

19

min{e(v) | ∀ v ∈ V } and the diameter diam(G) = max{e(v) | ∀ v ∈ V }. A vertex is

said to be central if e(v) = rad(G) and periphery if e(v) = diam(G).

In Figure 3.1 (reproduced above for convenience), e(1) = 3 because node 5 is the

node farthest away from node 1 in the graph and requires traversing three edges to

reach it. The radius of the graph rad(G) = 2 because e(1) = e(2) = e(5) = 3, but

e(3) = e(4) = 2. Also, diam(G) = 3, the maximal eccentricity value of all nodes in

the graph. According to the definitions above, nodes 3 and 4 are central, while nodes

1, 2 and 5 are said to be periphery nodes.

3.3.2 Centrality Measures

It is helpful to describe the centrality measures of a graph in terms of social networks in

order to make an analogy: the centrality measures of a vertex indicate how important,

prominent, or powerful the vertex is in a graph. The following is a brief examination

of four common centrality measures proposed by Freeman and Bonacich [26, 27]. The

most basic of these is degree centrality, or CD(v), defined as deg(v)|V |−1

. This equation

can be modified for directed networks to produce CDin and CDout. In terms of social

network analysis, indegree is interpreted as a a measure of popularity, while outdegree

is interpreted as gregariousness. In a dense adjacency matrix representation of a

graph, the time required to calculate the degree centrality for all nodes is O(V 2),

since all combinations of vertices must be considered.

20

Betweenness centrality is the fraction of shortest paths between all pairs of vertices

that pass through a particular vertex v. This measure is given by the equation:

CB(v) =∑

s 6=v 6=t∈Vs 6=t

δst(v)

δst

(3.1)

where δst is the number of shortest paths from s to t, and δst(v) is the number of

shortest paths from s to t that pass through v. A vertex with a higher betweenness

centrality occurs on more shortest paths than those that do not. This measure can

indicate how “powerful” a vertex is, because it influences the spread of information

through a network. O(V 3) calculations are required to determine betweenness and

closeness (described next) using the Floyd-Warshall algorithm to find all shortest

paths.

Closeness centrality is defined as the average shortest path length between a vertex

v and all other vertices reachable from it. In network theory it is regarded as a measure

of how long it will take information to spread from one vertex to the other reachable

vertices in the graph. Closeness centrality is given by:

CC(v) =

∑

t∈V

d(v, t)

n− 1(3.2)

where n ≥ 2 is the number of vertices reachable from v. Those vertices in G that

have shorter paths to other vertices will have a higher closeness centrality.

The eigenvector centrality is a more sophisticated version of the degree count of

a vertex, acknowledging that not all connections within a network are equal. The

eigenvector centrality score of a vertex i is proportional to the average degree of

i’s neighbors. In social networks, this reflects the idea that people connected to

influential people will themselves be more influential than if they were connected to

21

less influential people [28]. If the graph is represented as an adjacency matrix A where

Aij = 1 if node i is connected to node j, and Aij = 0 otherwise, eigenvector centrality

can be written:

xi =1

λ

|V |∑

j=1

Aijxj , (3.3)

where λ is a constant, and xi is the degree count of vertex i, and xj is the degree

count of vertex j. Defining the vector of centralities x = (x1, x2, . . . ), the previous

equation can be rewritten as

λx = A · x (3.4)

To force the centralities to be non-negative, it can be shown that λ must be the largest

eigenvalue of A, and x the corresponding eigenvector [28].

3.3.3 Clustering Coefficient

The clustering coefficient measure begins to extract a little bit more information about

the shape of structures within the graph, whereas many of the previous measures rely

on information about paths and path lengths between nodes. Instead, clustering

coefficient of v measures the percentage of edges that exist among neighborhood of

v, divided by the number of edges that could possibly exist among them. For an

undirected graph, the clustering coefficient is defined by the following equation:

C(v) =2 |{ejk}|

deg(v)(deg(v)− 1): vj , vk ∈ N(v), ejk ∈ E (3.5)

Another way to view clustering is the ratio of triangles (three nodes connected by

three edges) to the number of triples (three nodes and two edges, both incident to

v) that exist in the neighborhood of v. It has been shown in some types of networks

22

that if v1 connects to v2 and v2 connects to v3, then there is a greater chance that v1

and v3 will be connected as well [29, 28].

3.3.4 Application of Traditional Graph Measures in Com-

puter Networks

Past studies have looked at graph characteristics for the purpose of anomaly detec-

tion and traffic classification. Staniford et al.’s GrIDS system [30] generates graphs

describing communications between IP addresses and can generate alerts based on a

set of rules, such as a vertex degree count crossing some threshold value. The BLINC

traffic profiling system developed by Karagiannis et al. examines the interactions be-

tween hosts to identify an application, and utilizes measures including degree counts

and neighborhood information [31]. This thesis is similar to the BLINC study in

that they both evaluate interactions among hosts at the functional and social lev-

els in order to identify applications. The BLINC study, however, exploits additional

information such as the transport protocol and average packet size and attempts to

match network behavior to a library of empirically derived “graphlets”. In contrast,

this study examines a wider variety of graph measures, and also proposes the unique

approach of searching application graphs for motifs.

3.4 Network Motifs

A network motif is a pattern of interconnections that occurs in a graph significantly

more often than it does in randomized networks. Studies performed by Milo et al.

find motifs in several types of complex networks, and that a small number of network

motifs occur repeatedly across network types. They describe motifs as fundamental

building blocks of networks, capable of defining universal classes of networks [3, 16].

Research suggests that some motifs can be associated with a particular function,

23

discussed in Section 3.4.2. The work performed in this thesis extends this idea to

application graphs to determine if particular motifs indicate what application protocol

a host is using.

3.4.1 Definition of a Motif

In mathematical terms, a graph G′ = (V ′, E ′) is the subgraph of G if V ′ ⊆ V and

E ′ ⊆ E. A motif then, is any of such subgraphs that occur significantly more than

in random networks. The level of significance required depends on the problem, but

as an example Milo et al. consider those patterns with a p-value of 0.01, meaning

that there is only a 1% chance of seeing a particular pattern as many or more times

in random networks than is observed in the original network [3]. Motif detection is

depicted in Figure 3.2.

Figure 3.2: Schematic view of motif detection [3]

Generally speaking, motifs of order 3 or larger are considered when performing

motif searches. However, searching for large motifs can be prohibitively expensive

because of the computational complexity involved. Several algorithms [32, 33] have

been developed to increase the efficiency of these searches and allow for the analysis

of large networks containing thousands of edges and nodes. Figure 3.3 shows the

24

thirteen possible directed edge combinations for motifs of order 3. In application

graphs, the edge directionality indicates the flow of data between two hosts, such as

a request from a client to a server, or the response from the server back to the client.

Additional motif characteristics are described in Chapter 5.

Figure 3.3: All 13 configurations of order 3 connected subgraphs [3]

3.4.2 Function of Motifs

Several studies suggest that motifs can be linked to specific functions within a network.

Milo et al. analyze the motifs found in the direct transcriptional interactions in

Escherichia coli and find three highly significant motifs [16]. Their study states that

the appearance of network motifs at high frequencies suggests that they may have

some specific functions in the information processing performed by the network.

A different study analyzes the feed-forward loop, or FFL (Figure 3.4). In a FFL,

X regulates transcription factor Y , and both jointly regulate gene Z. Mangan et

al. show that it acts as a sign-sensitive delay element, in that it responds rapidly to

step-like stimuli in one direction (ON to OFF) and at a delay to steps in the opposite

direction (OFF to ON). They argue that this type of control mechanism can filter

out fluctuations in input stimuli [34].

X → Y

ց _Z

Figure 3.4: A feed-forward loop

25

3.5 Analysis of Application Graphs

The application graphs studied in this work are hybrid networks, reflecting a mix of

social interactions and computer network architectures. Although there are no genes

present that require precise regulation like in the biological networks discussed previ-

ously, network functions are carried out in a controlled environment that must follow

a set of established protocols. For example, if a user wishes to talk to another user on

a network via the AIM instant messaging service, each user must first authenticate

and establish a connection to a central server; the computers do not simply send text

back and forth between the two. Protocol behaviors are described in Chapter 4.

In terms of graph properties, application graphs are modeled with unweighted,

directed edges and do not contain any self-loops. If a computer connects to a service

running locally, the connection goes over the loopback interface, and is not visible

on the network traces examined. The edge direction is set to match the observed

traffic flow, which may be either unidirectional or bidirectional. If two computers

communicate at any time during a period of monitoring, an edge is drawn between

them. Edge weights are not used in this study, but may be considered in the future

to provide further detail when determining the application type.

The traditional graph measures defined previously are appropriate for the study of

application graphs because of the social aspect of the communications. Application

graphs are formed through specific user actions, such as surfing the web, checking

email, and sharing music. It is also for this reason that the study of motifs within

application graphs is interesting. In systems biology, processes such as gene transcrip-

tion and regulation are not voluntary tasks; cell survival depends on them. Chapter

5 details the methodology employed to describe application graphs based on their

traditional and motif-based characteristics.

Chapter 4: Data Selection and Considerations

As is the case in any type of research, proper data selection is imperative for

producing accurate results and analysis. This chapter examines several of the issues

involved with the collection and sampling of computer network data in an effort

to build a baseline measure for “normal” network behavior, and concludes with an

overview of the seven application protocols selected for this study.

4.1 Network Trace Files

The pcap library provides the packet-capture and filtering engines of several popular

network analysis and monitoring tools [35]. Some examples include tcpdump, nmap,

Wireshark and the Snort IDS. Tcpdump in particular is a valuable tool for capturing

packets as they come across a network interface card, a process known as “sniffing”,

and logging them in a raw format which can then be analyzed by other tools as shown

in Figure 4.1. Although tcpdump is able to capture all of the data associated with

each network packet such as packet length, flags and checksum values, only a few of

the fields specified by the IP, TCP, and UDP RFC documents [36, 37, 38] are needed to

model application graphs: source IP, destination IP, source port and destination port.

These four pieces of information are enough to uniquely identify a process running

over a computer network between two host. The creation of application graphs is

discussed in Chapter 3 and the implementation detailed in Chapter 5.

4.2 Challenges Associated with Network Data Collection

Pang et al. identify three key goals of sharing network data with other researchers:

verification of previous research, direct comparison of competing ideas on the same26

27

Figure 4.1: Tcpdump output containing timestamp, protocol, source IP, source port,destination IP, destination port, packet length and packet flags

data, and a broader view than a single investigator can obtain on their own [39].

Unfortunately there are several concerns that must be addressed such as the amount

of data collected, the accuracy of the data and protection of users’ privacy. This

section outlines a few of these issues.

4.2.1 Data Capture

Increased utilization and line speeds of today’s high speed, high capacity networks

present challenges for collecting network data in terms of data rate, storage and pro-

cessing [40]. A packet sniffer can easily log hundreds of gigabytes of data in a single

day, even on a moderately sized network. A study of traffic collected at Dartmouth

College shows a significant increase in peer-to-peer, streaming multimedia and VoIP

traffic, whereas initial network usage was dominated by web traffic [41]. Both static

and streaming multimedia applications require significantly more bandwidth than

simple web documents or other non-interactive file types. Research characterizing

YouTubeTM traffic found that 90% of videos requested by University of Calgary cam-

pus network users were larger than 21.9 MB [42], orders of magnitude larger than the

file sizes of other content types.

In addition to requiring a great deal of storage space, high speed packet capturing

also requires fast memory access and high disk speed so that packets can be written to

the disk before the capture buffer is full and loses packets. Although undesirable, this

28

behavior does not affect the study of application graphs proposed by this study, which

uses individual packets to establish a communication link instead of aggregated flows

(all packets associated with a particular origin and destination pair). Two nodes in an

application graph will be connected if any packets are sent between them, regardless

of which part of the flow they come from, beginning, middle, or end. Therefore,

partial flows are considered in these graphs.

Another advantage of using individual packets is that TCP and UDP sessions

don’t need to be defined. TCP connections are established by a three-way handshake

between the client and server, and are terminated by a FIN and FIN-ACK sequence.

The formal establishment or tear-down of a TCP session might not be correctly

logged for several reasons: the sniffer could be turned on or off in the middle the

session, parts of the handshake could be dropped by the sniffer, or either the client or

server could disconnect without following the closing protocol. UDP doesn’t establish

formal sessions like TCP does, so UDP flows are sometimes segregated by establishing

a timeout value for which the flow is terminated if there is no activity. The edges in

an application graph are binary in nature and only indicate whether or not host A

communicated with host B.

4.2.2 Privacy and Sanitization of Data

Monitoring network traffic may raise serious privacy concerns, as data sent in cleartext

(i.e. not encrypted) is easily read by sniffing. Data such as usernames and passwords

sent to web sites via the HTTP protocol instead of the encrypted HTTPS protocol can

be effortlessly obtained by an attacker on the network. Even if sensitive information

is not being sent, an attacker can log all text and images downloaded by a user as he

or she surfs the web, and reassemble the browser sessions later.

Not only do researchers who collect this kind of data need to be sure to sanitize

29

the resulting log files to ensure the privacy of users, but they must also disguise the

IP addresses of machines on the network so that an attacker does not have a map of

the network with which to launch an attack. Several methods and tools have been

developed to accomplish these tasks, such as [43, 39, 44, 45].

Often times a network administrator or developer does not need to log the packet

payloads to perform tasks such as verifying routes or debugging programs that utilize

sockets. If this is the case, only the packet headers are logged and the rest of the

packet is discarded. Only storing packet headers also helps alleviate the issue of

storage space discussed in the previous section. This shortcut cannot be used in the

case of signature-based intrusion detection systems which rely on scanning the payload

of a packet for known signatures that indicate an attack. The methods proposed in

this thesis do not consider packet payloads, but only the information readily available

in the packet header.

4.2.3 Network and Data View

Ideally, a “God’s eye view” of a computer network would reveal all communication

links within the network as well as connections from within the network to other

networks outside of it. Unfortunately, many sniffers are placed at gateway nodes at

the edge of a network and only capture traffic leaving from and coming to the network.

As a result, traffic originating from within a network and destined for internal servers

(web and application servers, email servers, etc.) is not logged because it never reaches

the gateway. Some data collection projects such as [46] attempt to address the lack

of internal enterprise network traffic that is available for research.

One drawback of the research method proposed in this paper is that it currently

assumes network activity for a particular application is limited to a single port. This

is not true for out-of-band protocols such as FTP that send authentication and control

30

messages on over one port but use another for data transfer. Even if provided with a

complete view of the network, the data is segmented into individual port numbers for

analysis. Therefore, network communications over multiple port numbers will not be

visible. If a client connects to a web server on port 80, and that web server requests

data from a MySQLTM server (default port 3306) or an IBM WebSphere® Application

server (default port 2809), only one part of the process is visible at a time, either client

to web server, web server to database, or web server to application server. Seeing all

components of a particular process would reveal interesting structural motifs, but the

motif and node properties examined in isolation still hint at the function of the nodes.

Possible techniques for aggregating data for different views are discussed in Chapter

7.

4.3 Data Sources

The data sets used in this study come from three different sources in an attempt to

show measureable differences in protocols and behavior, even across networks with

different underlying architectures and usage patterns. One data set often used in in-

trusion detection research is the 1998 & 1999 DARPA Intrusion Detection Evaluation

Data Set [47]. The primary reason this data was not selected, however, is because of

its age; as Henderson et al. point out, the type of traffic seen in computer networks

has changed [41]. This is not to imply that the approach described in this thesis would

not work with older data, but that newer network traces containing a wider variety of

application use might prove more interesting to examine. Additionally, traffic for the

DARPA initiative is synthetic, whereas the data sets described in this section contain

real network data that reflects current trends in in network and protocol use. Table

4.1 provides overview statistics for the traces examined.

31

4.3.1 Dartmouth College Wireless Traces

The CRAWDAD project at Dartmouth College provides an archive of wireless network

data from several contributors around the globe. Included in the archive is 163 GB

of packet headers captured from eighteen buildings on the campus during the Fall

2003 semester [48]. Data collected is representative of traffic in residential buildings,

academic buildings, as well as the library. It has been sanitized in such a way that

the IP addresses are consistent across traces, allowing for a more complete picture of

network use. The campus wireless network contains several thousand users and over

450 wireless access points.

4.3.2 LBNL/ICSI Enterprise Tracing Program

The ICSI Enterprise Tracing Program hopes to provide a view into the internal traffic

for an entire enterprise site [46]. These traces, taken from the Lawrence Berkeley

National Laboratory (LBNL) in 2004 and 2005 span more than 100 hours of activity

and include traffic from several thousand internal hosts. The data is sanitized in

accordance with the methodologies described in [39]. Like the Dartmouth wireless

traces, only packet headers were captured and the payload discarded.

Dartmouth LBNL OSDI

Capture length (seconds) 21818.575 600.079 193.348

Number of packets 2023527 2261261 324116

Avg. packets/sec 92.743 3768.274 1676.335

Number of bytes 1092602793 778659304 94814149

Avg. bytes/sec 50076.726 1297595.353 490380.757

Table 4.1: Summary statistics of three trace files examined

4.3.3 OSDI Conference Network Traces

The last source of data used for analysis in this paper also comes from the CRAW-

DAD archive, and includes traces from ten sniffers at the 2006 Operating Systems

32

Design and Implementation (OSDI) Conference [49]. Researchers collected this data

to enable the analysis of the behavior of a heavily used wireless LAN. The data was

initially sanitized on-the-fly and then reprocessed off-line to further obfuscate the

MAC addresses as necessary. Although this data set does not have the “enterprise”

characteristics of the previous two, its inclusion helps to determine the generalizabil-

ity of the methods proposed in this work to different networks and network points of

view.

4.4 Protocol Selection

Several criteria were used to select the protocols examined in this paper including

availability, popularity and diversity. First and foremost there must be enough data

samples of a particular protocol in the trace files to be able to perform the graph

characteristic and motif analysis. To achieve this goal, more well-known and widely

used protocols were chosen. Also, protocols that have different architectures (client-

server vs. peer-to-peer for example) were selected in order to highlight the differences

in node characteristics. Because packet payloads are not inspected, applications that

operate on official IANA port numbers and are in-bound protocols are used so that

reasonable assumptions can be made about the data, and that the port numbers

accurately reflect the protocol being used. As a reminder, the port number is not

used to classify applications, but is only used to provide class labels.

AOL Instant Messenger (AIM)

AOL’s instant messaging client has been a popular application for users around the

world for over a decade. AIM uses a proprietary protocol called OSCAR to commu-

nicate with other clients [50]. Multicast routing architectures exist and are used by

some chat programs such as IRC, but all AIM connections go through a centralized

33

server. Users authenticate to the AIM login server on port 5190. Once the user’s

session has been established, all chat communications also go through central AIM

servers on port 5190. The exception to this is when a user attempts to establish a

direct connection to another user (such as when sending pictures or other files), in

which case the communication goes directly to the other user and bypasses the cen-

tral AIM servers. Therefore AIM is primarily a client-server application, with some

peer-to-peer capabilities as well. This study restricts itself to communications on port

5190, so any direct file transfers are ignored.

HyperText Transfer Protocol (HTTP)

The HTTP protocol is used to retrieve hyper-linked text documents from the world

wide web [51]. A client initiates an HTTP request by connecting to a web server,

typically on port 80. The web server then responds with a status line, as well as

another message including the contents requested, such as an HTML file or an im-

age. HTTP is a stateless protocol, which means no information is retained between

requests. This protocol falls directly into a client-sever architecture model.

Domain Name System (DNS)

DNS is a hierarchical naming system that maps meaningful domain names to numer-

ical IP addresses [52]. If a DNS server does not know the correct mapping for a given

domain, it can instruct the DNS resolver on the client side of where to query next to

attempt to resolve the address. DNS primarily communicates via UDP on port 53,

and also follows a client-server architecture. Its hierarchical nature however makes it

an interesting selection for analysis.

34

Kazaa

Kazaa is a peer-to-peer file sharing application built on the FastTrack protocol that

operates on port 1214. This protocol employs the use of supernodes for scalability

purposes. A supernode is any node on the network that also acts as a proxy and

relayer for the network, and handles data flow and connections for other users. A

peer-to-peer network should be more highly connected than a client-server model

since all nodes in the network act as both clients and servers for each other.

Microsoft Active Directory (MSDS)

Microsoft Active Directory is a client-server protocol that provides a way to manage

objects and relationships across a network. Objects can be resources such as printers,

services such as email, or users (accounts and groups). It provides several services such

as DNS-based naming, authentication methods and LDAP-like directory services.

Active Directory Domain Services (MSDS) is the central location for configuration

information, authentication requests and information about network objects [53]. It

operates on port 445. Windows shares and Active Directory are commonly used

in Windows-based networks, and its inclusion for analysis provides an example of

platform-dependant network traffic.

NetBIOS Name Service

Netbios (Network Basic Input/Output System) is used to allow applications on sep-

arate computers to communicate over a local area network. It provides three main

services: (i) name service for name registration resolution, (ii) session service for

connection-oriented communication, and (iii) datagram distribution service for con-

nectionless communication. The name service communicates over port 137 with either

the TCP or UDP protocol. A computer, which has a unique host name, might have

35

multiple NetBIOS names. The inclusion of NetBIOS for analysis is interesting be-

cause it often receives port scans and is frequently the target of malicious attacks.

The architecture of Netbios communications is a bit unique in that it does not fall

cleanly into a client-sever model, nor does it fit the P2P model. It will occasionally

use broadcast messages, and Netbios hosts can also be configured as peers.

Secure Shell (SSH)

Secure shell is a protocol that allows encrypted data to be sent between two com-

puters on a network. It is often used for remote administration of other computers,

creating secure tunnels for web browsing and securely copying files. SSH is primarily

used on UNIX/Linux environments and runs on port 22. SSH utilizes a client-server

architecture.

Chapter 5: Experimental Methodology

The analysis of application graphs involves several stages and requires the use of

many different software tools. The major tasks include: parsing and storing network

data, creation of graphs and vertex profiles, node property analysis, motif searching,

and creating a classifier to predict application labels. Optimization of the classification

process via feature weighting is also considered. This chapter describes the process

as well as the tools used, which are open-source and freely available. For the reader’s

convenience, a summary diagram is given in figure 5.1.

Parse and

store data

Construct

application

graphs

Traditional

profile creation

and analysis

Motif−based

profile creation

and analysis

Nearest

neighbor

classification

Nearest

neighbor

classification

Evolutionary

attribute

weighting

Evolutionary

attribute

weighting

Wireshark

Afterglow

Python

NetworkX

Python

NetworkX

Python

FANMOD

RapidMiner RapidMiner

Pro

cess

Tools

Figure 5.1: Overview of the proposed methodology and tools used

5.1 Hardware and Linux System

All tests were run on a multi-core system running the Linux kernel version 2.6.22.

The system contains four dual-core AMD 64-bit processors running at 1.8 GHz each.

It uses a shared-memory architecture with 8 GB of memory. Although most of the

tools are not written to take advantage of multiple cores, the hardware architecture

allows for analysis of multiple network traces to happen simultaneously.

36

37

5.2 Packet Capture and Storage

The network traces are in the pcap format as described in Chapter 4. Modified

parsers based on those distributed as part of The Afterglow Project [54] were used to

parse pcap files. Additionally, Wireshark [10] was used to extract basic information

from the network trace files, including the source IP, destination IP, source port,

destination port, timestamp, protocol and packet length.

tshark −t e −r input . pcap tcp or udp | python tshark2mysql . py t

Figure 5.2: Storing packets from a pcap file into a MySQL database

Once the packets have been parsed, they are stored into a MySQLTM database

for later retrieval. This is done to facilitate later steps in the process so that packets

can be selected based on their source or destination port numbers, protocol type, or

other attribute. Figure 5.2 illustrates the process of parsing and storing information

from input.pcap into a MySQL database table t. Each network trace file is stored in

a unique table within the database.

5.3 Creation of Application Graphs

The next step in the process is to model the application graphs and analyze the

traditional measures as described in the first half of Chapter 3. NetworkX is a package

for the creation, manipulation and study of complex networks, written in the Python

programming language [55]. Graphs are created by querying a MySQL database

table for all entries for which either the source or destination port number matches

the port number of one of the seven application protocols. Although port numbers

do not always accurately reflect the application bound to them, they are generally

a strong indicator, especially for the well known port numbers 0-1023 (e.g., HTTP

servers typically listen on port 80 for connections). For the purposes of this work, the

38

applications tied to each port number are assumed to be correct; however, verification

is not possible because the packet payloads have been discarded.

Graph Size

There are two possible approaches to consider when creating and comparing appli-

cation graphs across different protocols. One approach is to collect network data

for a constant amount of time and then study the resulting communications that

occurred. For example, each application graph would represent ten hours of SSH

communications, 10 hours of HTTP communications, and so on. This approach is

complicated for several reasons. The data collected for these experiments come from

several different sources where the network monitors were run for variable lengths of

time. Because certain applications are much more heavily used than others, there

are no guarantees that there would be enough, or conversely, too much, data for each

protocol. Additional data pre-processing would be required to mitigate variables such

as these.

Instead, this study attempts to analyze application graphs that have a similar

number of participating nodes by allowing the network capture lengths to vary. By

doing so, the number of hosts in each application graph (Table 5.1) can be more easily

controlled, and the interaction patterns that form over a longer amount of time may

be viewed. The order of each application graph is consistent within examples of a

particular protocol, but not across protocols due to lack of availability.

Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH

Order 50 68 80 80 40 76 40

Table 5.1: Graph orders for each application protocol

The graph orders serve as an upper bound for the number of nodes considered

in each application graph. For example, if ten network trace files are searched for

39

AIM traffic, the lowest number of hosts found communicating on port 5190 in any

of the files becomes a limiting factor for the other trace files. If another file were

to have 120 hosts using AIM, only the first fifty would be considered. However, all

communications among those fifty hosts over the duration of the network trace would

be added to the graph, not just the links created as each node is added.

Connected Components

It is natural to expect that not all nodes in an application graph are connected; groups

of nodes exist whom communicate with one another, but there is no communication

that connects one group to the next. Scale-free networks, whose degree distributions

follow a power law degree, usually contain one larger connected component and a few

smaller connected components [24]. Several of the protocols exhibited this behavior

of scale-free networks except for SSH, which showed a high number of small connected

components. When calculating graph attributes for disconnected graphs, each con-

nected component is treated as a separate graph. This ensures that measures such as

distances between nodes, radius, diameter and others discussed in Chapter 3, are well

defined and don’t use infinite path lengths to represent nodes that are disconnected,

as is done in some graph algorithms.

5.4 Traditional Graph Measures

After the application graphs have been created, NetworkX is again used to perform

calculations on each connected component. There are eleven node characteristics

(described in Chapter 3) examined in this approach: indegree, outdegree, total degree,

clustering coefficient, betweenness centrality, degree centrality, closeness centrality,

eigenvector centrality, eccentricity, whether or not the node is a center node, and

whether or not the node is a periphery node. NetworkX provides functions to calculate

40

all of these values except for eigenvector centrality, whose implementation is listed in

Appendix B.

5.5 Motif Analysis

For the task of motif searching, several network analysis tools were considered: FAN-

MOD [32], mfinder [56], MAVisto [57] and Pajek [58]. FANMOD (Fast Network Motif

Detection) was selected for its rich feature set, including support for a graphical user

interface in addition to command line invocation, generation of motif images, the

ability to export results in several different formats and support for node and edge

colors. FANMOD employs algorithms [59] that allow it to search motifs faster and

with less memory usage than other motif searching tools such as MAVisto or mfinder.

FANMOD Parameters

FANMOD’s support for colored vertices and colored edges allows for the encoding

of additional information into a motif structure beyond its shape and directionality

of its edges. This study assumes edges are directed, but further exploits the flow of

information between nodes by defining three classes (colors) of hosts: client, server,

and peer (see Figure 5.3). In computer networking these terms refer to nodes that are

consumers of a service, providers of a service, or nodes that act as both consumers

and providers of a service. Here, they take on a related meaning, but are defined

somewhat more generally based on the source IP, source port and destination port of

a packet.

Definition 1 Let φ be the port number associated with an application and v be a

node in Gφ, the application graph of φ. Also, let P be a packet sent by v over the

network, where Psp and Pdp are the source and destination ports of P , respectively.

Client, server and peer are defined as follows:

41

• If Pdp = φ then v is a client node, labeled vc

• If Psp = φ then v is a server node, labeled vs

• If vc and vs hold, then v is a peer node, labeled vp

As described in Chapter 2, a client computer will request a service by connecting

from a random upper port on its own machine to a particular port φ on a server.

Therefore, if the destination port of a node v is the port number φ, then v is consuming

the service provided on that port. For many protocols, the server will the send data

back to the client from port φ. In this instance, the server becomes the source IP,

sending data from φ, as specified in Definition 1. The third part of the definition

describes the behavior of peers, computers that act as both “clients” and “servers”.

Therefore any node that is found to both send and receive data on port φ is labeled

as a peer. Edge colors are not currently used, but are considered for future work.

Figure 5.3: A motif with colored vertices

Random Graphs and Statistical Significance

Milo et al. define “network motifs” as patterns of interconnections that occur in

complex graphs at numbers that are significantly higher than those in randomized

networks [3]. To determine which motifs are statistically significant, network data

retrieved from the MySQL database for a particular application is converted into a

FANMOD input file, which describes an application graph. First, the input graph is

searched for all motifs of either order 3 or order 4. Next, a set of random graphs is

generated and the motif search is repeated for each. The frequency at which motifs

occur in the original input graph (the application graph), is compared to the frequency

42

of those same motifs in the random graphs. Motifs that are found significantly more

often in the original graph are then reported to the user.

Random graphs are created through a series of “edge switching operations” (Fig-

ure 5.4(a)), using the original input graph as a starting point. Several parameters

exist to control the randomization process. In this study, the “local constant” model

is selected, which means that unidirectional edges are only exchanged with other uni-

directional edges. As a result, the number of bidirectional edges incident upon each

vertex remains constant. Another option selected is to “regard vertex color” (Fig-

ure 5.4(b)), which indicates that edges should only be exchanged if their endpoints

have the same color. These options were enabled to create randomized networks that

are still structurally similar to the original network and allow for a more stringent

comparison [3].

(a) Edge-switching operation (b) Regard vertex color

Figure 5.4: FANMOD edge-switching process for generating random networks [4]

There is some variability in defining the phrase “statistically significant”, as dif-

ferent thresholds can be used. The mfinder Tool Guide suggests using 5,000+ random

graphs when searching for motifs of order 3, and 10,000+ random graphs when search-

ing for motifs of order 4, and suggests that ten occurrences of any individual motif is

a good starting point to measure the quality of a result [56]. For the motif analysis

performed in this work, similar parameters were used. To keep the problem size rea-

sonably small, 5,000 random networks were generated when searching for both order

3 and order 4 motifs; FANMOD supports sampling subgraphs for motif searching, but

43

an exhaustive enumeration of all subgraphs is currently used. The FANMOD output

files provide the user with several pieces of information, including the percentage of

subgraphs each motif was found in (for the original networks as well as in random

networks), and a p-value for each motif.

The p-value is a statistical measure that describes the probability of obtaining a

result at least as extreme as the result observed, given that the null hypothesis, or

expected outcome, is true [60]. If the p-value falls outside the range of the expected

outcome and is less than some threshold value α, the result is said to be statistically

significant at the α level. In practice, values of 5%, 2.5% and 1% for α are common.

For this study a “significant motif” is any motif that occurs in at least 1% of subgraphs

in the original graph and has a p-value of 0 — essentially those motifs that FANMOD

determines are the most significant results. By setting the threshold at this p-value,

the number of motifs considered for analysis can be limited to a more select group.

Experimentally, this results in a list of 130 significant motifs.

5.6 Vertex Profiles

The data collected through traditional graph analysis and motif analysis is used to

create “profiles” of each node, used in the classification algorithm described in Section

5.7. Each profile is a data point in d-dimensional space, where d is the number of

attributes a in the profile. A list of n vertices labeled v1 . . . vn, is written as follows:

v1 = [ a1, a2, a3, . . . , ad ]v2 = [ a1, a2, a3, . . . , ad ]

.... . .

vn = [ a1, a2, a3, . . . , ad ]

Figure 5.5: Arrays representing vertex profiles

44

The attributes a1 through ad can be any numerical data type or numerical repre-

sentation of a data type. In the traditional graph analysis approach there are eleven

attributes (degree counts, centrality measures, etc.), so d = 11. These attributes

include integers, real numbers and boolean values represented as a 1 (true) or a 0

(false). The intent is to associate an application with a certain profile.

The idea of vertex profiles based on graph characteristics is adapted to the motif-

based approach. Instead of considering the percentage of subgraphs a motif occurs

in, however, a binary attribute is created that describes whether or not the vertex

participates in the motif. One of the files output by FANMOD motif searches is a

comma separated file with the following format:

adjacency matrix, <participating vertices>

After the significant motifs have been determined, the script in Listing B.5 parses

these files and creates the profiles for each node based on its participation in significant

motifs. The dimensionality d of the motif profiles is 130: 42 of these are significant

order 3 motifs, while the remaining 88 are significant order 4 motifs. The motif profiles

were built putting both order 3 and order 4 motifs together because preliminary

investigations indicated that the combination is more successful in separating and

identifying protocols than either can do alone.

5.7 K-Nearest Neighbor Classification

The tasks of node classification and feature weighting (Section 5.8) are handled by

RapidMiner, an open source knowledge-discovery and data mining tool built on the

JavaTM platform [61]. RapidMiner allows for data mining experiments to be quickly

constructed through the use of hundreds of modular operators that handle data pre-

processing and post-processing, creation and storage of models, clustering and classi-

fication tasks as well as statistical analysis.

45

The k-nearest neighbor (k-NN) classification algorithm is a simple machine learn-

ing algorithm for classifying objects based on the closest training examples in a feature

space. First, the data is broken into a training set and a test set. The proximity of a

test point z to every point in the example set is then calculated.

Algorithm 1 The k-Nearest Neighbor classification algorithm [62]

1: Let k be the number of nearest neighbors and D be the set of training examples.2: for each test example z = (x′, y′) do3: Compute d(x′,x), the distance between z and every example (x, y) ∈ D.4: Select Dz ⊆ D, the set of k closest training examples to z.5: y′ = argmax

v

∑

(xi,yi)∈DzI(v = yi)

6: end for

After the nearest-neighbor list is obtained, the test example z is classified based

on a majority vote of the k nearest neighbors to z. In this study, k = 1, so a test

point z is given the same label as the label of its closest neighbor. In line 5 above,

yi is the class label for one of the nearest neighbors, and I() is an indicator function

that returns the value 1 if its argument is true and 0 otherwise.

5.7.1 Measuring Profile Separation

A number of similarity measures can be used to determine the distance from one point

to another (line 3 of Algorithm 1), the selection of which depends on the type of data

being examined and its application [62]. For example, there is Euclidean distance,

Jaccard coefficient, cosine similarity and simple matching coefficient. Euclidean dis-

tance is often chosen for instances of dense continuous data such as that found in

the profiles for traditional graph analysis. Although the simple matching coefficient

is often applied to binary data such as the motif profiles, the Euclidean distance is

also suitable, and is selected for use in this study. Equation 5.1 defines this distance,

46

where n is the number of dimensions and xk and yk are the kth attributes of x and y.

d(x,y) =

√

√

√

√

n∑

k=1

(xk − yk)2 (5.1)

5.7.2 Cross Validation of Classification Results

Cross validation is the process of partitioning a data set into n subsets, training a

classifier with n − 1 subsets and using the remaining subset to test. The process is

then repeated n times with a different subset left out each time. In 10-fold cross

validation, for example, ten subsets are created, each containing 10% of the original

data set. In each iteration, 90% of the data is used for training and 10% is used for

testing. To avoid the possibility of a particular subset not containing any instances

(or very few) of a particular label, stratified sampling is used so that each subset

contains roughly the same propotion of labels.

5.8 Genetic Algorithm Feature Weighting

Genetic algorithms provide a unique way to investigate which attributes in the vertex

profiles more effectively classify application protocols, as well as increase the accuracy

of the nearest-neighbor classifier. This study utilizes a genetic algorithm to perform

evolutionary feature weighting, the results of which are applied to each profile and a

new classifier is built using the nearest neighbor algorithm as before. Alternatively, a

brute-force search of all attribute combinations (given by Equation 5.2) might possible

for a small attribute set such as in the case of traditional graph analysis, but is not

feasible for motif analysis.

c =

d∑

n=1

(

d

n

)

(5.2)

47

Given that d = 11 for traditional graph analysis, applying the equation above

reveals that the number of possible attribute combinations c is 2,047. However when

d = 130 for motif analysis, c = 1.36× 1039. Genetic algorithms present one possible

way to explore this problem space within a reasonable amount of time.

5.8.1 Overview of Genetic Algorithms

Genetic algorithms view learning as a competition among a population of evolving

candidate problem solutions [63]. During each generation, a fitness function (line 4 of

Algorithm 2 below) assesses each candidate to determine if it will contribute to the

next generation of solutions. Those solutions found to be the most “fit” are selected

for mating and mutation and shape the following generation of potential solutions.

The algorithm repeats until some termination condition is met, such as convergence

to a solution or a predefined number of generations have been tested.

Algorithm 2 General form of a genetic algorithm [63]

1: Set time t = 02: Initialize the population P(t)3: while the termination condition is not met do4: Evaluate fitness of each member of the population P(t).5: Select members from population P(t) based on fitness.6: Produce the offspring of these pairs using genetic operators.7: Replace, based on fitness, candidates of P(t), with these offspring.8: Set time t = t + 19: end while

Before the algorithm can begin, candidate solutions must be transformed into

an appropriate representation for the problem space. Examples include binary, real

value, and tree encoding, the simplest and most studied of which, is binary encoding

[64]. Initial populations of candidate solutions are usually chosen at random. The

population size depends on the problem space, but studies have shown a population

size of 20-30 generally yields good results [65, 66]. At this point, the fitness function

evaluates each member of the population, and selects the best candidates for mating.

48

Figure 5.6 shows what a simple crossover of two binary strings might look like.

Input Bit Strings Output Bit Strings0011|0001

=⇒0011|1011

0100|1011 0100|0001

Figure 5.6: Single-point crossover of two binary strings

Just like in evolutionary biology, there is a small chance for random genetic mu-

tation to occur. In a binary string, this would equate to one of the bits being flipped

from a 0 to a 1 or vice versa, allowing the algorithm to explore more of the problem

space and not settle on a local solution. Previous research suggests variable values

for mutation probability, such as 0.0001 [65] or 0.005 - 0.01 [66].

5.8.2 Feature Weighting

The RapidMiner distribution contains a prewritten test for evolutionary feature weight-

ing using genetic algorithms. In the context of application identification, the function

used to determine the fitness of candidate solutions is based upon whether or not the

potential solution increases the overall accuracy of the 1-NN classifier. Solutions that

do not increase the performance of the classifier are not selected to contribute to the

following generation of candidate solutions. The algorithm is run for thirty genera-

tions, by which time the system should stabilize and begin to converge to a solution

set of attribute weights. The full test parameters, including crossover probabilities,

mutation rates and candidate selection can be found in Appendix C.

Chapter 6: Results and Analysis

To test the accuracy and performance of the proposed approaches, several exper-

iments were run using the method described in Chapter 5. In total, 65 application

graphs were examined: ten AIM, ten DNS, ten HTTP, five Kazaa, ten MSDS, ten

Netbios and ten SSH, with the discrepancy resulting from fewer examples of peer-to-

peer Kazaa traffic being located in the data traces that were downloaded. Profiles

were classified using both traditional graph attributes and motif-based attributes.

Afterwards, profile attributes were weighted using a genetic algorithm. This step

aims to provide two important functions: to increase the accuracy of the classifiers

and to provide insight into which attributes are more effective for identifying network

applications. Analysis of several key attributes is provided in this chapter, as well as

a direct comparison between traditional and motif-based profiles.

6.1 Preliminary Investigations

Because motifs have not been applied in the realm of application identification, some

preliminary classification work was required to vet this approach. Profiles for each of

the 65 application graphs were created using a combination of significant order 3 and

order 4 motifs, where each attribute represents the frequency of a particular motif

within that graph. The results provided in Table 6.1 were encouraging (for the full

classification results see Appendix D). Perhaps a more interesting question, however,

is not if an entire graph of communications can be correctly classified, but instead

if the activities of a particular host can be identified. It is on this question that the

remainder of the chapter is focused.

49

50


Accuracy 80% 80% 90% 40% 60% 100% 80%

Table 6.1: Classification accuracy of 65 application graphs

6.2 Initial Results

Classification results are presented as confusion matrices; each row of the table repre-

sents a predicted class label (an application in this case), while the columns represent

the true class label. The boldface numbers along the diagonal indicate correct clas-

sifications. Confusion matrices also show false positives and false negatives. Data

points that are predicted to have a certain class label but are incorrect are known

as false positives, found in the rows of the matrices. False negatives are examples of

a particular class that are incorrectly labeled, shown in the columns. For example,

given a set of data that is predicted to be hosts sharing files via Kazaa, true positives

would be those hosts that are actually using the P2P application while false positives

would be those hosts that are not. Conversely, given a set of data that is known to

be file-sharing hosts, false negatives would be those that are not labeled as using the

Kazaa application.

True A True B True C Precision

Pred. A 5 2 0 71.4%

Pred. B 3 3 2 37.5%

Pred. C 0 1 11 91.7%

Recall: 62.5% 50.0% 84.6% � 70.4%

Table 6.2: An example confusion matrix with three classes

The performance of the nearest-neighbor classification models are described by

three different accuracy measures. The overall accuracy of a model (denoted by� next to the number in the bottom-right corner) is simply the number of correct

classifications (true positives) over all classifications. Given a set of predictions of a

particular label, class precision is a measure of the accuracy of those predicted labels.

51

It is the ratio of correct predictions of label l to all predictions of label l. It can be

written:

precision =true positives

true positives + false positives(6.1)

Class recall (also called sensitivity) measures the accuracy of predicted labels if

provided a complete set of true labels. Recall is given by the following equation:

recall =true positives

true positives + false negatives(6.2)

Table 6.2 displays the results of an example classification experiment , as well as

the accuracy, precision and recall measures. This confusion matrix shows that while

the classifier has some trouble distinguishing between class A and class B, it can

effectively detect examples of class C.

6.2.1 Traditional Graph Measure Profiles

To remind the reader, traditional graph measure profiles have eleven attributes in-

cluding degree counts, centrality measures and clustering coefficient (see Section 5.4

for the full list). There are a total of 3,940 unique hosts found in the 65 application

graphs. Each line of the input file for the nearest neighbor algorithm contains the

true label assigned to the host and the eleven graph measures associated with that

host. Not all protocols have an equal number of training examples due to the popu-

larity and availability of certain applications in the trace files, but each protocol has

400–800 examples.

The computational load of a single test point for the nearest neighbor algorithm

is O(nd) where n is the number of training samples and d is the number of attributes.

When using 10-fold cross validation, the test set is n10

and ten iterations are run,

making the overall complexity of this process O(n2) if we absorb the constant d into

52

the expression. Although other methods exist to reduce the number of computations

necessary, RapidMiner is able to generate the model and accuracy measures for 3,940

data points in just a few seconds. Table 6.3 shows the resulting confusion matrix,

where each row is the predicted label and each column is the actual label.

AIM DNS HTTP Kazaa MSDS Netbios SSH PrecisionAIM 417 29 89 37 91 41 320 40.72%DNS 2 612 6 1 7 20 10 93.01%HTTP 49 11 658 2 20 32 11 84.04%Kazaa 1 1 3 355 1 1 0 98.07%MSDS 10 5 13 1 255 10 29 78.95%Netbios 13 19 24 4 9 655 2 90.22%SSH 8 3 7 0 17 1 28 43.75%Recall 83.40% 90.00% 82.25% 88.75% 63.75% 86.18% 7.00% � 75.63%

Table 6.3: Confusion matrix of unweighted traditional graph measures

After this initial classification test there are five protocols that have greater than

80% of their profiles labeled correctly (class recall): AIM, DNS, HTTP, Kazaa and

Netbios. The class recall of SSH is strikingly low at 7%. SSH is a particularly

difficult application to classify because it is used for a variety of tasks, including

remote management of hosts, application tunneling and file transfers using the secure

copy program. The traditional measures of SSH application graphs often resemble

those of other applications, resulting in a low class recall value.

Despite the fact that about half of the eleven traditional attributes are real-valued,

several ties occur when the nearest-neighbor algorithm computes the Euclidean dis-

tance from a test point to the training points in the model. In this study, a tie

situation is termed a profile collision (described in a moment). This behavior is de-

sireable if the tie is between profiles of the same class, suggesting that certain profiles

are strongly indicative of a particular application. However, many examples also tie

with examples from several different classes. When this happens, RapidMiner naively

assigns the first label in the list of ties to the test point. The order of this comparison

is affected by the order of the input data. AIM is the first protocol in the list, so

many multi-class ties are labeled as AIM, which explains why 80% of SSH traffic is

53

classified as AIM. It also explains the low class precision for AIM, since any multi-

class tie involving AIM receives that label. Regardless of RapidMiner’s tie-breaking

algorithm, classification inaccuracies are caused in part by the high rate of overlap

among classes when using traditional graph measure profies.

Profile Collisions

More often than not, single-class ties between a test point and the model have the

same label. With the method used by RapidMiner to break ties, multi-class ties are

more likely to be incorrectly labeled than single-class ties. Table 6.4 summarizes the

single and multi-class ties for each protocol. The three protocols involved in the most

multi-class ties (AIM, MSDS, SSH) also have the lowest class recalls.

AIM DNS HTTP Kazaa MSDS Netbios SSH

Single-class Ties 182 567 422 343 219 530 17

Multi-class Ties 110 29 49 46 83 33 331

Table 6.4: Number of single and multi-class ties for traditional graph measures

To explore the properties of vertex profiles in more detail, profile collisions are

introduced. A profile collision can occur in one of two ways: if the distance from a

test point to a training point is equal to zero, or, if a test point is equidistant from

two or more training points, written mathematically as follows:

d(z, t1) = 0, or

d(z, t1) = d(z, t2) [ = . . .= d(z, tn) ]

where z is a test point and t1 – tn are training points. Note that the first type of

profile collision results from vertices having identical profiles. The collision graphs in

Figure 6.1 show the total number of collisions with other profiles for each protocol.

For example, Figure 6.1(a) shows that AIM profiles collide with SSH profiles more

frequently than they do with other AIM profiles.

54

(a) AIM (b) DNS (c) HTTP (d) Kazaa

(e) MSDS (f) Netbios (g) SSH

AIMDNSHTTPKazaaMSDSNetbiosSSH

Figure 6.1: Profile collisions for traditional graph measures

6.2.2 Motif-based Profiles

Although profiles based on traditional graph measures can identify applications with

some success, there is certainly room for improvement. This section presents the

results of utilizing network motifs in a new way to characterize application graphs.

The result files from FANMOD were parsed for significant motifs (those with a p-value

of 0 and that occur in at least 1% of subgraphs), finding 130 motifs to be used as

profile attributes. Only those hosts that participate in at least one of the significant

motifs were considered in this part of the study. As a result, the total number of

profiles is 3,546 instead of 3,940 as in the traditional approach.

Table 6.5 presents the classifier results using motif-based profiles. Although the

discussion comparing the performance of profile types is saved for later, the reader

may notice that the class recall has improved for six of the seven protocols measured

— all except for AIM. Only four protocols score greater than 80% in the motif-based

approach, but three of these four (DNS, Kazaa, Netbios) have improved well into the

55

90% range. Figure 6.2 provides the profile collisions for the motif approach, while the

numbers of single and multi-class ties are given in Table 6.6. Again, the protocols

involved in a higher percentage of multi-class ties generally score lower than those

that do not.

AIM DNS HTTP Kazaa MSDS Netbios SSH PrecisionAIM 277 10 61 0 21 0 33 68.91%DNS 9 630 5 8 6 0 3 95.31%HTTP 136 13 665 1 21 6 29 76.35%Kazaa 5 0 0 370 4 34 2 89.16%MSDS 4 4 15 1 256 1 4 89.82%Netbios 2 1 0 0 4 699 0 99.01%SSH 35 1 24 2 60 0 84 40.78%Recall: 59.19% 95.60% 86.36% 96.86% 68.82% 94.46% 54.19% � 84.07%

Table 6.5: Confusion matrix of unweighted motif-based profiles

AIM DNS HTTP Kazaa MSDS Netbios SSH

Single-class Ties 93 576 446 195 231 611 17

Multi-class Ties 283 32 223 181 119 47 123

Table 6.6: Number of single and multi-class ties for motif-based profiles

Synthesizing the accuracy results and collision information, it can be seen that the

classifier confuses two protocol pairs in particular with one another: AIM with HTTP

and Kazaa with Netbios. In the case of Netbios name service, a broadcast message is

sent to the local network to locate a particular machine that has a registered name.

Somewhat similarly, if a Kazaa user wishes to locate a file, they contact an active

supernode which then communicates with the ordinary nodes attached to it to query

the desired file.

Even though AIM and HTTP are both classified more accurately with motifs

than with traditional measures, the indistinctness of the boundary between the two

protocols is a bit surprising. The arrangement of nodes in Figure 6.3(a) reflects the

expected communication patterns given the functional and social characteristics of

the HTTP protocol. Some web servers are more popular than others and would

have a higher degree count than others. Additionally, web servers will often establish

56

(a) AIM (b) DNS (c) HTTP (d) Kazaa

(e) MSDS (f) Netbios (g) SSH

AIMDNSHTTPKazaaMSDSNetbiosSSH

Figure 6.2: Profile collisions for motif-based profiles

communications with other web servers to pull content from RSS feeds, ad servers,

or other content providers. With the exception of direct connections for file transfers,

all AIM communications go through a central server, so one would expect to see

stronger influence of a star topology in the application graphs. Figure 6.3(b) shows

that this does not seem to be the case. There are several possible reasons for this.

For example, the actual IP address of the central AIM server will be anonymized

to a different random IP address across each of the network trace files. Given the

popularity of instant messaging and the prevalence of the AIM client, it is possible

that there are actually several servers that handle connections, and are load-balanced

as necessary. A more in-depth examination of some application protocols will be

required in future work.

The other protocol that needs further explanation is SSH. Figure 6.2(g) shows

that SSH has a high collision rate with other protocols, and points out an important

weakness in the current motif-based approach. The application graph in Figure 6.3(c)

57

(a) HTTP (b) AIM (c) SSH

Figure 6.3: Depiction of three application graphs: HTTP, AIM and SSH


Data Kept 93.6% 96.9% 96.3% 95.5% 93.0% 97.4% 38.8%

Table 6.7: Percentage of original data used in motif-based profiles

shows the tendency of SSH application graphs to be very disconnected, comprised of

several much smaller components instead of one larger connected component. Current

motif profiles are based on order 3 and order 4 network motifs, so these pairs of

connections are ignored. This is made evident by Table 6.7, which shows that less

than 40% of SSH traffic is included in the motif model, significantly less than the

other six protocols. This difficulty presents another interesting area of work to be

performed in the future.

6.3 Weighted Profiles and Key Attributes

In an effort to improve the performance of the two types of classifiers, the attributes

of each profile were weighted using a genetic algorithm. This process only increases

the accuracy of each model slightly, but it also allows for the investigation of a prob-

lem space that might otherwise be too computationally expensive to explore. This

section details the results of weighting the attributes and discusses some of the key

characteristics.

58

6.3.1 Attribute Weights of Traditional Graph Measures

After running the evolutionary feature weighting experiment for thirty generations,

a set of attribute weights was obtained that increased the overall accuracy of the

traditional graph measure classifier roughly 4%. The weights of the eleven attributes

are provided in Table 6.8.

Attribute WeightIndegree 0.259Outdegree 0.000Total Degree 0.023Clustering Coefficient 0.172Betweenness Centrality 0.271Degree Centrality 1.000Closeness Centrality 0.257Eigenvector Centrality 0.633Eccentricity 0.596Center 0.393Periphery 0.096

Table 6.8: Attribute weights for traditional graph measures

The weights of the attributes reflect the interaction of all eleven graph measures

and are the values that maximize the accuracy of the classifier. They should there-

fore not be interpreted too literally in isolation. For example, degree centrality was

weighted with a 1.000, the highest possible weight. This does not mean that an accu-

rate classifier could be built on this attribute alone. Section 6.5 addresses this point

further. However, the table does still provide some insight as to which attributes

might be more useful when providing classification tasks. It is not surprising that the

degree counts are not weighted especially high, as they are a very generic measure.

The “periphery” attribute has a low weight because it is not a very unique measure;

out of the 3,940 profiles, 2,132 of them are periphery nodes. In contrast, only 773

nodes are central nodes, which has a higher attribute weight.

Figure 6.4 shows the per-protocol accuracy of both unweighted and weighted pro-

files based on traditional graph measures. The confusion matrix for weighted at-

59

AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0

10

20

30

40

50

60

70

80

90

100

Protocol

% C

orre

ctly

Lab

eled

Accuracy of unweighted vs. weighted traditional graph measure profiles

UnweightedWeighted

Figure 6.4: Accuracy of unweighted vs. weighted traditional graph measure profiles

tributes can be found in Appendix D. As one would hope, the weighted attribute

profiles perform slightly better than their unweighted counterpart for each protocol.

The class recall for SSH is again very low for the same reasons described previously.

6.3.2 Attribute Weights of Motif-based Measures

Because of the high dimensionality of motif-based profiles, it becomes more impor-

tant to take advantage of other methods such as genetic algorithms to explore the

attributes. Figure 6.5 depicts the ten most heavily weighted motifs and their corre-

sponding weighted values. In the figure, green nodes represent clients, black nodes

represent servers and red nodes represent peers, as specified in Definition 1. As with

the weights of traditional graph measures, these weights reflect the combined infor-

mation from all attributes.

Motif 6.5(a) is the most highly weighted of the significant motifs found in this study

and only occurs in two application graphs. There are 24 instances of it in a MSDS

60

(a)1.000

(b)0.662

(c)0.650

(d)0.632

(e)0.585

(f)0.545

(g)0.537

(h)0.503

(i)0.503

(j)0.502

Figure 6.5: The ten highest-weighted motifs and their corresponding weights

graph and another 137 instances of it in a Netbios graph. Although weighted lower,

motif 6.5(b) occurs overwhelmingly more frequently in Netbios (1,007 instances) than

it does in MSDS (3 instances) or DNS (2 instances). If a node were to occur in these

two motifs, there would probably be a good chance that the host was using the Netbios

application.

Unfortunately the weights do not indicate which particular application(s) a motif

help to delineate, only which motifs successfully increase the overall accuracy of the

classifier. Perusing the profile data reveals that instances of many motifs are found

in several or all of the applications studied. This is not to say motif profiles are

unsuitable for describing computer networks (as they have shown a great deal of

promise already), rather that no single motif is indicative of a particular application.

Given the complexity of the highly dynamic interactions that occur in computer

networks, this is not entirely surprising. It is possible that different types of motifs

(described in Chapter 7) could be even more beneficial than the current generation

of motifs and motif profiles.

One final point to address before moving on to a comparison of traditional and

motif-based profiles is the performance of unweighted vs. weighted motif profiles,

61

shown in Figure 6.6. There is a slight increase in classification accuracy in each of

the protocols except for Kazaa, which sees no additional gain from attribute weight-

ing. The overall accuracy of the model increases to 85.70%, a difference of 1.63%.

Appendix D contains the confusion matrix for weighted motif profiles.


10

20

30

40

50

60

70

80

90

100

Protocol

% C

orre

ctly

Lab

eled

Accuracy of unweighted vs. weighted motif−based profiles

UnweightedWeighted

Figure 6.6: Accuracy of unweighted vs. weighted motif-based profiles

6.4 Comparison of Profile Types

This section compares the two profile types side-by-side and discusses some of the

advantages and disadvantages of each approach. The motif-based model generally

outperforms traditional graph measures, though this is not always the case as shown

in Figure 6.7. Notably, the traditional profiles significantly outperform motif-based

profiles for classifying AIM traffic, while the reverse is true for SSH (again, the SSH

results should be taken with a grain of salt due to the fact that slightly less than 40%

of SSH traffic is classified by the second approach).

Weighting the profile attributes benefits traditional graph measures more than

62


10

20

30

40

50

60

70

80

90

100

Protocol

% C

orre

ctly

Lab

eled

Traditional Graph Measures vs. Motif−based Profiles (Unweighted)

TraditionalMotif

(a) Unweighted


10

20

30

40

50

60

70

80

90

100

Protocol

% C

orre

ctly

Lab

eled

Traditional Graph Measures vs. Motif−based Profiles (Weighted)

TraditionalMotif

(b) Weighted

Figure 6.7: Accuracy comparison of unweighted profile types

motif descriptions. One reason for this might be the type of data used to describe

each profile. Traditional profiles are comprised of a mixture of binary, real-valued

and integer data. In addition to being purely binary, motif profiles are also sparse;

most nodes only participate in very few of the 130 significant motifs. As a result,

many of the motif weights are multiplied by zero, resulting in no information gain.

Regardless, weighting the attributes does not change which type of classifier performs

better for a particular protocol with the exception of HTTP. Unweighted motifs have

a 4% accuracy advantage over traditional measures, but fall to a 1% disadvantage

when the profiles are weighted.

Advantages and Disadvantages of Profile Types

Motif-based profiles have a slight advantage over traditional measures in a few cate-

gories. The overall accuracy of the motif-based classifiers is higher than that of the

traditional classifiers, both unweighted and weighted. Also, motif profiles result in

more favorable overlap with other profiles. Only 10% of motif profiles do not match

another profile, and 61% match profiles of a single label (note that “match” means

a Euclidean distance of zero, not an identical profile). With traditional measures on

63

the other hand, 58% match a single label, and nearly 25% of profiles do not match

any other profile.

Traditional graph measures are less demanding to compute than their motif coun-

terparts. Even though some graph measures are O(n3) where n is the order of the

graph, calculations can be performed extremely quickly because n is small in the

application graphs examined: 40 ≤ n ≤ 80. Motif searches are computationally

expensive and can be prohibitively so when searching for large motifs. This study

found that an exhaustive search of order 3 motifs could be completed in roughly 7-8

minutes, while an exhaustive search of order 4 motifs took 6-8 hours to complete.

6.5 Considerations for Optimizing Classifier Performance

There are several ways in which the performance of application classifiers may be

improved. An “on the fly” traffic classification system would need to be as fast as

possible so that network latency is minimized. One way to achieve increased classifier

speed is to reduce the dimensionality of the data. Already the evolutionary feature

weighting performed by the genetic algorithm has indicated which attributes are more

valuable to the classifier. Attributes below a certain threshold value could be ignored,

at the expense of a little bit of accuracy. Figure 6.8 demonstrates the accuracy of

models based on a single traditional graph measure.

By far, eigenvector centrality, closeness centrality and degree centrality provide

the most information to the classifier, each scoring better than 65% on its own. Most

of the attributes score no better than a random guess with a 17

chance of being correct,

shown as a vertical dotted line in the graph. Recall that eigenvector centrality assigns

a centrality score to a vertex proportional to that of its neighbors. This metric is more

“social” in nature than some of the others in that the centrality scores of neighboring

vertices are considered in the calculation. The idea of “distance” in an application

64

0 10 20 30 40 50 60 70

Center

Periphery

Indegree

Eccentricity

Total Deg.

Outdegree

Clust. Coeff.

Betw. Cent.

Deg. Cent.

Close. Cent.

Eig.Cent.

Attr

ibut

e

% Correctly Labeled

Accuracy When Using a Single Attribute

Figure 6.8: Accuracy of single attribute classification

graph is a bit tricky because it does not consider the number of hops data must go

through to reach its final destination nor the physical distance between hosts. There-

fore the “closeness” of closeness centrality describes the social usage of an application

and suggests that the average shortest path length between nodes differs somewhat

from application to application. The degree centrality is essentially a weighted degree

count, which again suggests that the size of connected components within application

graphs are important, influenced in part by the popularity of servers and services.

In addition to reducing the dimensionality of the attribute profiles, one can also

consider reducing the number of data points used in the training phase of the nearest

neighbor algorithm. Exploring the effectiveness of smaller classification models has

two important implications. First of all, it suggests that a more lightweight classifier

could be built when heading towards a real-time implementation. Secondly, it shows

that the methods proposed in this study can be used for smaller networks and not

just those containing thousands of nodes.

To test this hypothesis, several unweighted classifiers were built for each profile

type with an increasing number of nodes in each model. The data was selected at

65

random, while keeping the proportions of each class label the same as in the models

previously discussed. All of the test parameters are as they were before, including

the use of 10-fold cross validation to determine the accuracy. The results of this

experiment are illustrated in Figure 6.9.

500 1000 1500 2000 2500 3000 3500 40000

10

20

30

40

50

60

70

80

90

100

Number of Profiles

% C

orre

ctly

Lab

eled

Accuracy as the Number of Traditional Graph Measure Profiles Increases

AIMDNSHTTPKazaaNetbiosMSDSSSH

(a) Traditional graph measures

500 1000 1500 2000 2500 3000 3500 40000

10

20

30

40

50

60

70

80

90

100

Number of Profiles

% C

orre

ctly

Lab

eled

Accuracy as the Number of Motif−based Profiles Increases

AIMDNSHTTPKazaaNetbiosMSDSSSH

(b) Motif-based profiles

Figure 6.9: Comparison of profile types as the size of the training set increases

The classifiers tend to perform slightly better as the number of training data

points increases, but sometimes negligibly so. DNS, Kazaa and Netbios seem to

benefit the least from having additional training examples, while AIM and MSDS

fluctuate quite a bit more. It is interesting to note applications that were previously

classified more accurately also exhibit more stable behavior in Figure 6.9. This is

true for both profile types. For example, AIM and MSDS were by far the least

accurately classified protocols using a motif-based approach, and their trend lines

exhibit the most volatility in 6.9(b). In contrast, DNS, Kazaa and Netbios were the

most accurately classified protocols, and their trend lines are nearly flat. This finding

suggests that the protocols which can be clearly described by a profile (traditional or

motif-based) can be learned with a relatively low number of training points. Further

investigation into the AIM and MSDS protocols is needed to understand why the

accuracy of AIM peaks at 2,500 nodes and then declines, while the accuracy of MSDS

66

peaks at 1,000 nodes and then drops significantly.

6.6 Limitations of Current Approach

This chapter has demonstrated the promise of using vertex profiles to identify appli-

cation usage across a computer network. A few of the shortcomings of the proposed

methodology have been touched upon already, but are summarized here. Graph size

is an important factor to consider, since more “interesting” vertex characterizations

arise from the complex interactions of hosts. Motif-based profiles become more de-

scriptive as hosts communicate with a larger number of other hosts. The current

generation of classification models suffer when there is heavy overlap among profiles,

resulting in a distance of zero. A more intelligent tie-breaking scheme could yield

better performance for those protocols that share application graph characteristics.

Currently, the motif-based approach only considers motifs of order 3 and order 4.

This causes a problem for protocols like SSH that tend to have a large number of

small connected components instead of fewer large connected components. Some of

the stages in the process are computationally expensive. The genetic algorithm used

for feature weighting is a very time-consuming endeavor and does not yield the de-

sired increase in performance. On the other hand, once a network is learned and a

classifier built, the attribute weights need only be computed once and can be applied

in O(n) time to the attributes collected for the test points. Additionally, the analysis

techniques put forth by this work require a view of the network that shows as many

of the interactions as possible.

Chapter 7: Conclusions and Future Work

The tasks of managing and securing computer networks are becoming increasingly

complicated due to the use of applications over non-standard port numbers as well as

the use of data encryption techniques. These practices subvert a network administra-

tor’s ability to provide quality of service to legitimate users, ensure compliance with

security policies, as well as prevent outside intruders from gaining access to a system.

Intrusion detection systems and network monitoring tools that rely on deep packet

inspection are ineffective when data transfers are encrypted. Several previous studies

have attempted to classify network application usage by examining flow characteris-

tics pertaining to a particular series of communications between two hosts, examining

attributes such as the size of the data packets being sent, packet inter-arrival time

and session lengths.

This thesis has proposed an interdisciplinary approach to the study of networks

through the characterization of application graphs. It is an “in the dark” methodology

that relies on the communication patterns found in a network, rather than the contents

of packet payloads or port numbers used by the application. A wide variety of graph

measures heavily borrowed from social network analysis are used to create vertex

profiles to determine the application in which the host participates.

Furthermore, this work has uniquely applied motif-based analysis, used almost

exclusively in systems biology, to the study of application graphs. This method of

detecting significant subgraph patterns has shown a great deal of promise for modeling

and classifying application protocols. It has been shown that motifs can not only be

used to express communication patterns, but also to indicate the functional role of a

67

68

host. In this study, nodes were labeled as either a client, server, or peer based upon

their interactions at the transport layer. This information was used to generate motifs.

A second type of vertex profile was defined, based upon a node’s participation (or lack

thereof) in the motifs that were found to be significant across all of the application

protocols examined.

Through empirical testing, this study has shown that both types of profiles can

determine what application a host is using with a reasonable amount of accuracy.

Although some protocols like SSH and AIM present difficulties, many of the others

can be classified with greater than 80% accuracy, and in the case of weighted motif

profiles, as high as 96% for the peer-to-peer application Kazaa. In general, a motif-

based approach out-performs traditional graph measures and seems to have more

potential for related work in the future.

One issue to consider is how to best manage connected components in application

graphs that only contain two nodes. This phenomena was found to occur frequently

in SSH, contributing to the fact that less than 40% of SSH hosts were classified by

the motif-based approach. Ignoring vertex colors, there are three possible order 2

motifs: A → B, A ← B, and A ↔ B. Unfortunately, the edge-switching operations

for creating random graphs will not provide sufficiently randomized graphs, so it is

unlikely to find a particular pattern that is statistically significant.

Currently, the only information utilized in the creation of application graphs is the

source and destination IP addresses, and the source and destination port numbers.

The motif-based approach provides some additional information by using vertex colors

to represent node types, but other information could also be exploited to color the

edges. For example, colors could be used to denote the amount of data transferred

between two nodes. This would help create more detailed profiles that might be able

to distinguish between applications with similar connection patterns, but use network

69

bandwidth in ways that are distinct from one another. Also related to the creation of

application graphs, it would be interesting to observe the data flow through all nodes

involved in a particular activity and not just the flow on a particular port number.

A server might request content from an application or database server in response

to a client’s request for a web document. Back-end communications to other related

services occur on a port other than 80, the usual HTTP port number.

Another area to explore is the different machine learning techniques that can be

applied to vertex profiles for classification and feature weighting; nearest-neighbor

and genetic algorithms are only two possibilities. The many parameters of these

algorithms require further tuning to optimize the classification accuracy of the models

built. This thesis describes a process which allows the substitution of particular

algorithms. For example, a Bayes classifier or support vector machine could be used

instead of nearest-neighbor, while principal component analysis could be used in place

of the genetic algorithm [67, 62].

Although not used in the current approach, temporal information could also prove

to be useful in classifying application protocols. One approach would be to encode

information such as session lengths or packet inter-arrival times into the edge colors.

Another use of time-based information would be to observe communication patterns

over a much smaller time window (on the order of seconds or minutes instead of hours)

and determine how a node’s participation in motifs changes over time.

Moving away from implementation details and algorithm decisions, this type of

research can be expanded outside of application identification. Assuming that the

process can be tweaked to allow a high accuracy in protocol recognition, this approach

could be used to detect anomalies in network behavior. Hosts that participate in

activities that look similar to a known application but differ more than an established

threshold value would be considered anomalous for that particular application and

70

trigger an alert. One final consideration is pushing this research further into the

realm of social network analysis, applying it to the detection of communities and

associations within a network, such as locating all hosts that are part of the same

online gaming community.

References

[1] P. Dyson, Dictionary of Networking. Sybex, 1999.

[2] A. S. Tanenbaum, Computer Networks. Prentice Hall, 2003.

[3] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, andU. Alon, “Network motifs: simple building blocks of complex networks.”Science, vol. 298, no. 5594, pp. 824–827, October 2002. [Online]. Available:http://dx.doi.org/10.1126/science.298.5594.824

[4] F. Rasche and S. Wernicke, “Fanmod manual,” 2006.

[5] Input federal it market forecast 2008 - 2013. [Online]. Avail-able: http://www.input.com/corp/library/detail.cfm?itemid=5437&cmp=OTC-fedinfosecfcst08

[6] The sans security policy project. [Online]. Available:http://www.sans.org/resources/policies/

[7] S. Christey and R. A. Martin, “Cve - vulnerability type distributions in cve,” 2007technical white paper on the distribution of vulnerabilities reported to CVE.

[8] Internet assigned numbers authority: Assigned port numbers. [Online]. Available:http://iana.org/assignments/port-numbers

[9] Netstat. [Online]. Available: http://www.netstat.net/

[10] Wireshark: A network protocol analyzer. [Online]. Available:http://www.wireshark.org/

[11] Snort - the de facto standard for intrusion detection/prevention. [Online]. Available:http://www.snort.org/

[12] M. E. J. Newman, “Coauthorship networks and patterns of scientific collaboration,”in Proceedings of the National Academy of Science, 2004, pp. 5200–5205.

[13] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications.Cambridge University Press, 1994.

[14] C. Yang and T. Ng, “Terrorism and crime related weblog social network: Link, contentanalysis and information visualization,” Intelligence and Security Informatics, 2007IEEE, pp. 55–58, May 2007.

[15] E. Yeger-Lotem, S. Sattath, N. Kashtan, S. Itzkovitz, R. Milo, R. Y. Pinter,U. Alon, and H. Margalit, “Network motifs in integrated cellular networks oftranscription-regulation and protein-protein interaction.” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 101, no. 16, pp. 5934–5939,April 2004. [Online]. Available: http://dx.doi.org/10.1073/pnas.0306752101

71

72

[16] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in thetranscriptional regulation network of escherichia coli,” Nat Genet, vol. 31, no. 1, pp.64–68, May 2002. [Online]. Available: http://dx.doi.org/10.1038/ng881

[17] J. Grochow and M. Kellis, “Network motif discovery using subgraph enumeration andsymmetry-breaking,” 2007, pp. 92–106.

[18] U. Alon, “Network motifs: Theory and experimental approaches,” Nature ReviewsGenetics, vol. 8, no. 6, pp. 450–461, Jun. 2007.

[19] J. Day and H. Zimmermann, “The osi reference model,” Proceedings of the IEEE,vol. 71, no. 12, pp. 1334–1340, Dec. 1983.

[20] V. Cerf and R. Kahn, “A protocol for packet network intercommunication,” Commu-nications, IEEE Transactions on [legacy, pre - 1988], vol. 22, no. 5, pp. 637–648, May1974.

[21] L. Euler, “Solutio problematis ad geometriam situs pertinentis,” in Commentariiacademiae scientiarum imperialis Petropolitanae. St. Petersburg Academy, 1736,vol. 8.

[22] R. G. Busacker and T. L. Saaty, Finite Graphs and Networks, ser. International Seriesin Pure and Applied Mathematics. McGraw-Hill, 1965.

[23] G. Chartrand and P. Zhang, Introduction to Graph Theory. McGraw-Hill, 2005.

[24] A. A. Nanavati, R. Singh, D. Chakraborty, K. Dasgupta, S. Mukherjea, G. Guru-murthy, and A. Joshi, “Analyzing the Structure and Evolution of Massive TelecomGraphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 5, pp.703–718, March 2008.

[25] E. W. Dijkstra, “A note on two problems in connexion with graphs,” NumerischeMathematik, vol. 1, pp. 269–271, 1959.

[26] L. C. Freeman, “Centrality in social networks conceptual clarification,” Social Net-works, vol. 1, no. 3, pp. 215–239.

[27] P. Bonacich, “Technique for analyzing overlapping memberships,” Sociological Method-ology, 1972.

[28] M. E. J. Newman, Mathematics of Networks. Palgrave Macmillan, 2008.

[29] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’ networks,” Na-ture, vol. 393, 1998.

[30] S. Cheung, R. Crawford, M. Dilger, J. Frank, J. Hoagl, K. Levitt, J. Rowe, S. Staniford-chen, R. Yip, and D. Zerkle, “Grids – a graph-based intrusion detection system forlarge networks,” in In Proceedings of the 19th National Information Systems SecurityConference, 1996, pp. 361–370.

73

[31] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “Blinc: multilevel traffic classifi-cation in the dark,” in SIGCOMM ’05: Proceedings of the 2005 conference on Appli-cations, technologies, architectures, and protocols for computer communications. NewYork, NY, USA: ACM, 2005, pp. 229–240.

[32] S. Wernicke and F. Rasche, “Fanmod: a tool for fast network motif detection,” Bioin-formatics, vol. 22, no. 9, pp. 1152–1153, 2006.

[33] R. Itzhack, Y. Mogilevski, and Y. Louzoun, “An optimal algorithm for counting net-work motifs,” Physica A, vol. 381, pp. 482–490, Jul. 2007.

[34] S. Mangan, A. Zaslaver, and U. Alon, “The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks.” Journal of Molecular Biololgy, vol.334, no. 2, pp. 197–204, November 2003.

[35] Tcpdump/libpcap public repository. [Online]. Available: http://www.tcpdump.org/

[36] J. Postel, “Internet Protocol,” RFC 791 (Standard), Sep. 1981, updated by RFC1349. [Online]. Available: http://www.ietf.org/rfc/rfc791.txt

[37] J. Postel, “Transmission Control Protocol,” RFC 793 (Standard), Sep. 1981, updatedby RFC 3168. [Online]. Available: http://www.ietf.org/rfc/rfc793.txt

[38] J. Postel, “User Datagram Protocol,” RFC 768 (Standard), Aug. 1980. [Online].Available: http://www.ietf.org/rfc/rfc768.txt

[39] R. Pang, M. Allman, V. Paxson, and J. Lee, “The devil and packet traceanonymization,” ACM Computer Communication Review, vol. 36, no. 1, pp. 29–38,January 2006. [Online]. Available: http://www.icir.org/mallman/papers/devil-ccr-jan06.pdf

[40] G. Iannaccone, C. Diot, I. Graham, and N. McKeown, “Monitoring very high speedlinks,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on InternetMeasurement. New York, NY, USA: ACM, 2001, pp. 267–271.

[41] T. Henderson, D. Kotz, and I. Abyzov, “The changing usage of a mature campus-widewireless network,” Computer Networks, vol. In Press, Accepted Manuscript. [Online].Available: http://dx.doi.org/10.1016/j.comnet.2008.05.003

[42] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Youtube traffic characterization: a viewfrom the edge,” in IMC ’07: Proceedings of the 7th ACM SIGCOMM conference onInternet measurement. New York, NY, USA: ACM, 2007, pp. 15–28.

[43] E. Blanton. (2008, January) tcpurify. [Online]. Available:http://irg.cs.ohiou.edu/ eblanton/tcpurify/

[44] T. Gamer, C. P. Mayer, and M. Scholler, “PktAnon - A Generic Framework for Profile-based Traffic Anonymization,” PIK Praxis der Informationsverarbeitung und Kommu-nikation, vol. 2, pp. 67–81, Jun. 2008.

74

[45] D. Koukis, S. Antonatos, D. Antoniades, E. P. Markatos, P. Trimintzios, andM. Fukarakis, “CRAWDAD tool tools/sanitize/generic/anontool (v. 2006-09-26),”Downloaded from http://crawdad.cs.dartmouth.edu/tools/sanitize/generic/AnonTool,Sep. 2006.

[46] (2005) Lbnl enterprise trace repository. [Online]. Available:http://www.icir.org/enterprise-tracing/

[47] MIT Lincoln Laboratory: 1999 DARPA Intru-sion Detection Evaluation Data Set. [Online]. Available:http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/1999data.html

[48] D. Kotz, T. Henderson, and I. Abyzov, “CRAWDAD trace dart-mouth/campus/tcpdump/fall03 (v. 2004-11-09),” Downloaded fromhttp://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall03, Nov. 2004.

[49] R. Chandra, R. Mahajan, V. Padmanabhan, and M. Zhang, “CRAW-DAD data set microsoft/osdi2006 (v. 2007-05-23),” Downloaded fromhttp://crawdad.cs.dartmouth.edu/microsoft/osdi2006, May 2007.

[50] Oscar protocol. [Online]. Available: http://dev.aol.com/aim/oscar/

[51] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, andT. Berners-Lee, “Hypertext transfer protocol – http/1.1,” RFC 2616 (Standard), Jun.1999. [Online]. Available: http://www.ietf.org/rfc/rfc2616.txt

[52] P. V. Mockapetris, “Domain names - implementation and specifica-tion,” RFC 1035 (Standard), United States, 1987. [Online]. Available:http://www.ietf.org/rfc/rfc1035.txt

[53] Active directory. [Online]. Available:http://www.microsoft.com/windowsserver2008/en/us/active-directory.aspx

[54] R. Marty. Afterglow. [Online]. Available: http://www.afterglow.sourceforge.net/

[55] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics,and function using networkx,” in Proceedings of the 7th Python in Science Conference(SciPy2008), Pasadena, CA USA, Aug. 2008, pp. 11–15.

[56] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon, “Mfinder tool guide,” 2002.

[57] F. Schreiber and H. Schwobbermeyer, “Bioinformatics applications note structuralbioinformatics mavisto: a tool for the exploration of network motifs,” 2005.

[58] W. de Nooy, A. Mrvar, and V. Batagelj, Exploratory Social Network Analysis withPajek. Cambridge University Press, 2005.

[59] S. Wernicke, “Efficient detection of network motifs,” IEEE/ACM Transactions onComputational Biology and Bioinformatics, vol. 3, no. 4, pp. 347–359, 2006.

75

[60] W. Mendenhall and R. J. Beaver, Introduction to Probability and Statistics, 8th ed.PWS-Kent Publishing Company, 1991.

[61] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, “Yale: rapid pro-totyping for complex data mining tasks.” New York, NY, USA: ACM, 2006, pp.935–940.

[62] Pang-Ning Tan and Michael Steinbach and Vipin Kumar, Introduction to Data Mining.Addison Wesley, 2006.

[63] G. F. Luger, Artificial Intelligence: Structures and Strategies for Complex ProblemSolving, 5th ed. Addison Wesley, 2005.

[64] M. Mitchell, An Introduction to Genetic Algorithms. MIT Press, 1998.

[65] J. J. Grefenstette, “Optimization of control parameters for genetic algorithms,” IEEETransactions on Systems, Man and Cybernetics, vol. 16, no. 1, pp. 122–128, Jan. 1986.

[66] J. D. Schaffer, R. A. Caruana, L. J. Eshelman, and R. Das, “A study of control param-eters affecting online performance of genetic algorithms for function optimization,” inProceedings of the third international conference on Genetic algorithms. San Francisco,CA, USA: Morgan Kaufmann Publishers Inc., 1989, pp. 51–60.

[67] D. Lay, Linear Algebra and Its Applications, 2nd ed. Addison Wesley, 2000.

Appendix A: Examples of Application Graphs

Figure A.1: Application graphs depicting AIM communications

Figure A.2: Application graphs depicting DNS communications

Figure A.3: Application graphs depicting HTTP communications

76

77

Figure A.4: Application graphs depicting Kazaa communications

Figure A.5: Application graphs depicting MSDS communications

Figure A.6: Application graphs depicting Netbios communications

Figure A.7: Application graphs depicting SSH communications

Appendix B: Code Listings

Listing B.1: tshark2mysql.py – stores pcap data into a MySQL database

#!/ usr / bin /python

# This f i l e parse s output from stdout and i n s e t s i t in to a MySQL# database . I t assumes the database has a l ready been creat ed and

5 # w i l l c reat e the necessary t a b l e## The Tshark command i s :# t shark −t e −r <pcap f i l e > t cp or udp

10 import sysimport MySQLdb

i f l en ( sys . argv ) != 2 :sys . e x i t ( ”Supply name of tab l e to s to r e data in \n” )

15try :

conn = MySQLdb . connect ( host = ” l o c a l h o s t ” ,user = ” root ” ,passwd = ”pass ” ,

20 db = ”data ” )except MySQLdb . Error , e :

sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )

cur sor = conn . cur sor ( )25 cur sor . execute ( ”DROP TABLE IF EXISTS %s ” % sys . argv [ 1 ] )

cur sor . execute ( ”””CREATE TABLE %s (

id INT(11) NOT NULL AUTO INCREMENT,30 t s DOUBLE NOT NULL DEFAULT ’ 0 . 0 ’ ,

p r o to co l VARCHAR(12) NOT NULL,s i p VARCHAR(15) NOT NULL,spor t INT(5) NOT NULL DEFAULT ’0 ’ ,dip VARCHAR(15) NOT NULL,

35 dport INT(5) NOT NULL DEFAULT ’0 ’ ,l ength INT(11) NOT NULL DEFAULT ’0 ’ ,PRIMARY KEY id ( id )

) ;””” % sys . argv [ 1 ] )

40 r c = 0while True :

l i n e = sys . s td i n . r e a d l i n e ( )i f not l i n e : break

v = l i n e . s p l i t ( ’ ’ )45 tmp = [ ]

for i in range ( l en (v ) ) :i f v [ i ] not in ( ’ ’ , ’−> ’ , ’ ’ ) :

tmp . append (v [ i ] )v = tmp

50 i f l en (v ) == 8 :try :

t s = f l o a t (v [ 1 ] )s i p = v [ 2 ]

78

79

dip = v [ 3 ]55 spor t = in t ( v [ 4 ] )

dport = in t ( v [ 5 ] )proto = v [ 6 ]l ength = in t (v [ 7 ] [ : − 2 ] ) # s t r i p o f f t he newl ine charac t e rs q l = ”INSERT INTO %s ( ts , protoco l , s ip , sport , dip , dport , l ength )

VALUES (%f , \’%s \ ’ , \’%s \ ’ , %d , \’%s \ ’ , %d , %d) ” % ( sys . argv [ 1 ] , ts ,proto , s ip , sport , dip , dport , l ength )

60 try :cur sor . execute ( s q l )r c += cur sor . rowcount

except MySQLdb . Error , e :print ”Error [%d ] : %d : %s” % ( rc , e . a rgs [ 0 ] , e . a rgs [ 1 ] )

65 except Exception , e :print ”ERROR: ” , v

cur sor . c l o s e ( )conn . commit ( )

70 conn . c l o s e ( )

print ( ”\n%d rows i n s e r t e d i n to %s \n” % ( rc , sys . argv [ 1 ] ) )

Listing B.2: graph utils.py – implementation of adjacency matrix conversion andeigenvector centrality using the NetworkX API

import networkx as NXimport math

def adj matr i x (G) :5 ”””

Function takes a networkx . Graph as an argument ( undi rec t ed )and re turns a l i s t o f l i s t s r e p r e s en t i n g the correspondingadjacency matrix . I t can can be re f e renced as you woulda normal 2D matrix A[ i ] [ j ]

10node IDs must be [1 G. order ] ( taken care o f in e i g e n v e c t o r c e n t r a l i t y ( ) )

”””adj = [ ]for n in G. nodes ( ) :

15 row = [ ]for m in range ( l en (G. nodes ( ) )+1) : row . append (0)for m in NX. ne i ghbor s (G, n) : row [m] = 1adj . append ( row )

# Get r i d o f f i r s t e lement o f each row ( nodes s t a r t at 1 , adj i s 0−based )20 for i in range ( l en ( adj ) ) : adj [ i ] = adj [ i ] [ 1 : ]

return adj

def e i g e n v e c t o r c e n t r a l i t y (G) :”””

25 Function takes an undi rec t ed graph (Graph or XGraph) andre turns a d i c t i onary o f e i g env e c t o r c e n t r a l i t i e s , keyedby node ID ( s im i l a r to c e n t r a l i t y f unc t i on s in networkx )

Function w i l l map node l a b e l s to i n t e g e r s [1 G. order ]30

Algorithm adaped from : h t t p ://www. ana l y t i c t e c h . com/networks / centa ids . htm”””

80

e i g e n v e c t o r c e n t r a l i t i e s = {}evCent r a l i t y = [ ]

35 evUpdate = [ ]maxValue = −1.0

for i in range (G. order ( ) ) :evCen t r a l i t y . append ( 1 . 0 )

40 evUpdate . append ( 0 . 0 )

H = NX. c on v e r t n od e l a b e l s t o i n t e g e r s (G, f i r s t l a b e l =1, d i s c a r d o l d l a b e l s=Fal s e)

l a b e l s = {}for k , v in H. node l abe l s . i t e r i t em s ( ) : l a b e l s [ v ] = k

45 A = adj matr i x (H)

# 30 i t e r a t i o n s should be enough to converge to a s o l u t i onfor x in range (30) :

for i in range (G. order ( ) ) :50 evUpdate [ i ] = 0 .0

for j in range (G. order ( ) ) :i f (A[ i ] [ j ] != 0) : evUpdate [ i ] += evCent r a l i t y [ j ]

maxValue = 0for i in range (G. order ( ) ) :

55 maxValue += evUpdate [ i ] * evUpdate [ i ]maxValue = math . s q r t (maxValue )for i in range (G. order ( ) ) :

evCen t r a l i t y [ i ] = evUpdate [ i ] / maxValuefor i in range (1 , G. order ( ) + 1) :

60 e i g e n v e c t o r c e n t r a l i t i e s [ l a b e l s [ i ] ] = evCent r a l i t y [ i −1]

return e i g e n v e c t o r c e n t r a l i t i e s

Listing B.3: node props main.py – creates application graphs from MySQL databaseand computes traditional graph metrics using the NetworkX API


”””This program crea t e s a DiGraph and c a l c u l a t e s var ious graph metr ics ,

5 conver t ing DiGraph to Graph as necessary f o r some metrics

Usage :arg1 = t a b l e namearg2 = por t number

10 arg3 = max # of nodes to cons ider”””

import sysimport networkx as NX

15 import MySQLdbfrom g r aph u t i l s import *class Node :

”””Class to hold prope r t i e s o f nodes ”””20 i n deg r e e = 0

out degr ee = 0degree = 0

81

c l u s t e r i n g = 0be tweenne s s c en t r a l i t y = 0

25 d e g r e e c e n t r a l i t y = 0c l o s e n e s s c e n t r a l i t y = 0e i g e n v e c t o r c e n t r a l i t y = 0e c c e n t r i c i t y = 0i s c e n t e r = 0

30 i s p e r i p h e r y = 0

i f l en ( sys . argv ) != 4 :35 sys . e x i t ( ”Provide tab l e name , port number , and # nodes at command l i n e \n” )

tab l e = sys . argv [ 1 ]port = sys . argv [ 2 ]n max = in t ( sys . argv [ 3 ] )

40 # MySQL connect iontry : conn = MySQLdb . connect ( host = ” l o c a l h o s t ” , user = ” root ” , passwd = ”pass ” ,db = ”

data ” )except MySQLdb . Error , e : sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )cur sor = conn . cur sor ( )

45 s q l = ”SELECT sip , dip , sport , dport FROM %s WHERE spor t=%s OR dport=%s” % ( table ,port , port )

try : cur sor . execute ( s q l )except MySQLdb . Error , e : sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )

# Create a d i r e c t e d graph from SQL r e s u l t s50 G = NX. DiGraph (name=”%s %s” % ( port , n max) )

for i in range ( cur sor . rowcount ) :r = cur sor . f e tchone ( )i f r [ 0 ] in G and r [ 1 ] in G: G. add edge ( r [ 0 ] , r [ 1 ] )else :

55 i f G. order ( ) < n max : G. add node ( r [ 0 ] )i f G. order ( ) < n max : G. add node ( r [ 1 ] )i f r [ 0 ] in G and r [ 1 ] in G: G. add edge ( r [ 0 ] , r [ 1 ] )

# Calcu la t e graph prope r t i e s60 myNodes = {}

for n in G. nodes ( ) :myN = Node ( )# Basic Proper t i e smyN. degree = G. degree (n )

65 myN. out degr ee = G. out degr ee (n)myN. i n deg r e e = G. i n deg r e e (n )myNodes [ n ] = myN

”””70 The f o l l ow i n g measures are a l l based on undi rec t ed graphs , but are

connected components .”””

H = G. to und i r e c t ed ( )75 CCS = NX. connected component subgraphs (H)

for i in range ( l en (CCS) ) :i f CCS[ i ] . order ( ) >= 2 :

c l = NX. c l u s t e r i n g (CCS[ i ] , w i t h l a b e l s=True )for k , v in c l . i t e r i t em s ( ) : myNodes [ k ] . c l u s t e r i n g = v

80bc = NX. be tweenne s s c en t r a l i t y (CCS[ i ] )for k , v in bc . i t e r i t em s ( ) : myNodes [ k ] . b e tweenne s s c en t r a l i t y = v

dc = NX. d e g r e e c e n t r a l i t y (CCS[ i ] )

82

85 for k , v in dc . i t e r i t em s ( ) : myNodes [ k ] . d e g r e e c e n t r a l i t y = v

cc = NX. c l o s e n e s s c e n t r a l i t y (CCS[ i ] )for k , v in cc . i t e r i t em s ( ) : myNodes [ k ] . c l o s e n e s s c e n t r a l i t y = v

90 ec = e i g e n v e c t o r c e n t r a l i t y (CCS[ i ] )for k , v in ec . i t e r i t em s ( ) : myNodes [ k ] . e i g e nv e c t o r c e n t r a l i t y = v

d = NX. diameter (CCS[ i ] )r = NX. rad ius (CCS[ i ] )

95 ecc = NX. e c c e n t r i c i t y (CCS[ i ] , w i t h l a b e l s=True )for k , v in ecc . i t e r i t em s ( ) :

myNodes [ k ] . e c c e n t r i c i t y = vi f v == d : myNodes [ k ] . i s p e r i p h e r y = 1i f v == r : myNodes [ k ] . i s c e n t e r = 1

100else : pass

# Print r e s u l t sfor k , v in myNodes . i t e r i t em s ( ) :

105 s = ””s += ”%s ,%d,%d,%d,%f ,%f ,%f ,%f ,%f ,%d,%d,%d” % ( port , v . i n degr ee , v . out degree , v .

degree , v . c l u s t e r i ng , v . b e tweenne s s c en t r a l i t y , v . d e g r e e c e n t r a l i t y , v .c l o s e n e s s c e n t r a l i t y , v . e i g e nv e c t o r c e n t r a l i t y , v . e c c e n t r i c i t y , v .i s p e r i phe r y , v . i s c e n t e r )

print s

Listing B.4: motif results.py – parses FANMOD results for significant motifs


”””This program reads FANMOD r e s u l t f i l e s and l ook s f o r s i g n i f i c a n t mot i f s .

5 I t a s s o c i a t e s each mot i f wi th an i d e n t i f y i n g i n t e g e r ID and p i c k l e sthe r e s u l t s f o r l a t e r use”””

import p i c k l e10 import glob

import ppr int

f i l e d i r = ”/home/ eddie / r e s ea r ch /fanmod/ r e s c s v s /”f i l e s = glob . g lob ( ’ /home/ eddie / r e s ea r ch /fanmod/ r e s c s v s /* . tx t ’ )

15s i z e 3 = {} # mapping f o r s i z e 3 mot i f sid3 = 0 # f i r s t ID for s i z e 3s i z e 4 = {} # mapping f o r s i z e 4 mot i f sid4 = 0 # f i r s t ID for s i z e 4

20 p thr esh = 0.0 # get mot i f s wi th pvalue <= p thr e shpct occ = 1.0 # get mot i f s wi th f requency >= pct occ

# I t e r a t e through f i l e s and make ID as soc i a t i on sfor i in range ( l en ( f i l e s ) ) :

25 i n F i l e = f i l e s [ i ]msize = in t ( i n F i l e [ −14] ) # Motif s i z e i s s tored in f i lenamef = open ( i nF i l e , ’ r ’ )f i l e = [ ]for l in f :

83

30 l = l [ : −1 ]i f l en ( l ) > 1 : f i l e . append ( l )

# Ignore s t u f f at top o f f i l ef i l e = f i l e [ 2 4 : ]

35 for j in range (0 , l en ( f i l e ) , msize ) :adjMatrix = ””l 1 = f i l e [ j ] . s p l i t ( ’ , ’ )i f ( f l o a t ( l 1 [ 6 ] ) <= p thresh ) and ( f l o a t ( l 1 [ 2 ] [ : − 1 ] ) >= pct occ ) :

# I f t h i s i s a s i g n i f i c a n t mot i f . . .40 adjMatrix += l1 [ 1 ]

for k in range (1 , msize ) :adjMatrix += f i l e [ j+k ] . s p l i t ( ’ , ’ ) [ 1 ]

i f msize == 3 and adjMatrix not in s i z e 3 . va lues ( ) :s i z e 3 [ id3 ] = adjMatrix

45 id3 += 1i f msize == 4 and adjMatrix not in s i z e 4 . va lues ( ) :

s i z e 4 [ id4 ] = adjMatrixid4 += 1

50 f . c l o s e ( ) # Close f i l e handle

# Pi c k l e the r e s u l t i n g d i c t i o n a r i e ss3 = open ( ’ s3map . pkl ’ , ’w ’ )p i c k l e . dump( s i ze3 , s3 )

55 s3 . c l o s e ( )s4 = open ( ’ s4map . pkl ’ , ’w ’ )p i c k l e . dump( s i ze4 , s4 )s4 . c l o s e ( )

Listing B.5: motif profiles.py – creates motif profiles from FANMOD dump files


”””This f i l e reads the p i c k l e d s3 and s4 maps and c r ea t e s the b inary

5 moti f p a r t i c i p a t i on p r o f i l e s f o r the NN c l u s t e r i n g”””

import sysimport p i c k l e

10 from s t r i n g import s p l i tfrom ppr int import ppr intfrom glob import glob

class p r o f i l e :15 ””” Ins tances o f p r o f i l e s ”””

def i n i t ( s e l f , id , l ) :s e l f . ID = ids e l f . l a b e l = ls e l f . a = [ ]

20 for i in range ( l en ( s3map) + len ( s4map) ) :s e l f . a . append (0)

def mark ( s e l f , m) :try : s e l f . a [ adjM [m] ] = 1

25 except KeyError : pass # in s i g n i f i c a n t motif , not in our d i c t

84

# Unpick le adjMatrix mappings3 = open ( ’ s3map 1pct . pkl ’ , ’ r ’ )

30 s3map = p i c k l e . load ( s3 )s3 . c l o s e ( )s4 = open ( ’ s4map 1pct . pkl ’ , ’ r ’ )s4map = p i c k l e . load ( s4 )s4 . c l o s e ( )

35# Create d i c t i onary f o r adjMatrix mappingadjM = {}idx = 0for k , v in s3map . i t e r i t em s ( ) :

40 adjM [ v ] = idxidx += 1

for k , v in s4map . i t e r i t em s ( ) :adjM [ v ] = idxidx += 1

45seen = {} # d i c t f o r nodesf i l e s = glob ( ’ /home/ eddie / r e s ea r ch /fanmod/ data new l c / dumpf i l es /* ’ )for i in range ( l en ( f i l e s ) ) :#for i in range (3 ,4) :

50 # Open f i l e f o r readingf = open ( f i l e s [ i ] , ’ r ’ )# need a unique p r e f i x s ince we w i l l have mu l t i p l e node 0 , 1 , 2 , e t c . . .p r e f i x = ( f i l e s [ i ] . s p l i t ( ”/” ) [ 7 ] ) . s p l i t ( ” ” ) [ : −2 ]tmp = ””

55 for j in range ( l en ( p r e f i x ) ) : tmp += pr e f i x [ j ] + ” ”p r e f i x = tmpl a b e l = p r e f i x . s p l i t ( ” ” ) [−2]for l i n e s in f :

l = l i n e s . s p l i t ( ” , ” )60 # ignore header l i n e s in dump f i l e s

i f l en ( l ) > 2 :for j in range (1 , l en ( l ) ) :

myNode = p r e f i x + s t r ( i n t ( l [ j ] ) )i f myNode not in seen :

65 seen [myNode ] = p r o f i l e (myNode , l a b e l )seen [myNode ] . mark ( s t r ( l [ 0 ] ) )

f . c l o s e ( ) # c lo s e f i l e handle

70 for k , v in seen . i t e r i t em s ( ) :print v . ID , v . l abe l ,for i in range ( l en ( v . a ) ) :

i f i < l en ( v . a )−1: print v . a [ i ] ,else : print v . a [ i ]

Appendix C: Test Parameters

Parameter [default] Value

subgraph (motif) size [default: 3] 3/4

# of samples used to determine approx. # of subgraphs [100000] 100000

full enumeration? 1(yes)/0(no) [1] 1 (yes)

directed? 1(yes)/0(no) [1] 1 (yes)

colored vertices? 1(yes)/0(no) [0] 1 (yes)

colored edges? 1(yes)/0(no) [0] 0 (no)

random type: 0(no regard)/1(global const)/2(local const) [2] 2

regard vertex colors? 1(yes)/0(no) [0] 1 (yes)

regard edge colors? 1(yes)/0(no) [0] 0 (no)

reestimate subgraph number? 1(yes)/0(no) [0] 0 (no)

# of random networks [1000] 5000

# of exchanges per edge [3] 5

# of exchange attempts per edge [3] 5

Table C.1: FANMOD test parameters

85

86

Listing C.1: GA weights.xml – RapidMiner process parameters for genetic algorithmand 1-NN classification

<?xml version=” 1.0 ” encoding=”UTF−8”?><pr oc e s s version=” 4.2 ”>

<operator name=”Root” c l a s s=”Process ” expanded=” yes ”>5 <operator name=”ExampleSource” c l a s s=”ExampleSource”>

<parameter key=” a t t r i b u t e s ” value=”/home/ eddie / r e s ea r ch /fanmod/mo t i f a n a l y s i s /weighted / mot i f s . aml”/>

</ operator><operator name=”Evolut ionaryWeighting” c l a s s=”Evolut ionaryWeighting” expanded=”

yes ”><parameter key=” c r o s s ove r type ” value=” s h u f f l e ”/>

10 <parameter key=” k e ep b e s t i n d i v i d u a l ” value=” true ”/><parameter key=” p c r o s s ove r ” value=” 0 .6 ”/><parameter key=” popu l a t i o n s i z e ” value=”20”/><parameter key=” tournament s i ze ” value=” 0 .2 ”/><operator name=”WeightingChain” c l a s s=”OperatorChain ” expanded=” yes ”>

15 <operator name=”XValidation ” c l a s s=”XValidation ” expanded=” yes ”><parameter key=” keep example s et ” value=” true ”/><operator name=”NearestNeighbors ” c l a s s=”NearestNeighbors ”>

<parameter key=” keep example s et ” value=” true ”/></ operator>

20 <operator name=”ApplierChain ” c l a s s=”OperatorChain ” expanded=” yes ”><operator name=”ModelAppl ier” c l a s s=”ModelAppl ier”>

< l i s t key=” app l i c a t i on pa r amete r s ”></ l i s t>

< l i s t key=” pr ed i c t i on pa r amete r s ”>25 </ l i s t>

</ operator><operator name=”Performance” c l a s s=”Performance”></ operator><operator name=”PerformanceWriter ” c l a s s=”PerformanceWriter ”>

30 <parameter key=” p e r f o rman c e f i l e ” value=”/home/ eddie /r e s ea r ch /fanmod/ mo t i f a n a l y s i s /weighted / mot i f s . per ”/>

</ operator></ operator>

</ operator></ operator>

35 </ operator></ operator>

</ p r oc e s s>

Appendix D: Additional Classification Results

AIM DNS HTTP Kazaa MSDS Netbios SSH PrecisionAIM 8 0 0 0 1 0 1 80.00%DNS 0 8 0 2 2 0 0 66.67%HTTP 2 0 9 0 1 0 1 69.23%Kazaa 0 1 0 2 0 0 0 66.67%MSDS 0 0 0 0 6 0 0 100.00%Netbios 0 1 1 1 0 10 0 76.92%SSH 0 0 0 0 0 0 8 100.00%Recall: 80.00% 80.00% 90.00% 20.00% 60.00% 100.00% 80.00% � 78.46%

Table D.1: Confusion matrix of 65 application graphs using motif frequencies

True 5190 True 53 True 80 True 1214 True 445 True 137 True 22 PrecisionPred. 5190: 453 27 63 36 84 41 316 44.41%Pred. 53: 1 621 3 0 4 15 8 95.25%Pred. 80: 23 8 712 1 9 26 4 90.93%Pred. 1214: 1 1 2 361 3 0 0 98.10%Pred. 445: 13 3 2 2 282 8 36 81.50%Pred. 137: 2 19 15 0 3 669 0 94.49%Pred. 22: 7 1 3 0 15 1 36 57.14%Recall: 90.60% 91.32% 89.00% 90.25% 70.50% 88.03% 9.00% � 79.54%

Table D.2: Confusion matrix of weighted traditional graph measures

True 5190 True 53 True 80 True 1214 True 445 True 137 True 22 PrecisionPred. 5190: 298 8 56 0 18 0 32 72.33%Pred. 53: 7 632 3 9 2 0 4 96.19%Pred. 80: 120 14 676 0 19 3 23 79.06%Pred. 1214: 5 0 1 370 5 34 1 88.94%Pred. 445: 2 4 15 2 269 1 1 91.50%Pred. 137: 0 1 0 0 2 700 0 99.57%Pred. 22: 36 0 19 1 57 2 94 44.98%Recall: 63.68% 95.90% 87.97% 96.86% 72.31% 94.59% 60.65% � 85.70%

Table D.3: Confusion matrix of weighted motif profiles

87

Vita


Personal

• 4259 Cezanne Cir. Ellicott City, MD 21042email: [email protected]: (443) 812-6232

Education

• Master of Science, Computer ScienceWake Forest University, Winston-Salem, NCDecember 2008Thesis: “Identifying Application Protocols in Computer Networks Using VertexProfiles”GPA: 3.79

• Bachelor of Science, Computer ScienceWake Forest University, Winston-Salem, NCDecember 2006GPA: 3.51

Publication

• Allan, Edward G., Horvath, Michael R., Kopek, Christover V., Lamb, Brian T.,Whaples, Thomas S., and Berry, Michael W.: Anomaly Detection Using NonnegativeMatrix Factorization, Survey of Text Mining II, Springer, 203–217, 2008

Experience

• Research AssistantWake Forest University, Winston-Salem, NCAugust 2007 – December 2008Worked with Dr. Errin Fulp on various projects in computer security. Researchedtopics in computer networks leading to masters thesis. Assisted in classroom and labduties for networking class.

• Software Development InternGreatWall Systems, Inc., Winston-Salem, NCJune 2007 – August 2007Designed and programmed a testing platform for new high-speed firewall product.

88

89

Implemented portions of firewall software in the Python programming language toallow firewall policies to be swapped in place with no gap in coverage.

• Intern – R&D teamTenable Network Security, Columbia, MDJune 2006 – August 2006Developed, implemented, and tested Nessus vulnerability scanner plugins. Imple-mented code for the Tenable Log Correlation Engine product using a proprietarylanguage. Analyzed, assessed, and scored software vulnerabilities according to theCommon Vulnerability Scoring System for use with the Nessus Vulnerability scanner.

Honors

• Inducted into the Upsilon Pi Epsilon honor society in 2005

• Graduated cum laude from Wake Forest University in 2006

• 2nd place in the 2007 SIAM text mining competition

IDENTIFYING APPLICATION PROTOCOLS IN COMPUTER …...eij is an edge from vertex i to vertex j deg(v) is the degree of vertex v id(v) is the indegree of vertex v od(v) is the outdgree

Documents