IDENTIFYING APPLICATION PROTOCOLS IN COMPUTER NETWORKS USING VERTEX PROFILES By Edward G. Allan, Jr. A Thesis Submitted to the Graduate Faculty of WAKE FOREST UNIVERSITY in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE in the Department of Computer Science December 2008 Winston-Salem, North Carolina Approved By: Errin W. Fulp, Ph.D., Advisor Examining Committee: David J. John, Ph.D., Chairperson William H. Turkett, Jr., Ph.D.
99
Embed
IDENTIFYING APPLICATION PROTOCOLS IN COMPUTER …...eij is an edge from vertex i to vertex j deg(v) is the degree of vertex v id(v) is the indegree of vertex v od(v) is the outdgree
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IDENTIFYING APPLICATION PROTOCOLS IN COMPUTERNETWORKS USING VERTEX PROFILES
By
Edward G. Allan, Jr.
A Thesis Submitted to the Graduate Faculty of
WAKE FOREST UNIVERSITY
in Partial Fulfillment of the Requirements
for the Degree of
MASTER OF SCIENCE
in the Department of Computer Science
December 2008
Winston-Salem, North Carolina
Approved By:
Errin W. Fulp, Ph.D., Advisor
Examining Committee:
David J. John, Ph.D., Chairperson
William H. Turkett, Jr., Ph.D.
Acknowledgements
This thesis is the product of many people’s labors, not just my own. The ideascontained in the pages that follow have been formulated and refined for over a year,with the guidance and support of several people, whose assistance I would be remissnot to mention. I would like to thank Wake Forest University and GreatWall Systems,Inc. for their support. This research was funded by GreatWall Systems, Inc. via theUnited States Department of Energy STTR grant DE-FG02-06ER86274. 1
I would also like to thank my parents for their support throughout my years atWake Forest, both as an undergraduate and as a graduate student. Without theirencouragement and financial assistance, none of this would have been possible. I alsowould not be where I am today without the help of my friends, who have made thesepast several years some of the most enjoyable and most memorable yet.
My thesis committee members, Dr. David John and Dr. William Turkett, Jr.,were instrumental in providing me with feedback throughout the research and writingprocess. Their comments and criticism have undoubtedly enabled the success ofthis endeavor. I would especially like to thank Dr. Turkett for selflessly spendinghours assisting me and stepping in as my “adopted advisor” during Dr. Errin Fulp’ssabbatical.
Last, but certainly not least, I must thank my advisor, Dr. Errin Fulp. I havebeen fortunate to work with him in a variety of contexts for more than five yearsnow, and he has been a tremendous influence on both my personal and academicdevelopment. His relaxed personality and great sense of humor kept me off-task justenough to save my sanity, while his insight and guidance allowed me to complete mystudies and be ready to move on to the next chapter in my life. Many thanks againto all who have helped me along the way — you are much appreciated.
1The views and conclusions contained herein are those of the author and should not be interpretedas necessarily representing the official policies or endorsements, either expressed or implied, of theDOE or the U.S. Government.
SANSTM - SysAdmin, Audit, Networking, and Security
SMTP - Simple Mail Transfer Protocol
SSH - Secure Shell
TCP - Transmission Control Protocol
UDP - User Datagram Protocol
VoIP - Voice over IP
viii
ix
Symbols
| V | is the number of vertices in a graph
eij is an edge from vertex i to vertex j
deg(v) is the degree of vertex v
id(v) is the indegree of vertex v
od(v) is the outdgree of vertex v
N(v) is the set of nodes in the neighborhood of vertex v
e(v) is the eccentricity of vertex v
rad(G) is the radius of graph G
diam(G) is the diameter of graph G
d(u, v) is the distance between vertex u and vertex v
CD(v) is the degree centrality of vertex v
CB(v) is the betweenness centrality of vertex v
CC(v) is the closeness centrality of vertex v
xi is the eigenvector centrality of vertex i
C(v) is the clustering coefficient of vertex v
φ is a port number associated with an application (e.g., 80 for HTTP)
Abstract
Edward G. Allan, Jr.
Identifying Application Protocols in Computer
Networks Using Vertex Profiles
Thesis under the direction of Errin W. Fulp, Ph.D., Associate Professor ofComputer Science
Security and management of computer network resources exemplify two criticalactivities that challenge system administrators. They face potential threats from out-side intruders as well as internal users whom already have access to the organization’sassets. It is imperative that administrators are aware of what applications are beingexecuted, but the use of data encryption techniques and non-standard port numberspresents difficulties that must be overcome.
To that end, this thesis introduces a novel method to identify application protocolsbased on the analysis of application graphs, which model application-level communica-tions between computers. The performance of two types of node descriptions, calledvertex profiles, are compared. “Traditional” vertex profiles characterize each nodeusing several well-studied graph measures. Furthermore, this work uniquely appliesmotif-based analysis, which has previously been used primarily in systems biology, tothe study of application graphs by creating a second type of vertex profile based ona node’s participation in statistically significant motifs. Machine learning techniquesare employed to evaluate the importance of specific profile features. The experimen-tal results, using a nearest-neighbor classifier, show that this type of analysis cancorrectly classify the applications observed with greater than 80% accuracy.
x
Chapter 1: Introduction
Managing and securing today’s critical data networks is a daunting and expensive
task. According to INPUT [5], demand for vendor-furnished information systems
and services by the U.S. government will increase from $71.9 billion in 2008 to $87.8
billion in 2013. This money funds such tasks as system modernization, information
sharing, IT management and information security. As computer networks increase in
size, speed and complexity, and malicious hackers develop more sophisticated attacks,
traditional methods of managing and securing these networks begin to break down.
This thesis proposes a novel approach to identifying the actions of hosts within a
network by examining the properties of application graphs, which model the social and
functional interactions of hosts with one another at the software application level (e.g.
HTTP, FTP, etc.). With the aid of machine learning techniques and algorithms, this
method exploits graph characteristics of each host in the application graph, such as its
connectedness, its position in the graph and the shapes of the subgraphs in which it is
found. One distinct advantage to this approach is that classification can be performed
“in the dark”, meaning that the packet payloads are either unavailable or have been
encrypted, rendering deep packet inspection futile. Knowing what activities users on
the network are participating in is crucial to network administrators who must manage
bandwidth allocations, network configurations, performance and security and access
policies. The following sections of this chapter provide background information and
motivation for the study.
1
2
1.1 Issues in Network Management and Security
To protect itself from litigation and to help ensure the integrity of its network, an
organization (such as a school, business, or government) will often develop an Accept-
able Use Policy, or AUP. An AUP defines what behaviors are acceptable for internet
browsing, what applications can be run by users and other relevant guidelines for
usage. The SANS Security Policy Project [6] provides several resources and tem-
plates for such policies. Take, for example, a policy that does not allow users to run
a personal web server using an organization’s computing resources. Identifying such
behavior can help to preserve network bandwidth that is otherwise used for legitimate
business activities.
Not only can failure to comply with an organization’s AUP waste computing
resources, it can also have serious security implications as well. Continuing with
the example above, running an improperly configured web server or hosting insecure
web application files gives an attacker an easy point of entry into the network. A
study performed by MITRE from 2001-2006 notes a sharp increase in the number of
public reports for vulnerabilities that are specific to web applications [7]. For several
years buffer overflow attacks had been the most common, but were overtaken in 2005
by web application vulnerabilities such as SQL injection, cross-site scripting (XSS)
and remote file inclusion. It is, therefore, in a network administrator’s best interest
to ensure that the network is properly utilized in accordance with the policies and
guidelines adopted by the organization.
1.2 Current Methods of Network Analysis
Several tools allow system administrators to determine which applications are being
used on a network. This information assists them in the maintenance and protection
3
of networked systems. Sophisticated users, however, are able to hide their activities,
which could potentially include actions that are against the organization’s AUP, or
worse yet, are illegal. This section examines a few of the tools used by administrators
and identifies some of their weaknesses.
1.2.1 Applications and Port Numbers
When data is sent to a computer over a network, the destination port number identifies
which application on the host computer should receive and process the data. Many
applications use port numbers specified by the Internet Assigned Numbers Authority
[8]. For example, FTP servers use ports 20 and 21, while web servers use port 80
by default. NetStat is a command line tool that shows information about network
connections, both incoming and outgoing [9]. Figure 1.1 demonstrates the output of
The attributes a1 through ad can be any numerical data type or numerical repre-
sentation of a data type. In the traditional graph analysis approach there are eleven
attributes (degree counts, centrality measures, etc.), so d = 11. These attributes
include integers, real numbers and boolean values represented as a 1 (true) or a 0
(false). The intent is to associate an application with a certain profile.
The idea of vertex profiles based on graph characteristics is adapted to the motif-
based approach. Instead of considering the percentage of subgraphs a motif occurs
in, however, a binary attribute is created that describes whether or not the vertex
participates in the motif. One of the files output by FANMOD motif searches is a
comma separated file with the following format:
adjacency matrix, <participating vertices>
After the significant motifs have been determined, the script in Listing B.5 parses
these files and creates the profiles for each node based on its participation in significant
motifs. The dimensionality d of the motif profiles is 130: 42 of these are significant
order 3 motifs, while the remaining 88 are significant order 4 motifs. The motif profiles
were built putting both order 3 and order 4 motifs together because preliminary
investigations indicated that the combination is more successful in separating and
identifying protocols than either can do alone.
5.7 K-Nearest Neighbor Classification
The tasks of node classification and feature weighting (Section 5.8) are handled by
RapidMiner, an open source knowledge-discovery and data mining tool built on the
JavaTM platform [61]. RapidMiner allows for data mining experiments to be quickly
constructed through the use of hundreds of modular operators that handle data pre-
processing and post-processing, creation and storage of models, clustering and classi-
fication tasks as well as statistical analysis.
45
The k-nearest neighbor (k-NN) classification algorithm is a simple machine learn-
ing algorithm for classifying objects based on the closest training examples in a feature
space. First, the data is broken into a training set and a test set. The proximity of a
test point z to every point in the example set is then calculated.
Algorithm 1 The k-Nearest Neighbor classification algorithm [62]
1: Let k be the number of nearest neighbors and D be the set of training examples.2: for each test example z = (x′, y′) do3: Compute d(x′,x), the distance between z and every example (x, y) ∈ D.4: Select Dz ⊆ D, the set of k closest training examples to z.5: y′ = argmax
v
∑
(xi,yi)∈DzI(v = yi)
6: end for
After the nearest-neighbor list is obtained, the test example z is classified based
on a majority vote of the k nearest neighbors to z. In this study, k = 1, so a test
point z is given the same label as the label of its closest neighbor. In line 5 above,
yi is the class label for one of the nearest neighbors, and I() is an indicator function
that returns the value 1 if its argument is true and 0 otherwise.
5.7.1 Measuring Profile Separation
A number of similarity measures can be used to determine the distance from one point
to another (line 3 of Algorithm 1), the selection of which depends on the type of data
being examined and its application [62]. For example, there is Euclidean distance,
Jaccard coefficient, cosine similarity and simple matching coefficient. Euclidean dis-
tance is often chosen for instances of dense continuous data such as that found in
the profiles for traditional graph analysis. Although the simple matching coefficient
is often applied to binary data such as the motif profiles, the Euclidean distance is
also suitable, and is selected for use in this study. Equation 5.1 defines this distance,
46
where n is the number of dimensions and xk and yk are the kth attributes of x and y.
d(x,y) =
√
√
√
√
n∑
k=1
(xk − yk)2 (5.1)
5.7.2 Cross Validation of Classification Results
Cross validation is the process of partitioning a data set into n subsets, training a
classifier with n − 1 subsets and using the remaining subset to test. The process is
then repeated n times with a different subset left out each time. In 10-fold cross
validation, for example, ten subsets are created, each containing 10% of the original
data set. In each iteration, 90% of the data is used for training and 10% is used for
testing. To avoid the possibility of a particular subset not containing any instances
(or very few) of a particular label, stratified sampling is used so that each subset
contains roughly the same propotion of labels.
5.8 Genetic Algorithm Feature Weighting
Genetic algorithms provide a unique way to investigate which attributes in the vertex
profiles more effectively classify application protocols, as well as increase the accuracy
of the nearest-neighbor classifier. This study utilizes a genetic algorithm to perform
evolutionary feature weighting, the results of which are applied to each profile and a
new classifier is built using the nearest neighbor algorithm as before. Alternatively, a
brute-force search of all attribute combinations (given by Equation 5.2) might possible
for a small attribute set such as in the case of traditional graph analysis, but is not
feasible for motif analysis.
c =
d∑
n=1
(
d
n
)
(5.2)
47
Given that d = 11 for traditional graph analysis, applying the equation above
reveals that the number of possible attribute combinations c is 2,047. However when
d = 130 for motif analysis, c = 1.36× 1039. Genetic algorithms present one possible
way to explore this problem space within a reasonable amount of time.
5.8.1 Overview of Genetic Algorithms
Genetic algorithms view learning as a competition among a population of evolving
candidate problem solutions [63]. During each generation, a fitness function (line 4 of
Algorithm 2 below) assesses each candidate to determine if it will contribute to the
next generation of solutions. Those solutions found to be the most “fit” are selected
for mating and mutation and shape the following generation of potential solutions.
The algorithm repeats until some termination condition is met, such as convergence
to a solution or a predefined number of generations have been tested.
Algorithm 2 General form of a genetic algorithm [63]
1: Set time t = 02: Initialize the population P(t)3: while the termination condition is not met do4: Evaluate fitness of each member of the population P(t).5: Select members from population P(t) based on fitness.6: Produce the offspring of these pairs using genetic operators.7: Replace, based on fitness, candidates of P(t), with these offspring.8: Set time t = t + 19: end while
Before the algorithm can begin, candidate solutions must be transformed into
an appropriate representation for the problem space. Examples include binary, real
value, and tree encoding, the simplest and most studied of which, is binary encoding
[64]. Initial populations of candidate solutions are usually chosen at random. The
population size depends on the problem space, but studies have shown a population
size of 20-30 generally yields good results [65, 66]. At this point, the fitness function
evaluates each member of the population, and selects the best candidates for mating.
48
Figure 5.6 shows what a simple crossover of two binary strings might look like.
Input Bit Strings Output Bit Strings0011|0001
=⇒0011|1011
0100|1011 0100|0001
Figure 5.6: Single-point crossover of two binary strings
Just like in evolutionary biology, there is a small chance for random genetic mu-
tation to occur. In a binary string, this would equate to one of the bits being flipped
from a 0 to a 1 or vice versa, allowing the algorithm to explore more of the problem
space and not settle on a local solution. Previous research suggests variable values
for mutation probability, such as 0.0001 [65] or 0.005 - 0.01 [66].
5.8.2 Feature Weighting
The RapidMiner distribution contains a prewritten test for evolutionary feature weight-
ing using genetic algorithms. In the context of application identification, the function
used to determine the fitness of candidate solutions is based upon whether or not the
potential solution increases the overall accuracy of the 1-NN classifier. Solutions that
do not increase the performance of the classifier are not selected to contribute to the
following generation of candidate solutions. The algorithm is run for thirty genera-
tions, by which time the system should stabilize and begin to converge to a solution
set of attribute weights. The full test parameters, including crossover probabilities,
mutation rates and candidate selection can be found in Appendix C.
Chapter 6: Results and Analysis
To test the accuracy and performance of the proposed approaches, several exper-
iments were run using the method described in Chapter 5. In total, 65 application
graphs were examined: ten AIM, ten DNS, ten HTTP, five Kazaa, ten MSDS, ten
Netbios and ten SSH, with the discrepancy resulting from fewer examples of peer-to-
peer Kazaa traffic being located in the data traces that were downloaded. Profiles
were classified using both traditional graph attributes and motif-based attributes.
Afterwards, profile attributes were weighted using a genetic algorithm. This step
aims to provide two important functions: to increase the accuracy of the classifiers
and to provide insight into which attributes are more effective for identifying network
applications. Analysis of several key attributes is provided in this chapter, as well as
a direct comparison between traditional and motif-based profiles.
6.1 Preliminary Investigations
Because motifs have not been applied in the realm of application identification, some
preliminary classification work was required to vet this approach. Profiles for each of
the 65 application graphs were created using a combination of significant order 3 and
order 4 motifs, where each attribute represents the frequency of a particular motif
within that graph. The results provided in Table 6.1 were encouraging (for the full
classification results see Appendix D). Perhaps a more interesting question, however,
is not if an entire graph of communications can be correctly classified, but instead
if the activities of a particular host can be identified. It is on this question that the
remainder of the chapter is focused.
49
50
Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH
Accuracy 80% 80% 90% 40% 60% 100% 80%
Table 6.1: Classification accuracy of 65 application graphs
6.2 Initial Results
Classification results are presented as confusion matrices; each row of the table repre-
sents a predicted class label (an application in this case), while the columns represent
the true class label. The boldface numbers along the diagonal indicate correct clas-
sifications. Confusion matrices also show false positives and false negatives. Data
points that are predicted to have a certain class label but are incorrect are known
as false positives, found in the rows of the matrices. False negatives are examples of
a particular class that are incorrectly labeled, shown in the columns. For example,
given a set of data that is predicted to be hosts sharing files via Kazaa, true positives
would be those hosts that are actually using the P2P application while false positives
would be those hosts that are not. Conversely, given a set of data that is known to
be file-sharing hosts, false negatives would be those that are not labeled as using the
Kazaa application.
True A True B True C Precision
Pred. A 5 2 0 71.4%
Pred. B 3 3 2 37.5%
Pred. C 0 1 11 91.7%
Recall: 62.5% 50.0% 84.6% � 70.4%
Table 6.2: An example confusion matrix with three classes
The performance of the nearest-neighbor classification models are described by
three different accuracy measures. The overall accuracy of a model (denoted by� next to the number in the bottom-right corner) is simply the number of correct
classifications (true positives) over all classifications. Given a set of predictions of a
particular label, class precision is a measure of the accuracy of those predicted labels.
51
It is the ratio of correct predictions of label l to all predictions of label l. It can be
written:
precision =true positives
true positives + false positives(6.1)
Class recall (also called sensitivity) measures the accuracy of predicted labels if
provided a complete set of true labels. Recall is given by the following equation:
recall =true positives
true positives + false negatives(6.2)
Table 6.2 displays the results of an example classification experiment , as well as
the accuracy, precision and recall measures. This confusion matrix shows that while
the classifier has some trouble distinguishing between class A and class B, it can
effectively detect examples of class C.
6.2.1 Traditional Graph Measure Profiles
To remind the reader, traditional graph measure profiles have eleven attributes in-
cluding degree counts, centrality measures and clustering coefficient (see Section 5.4
for the full list). There are a total of 3,940 unique hosts found in the 65 application
graphs. Each line of the input file for the nearest neighbor algorithm contains the
true label assigned to the host and the eleven graph measures associated with that
host. Not all protocols have an equal number of training examples due to the popu-
larity and availability of certain applications in the trace files, but each protocol has
400–800 examples.
The computational load of a single test point for the nearest neighbor algorithm
is O(nd) where n is the number of training samples and d is the number of attributes.
When using 10-fold cross validation, the test set is n10
and ten iterations are run,
making the overall complexity of this process O(n2) if we absorb the constant d into
52
the expression. Although other methods exist to reduce the number of computations
necessary, RapidMiner is able to generate the model and accuracy measures for 3,940
data points in just a few seconds. Table 6.3 shows the resulting confusion matrix,
where each row is the predicted label and each column is the actual label.
Table 6.8: Attribute weights for traditional graph measures
The weights of the attributes reflect the interaction of all eleven graph measures
and are the values that maximize the accuracy of the classifier. They should there-
fore not be interpreted too literally in isolation. For example, degree centrality was
weighted with a 1.000, the highest possible weight. This does not mean that an accu-
rate classifier could be built on this attribute alone. Section 6.5 addresses this point
further. However, the table does still provide some insight as to which attributes
might be more useful when providing classification tasks. It is not surprising that the
degree counts are not weighted especially high, as they are a very generic measure.
The “periphery” attribute has a low weight because it is not a very unique measure;
out of the 3,940 profiles, 2,132 of them are periphery nodes. In contrast, only 773
nodes are central nodes, which has a higher attribute weight.
Figure 6.4 shows the per-protocol accuracy of both unweighted and weighted pro-
files based on traditional graph measures. The confusion matrix for weighted at-
59
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Accuracy of unweighted vs. weighted traditional graph measure profiles
UnweightedWeighted
Figure 6.4: Accuracy of unweighted vs. weighted traditional graph measure profiles
tributes can be found in Appendix D. As one would hope, the weighted attribute
profiles perform slightly better than their unweighted counterpart for each protocol.
The class recall for SSH is again very low for the same reasons described previously.
6.3.2 Attribute Weights of Motif-based Measures
Because of the high dimensionality of motif-based profiles, it becomes more impor-
tant to take advantage of other methods such as genetic algorithms to explore the
attributes. Figure 6.5 depicts the ten most heavily weighted motifs and their corre-
sponding weighted values. In the figure, green nodes represent clients, black nodes
represent servers and red nodes represent peers, as specified in Definition 1. As with
the weights of traditional graph measures, these weights reflect the combined infor-
mation from all attributes.
Motif 6.5(a) is the most highly weighted of the significant motifs found in this study
and only occurs in two application graphs. There are 24 instances of it in a MSDS
60
(a)1.000
(b)0.662
(c)0.650
(d)0.632
(e)0.585
(f)0.545
(g)0.537
(h)0.503
(i)0.503
(j)0.502
Figure 6.5: The ten highest-weighted motifs and their corresponding weights
graph and another 137 instances of it in a Netbios graph. Although weighted lower,
motif 6.5(b) occurs overwhelmingly more frequently in Netbios (1,007 instances) than
it does in MSDS (3 instances) or DNS (2 instances). If a node were to occur in these
two motifs, there would probably be a good chance that the host was using the Netbios
application.
Unfortunately the weights do not indicate which particular application(s) a motif
help to delineate, only which motifs successfully increase the overall accuracy of the
classifier. Perusing the profile data reveals that instances of many motifs are found
in several or all of the applications studied. This is not to say motif profiles are
unsuitable for describing computer networks (as they have shown a great deal of
promise already), rather that no single motif is indicative of a particular application.
Given the complexity of the highly dynamic interactions that occur in computer
networks, this is not entirely surprising. It is possible that different types of motifs
(described in Chapter 7) could be even more beneficial than the current generation
of motifs and motif profiles.
One final point to address before moving on to a comparison of traditional and
motif-based profiles is the performance of unweighted vs. weighted motif profiles,
61
shown in Figure 6.6. There is a slight increase in classification accuracy in each of
the protocols except for Kazaa, which sees no additional gain from attribute weight-
ing. The overall accuracy of the model increases to 85.70%, a difference of 1.63%.
Appendix D contains the confusion matrix for weighted motif profiles.
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Accuracy of unweighted vs. weighted motif−based profiles
UnweightedWeighted
Figure 6.6: Accuracy of unweighted vs. weighted motif-based profiles
6.4 Comparison of Profile Types
This section compares the two profile types side-by-side and discusses some of the
advantages and disadvantages of each approach. The motif-based model generally
outperforms traditional graph measures, though this is not always the case as shown
in Figure 6.7. Notably, the traditional profiles significantly outperform motif-based
profiles for classifying AIM traffic, while the reverse is true for SSH (again, the SSH
results should be taken with a grain of salt due to the fact that slightly less than 40%
of SSH traffic is classified by the second approach).
Weighting the profile attributes benefits traditional graph measures more than
62
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Traditional Graph Measures vs. Motif−based Profiles (Unweighted)
TraditionalMotif
(a) Unweighted
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Traditional Graph Measures vs. Motif−based Profiles (Weighted)
TraditionalMotif
(b) Weighted
Figure 6.7: Accuracy comparison of unweighted profile types
motif descriptions. One reason for this might be the type of data used to describe
each profile. Traditional profiles are comprised of a mixture of binary, real-valued
and integer data. In addition to being purely binary, motif profiles are also sparse;
most nodes only participate in very few of the 130 significant motifs. As a result,
many of the motif weights are multiplied by zero, resulting in no information gain.
Regardless, weighting the attributes does not change which type of classifier performs
better for a particular protocol with the exception of HTTP. Unweighted motifs have
a 4% accuracy advantage over traditional measures, but fall to a 1% disadvantage
when the profiles are weighted.
Advantages and Disadvantages of Profile Types
Motif-based profiles have a slight advantage over traditional measures in a few cate-
gories. The overall accuracy of the motif-based classifiers is higher than that of the
traditional classifiers, both unweighted and weighted. Also, motif profiles result in
more favorable overlap with other profiles. Only 10% of motif profiles do not match
another profile, and 61% match profiles of a single label (note that “match” means
a Euclidean distance of zero, not an identical profile). With traditional measures on
63
the other hand, 58% match a single label, and nearly 25% of profiles do not match
any other profile.
Traditional graph measures are less demanding to compute than their motif coun-
terparts. Even though some graph measures are O(n3) where n is the order of the
graph, calculations can be performed extremely quickly because n is small in the
application graphs examined: 40 ≤ n ≤ 80. Motif searches are computationally
expensive and can be prohibitively so when searching for large motifs. This study
found that an exhaustive search of order 3 motifs could be completed in roughly 7-8
minutes, while an exhaustive search of order 4 motifs took 6-8 hours to complete.
6.5 Considerations for Optimizing Classifier Performance
There are several ways in which the performance of application classifiers may be
improved. An “on the fly” traffic classification system would need to be as fast as
possible so that network latency is minimized. One way to achieve increased classifier
speed is to reduce the dimensionality of the data. Already the evolutionary feature
weighting performed by the genetic algorithm has indicated which attributes are more
valuable to the classifier. Attributes below a certain threshold value could be ignored,
at the expense of a little bit of accuracy. Figure 6.8 demonstrates the accuracy of
models based on a single traditional graph measure.
By far, eigenvector centrality, closeness centrality and degree centrality provide
the most information to the classifier, each scoring better than 65% on its own. Most
of the attributes score no better than a random guess with a 17
chance of being correct,
shown as a vertical dotted line in the graph. Recall that eigenvector centrality assigns
a centrality score to a vertex proportional to that of its neighbors. This metric is more
“social” in nature than some of the others in that the centrality scores of neighboring
vertices are considered in the calculation. The idea of “distance” in an application
64
0 10 20 30 40 50 60 70
Center
Periphery
Indegree
Eccentricity
Total Deg.
Outdegree
Clust. Coeff.
Betw. Cent.
Deg. Cent.
Close. Cent.
Eig.Cent.
Attr
ibut
e
% Correctly Labeled
Accuracy When Using a Single Attribute
Figure 6.8: Accuracy of single attribute classification
graph is a bit tricky because it does not consider the number of hops data must go
through to reach its final destination nor the physical distance between hosts. There-
fore the “closeness” of closeness centrality describes the social usage of an application
and suggests that the average shortest path length between nodes differs somewhat
from application to application. The degree centrality is essentially a weighted degree
count, which again suggests that the size of connected components within application
graphs are important, influenced in part by the popularity of servers and services.
In addition to reducing the dimensionality of the attribute profiles, one can also
consider reducing the number of data points used in the training phase of the nearest
neighbor algorithm. Exploring the effectiveness of smaller classification models has
two important implications. First of all, it suggests that a more lightweight classifier
could be built when heading towards a real-time implementation. Secondly, it shows
that the methods proposed in this study can be used for smaller networks and not
just those containing thousands of nodes.
To test this hypothesis, several unweighted classifiers were built for each profile
type with an increasing number of nodes in each model. The data was selected at
65
random, while keeping the proportions of each class label the same as in the models
previously discussed. All of the test parameters are as they were before, including
the use of 10-fold cross validation to determine the accuracy. The results of this
experiment are illustrated in Figure 6.9.
500 1000 1500 2000 2500 3000 3500 40000
10
20
30
40
50
60
70
80
90
100
Number of Profiles
% C
orre
ctly
Lab
eled
Accuracy as the Number of Traditional Graph Measure Profiles Increases
AIMDNSHTTPKazaaNetbiosMSDSSSH
(a) Traditional graph measures
500 1000 1500 2000 2500 3000 3500 40000
10
20
30
40
50
60
70
80
90
100
Number of Profiles
% C
orre
ctly
Lab
eled
Accuracy as the Number of Motif−based Profiles Increases
AIMDNSHTTPKazaaNetbiosMSDSSSH
(b) Motif-based profiles
Figure 6.9: Comparison of profile types as the size of the training set increases
The classifiers tend to perform slightly better as the number of training data
points increases, but sometimes negligibly so. DNS, Kazaa and Netbios seem to
benefit the least from having additional training examples, while AIM and MSDS
fluctuate quite a bit more. It is interesting to note applications that were previously
classified more accurately also exhibit more stable behavior in Figure 6.9. This is
true for both profile types. For example, AIM and MSDS were by far the least
accurately classified protocols using a motif-based approach, and their trend lines
exhibit the most volatility in 6.9(b). In contrast, DNS, Kazaa and Netbios were the
most accurately classified protocols, and their trend lines are nearly flat. This finding
suggests that the protocols which can be clearly described by a profile (traditional or
motif-based) can be learned with a relatively low number of training points. Further
investigation into the AIM and MSDS protocols is needed to understand why the
accuracy of AIM peaks at 2,500 nodes and then declines, while the accuracy of MSDS
66
peaks at 1,000 nodes and then drops significantly.
6.6 Limitations of Current Approach
This chapter has demonstrated the promise of using vertex profiles to identify appli-
cation usage across a computer network. A few of the shortcomings of the proposed
methodology have been touched upon already, but are summarized here. Graph size
is an important factor to consider, since more “interesting” vertex characterizations
arise from the complex interactions of hosts. Motif-based profiles become more de-
scriptive as hosts communicate with a larger number of other hosts. The current
generation of classification models suffer when there is heavy overlap among profiles,
resulting in a distance of zero. A more intelligent tie-breaking scheme could yield
better performance for those protocols that share application graph characteristics.
Currently, the motif-based approach only considers motifs of order 3 and order 4.
This causes a problem for protocols like SSH that tend to have a large number of
small connected components instead of fewer large connected components. Some of
the stages in the process are computationally expensive. The genetic algorithm used
for feature weighting is a very time-consuming endeavor and does not yield the de-
sired increase in performance. On the other hand, once a network is learned and a
classifier built, the attribute weights need only be computed once and can be applied
in O(n) time to the attributes collected for the test points. Additionally, the analysis
techniques put forth by this work require a view of the network that shows as many
of the interactions as possible.
Chapter 7: Conclusions and Future Work
The tasks of managing and securing computer networks are becoming increasingly
complicated due to the use of applications over non-standard port numbers as well as
the use of data encryption techniques. These practices subvert a network administra-
tor’s ability to provide quality of service to legitimate users, ensure compliance with
security policies, as well as prevent outside intruders from gaining access to a system.
Intrusion detection systems and network monitoring tools that rely on deep packet
inspection are ineffective when data transfers are encrypted. Several previous studies
have attempted to classify network application usage by examining flow characteris-
tics pertaining to a particular series of communications between two hosts, examining
attributes such as the size of the data packets being sent, packet inter-arrival time
and session lengths.
This thesis has proposed an interdisciplinary approach to the study of networks
through the characterization of application graphs. It is an “in the dark” methodology
that relies on the communication patterns found in a network, rather than the contents
of packet payloads or port numbers used by the application. A wide variety of graph
measures heavily borrowed from social network analysis are used to create vertex
profiles to determine the application in which the host participates.
Furthermore, this work has uniquely applied motif-based analysis, used almost
exclusively in systems biology, to the study of application graphs. This method of
detecting significant subgraph patterns has shown a great deal of promise for modeling
and classifying application protocols. It has been shown that motifs can not only be
used to express communication patterns, but also to indicate the functional role of a
67
68
host. In this study, nodes were labeled as either a client, server, or peer based upon
their interactions at the transport layer. This information was used to generate motifs.
A second type of vertex profile was defined, based upon a node’s participation (or lack
thereof) in the motifs that were found to be significant across all of the application
protocols examined.
Through empirical testing, this study has shown that both types of profiles can
determine what application a host is using with a reasonable amount of accuracy.
Although some protocols like SSH and AIM present difficulties, many of the others
can be classified with greater than 80% accuracy, and in the case of weighted motif
profiles, as high as 96% for the peer-to-peer application Kazaa. In general, a motif-
based approach out-performs traditional graph measures and seems to have more
potential for related work in the future.
One issue to consider is how to best manage connected components in application
graphs that only contain two nodes. This phenomena was found to occur frequently
in SSH, contributing to the fact that less than 40% of SSH hosts were classified by
the motif-based approach. Ignoring vertex colors, there are three possible order 2
motifs: A → B, A ← B, and A ↔ B. Unfortunately, the edge-switching operations
for creating random graphs will not provide sufficiently randomized graphs, so it is
unlikely to find a particular pattern that is statistically significant.
Currently, the only information utilized in the creation of application graphs is the
source and destination IP addresses, and the source and destination port numbers.
The motif-based approach provides some additional information by using vertex colors
to represent node types, but other information could also be exploited to color the
edges. For example, colors could be used to denote the amount of data transferred
between two nodes. This would help create more detailed profiles that might be able
to distinguish between applications with similar connection patterns, but use network
69
bandwidth in ways that are distinct from one another. Also related to the creation of
application graphs, it would be interesting to observe the data flow through all nodes
involved in a particular activity and not just the flow on a particular port number.
A server might request content from an application or database server in response
to a client’s request for a web document. Back-end communications to other related
services occur on a port other than 80, the usual HTTP port number.
Another area to explore is the different machine learning techniques that can be
applied to vertex profiles for classification and feature weighting; nearest-neighbor
and genetic algorithms are only two possibilities. The many parameters of these
algorithms require further tuning to optimize the classification accuracy of the models
built. This thesis describes a process which allows the substitution of particular
algorithms. For example, a Bayes classifier or support vector machine could be used
instead of nearest-neighbor, while principal component analysis could be used in place
of the genetic algorithm [67, 62].
Although not used in the current approach, temporal information could also prove
to be useful in classifying application protocols. One approach would be to encode
information such as session lengths or packet inter-arrival times into the edge colors.
Another use of time-based information would be to observe communication patterns
over a much smaller time window (on the order of seconds or minutes instead of hours)
and determine how a node’s participation in motifs changes over time.
Moving away from implementation details and algorithm decisions, this type of
research can be expanded outside of application identification. Assuming that the
process can be tweaked to allow a high accuracy in protocol recognition, this approach
could be used to detect anomalies in network behavior. Hosts that participate in
activities that look similar to a known application but differ more than an established
threshold value would be considered anomalous for that particular application and
70
trigger an alert. One final consideration is pushing this research further into the
realm of social network analysis, applying it to the detection of communities and
associations within a network, such as locating all hosts that are part of the same
online gaming community.
References
[1] P. Dyson, Dictionary of Networking. Sybex, 1999.
[2] A. S. Tanenbaum, Computer Networks. Prentice Hall, 2003.
[3] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, andU. Alon, “Network motifs: simple building blocks of complex networks.”Science, vol. 298, no. 5594, pp. 824–827, October 2002. [Online]. Available:http://dx.doi.org/10.1126/science.298.5594.824
[4] F. Rasche and S. Wernicke, “Fanmod manual,” 2006.
[5] Input federal it market forecast 2008 - 2013. [Online]. Avail-able: http://www.input.com/corp/library/detail.cfm?itemid=5437&cmp=OTC-fedinfosecfcst08
[6] The sans security policy project. [Online]. Available:http://www.sans.org/resources/policies/
[7] S. Christey and R. A. Martin, “Cve - vulnerability type distributions in cve,” 2007technical white paper on the distribution of vulnerabilities reported to CVE.
[8] Internet assigned numbers authority: Assigned port numbers. [Online]. Available:http://iana.org/assignments/port-numbers
[10] Wireshark: A network protocol analyzer. [Online]. Available:http://www.wireshark.org/
[11] Snort - the de facto standard for intrusion detection/prevention. [Online]. Available:http://www.snort.org/
[12] M. E. J. Newman, “Coauthorship networks and patterns of scientific collaboration,”in Proceedings of the National Academy of Science, 2004, pp. 5200–5205.
[13] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications.Cambridge University Press, 1994.
[14] C. Yang and T. Ng, “Terrorism and crime related weblog social network: Link, contentanalysis and information visualization,” Intelligence and Security Informatics, 2007IEEE, pp. 55–58, May 2007.
[15] E. Yeger-Lotem, S. Sattath, N. Kashtan, S. Itzkovitz, R. Milo, R. Y. Pinter,U. Alon, and H. Margalit, “Network motifs in integrated cellular networks oftranscription-regulation and protein-protein interaction.” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 101, no. 16, pp. 5934–5939,April 2004. [Online]. Available: http://dx.doi.org/10.1073/pnas.0306752101
71
72
[16] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in thetranscriptional regulation network of escherichia coli,” Nat Genet, vol. 31, no. 1, pp.64–68, May 2002. [Online]. Available: http://dx.doi.org/10.1038/ng881
[17] J. Grochow and M. Kellis, “Network motif discovery using subgraph enumeration andsymmetry-breaking,” 2007, pp. 92–106.
[18] U. Alon, “Network motifs: Theory and experimental approaches,” Nature ReviewsGenetics, vol. 8, no. 6, pp. 450–461, Jun. 2007.
[19] J. Day and H. Zimmermann, “The osi reference model,” Proceedings of the IEEE,vol. 71, no. 12, pp. 1334–1340, Dec. 1983.
[20] V. Cerf and R. Kahn, “A protocol for packet network intercommunication,” Commu-nications, IEEE Transactions on [legacy, pre - 1988], vol. 22, no. 5, pp. 637–648, May1974.
[21] L. Euler, “Solutio problematis ad geometriam situs pertinentis,” in Commentariiacademiae scientiarum imperialis Petropolitanae. St. Petersburg Academy, 1736,vol. 8.
[22] R. G. Busacker and T. L. Saaty, Finite Graphs and Networks, ser. International Seriesin Pure and Applied Mathematics. McGraw-Hill, 1965.
[23] G. Chartrand and P. Zhang, Introduction to Graph Theory. McGraw-Hill, 2005.
[24] A. A. Nanavati, R. Singh, D. Chakraborty, K. Dasgupta, S. Mukherjea, G. Guru-murthy, and A. Joshi, “Analyzing the Structure and Evolution of Massive TelecomGraphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 5, pp.703–718, March 2008.
[25] E. W. Dijkstra, “A note on two problems in connexion with graphs,” NumerischeMathematik, vol. 1, pp. 269–271, 1959.
[26] L. C. Freeman, “Centrality in social networks conceptual clarification,” Social Net-works, vol. 1, no. 3, pp. 215–239.
[27] P. Bonacich, “Technique for analyzing overlapping memberships,” Sociological Method-ology, 1972.
[28] M. E. J. Newman, Mathematics of Networks. Palgrave Macmillan, 2008.
[29] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’ networks,” Na-ture, vol. 393, 1998.
[30] S. Cheung, R. Crawford, M. Dilger, J. Frank, J. Hoagl, K. Levitt, J. Rowe, S. Staniford-chen, R. Yip, and D. Zerkle, “Grids – a graph-based intrusion detection system forlarge networks,” in In Proceedings of the 19th National Information Systems SecurityConference, 1996, pp. 361–370.
73
[31] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “Blinc: multilevel traffic classifi-cation in the dark,” in SIGCOMM ’05: Proceedings of the 2005 conference on Appli-cations, technologies, architectures, and protocols for computer communications. NewYork, NY, USA: ACM, 2005, pp. 229–240.
[32] S. Wernicke and F. Rasche, “Fanmod: a tool for fast network motif detection,” Bioin-formatics, vol. 22, no. 9, pp. 1152–1153, 2006.
[33] R. Itzhack, Y. Mogilevski, and Y. Louzoun, “An optimal algorithm for counting net-work motifs,” Physica A, vol. 381, pp. 482–490, Jul. 2007.
[34] S. Mangan, A. Zaslaver, and U. Alon, “The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks.” Journal of Molecular Biololgy, vol.334, no. 2, pp. 197–204, November 2003.
[35] Tcpdump/libpcap public repository. [Online]. Available: http://www.tcpdump.org/
[36] J. Postel, “Internet Protocol,” RFC 791 (Standard), Sep. 1981, updated by RFC1349. [Online]. Available: http://www.ietf.org/rfc/rfc791.txt
[37] J. Postel, “Transmission Control Protocol,” RFC 793 (Standard), Sep. 1981, updatedby RFC 3168. [Online]. Available: http://www.ietf.org/rfc/rfc793.txt
[38] J. Postel, “User Datagram Protocol,” RFC 768 (Standard), Aug. 1980. [Online].Available: http://www.ietf.org/rfc/rfc768.txt
[39] R. Pang, M. Allman, V. Paxson, and J. Lee, “The devil and packet traceanonymization,” ACM Computer Communication Review, vol. 36, no. 1, pp. 29–38,January 2006. [Online]. Available: http://www.icir.org/mallman/papers/devil-ccr-jan06.pdf
[40] G. Iannaccone, C. Diot, I. Graham, and N. McKeown, “Monitoring very high speedlinks,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on InternetMeasurement. New York, NY, USA: ACM, 2001, pp. 267–271.
[41] T. Henderson, D. Kotz, and I. Abyzov, “The changing usage of a mature campus-widewireless network,” Computer Networks, vol. In Press, Accepted Manuscript. [Online].Available: http://dx.doi.org/10.1016/j.comnet.2008.05.003
[42] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Youtube traffic characterization: a viewfrom the edge,” in IMC ’07: Proceedings of the 7th ACM SIGCOMM conference onInternet measurement. New York, NY, USA: ACM, 2007, pp. 15–28.
[43] E. Blanton. (2008, January) tcpurify. [Online]. Available:http://irg.cs.ohiou.edu/ eblanton/tcpurify/
[44] T. Gamer, C. P. Mayer, and M. Scholler, “PktAnon - A Generic Framework for Profile-based Traffic Anonymization,” PIK Praxis der Informationsverarbeitung und Kommu-nikation, vol. 2, pp. 67–81, Jun. 2008.
74
[45] D. Koukis, S. Antonatos, D. Antoniades, E. P. Markatos, P. Trimintzios, andM. Fukarakis, “CRAWDAD tool tools/sanitize/generic/anontool (v. 2006-09-26),”Downloaded from http://crawdad.cs.dartmouth.edu/tools/sanitize/generic/AnonTool,Sep. 2006.
[47] MIT Lincoln Laboratory: 1999 DARPA Intru-sion Detection Evaluation Data Set. [Online]. Available:http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/1999data.html
[48] D. Kotz, T. Henderson, and I. Abyzov, “CRAWDAD trace dart-mouth/campus/tcpdump/fall03 (v. 2004-11-09),” Downloaded fromhttp://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall03, Nov. 2004.
[49] R. Chandra, R. Mahajan, V. Padmanabhan, and M. Zhang, “CRAW-DAD data set microsoft/osdi2006 (v. 2007-05-23),” Downloaded fromhttp://crawdad.cs.dartmouth.edu/microsoft/osdi2006, May 2007.
[50] Oscar protocol. [Online]. Available: http://dev.aol.com/aim/oscar/
[51] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, andT. Berners-Lee, “Hypertext transfer protocol – http/1.1,” RFC 2616 (Standard), Jun.1999. [Online]. Available: http://www.ietf.org/rfc/rfc2616.txt
[52] P. V. Mockapetris, “Domain names - implementation and specifica-tion,” RFC 1035 (Standard), United States, 1987. [Online]. Available:http://www.ietf.org/rfc/rfc1035.txt
[53] Active directory. [Online]. Available:http://www.microsoft.com/windowsserver2008/en/us/active-directory.aspx
[54] R. Marty. Afterglow. [Online]. Available: http://www.afterglow.sourceforge.net/
[55] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics,and function using networkx,” in Proceedings of the 7th Python in Science Conference(SciPy2008), Pasadena, CA USA, Aug. 2008, pp. 11–15.
[56] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon, “Mfinder tool guide,” 2002.
[57] F. Schreiber and H. Schwobbermeyer, “Bioinformatics applications note structuralbioinformatics mavisto: a tool for the exploration of network motifs,” 2005.
[58] W. de Nooy, A. Mrvar, and V. Batagelj, Exploratory Social Network Analysis withPajek. Cambridge University Press, 2005.
[59] S. Wernicke, “Efficient detection of network motifs,” IEEE/ACM Transactions onComputational Biology and Bioinformatics, vol. 3, no. 4, pp. 347–359, 2006.
75
[60] W. Mendenhall and R. J. Beaver, Introduction to Probability and Statistics, 8th ed.PWS-Kent Publishing Company, 1991.
[61] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, “Yale: rapid pro-totyping for complex data mining tasks.” New York, NY, USA: ACM, 2006, pp.935–940.
[62] Pang-Ning Tan and Michael Steinbach and Vipin Kumar, Introduction to Data Mining.Addison Wesley, 2006.
[63] G. F. Luger, Artificial Intelligence: Structures and Strategies for Complex ProblemSolving, 5th ed. Addison Wesley, 2005.
[64] M. Mitchell, An Introduction to Genetic Algorithms. MIT Press, 1998.
[65] J. J. Grefenstette, “Optimization of control parameters for genetic algorithms,” IEEETransactions on Systems, Man and Cybernetics, vol. 16, no. 1, pp. 122–128, Jan. 1986.
[66] J. D. Schaffer, R. A. Caruana, L. J. Eshelman, and R. Das, “A study of control param-eters affecting online performance of genetic algorithms for function optimization,” inProceedings of the third international conference on Genetic algorithms. San Francisco,CA, USA: Morgan Kaufmann Publishers Inc., 1989, pp. 51–60.
[67] D. Lay, Linear Algebra and Its Applications, 2nd ed. Addison Wesley, 2000.
Listing B.1: tshark2mysql.py – stores pcap data into a MySQL database
#!/ usr / bin /python
# This f i l e parse s output from stdout and i n s e t s i t in to a MySQL# database . I t assumes the database has a l ready been creat ed and
5 # w i l l c reat e the necessary t a b l e## The Tshark command i s :# t shark −t e −r <pcap f i l e > t cp or udp
10 import sysimport MySQLdb
i f l en ( sys . argv ) != 2 :sys . e x i t ( ”Supply name of tab l e to s to r e data in \n” )
15try :
conn = MySQLdb . connect ( host = ” l o c a l h o s t ” ,user = ” root ” ,passwd = ”pass ” ,
20 db = ”data ” )except MySQLdb . Error , e :
sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )
cur sor = conn . cur sor ( )25 cur sor . execute ( ”DROP TABLE IF EXISTS %s ” % sys . argv [ 1 ] )
cur sor . execute ( ”””CREATE TABLE %s (
id INT(11) NOT NULL AUTO INCREMENT,30 t s DOUBLE NOT NULL DEFAULT ’ 0 . 0 ’ ,
p r o to co l VARCHAR(12) NOT NULL,s i p VARCHAR(15) NOT NULL,spor t INT(5) NOT NULL DEFAULT ’0 ’ ,dip VARCHAR(15) NOT NULL,
35 dport INT(5) NOT NULL DEFAULT ’0 ’ ,l ength INT(11) NOT NULL DEFAULT ’0 ’ ,PRIMARY KEY id ( id )
) ;””” % sys . argv [ 1 ] )
40 r c = 0while True :
l i n e = sys . s td i n . r e a d l i n e ( )i f not l i n e : break
v = l i n e . s p l i t ( ’ ’ )45 tmp = [ ]
for i in range ( l en (v ) ) :i f v [ i ] not in ( ’ ’ , ’−> ’ , ’ ’ ) :
tmp . append (v [ i ] )v = tmp
50 i f l en (v ) == 8 :try :
t s = f l o a t (v [ 1 ] )s i p = v [ 2 ]
78
79
dip = v [ 3 ]55 spor t = in t ( v [ 4 ] )
dport = in t ( v [ 5 ] )proto = v [ 6 ]l ength = in t (v [ 7 ] [ : − 2 ] ) # s t r i p o f f t he newl ine charac t e rs q l = ”INSERT INTO %s ( ts , protoco l , s ip , sport , dip , dport , l ength )
60 try :cur sor . execute ( s q l )r c += cur sor . rowcount
except MySQLdb . Error , e :print ”Error [%d ] : %d : %s” % ( rc , e . a rgs [ 0 ] , e . a rgs [ 1 ] )
65 except Exception , e :print ”ERROR: ” , v
cur sor . c l o s e ( )conn . commit ( )
70 conn . c l o s e ( )
print ( ”\n%d rows i n s e r t e d i n to %s \n” % ( rc , sys . argv [ 1 ] ) )
Listing B.2: graph utils.py – implementation of adjacency matrix conversion andeigenvector centrality using the NetworkX API
import networkx as NXimport math
def adj matr i x (G) :5 ”””
Function takes a networkx . Graph as an argument ( undi rec t ed )and re turns a l i s t o f l i s t s r e p r e s en t i n g the correspondingadjacency matrix . I t can can be re f e renced as you woulda normal 2D matrix A[ i ] [ j ]
10node IDs must be [1 G. order ] ( taken care o f in e i g e n v e c t o r c e n t r a l i t y ( ) )
”””adj = [ ]for n in G. nodes ( ) :
15 row = [ ]for m in range ( l en (G. nodes ( ) )+1) : row . append (0)for m in NX. ne i ghbor s (G, n) : row [m] = 1adj . append ( row )
# Get r i d o f f i r s t e lement o f each row ( nodes s t a r t at 1 , adj i s 0−based )20 for i in range ( l en ( adj ) ) : adj [ i ] = adj [ i ] [ 1 : ]
return adj
def e i g e n v e c t o r c e n t r a l i t y (G) :”””
25 Function takes an undi rec t ed graph (Graph or XGraph) andre turns a d i c t i onary o f e i g env e c t o r c e n t r a l i t i e s , keyedby node ID ( s im i l a r to c e n t r a l i t y f unc t i on s in networkx )
Function w i l l map node l a b e l s to i n t e g e r s [1 G. order ]30
Algorithm adaped from : h t t p ://www. ana l y t i c t e c h . com/networks / centa ids . htm”””
80
e i g e n v e c t o r c e n t r a l i t i e s = {}evCent r a l i t y = [ ]
35 evUpdate = [ ]maxValue = −1.0
for i in range (G. order ( ) ) :evCen t r a l i t y . append ( 1 . 0 )
40 evUpdate . append ( 0 . 0 )
H = NX. c on v e r t n od e l a b e l s t o i n t e g e r s (G, f i r s t l a b e l =1, d i s c a r d o l d l a b e l s=Fal s e)
l a b e l s = {}for k , v in H. node l abe l s . i t e r i t em s ( ) : l a b e l s [ v ] = k
45 A = adj matr i x (H)
# 30 i t e r a t i o n s should be enough to converge to a s o l u t i onfor x in range (30) :
for i in range (G. order ( ) ) :50 evUpdate [ i ] = 0 .0
for j in range (G. order ( ) ) :i f (A[ i ] [ j ] != 0) : evUpdate [ i ] += evCent r a l i t y [ j ]
maxValue = 0for i in range (G. order ( ) ) :
55 maxValue += evUpdate [ i ] * evUpdate [ i ]maxValue = math . s q r t (maxValue )for i in range (G. order ( ) ) :
evCen t r a l i t y [ i ] = evUpdate [ i ] / maxValuefor i in range (1 , G. order ( ) + 1) :
60 e i g e n v e c t o r c e n t r a l i t i e s [ l a b e l s [ i ] ] = evCent r a l i t y [ i −1]
return e i g e n v e c t o r c e n t r a l i t i e s
Listing B.3: node props main.py – creates application graphs from MySQL databaseand computes traditional graph metrics using the NetworkX API
#!/ usr / bin /python
”””This program crea t e s a DiGraph and c a l c u l a t e s var ious graph metr ics ,
5 conver t ing DiGraph to Graph as necessary f o r some metrics
Usage :arg1 = t a b l e namearg2 = por t number
10 arg3 = max # of nodes to cons ider”””
import sysimport networkx as NX
15 import MySQLdbfrom g r aph u t i l s import *class Node :
”””Class to hold prope r t i e s o f nodes ”””20 i n deg r e e = 0
out degr ee = 0degree = 0
81
c l u s t e r i n g = 0be tweenne s s c en t r a l i t y = 0
25 d e g r e e c e n t r a l i t y = 0c l o s e n e s s c e n t r a l i t y = 0e i g e n v e c t o r c e n t r a l i t y = 0e c c e n t r i c i t y = 0i s c e n t e r = 0
30 i s p e r i p h e r y = 0
i f l en ( sys . argv ) != 4 :35 sys . e x i t ( ”Provide tab l e name , port number , and # nodes at command l i n e \n” )
tab l e = sys . argv [ 1 ]port = sys . argv [ 2 ]n max = in t ( sys . argv [ 3 ] )
40 # MySQL connect iontry : conn = MySQLdb . connect ( host = ” l o c a l h o s t ” , user = ” root ” , passwd = ”pass ” ,db = ”
data ” )except MySQLdb . Error , e : sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )cur sor = conn . cur sor ( )
45 s q l = ”SELECT sip , dip , sport , dport FROM %s WHERE spor t=%s OR dport=%s” % ( table ,port , port )
try : cur sor . execute ( s q l )except MySQLdb . Error , e : sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )
# Create a d i r e c t e d graph from SQL r e s u l t s50 G = NX. DiGraph (name=”%s %s” % ( port , n max) )
for i in range ( cur sor . rowcount ) :r = cur sor . f e tchone ( )i f r [ 0 ] in G and r [ 1 ] in G: G. add edge ( r [ 0 ] , r [ 1 ] )else :
55 i f G. order ( ) < n max : G. add node ( r [ 0 ] )i f G. order ( ) < n max : G. add node ( r [ 1 ] )i f r [ 0 ] in G and r [ 1 ] in G: G. add edge ( r [ 0 ] , r [ 1 ] )
# Calcu la t e graph prope r t i e s60 myNodes = {}
for n in G. nodes ( ) :myN = Node ( )# Basic Proper t i e smyN. degree = G. degree (n )
65 myN. out degr ee = G. out degr ee (n)myN. i n deg r e e = G. i n deg r e e (n )myNodes [ n ] = myN
”””70 The f o l l ow i n g measures are a l l based on undi rec t ed graphs , but are
connected components .”””
H = G. to und i r e c t ed ( )75 CCS = NX. connected component subgraphs (H)
for i in range ( l en (CCS) ) :i f CCS[ i ] . order ( ) >= 2 :
c l = NX. c l u s t e r i n g (CCS[ i ] , w i t h l a b e l s=True )for k , v in c l . i t e r i t em s ( ) : myNodes [ k ] . c l u s t e r i n g = v
80bc = NX. be tweenne s s c en t r a l i t y (CCS[ i ] )for k , v in bc . i t e r i t em s ( ) : myNodes [ k ] . b e tweenne s s c en t r a l i t y = v
dc = NX. d e g r e e c e n t r a l i t y (CCS[ i ] )
82
85 for k , v in dc . i t e r i t em s ( ) : myNodes [ k ] . d e g r e e c e n t r a l i t y = v
cc = NX. c l o s e n e s s c e n t r a l i t y (CCS[ i ] )for k , v in cc . i t e r i t em s ( ) : myNodes [ k ] . c l o s e n e s s c e n t r a l i t y = v
90 ec = e i g e n v e c t o r c e n t r a l i t y (CCS[ i ] )for k , v in ec . i t e r i t em s ( ) : myNodes [ k ] . e i g e nv e c t o r c e n t r a l i t y = v
d = NX. diameter (CCS[ i ] )r = NX. rad ius (CCS[ i ] )
95 ecc = NX. e c c e n t r i c i t y (CCS[ i ] , w i t h l a b e l s=True )for k , v in ecc . i t e r i t em s ( ) :
myNodes [ k ] . e c c e n t r i c i t y = vi f v == d : myNodes [ k ] . i s p e r i p h e r y = 1i f v == r : myNodes [ k ] . i s c e n t e r = 1
100else : pass
# Print r e s u l t sfor k , v in myNodes . i t e r i t em s ( ) :
105 s = ””s += ”%s ,%d,%d,%d,%f ,%f ,%f ,%f ,%f ,%d,%d,%d” % ( port , v . i n degr ee , v . out degree , v .
degree , v . c l u s t e r i ng , v . b e tweenne s s c en t r a l i t y , v . d e g r e e c e n t r a l i t y , v .c l o s e n e s s c e n t r a l i t y , v . e i g e nv e c t o r c e n t r a l i t y , v . e c c e n t r i c i t y , v .i s p e r i phe r y , v . i s c e n t e r )
”””This program reads FANMOD r e s u l t f i l e s and l ook s f o r s i g n i f i c a n t mot i f s .
5 I t a s s o c i a t e s each mot i f wi th an i d e n t i f y i n g i n t e g e r ID and p i c k l e sthe r e s u l t s f o r l a t e r use”””
import p i c k l e10 import glob
import ppr int
f i l e d i r = ”/home/ eddie / r e s ea r ch /fanmod/ r e s c s v s /”f i l e s = glob . g lob ( ’ /home/ eddie / r e s ea r ch /fanmod/ r e s c s v s /* . tx t ’ )
15s i z e 3 = {} # mapping f o r s i z e 3 mot i f sid3 = 0 # f i r s t ID for s i z e 3s i z e 4 = {} # mapping f o r s i z e 4 mot i f sid4 = 0 # f i r s t ID for s i z e 4
20 p thr esh = 0.0 # get mot i f s wi th pvalue <= p thr e shpct occ = 1.0 # get mot i f s wi th f requency >= pct occ
# I t e r a t e through f i l e s and make ID as soc i a t i on sfor i in range ( l en ( f i l e s ) ) :
25 i n F i l e = f i l e s [ i ]msize = in t ( i n F i l e [ −14] ) # Motif s i z e i s s tored in f i lenamef = open ( i nF i l e , ’ r ’ )f i l e = [ ]for l in f :
83
30 l = l [ : −1 ]i f l en ( l ) > 1 : f i l e . append ( l )
# Ignore s t u f f at top o f f i l ef i l e = f i l e [ 2 4 : ]
35 for j in range (0 , l en ( f i l e ) , msize ) :adjMatrix = ””l 1 = f i l e [ j ] . s p l i t ( ’ , ’ )i f ( f l o a t ( l 1 [ 6 ] ) <= p thresh ) and ( f l o a t ( l 1 [ 2 ] [ : − 1 ] ) >= pct occ ) :
# I f t h i s i s a s i g n i f i c a n t mot i f . . .40 adjMatrix += l1 [ 1 ]
for k in range (1 , msize ) :adjMatrix += f i l e [ j+k ] . s p l i t ( ’ , ’ ) [ 1 ]
i f msize == 3 and adjMatrix not in s i z e 3 . va lues ( ) :s i z e 3 [ id3 ] = adjMatrix
45 id3 += 1i f msize == 4 and adjMatrix not in s i z e 4 . va lues ( ) :
s i z e 4 [ id4 ] = adjMatrixid4 += 1
50 f . c l o s e ( ) # Close f i l e handle
# Pi c k l e the r e s u l t i n g d i c t i o n a r i e ss3 = open ( ’ s3map . pkl ’ , ’w ’ )p i c k l e . dump( s i ze3 , s3 )
55 s3 . c l o s e ( )s4 = open ( ’ s4map . pkl ’ , ’w ’ )p i c k l e . dump( s i ze4 , s4 )s4 . c l o s e ( )
”””This f i l e reads the p i c k l e d s3 and s4 maps and c r ea t e s the b inary
5 moti f p a r t i c i p a t i on p r o f i l e s f o r the NN c l u s t e r i n g”””
import sysimport p i c k l e
10 from s t r i n g import s p l i tfrom ppr int import ppr intfrom glob import glob
class p r o f i l e :15 ””” Ins tances o f p r o f i l e s ”””
def i n i t ( s e l f , id , l ) :s e l f . ID = ids e l f . l a b e l = ls e l f . a = [ ]
20 for i in range ( l en ( s3map) + len ( s4map) ) :s e l f . a . append (0)
def mark ( s e l f , m) :try : s e l f . a [ adjM [m] ] = 1
25 except KeyError : pass # in s i g n i f i c a n t motif , not in our d i c t
84
# Unpick le adjMatrix mappings3 = open ( ’ s3map 1pct . pkl ’ , ’ r ’ )
30 s3map = p i c k l e . load ( s3 )s3 . c l o s e ( )s4 = open ( ’ s4map 1pct . pkl ’ , ’ r ’ )s4map = p i c k l e . load ( s4 )s4 . c l o s e ( )
35# Create d i c t i onary f o r adjMatrix mappingadjM = {}idx = 0for k , v in s3map . i t e r i t em s ( ) :
40 adjM [ v ] = idxidx += 1
for k , v in s4map . i t e r i t em s ( ) :adjM [ v ] = idxidx += 1
45seen = {} # d i c t f o r nodesf i l e s = glob ( ’ /home/ eddie / r e s ea r ch /fanmod/ data new l c / dumpf i l es /* ’ )for i in range ( l en ( f i l e s ) ) :#for i in range (3 ,4) :
50 # Open f i l e f o r readingf = open ( f i l e s [ i ] , ’ r ’ )# need a unique p r e f i x s ince we w i l l have mu l t i p l e node 0 , 1 , 2 , e t c . . .p r e f i x = ( f i l e s [ i ] . s p l i t ( ”/” ) [ 7 ] ) . s p l i t ( ” ” ) [ : −2 ]tmp = ””
55 for j in range ( l en ( p r e f i x ) ) : tmp += pr e f i x [ j ] + ” ”p r e f i x = tmpl a b e l = p r e f i x . s p l i t ( ” ” ) [−2]for l i n e s in f :
l = l i n e s . s p l i t ( ” , ” )60 # ignore header l i n e s in dump f i l e s
i f l en ( l ) > 2 :for j in range (1 , l en ( l ) ) :
myNode = p r e f i x + s t r ( i n t ( l [ j ] ) )i f myNode not in seen :
65 seen [myNode ] = p r o f i l e (myNode , l a b e l )seen [myNode ] . mark ( s t r ( l [ 0 ] ) )
f . c l o s e ( ) # c lo s e f i l e handle
70 for k , v in seen . i t e r i t em s ( ) :print v . ID , v . l abe l ,for i in range ( l en ( v . a ) ) :
i f i < l en ( v . a )−1: print v . a [ i ] ,else : print v . a [ i ]
Appendix C: Test Parameters
Parameter [default] Value
subgraph (motif) size [default: 3] 3/4
# of samples used to determine approx. # of subgraphs [100000] 100000
full enumeration? 1(yes)/0(no) [1] 1 (yes)
directed? 1(yes)/0(no) [1] 1 (yes)
colored vertices? 1(yes)/0(no) [0] 1 (yes)
colored edges? 1(yes)/0(no) [0] 0 (no)
random type: 0(no regard)/1(global const)/2(local const) [2] 2
Listing C.1: GA weights.xml – RapidMiner process parameters for genetic algorithmand 1-NN classification
<?xml version=” 1.0 ” encoding=”UTF−8”?><pr oc e s s version=” 4.2 ”>
<operator name=”Root” c l a s s=”Process ” expanded=” yes ”>5 <operator name=”ExampleSource” c l a s s=”ExampleSource”>
<parameter key=” a t t r i b u t e s ” value=”/home/ eddie / r e s ea r ch /fanmod/mo t i f a n a l y s i s /weighted / mot i f s . aml”/>
</ operator><operator name=”Evolut ionaryWeighting” c l a s s=”Evolut ionaryWeighting” expanded=”
yes ”><parameter key=” c r o s s ove r type ” value=” s h u f f l e ”/>
10 <parameter key=” k e ep b e s t i n d i v i d u a l ” value=” true ”/><parameter key=” p c r o s s ove r ” value=” 0 .6 ”/><parameter key=” popu l a t i o n s i z e ” value=”20”/><parameter key=” tournament s i ze ” value=” 0 .2 ”/><operator name=”WeightingChain” c l a s s=”OperatorChain ” expanded=” yes ”>
15 <operator name=”XValidation ” c l a s s=”XValidation ” expanded=” yes ”><parameter key=” keep example s et ” value=” true ”/><operator name=”NearestNeighbors ” c l a s s=”NearestNeighbors ”>
<parameter key=” keep example s et ” value=” true ”/></ operator>
20 <operator name=”ApplierChain ” c l a s s=”OperatorChain ” expanded=” yes ”><operator name=”ModelAppl ier” c l a s s=”ModelAppl ier”>
< l i s t key=” app l i c a t i on pa r amete r s ”></ l i s t>
< l i s t key=” pr ed i c t i on pa r amete r s ”>25 </ l i s t>
</ operator><operator name=”Performance” c l a s s=”Performance”></ operator><operator name=”PerformanceWriter ” c l a s s=”PerformanceWriter ”>
30 <parameter key=” p e r f o rman c e f i l e ” value=”/home/ eddie /r e s ea r ch /fanmod/ mo t i f a n a l y s i s /weighted / mot i f s . per ”/>
• Allan, Edward G., Horvath, Michael R., Kopek, Christover V., Lamb, Brian T.,Whaples, Thomas S., and Berry, Michael W.: Anomaly Detection Using NonnegativeMatrix Factorization, Survey of Text Mining II, Springer, 203–217, 2008
Experience
• Research AssistantWake Forest University, Winston-Salem, NCAugust 2007 – December 2008Worked with Dr. Errin Fulp on various projects in computer security. Researchedtopics in computer networks leading to masters thesis. Assisted in classroom and labduties for networking class.
• Software Development InternGreatWall Systems, Inc., Winston-Salem, NCJune 2007 – August 2007Designed and programmed a testing platform for new high-speed firewall product.
88
89
Implemented portions of firewall software in the Python programming language toallow firewall policies to be swapped in place with no gap in coverage.
• Intern – R&D teamTenable Network Security, Columbia, MDJune 2006 – August 2006Developed, implemented, and tested Nessus vulnerability scanner plugins. Imple-mented code for the Tenable Log Correlation Engine product using a proprietarylanguage. Analyzed, assessed, and scored software vulnerabilities according to theCommon Vulnerability Scoring System for use with the Nessus Vulnerability scanner.
Honors
• Inducted into the Upsilon Pi Epsilon honor society in 2005
• Graduated cum laude from Wake Forest University in 2006
• 2nd place in the 2007 SIAM text mining competition