Page 1
Chapter 1
INTRODUCTION
1.1 Introduction
Security of computers and network systems has become more and more
important, as more computers are connected to each other and more applications are
implemented in this "virtual" world. Applications, such as Electronic Commerce and
Online Banking, require a strict sense of security. While security may not be able to
stop all kinds of threats and attacks, there is a need for a way to know the events of any
attack that is happening. This record of attacks can be used as a tool to strengthen the
security and also can be used as a forensic tool for evidences of crime.
The tool for this purpose is known as Intrusion Detection System (IDS). As the
name suggested, IDS is built to provide detection of "intruder" in the system. With the
knowledge of attacks happening, it gives better way of countering them. Intrusion
prevention techniques, such as user authentication (e.g. using passwords or biometrics),
avoiding programming errors, and information protection (e.g., encryption) have been
used to protect computer systems as a first line of defense. Intrusion prevention alone is
not sufficient because as systems become ever more complex, there are always
exploitable weaknesses in the systems due to design and programming errors, or
various socially engineered penetration techniques. Intrusion detection is therefore
needed as another wall to protect computer systems.
An intrusion into a computer system can be compared to a physical intrusion
into a building by a thief. It is an entity gaining unauthorized access to resources. The
unauthorized access is intended to steal or change information or to disrupt the valid
use of the resource by an authorized user. Intrusion detection is the ability to determine
that an intruder has gained, or is attempting to gain unauthorized access. An intrusion-
detection system is a tool used to make this determination. The goal of any intrusion-
detection system is to alert an authority of unauthorized access before the intruders can
1
Page 2
cause any damage or take any information, much like a burglar alarm system in a
building. However, a digital computer system is far more vulnerable than a building
and much harder to protect. The intruder can be hundreds of miles away when the
attack is initiated, leaving behind very little evidence.
There are some basic definitions of the terms used in this system:
Security: Security consists of mechanisms for providing confidentiality, integrity,
and availability. Confidentiality means that only the individuals allowed access to
particular information should be able to access that information. Integrity refers to
those controls that prevent information from being altered in any unauthorized
manner. Availability controls are those that prevent the proper functioning of
computer systems from being interfered with.
Threat: A threat is any situation or event that has potential to harm a system.
Threats may be external or internal. Threats from users consist of masqueraders
(those who use credentials of others) and clandestine users (those who avoid
auditing and detection). Misfeasors are legitimate users who exceed their privileges.
Attack: An intentional attempt to bypass computer security measures in some
fashion.
Intrusion: A successful attack. An intrusion can be defined as any set of actions
that attempt to compromise the confidentiality, integrity or availability of a
resource.
Signature: A pattern that can be matched to identify a particular type if activity.
Detection rules: A rule typically consists of a signature and associated contextual
and response information.
1.2 Motivation
Our objective is to eliminate, as much as possible, the manual and ad-hoc
elements from the process of building an intrusion detection system. We take a data-
centric point of view and consider intrusion detection as a data analysis process.
Anomaly detection is about finding the normal usage patterns from the audit data,
whereas misuse detection is about encoding and matching the intrusion patterns using
2
Page 3
the audit data. The central theme of our approach is to apply data mining techniques to
intrusion detection. Data Mining generally refers to the process of (automatically)
extracting models from large stores of data. The recent rapid development in data
mining has made available a wide variety of algorithms, drawn from the fields of
statistics, pattern recognition, machine learning, and database. Several types of
algorithms are particularly relevant to our research.
1.3 Problem Definition
In this Research work, we describe a data mining framework for adaptively
building Intrusion Detection (ID) model. Datamining refers to Knowledge Discovry
from the Data(KDD). We say that knowledge mining from data that is knowledge
extraction, data analysis or pattern analysis. Datamining can be applicable to any kind
of data repossitory. It is possible to find the pattern for further references, aggregattion
operation and find the intresting mesure also. Through datamining large data tombs
turns into golden nuggets of the knowledge.
The central idea is to utilize auditing programs to extract an extensive set of
features that describe each network connection or host session, and apply data mining
programs to learn rules that accurately capture the behavior of intrusions and normal
activities. These nuggets or rule then used for misuse detection and anomaly detection.
The network-based approached relies on the Tcpdump data as input, which
gives per packet information. This data was pre-processed as grouping records
corresponding to their protocols and extract features from the data that is useful to train
the intrusion detection system.
After extract features from the Tcpdump data, the data mining algorithm is
operated on the data. Here the task is to get association rules from the data. The data-
mining algorithm will take the data that contains extensive set of features and produce
the rules. Apriori association algorithm is chosen for getting the rules.
The rules generated from the above algorithm are then used to detect intruders.
To perform this task ID3 algorithm is used that take data network data as input and
3
Page 4
compare it with the association rules. This algorithm labels the events as either normal
or attacks.
1.4 The Challenges
Formulating the classification tasks, i.e., determining the class labels and the
set of features, from audit data is a very difficult and time-consuming task. Since
security is usually an after-thought of computer system design, there is no standard
auditing mechanisms and data format specifically for intrusion analysis purposes.
Considerable amount of data pre-processing, which involves domain knowledge, is
required to extract raw “action" level audit data into higher level “session/event"
records with the set of intrinsic system features. Figure 1.4 shows an example of audit
data preprocessing.
Here, binary tcpdump data is first converted into ASCII packet level data, where
each line contains the information of one network packet. The data is ordered by the
timestamps of the packets. Therefore, packets belonging to different connections may
be interleaved. For example, the 3 packets shown in the figure 1.4 [16] are from
different connections. The packet data is then processed into connection records with a
number of features (i.e. attributes), e.g., time (the starting time of the connection, i.e.,
the timestamp of its first packet), dur (the duration of the connection), src and dst
(source and destination hosts), bytes (number of data bytes from source to destination),
srv (the service, i.e., port, in the destination), and ag (how the connection conforms to
the network protocols, e.g., SF is normal, REJ is “rejected"), etc.
4
Page 5
Figure1.1: Generation of audit data
These intrinsic features essentially summarize the packet level information within a
connection. There are commonly available programs that can process packet level data
into such connection records for network traffic analysis tasks. However, for intrusion
detection, the temporal and statistical characteristics of connections also need to be
considered because of the temporal nature of event sequences in network-based
computer systems.
For example, a large number of “rejected" connections, i.e., flag = REJ, within a short
time frame can be a strong indication of intrusions, because normal connections are
rejected rarely.
A critical requirement for using classification rules as an anomaly detector is
that we need to have “sufficient" training data that covers as much variation of the
normal behavior as possible, so that the false positive rate is kept low (i.e., we wish to
minimize detected “abnormal normal" behavior). It is not always possible to formulate
a classification model to learn the anomaly detector with limited (“insufficient")
training data, and then incrementally update the classifier using on-line learning
algorithms. This is because the limited training data may not have covered all the class
labels, and on-line algorithms. For example in modeling daily network traffic, we use
the services, e.g., http, telnet etc., of the connections as the class labels in training
models. We may not have connection records of the infrequently used services with,
say, only one week's traffic data.
5
Page 6
A formal audit data gathering process therefore needs to take place first. As we
collect audit data, we need an indicator that can tell us whether the new audit data
exhibits any “new" normal behavior, so that we can stop the process when there is no
more variation. This indicator should be simple to compute and must be incrementally
updated.
1.5 Conclusion
This chapter describes the background and motivation System. This chapter also
contains introduction to data mining and its need, and problem statement of the
research.
Chapter 2
LITERATURE SURVEY
2.1 Introduction
Our objective is to develop general rather than intrusion-specific tools in
response to the challenges discussed in the previous section. The idea is to first
compute the association rules from audit data, which (intuitively) capture the intra-
(temporal) audit record patterns.
These patterns are then utilized, with user participation, to guide the data
gathering and feature selection processes. Here we use the term “audit data” to refer to
general data streams that can be processed for detection purposes. Examples of such
data streams are the connection records extracted from the raw tcpdump output, and the
Web site visit records processed using the Web site logs.
We assume that audit data records are time stamped. As described in [10], the
main challenge in developing these data mining algorithms is to provide support
mechanisms for domain knowledge so that “useful” patterns are computed. We next
6
Page 7
describe these basic data mining algorithms and our proposed extensions that allow the
introduction of domain knowledge in a convenient manner.
2.2 Intrusion Detection System
Intrusion detection is the process of monitoring the events occurring in a
computer system or network and analyzing them for signs of intrusions, defined as
attempts to compromise the confidentiality, integrity, availability, or to bypass the
security mechanisms of a computer or network. Intrusions are caused by attackers
accessing the systems from the Internet, authorized users of the systems who attempt to
gain additional privileges for which they are not authorized, and authorized users who
misuse the privileges given them. Intrusion Detection Systems (IDSs) are software or
hardware products that automate this monitoring and analysis process.
Intrusion detection allows organizations to protect their systems from the threats
that come with increasing network connectivity and reliance on information systems.
Given the level and nature of modern network security threats, the question for security
professionals should not be whether to use intrusion detection, but which intrusion
detection features and capabilities to use.
IDSs have gained acceptance as a necessary addition to every organization’s
security infrastructure. Despite the documented contributions intrusion detection
technologies make to system security, in many organizations one must still justify the
acquisition of IDSs.
There are several compelling reasons to acquire and use IDSs:
To prevent problem behaviors by increasing the perceived risk of discovery and
punishment for those who would attack or otherwise abuse the system,
To detect attacks and other security violations that are not prevented by other
security measures,
To detect and deal with the preambles to attacks (commonly experienced as
network probes and other “doorknob rattling” activities),
To document the existing threat to an organization
7
Page 8
To act as quality control for security design and administration, especially of large
and complex enterprises
To provide useful information about intrusions that do take place, allowing
improved diagnosis, recovery, and correction of causative factors.
2.2.1 Intrusion Detection System Taxonomy
There are several attributes by which all intrusion detection technology can be
classified. These attributes are information sources, type of analysis, timing,
architecture, and activeness. They can be used to compare and categorize specific
intrusion detection solutions. The following figure 2.1 illustrates classification of
intrusion detection system[18].
Figure 2.1 Classification of Intrusion Detection System
8
Page 9
The main distinction among IDS is: network based and host based intrusion
detection systems on the basis of their source of information.
( I ) Host-based intrusion detection systems (HIDS):
These are concerned with what is happening on each individual computer or
host. Operating system and computer system details, such as memory and processor
use, user activities, and applications running, are examined for indications of misuse.
They are able to detect such things as repeated failed access attempts or changes to
critical system files. The HIDS reside on a particular computer and provide protection
for a specific computer system[19].
The advantages of HIDS are:
More detailed logging: Because host-based intrusion detection runs on the
monitored host, it can collect much more detailed information regarding exactly
what occurs during the course of an attack.
Increased Recovery: Because of increased granularity of tracking events in the
monitored system, recovery from a successful incident is usually more complete.
Detects unknown attacks: Host-based intrusion detection is better at detecting
unknown attacks that affect monitored host than is network-based intrusion
detection.
Fewer false positives: A side effect of the way host-based intrusion detection
works is to provide substantially fewer false alerts than produced by network-based
IDS.
HIDS products such as Snort, Dragon Squire, Emerald eXpert-BSM, NFR HID,
Intruder- Alert all perform this type of monitoring.
( II ) Network-based intrusion detection systems (NIDS):
These examine the individual packets flowing through a network. Unlike
firewalls, which typically only look at IP addresses, ports and ICMP types, network
9
Page 10
based intrusion detection systems are able to understand all the different flags and
options that can exist within a network packet. A NIDS can therefore detect maliciously
crafted packets that are designed to be overlooked by a firewall's relatively simplistic
filtering rules. Hackers often craft such traffic in order to map out a network, as a form
of pre-attack reconnaissance[19].
Network-based intrusion detection is more popular than host-based intrusion
detection for several reasons:
Ease of deployment: Network-based IDS listens to activity on the network and
analyzes it. This model results in few performance or compatibility issues in the
monitored environment.
Cost: A handful of strategically placed sensors can be used to monitor a large
organizational environment. Host-based IDS requires software on each monitored
host.
Range of detection: The variety of malicious activities able to be detected through
the analysis of network traffic is wider than the variety able to be detected in host-
based IDS.
Forensics Integrity: If a host using host-based IDS is compromised, then all of the
intrusion detection activity logs become suspect because the attacker most likely
gained the ability to modify information on the host.
Detects all attempts, even failed ones: Network-based IDS analyzes activity
regardless of whether the activity is successful or unsuccessful attack. Host-based
IDS generally only detect successful attacks because most unsuccessful attacks
don’t affect the monitored host directly.The network-based intrusion systems
products are: Cisco Secure IDS (formerly NetRanger), Hogwash, Dragon, E-Trust
IDS.
2.2.1.1 Analysis Strategy
Intrusion detection systems must be capable of distinguishing between normal
(not security-critical) and abnormal user activities, to discover malicious attempts in
time. However translating user behaviors (or a complete user-system session) in a
10
Page 11
consistent security-related decision is often not that simple, many behavior patterns are
unpredictable and unclear[18].
( I ) Misuse detection system:
It is based on extensive knowledge of patterns associated with known attacks
provided by human experts. It attempts to recognize attacks that follow intrusion
patterns that have been recognized and reported by experts. Existing approaches to
implement misuse detection systems are signature matching, expert systems, state
transition analysis and heuristic approach. Typical structure of misuse detection system
is shown in Figure 2.2.
Misuse detection systems are vulnerable to intruders who use new patterns of
behavior or who mask their illegal behavior to deceive the detection system.
Figure 2.2 Misuse Detection System
( II ) Anomaly detection system:
Anomaly detection methods were developed to counter the problem of misuse
detection systems. With the anomaly detection approach, one represents patterns of
normal behavior, with the assumption that an intrusion can be identified based on some
deviation from this normal behavior. When such a deviation is observed, an intrusion
11
Page 12
alarm is produced. The major benefit of anomaly detection system is that it is able to
recognize unforeseen attacks. But its major limitation is high false alarm rate, since
detected deviations do not necessarily represent actual attacks[18].
Figure 2.3 shows typical structure of anomaly detection system. The sensor is a
network interface which collects the packets. The activity normalizer performs analysis
of the data. Note the two-way interchange between the activity normalizer and the
“normal” activity database. The activity normalizer must constantly adjust the baseline
of normal activity to reflect the dynamic nature of the monitored computer systems and
network.
Figure 2.3 Anomaly Detection System
The common approaches used to develop anomaly detection systems are
statistical methods, expert systems, neural networks, data mining, and outlier detection
schemes.
2.2.1.2 Time Aspects
Detection systems work in either real-time or at given intervals. At first glance,
real time systems are more desirable, but certain types of activities can be detected over
larger ranges of time. Given the amount of analysis being performed and the amount of
data that most real-time systems must handle, real-time systems have practical
limitations on the size of the window of time that can be examined. Most commercial
12
Page 13
products offering real-time analysis are limited to a 5 to 15 minute window of time.
Off-line analysis analyzes the data when information about the sessions is already
collected. They are most useful for understanding attacker’s behavior. Most
commercial products today recognize this and use a combination of both types of
timings for the best effect.
2.2.1.3 Architecture
In case of centralized IDS the data analysis is performed at fixed number of
locations independent of how many hosts are being monitored. In distributed IDS data analysis is performed in a number of locations proportional
to the number of hosts that are being monitored. The distributed intrusion detection
system is necessary for detection of distributed/coordinated attacks targeted at
multiple networks/machines.
Some of the products use combination of both of these architectures.
2.2.1.4 Activeness
The other method of categorizing intrusion detection systems is by their passive
or reactive nature.In a passive system, the IDS sensor detects a potential security
breach, logs the information and signals an alert on the console i.e. no countermeasure
is actively applied to thwart the attack. While in a reactive system, the IDS responds to
the suspicious activity by logging off a user or by reprogramming the firewall to block
network traffic from the suspected malicious source, either autonomously or at the
command of an operator.
2.2.2 Data Processing techniques used in Intrusion Detection
Systems
Depending on the type of approach taken in intrusion detection, various
processing mechanisms (techniques) are employed for data that is to reach IDS. Below,
several systems are described briefly:
13
Page 14
Expert systems: These work on a previously defined set of rules describing an
attack. All security related events incorporated in an audit trail are translated in
terms of if-then-else rules. Examples are Wisdom & Sense and ComputerWatch
(developed at AT&T)[24].
Signature analysis: Like expert System approach, this method is based on the
attack knowledge. They transform the semantic description of an attack into the
appropriate audit trail format. Thus, attack signatures can be found in logs or input
data streams in a straightforward way. Detection is accomplished by using common
text string matching mechanisms. Typically, it is a very powerful technique and as
such very often employed in commercial systems[20].
Colored Petri: The Colored Petri Nets approach is often used to generalize
attacks from expert knowledge bases and to represent attacks graphically. With this
technique, it is easy for system administrators to add new signatures to the system.
However, matching a complex signature to the audit trail data may be time-
consuming. The technique is not used in commercial systems [3].
State-transition: Here, an attack is described with a set of goals and transitions
that must be achieved by an intruder to compromise a system. Transitions are
represented on state-transition diagrams.
Statistical analysis approach: This is a frequently used method (for example
SECURENET). The user or system behavior (set of attributes) is measured by a
number of variables over time. Examples of such variables are: user login, logout,
number of files accessed in a period of time, usage of disk space, memory, CPU etc.
Neural networks: Neural networks use their learning algorithms to learn about
the relationship between input and output vectors and to generalize them to extract
new input/output relationships. With the neural network approach to intrusion
detection, the main purpose is to learn the behavior of actors in the system (e.g.,
users, daemons) [4].
14
Page 15
User intention identification: This technique models normal behavior of
users by the set of high-level tasks they have to perform on the system (in relation
to the users’ functions). These tasks are taken as series of actions, which in turn are
matched to the appropriate audit data. The analyzer keeps a set of tasks that are
acceptable for each user. Whenever a mismatch is encountered, an alarm is
produced.
Computer immunology: Analogies with immunology has lead to the
development of a technique that constructs a model of normal behavior of UNIX
network services, rather than that of individual users. This model consists of short
sequences of system calls made by the processes. Attacks that exploit flaws in the
application code are very likely to take unusual execution paths. First, a set of
reference audit data is collected which represents the appropriate behavior of
services, and then the knowledge base is added with all the known “good”
sequences of system calls. These patterns are then used for continuous monitoring
of system calls to check whether the sequence generated is listed in the knowledge
base; if not an alarm is generated. This technique has a potentially very low false
alarm rate provided that the knowledge base is fairly complete. Its drawback is the
inability to detect errors in the configuration of network services. Whenever an
attacker uses legitimate actions on the system to gain unauthorized access, no alarm
is generated.
Machine learning: This is an artificial intelligence technique that stores the
user-input stream of commands in a vectorial form and is used as a reference of
normal user behavior profile. Profiles are then grouped in a library of user
commands having certain common characteristics [5].
Data mining: It refers to a set of techniques that use the process of extracting
previously unknown but potentially useful data from large stores of data. A typical
data mining technique is associated with finding association rules. It allows one to
extract previously unknown knowledge on new attacks or built on normal behavior
patterns. Anomaly detection often generates false alarms. With data mining it is
15
Page 16
easy to correlate data related to alarms with mined audit data, thereby considerably
reducing the rate of false alarms . Data mining refers to as Knowledge Discovery
from the Data (KDD) .As we can say that knowledge mining from data , knowledge
extraction , data analysis or pattern analysis . Data mining should be applicable to
any kind of data repsitory . Data mining is to find the pattern to present the
knowledge in intregated form and remove the noise and unneccesary data form the
data resourse.Through Data mining , it is possible to find the pattern for further
references , aggregattion operation and find the intresting mesure also. We can view
Data mining as the evolution of the information technology. The widening gap
between data and information develop the Data mining , that turns the large data
repository or data tombs into golden nuggets of the knowledge Huge amount of
database is mine by tools to gain some meaningful knowledge. Data mining is the
process of automatically searching large volumes of data for patterns. Data mining
is a fairly recent and contemporary topic in computing. However, Data mining
applies many older computational techniques from statistics, machine learning, and
pattern recognition. Data mining can be defined as the nontrivial extraction of
implicit, previously unknown, and potentially useful information from data. It is the
science of extracting useful information from large data sets or databases[6,24].
2.2.3 Requirements for Intrusion Detection System
The basic requirements for a good intrusion detection system:
A system must recognize any suspect activity or triggering event that could
potentially be an attack.
Escalating behavior on the part of an intruder should be detected at the lowest level
possible.
Components on various hosts must communicate with each other regarding level of
alert and intrusions detected.
The system must respond appropriately to changing levels of alertness.
The detection system must have some manual control mechanisms to allow
administrators to control various functions and alert levels of the system.
The system must be able to adapt to changing methods of attack.
16
Page 17
The system must be able to handle multiple concurrent attacks.
The system must be scalable and easily expandable as the network changes.
The system must be resistant to compromise, able to protect itself from intrusion.
The system must be efficient and reliable.
2.2.4 Shortfalls of current IDS
Despite 20 years of research, intrusion detection technology has quite a way to
go to achieve a perfect solution. There are still many challenges to achieve effective
intrusion detection as discussed below :
Alert handling: Until intrusion detection system is properly tunes to a specific
environment, there can be literally thousands of alerts generated on a daily basis.
The expertise and manpower required to handle alerts can be quite daunting.
Variants: As stated previously signatures are developed in response to new
vulnerabilities or exploits that have been posted or released. Integral to the success
of a signature, it must be unique enough to only alert on malicious traffic and rarely
on valid network traffic. The difficulty here is that exploit code can often be easily
changed. It is not uncommon for an exploit tool to be released and then have its
defaults changed shortly thereafter by the hacker community.
False positives: A common complaint is the amount of false positives IDS will
generate. Developing unique signatures is a difficult task and often times the
vendors will err on the side of alerting too often rather than not enough. This is
analogous to the story of the boy who cried wolf. It is much more difficult to pick
out a valid intrusion attempt if a signature also alerts regularly on valid network
activity. A difficult problem that arises from this is how much can be filtered out
without potentially missing an attack.
False negatives: This leads to the other concept of false negatives where an IDS
does not generate an alert when an intrusion is actually taking place. Simply put if a
signature has not been written for a particular exploit there is an extremely good
chance that the IDS will not detect it.
17
Page 18
Evasion: An increasing number of attackers understand the shortcomings of the
some of the intrusion detection technology, such as signature-based IDS. As
attackers understand the weakness, their attacks are designed to bypass detection.
Architectural issues: Technology such as switches, Gigabit Ethernet, and
encryption make network-based intrusion detection much more challenging.
Data overload: Another aspect, which is extremely important, is how much data
an analyst can effectively and efficiently analyze. That being said the amount of
data one needs to look at seems to be growing rapidly. Depending on the intrusion
detection tools employed by a company and its size there is the possibility for logs
to reach millions of records per day.
2.3 Goals of Data Mining
Data mining is typically carried out with some end goals or applications.
Broadly speaking, these goals fall into the following classes: prediction, identification,
classification, and optimization [9]:
Prediction: Data mining can show how certain attributes within data will behave
in the future. Examples of predictive data mining include the analysis of buying
transaction to predict what consumers will buy under certain discounts, how much
sales volume a store would generate in a given period, and whether deleting product
line would yield more profits. In such applications, business logic is used coupled
with data mining.
Identification: Data patterns can be used to identify the existence of an item, an
event, or an activity. For example, in biological applications, existence of a gene
may be identified by certain sequences of nucleotide symbols in the DNA sequence.
Classification: Data mining can partition the data so that different classes or
categories can be identified based on combinations of parameters. For example,
customers in super market can be categorized into discount-seeking shoppers,
shoppers in rush, loyal regular shoppers, shoppers attached to name brands, and
infrequent shoppers. Sometimes classification based on common domain
18
Page 19
knowledge is used as an input to decompose the mining problem and make it
simpler.
Optimization: One eventual goal of data mining may be to optimize the use of
limited resources such as time, space, money, or materials and to maximize output
variables such as sales or profit under a given set of constraints. As such, this goal
of data mining resembles the objective function used in operations research problem
that deals with optimization under constraints.
2.3.1 Types of Knowledge Discovered During Data Mining
The term “knowledge” is very broadly interpreted as involving some degree of
intelligence. There is progression from raw data to information to knowledge as we go
through additional processing. Knowledge is often classified as inductive versus
deductive. Deductive knowledge deduces new information based on applying pre-
specified logical rules of deduction on the given data. Data mining addresses inductive
knowledge, which discovers new rules and patterns from supplied data.
It is common to describe the knowledge discovered during data mining in five
ways, as follows[24]:
Association rules: Find the association between mesurin attributes. That how
one attribute is act or releted to another . These rules correlate the presence of a set
of items with another range of values for another set of variables[8].
Example: Female buys (S “handbeg”) Female buys (S “shoes”) .When female retail
shopper buys handbag, she is likely to buy shoes.
Classification hierarchies : It is the procedure to describe and distinguish the
concept to predict the classes which is unknown through training data. The goal is
to work from an existing set of events or transactions to create a hierarchy of
classes.
Example: A population may be divided into five ranges of credit worthiness based on a
history of previous credit transactions
Sequential patterns: find patteerns that is frequently in the data.Analysis the
data patterns to find which patterns is very frquent A sequence of actions or events
is sought[9].
19
Page 20
Example: If a patient underwent cardiac bypass surgery for blocked arteries and later
developed high blood urea within year of surgery, he or she is likely to suffer from
kidney failure within the next 18 months.
Detection of events is equivalent to detecting associations among events with certain
temporal relationships.
Patterns within time series: Similarities can be detected within positions of
time series of data, which is sequence of data taken at regular intervals such as daily
sales[9].
Example: Two products show the same selling pattern in summer but different one in
winter.
Clustering: Analyze the data of unknown category or class lable .Data are
analyze such that the data within the same cluster have the similarity beyond the
base class they belong. Every cluster is a collection of objects that have the same
property. A given population of events or items can be partitioned into sets of
“similar” elements [10].
Example: An entire population of treatment on disease may be divided into groups
based on the similarity of side effects produced.
A typical data mining technique is associated with finding association rules.
Association rules are used to gather necessary knowledge about nature of audit data, on
assumption that discovering patterns within individual records in a trace can improve
specifies the correlation among different features.
2.3.2 Association Rules
Association rules were originally developed as a tool for analysis of retail sales.
A piece of sales data usually includes information about a transaction, such as
transaction date and items purchased. Association rules can be used to find the
correlation among different items in a transaction. For example, when a customer buys
item A, item B will also be purchased by the customer with the probability of 90%.
Agrawral and Srikant [11] have presented some fast algorithms to mine association
rules, including algorithm Apriori. Using the notation of Agrawal and Srikant, let D =
{T1, T2 , …, Tn} be the transaction database with n transactions in total and I = { i1 , i2 ,
20
Page 21
…, im } be the set of all the items where each ij (1 ≤ j ≤ m) represents one kind of
item. Then each transaction Tl (1 ≤ l ≤ n) in D records the items purchased, i.e., Tl
I. Define an itemset as a nonempty subset of I. An association rule will have the
form: X→Y, c, s, where X I, Y I, and X Y = , i.e., X and Y are disjoint
itemsets. Here s represents the support of this association rule and c represents the
confidence of this association rule.
Assume the number of transactions that contains both the itemset X and the
itemset Y is n’; then s = support(X U Y) = n’/n; and c = support(X U Y)/support(X).
Intuitively, support(X) can be viewed as the occurrence frequency of the itemset X in
the whole transaction database D, while c indicates that when X is satisfied, there will
be the certainty of c that Y is also true. Two thresholds, minconfidence (representing
minimum confidence) and minsupport (representing minimum support), are used by the
mining algorithm to find all association rules X→Y, c, s such that c minconfidence
and s minsupport. Any itemset X is called a large itemset if support(X)
minsupport.
2.3.3 The Apriori Algorithm
The basic Apriori algorithm finds frequent itemsets for Boolean association
rules, receiving as input a database T of transactions and the minimum support for the
rules. It uses the Apriori property: if an itemset I is not frequent, the itemset I UA (A
is any other item) is also not frequent; i.e. “all nonempty subsets of a frequent itemset
must also be frequent”.
The Apriori algorithm [12] builds a set Ck (candidate itemsets of size k) and Lk
(frequent itemsets of size k) to create frequent itemsets of size k+1:
_____________________________________________________
Input: a database T of transactions and the minimum support for the rules
_______________________________________________________
Algorithm:
L1= {frequent items}
21
Page 22
k=2
While (Lk! =Ø) {
Ck = GenerateCandidates (Lk);
for each transaction t in the database
Increment the count of all candidates in Ck that
belong to T
Lk+1 = candidates in Ck with enough support
K++;
}
return (L=L1UL2U……);
_______________________________________________________
Output: frequent itemsets for Boolean association rules
_______________________________________________________
Figure 2.4: The Apriori algorithm
GenerateCandidates() returns a subset of the join operation of Lk and Lk,
pruning itemsets that do not satisfy the Apriori property. Computing the support and
confidence of all nonempty subsets of each frequent itemset generates the set of
association rules.
2.3.4 ID3 Decision Tree Algorithm
Decision trees are powerful and popular tools for classification and prediction.
The attractiveness of decision trees is due to the fact that, in contrast to neural
networks, decision trees represent rules. Rules can readily be expressed so that humans
can understand them or even directly used in a database access language like SQL so
that records falling into a particular category may be retrieved.
In some applications, the accuracy of a classification or prediction is the only
thing that matters. In such situations we do not necessarily care how or why the model
works. In other situations, the ability to explain the reason for a decision is crucial. In
marketing one has describe the customer segments to marketing professionals, so that
they can utilize this knowledge in launching a successful marketing campaign. These
22
Page 23
domain experts must recognize and approve this discovered knowledge, and for this we
need good descriptions. There are a variety of algorithms for building decision trees
that share the desirable quality of interpretability.
Decision Tree: Decision tree is a classifier in the form of a tree structure, where
each node is either:
A leaf node - indicates the value of the target attribute (class) of examples, or
A decision node - specifies some test to be carried out on a single attribute-
value, with one branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an example by starting at the root of the
tree and moving through it until a leaf node, which provides the classification of the
instance.
Decision tree induction is a typical inductive approach to learn knowledge on
classification. The key requirements to do mining with decision trees are:
Attribute-value description: object or case must be expressible in terms of a
fixed collection of properties or attributes. This means that we need to discretize
continuous attributes, or this must have been provided in the algorithm.
Predefined classes (target attribute values): The categories to which
examples are to be assigned must have been established beforehand (supervised
data).
Discrete classes: A case does or does not belong to a particular class, and there
must be more cases than classes.
Sufficient data: Usually hundreds or even thousands of training cases.
Constructing Decision Trees: Most algorithms that have been developed for
learning decision trees are variations on a core algorithm that employs a top-down,
greedy search through the space of possible decision trees. Decision tree programs
construct a decision tree T from a set of training cases[24].
__________________________________________________
Input: R: a set of non-target attributes,
23
Page 24
C: the target attribute,
S: a training set
_______________________________________________________
Algorithm:
begin
- If S is empty, return a single node with value Failure;
- If S consists of records all with the same value for the target attribute, return a
single leaf node with that value;
- If R is empty, then return a single node with the value of the most frequent of
the values of the target attribute that are found in records of S; [in that case
there may be errors, examples that will be improperly classified];
- Let A be the attribute with largest Gain (A, S) among attributes in R;
-Let {aj| j=1, 2, .., m} be the values of attribute A;
- Let {Sj| j=1, 2, .., m} be the subsets of S consisting respectively of records
with value aj for A;
- Return a tree with root labeled A and arcs labeled a1, a2, .., am going
respectively to the trees (ID3(R-{A}, C, S1), ID3(R-{A}, C,
S2), .....,ID3(R-{A}, C, Sm);
-Recursively apply ID3 to subsets {Sj| j=1,2, .., m} until they are empty
end
_______________________________________________________
Output: a decision tree
___________________________________________________
Figure 2.5: ID3 Decision Tree Algorithm
ID3 searches through the attributes of the training instances and extracts the
attribute that best separates the given examples. If the attribute perfectly classifies the
training sets then ID3 stops; otherwise it recursively operates on the m (where m =
number of possible values of an attribute) partitioned subsets to get their "best"
attribute.
24
Page 25
The best classifier: The estimation criterion in the decision tree algorithm is the
selection of an attribute to test at each decision node in the tree. The goal is to select the
attribute that is most useful for classifying examples. A good quantitative measure of
the worth of an attribute is a statistical property called information gain that measures
how well a given attribute separates the training examples according to their target
classification. This measure is used to select among the candidate attributes at each step
while growing the tree.
Entropy: A measure of homogeneity of the set of examples. It is the most commonly
used discretization measure. It explores class distribution information in its calculation
determination of split points mean data values for partitioning an range of attribute.
In order to define information gain precisely, we need to define a measure
commonly used in information theory, called entropy, that characterizes the (im)purity
of an arbitrary collection of examples. Given a set S, containing only positive and
negative examples of some target concept (a 2 class problem), the entropy of set S
relative to this simple, binary classification is defined as[24].
Entropy (S) = - Pi Pi – Pj Pj
where Pi is the proportion of positive examples in S and Pj is the proportion of
negative examples in S. In all calculations involving entropy we define 0log0 to be 0.
If the target attribute takes on c different values, then the entropy of S relative to
this c-wise classification is defined as[24]
where pi is the proportion of S belonging to class i. Note the logarithm is still
base 2 because entropy is a measure of the expected encoding length measured in bits.
Note also that if the target attribute can take on c possible values, the maximum
possible entropy is Log 2 C.
25
Page 26
Information gain: Given entropy as a measure of the impurity in a collection of
training examples, we can now define a measure of the effectiveness of an attribute in
classifying the training data. The measure we will use, called information gain, is
simply the expected reduction in entropy caused by partitioning the examples according
to this attribute. More precisely, the information gain, Gain (S, A) of an attribute A,
relative to a collection of examples S, is defined as [24]
where Values(A) is the set of all possible values for attribute A, and Sv is the
subset of S for which attribute A has value v (i.e., Sv = {s Î S | A(s) = v}). Note the first
term in the equation for Gain is just the entropy of the original collection S and the
second term is the expected value of the entropy after S is partitioned using attribute A.
The expected entropy described by this second term is simply the sum of the entropies
of each subset Sv, weighted by the fraction of examples |Sv|/|S| that belong to Sv. Gain
(S,A) is therefore the expected reduction in entropy caused by knowing the value of
attribute A. Put another way, Gain(S,A) is the information provided about the target
attribute value, given the value of some other attribute A. The value of Gain(S,A) is the
number of bits saved when encoding the target value of an arbitrary member of S, by
knowing the value of attribute A.
The process of selecting a new attribute and partitioning the training examples
is now repeated for each non-terminal descendant node, this time using only the
training examples associated with that node. Attributes that have been incorporated
higher in the tree are excluded, so that any given attribute can appear at most once
along any path through the tree. This process continues for each new leaf node until
either of two conditions is met:
Every attribute has already been included along this path through the tree, or
The training examples associated with this leaf node all have the same target
attribute value (i.e., their entropy is zero).
26
Page 27
2.4 Extensions of Apriori Algorithm
These basic algorithms do not consider any domain knowledge and as a result
they can generate many “irrelevant" (i.e., uninteresting) rules. We need to limit the
generation of these “uninteresting” rules. Also, the rules need to be generalized so as to
use them in a generic manner for different rule set which are “situationally” similar[8].
The above limitations and generalization is achieved by incorporating two
modifications to the basic A priori algorithm. They are:
Axis attributes
Reference attributes
2.4.1 Interestingness measures based on attributes
We attempt to utilize the schema level information about audit records to direct
the pattern mining process. That is, although we cannot know in advance what patterns,
which involve actual attribute values, are interesting, we often know what attributes are
more important or useful given a data analysis task. Using the minimum support and
confidence values to output only the “statistically” significant" patterns, the basic
algorithms implicitly measure the interestingness (i.e., relevancy) of patterns by their
support and confidence values, without regard to any available prior domain
knowledge. That is, assume I is the interestingness measure of a pattern p, then
I(p) = f(support(p); confidence(p))
Where f is some ranking function. We propose here to incorporate schema level
information into the interestingness measures. Assume IA is a measure on whether a
pattern p contains the specified important (i.e. “interesting") attributes, our extended
interestingness measure is
Ie(p) = fe(IA(p); f(support(p); confidence(p))) = fe(IA(p); I(p))
27
Page 28
Where fe is a ranking function that first considers the attributes in the pattern,
then the support and confidence values. In the following sections, we describe several
schema-level characteristics of audit data, in the forms of “what attributes must be
considered", that can be used to guide the mining of relevant features. We do not use
these IA measures in post-processing to filter out irrelevant rules by rank ordering.
Rather, for efficiency, we use them as item constraints, i.e., conditions, during
candidate itemset generation.
2.4.2 Using the Axis Attribute (S)
There is a partial “order of importance" among the attributes of an audit record.
Some attributes are essential in describing the data, while others only provide auxiliary
Information. Consider the audit data of network connections shown in Figure 2.6.
Figure 2.6 Network connection records
Here each record (row) describes a network connection. The continuous
attribute values, except the timestamps, are discretized into proper bins. A network
connection can be uniquely identified by
< timestamp; src host; src port; dst host; service >
That is, the combination of its start time, source host, source port, destination
host, and service (destination port). These are the essential attributes when describing
network data. We argue that the “relevant" association rules should describe patterns
related to the essential attributes. Patterns that include only the unessential attributes are
28
Page 29
normally “irrelevant". For example, the basic association rules algorithm may generate
rules such as
(src bytes = 200) →(flag = SF)
These rules are not useful and to some degree are misleading. There is no
intuition for the association between the number of bytes from the source, src bytes,
and the normal status (flag = SF) of the connection, but rather it may just be a statistical
correlation evident from the dataset.
We call the essential attribute(s) axis attribute(s) when they are used as a form
of item constraints in the association rules algorithm. During candidate generation, an
item set must contain value(s) of the axis attribute(s). We consider the correlation
among non-axis attributes as not interesting. In other words,
IA (p) = 1 if p contains axis attribute(s)
= 0 otherwise
In practice, we need not designate all essential attributes as the axis attributes.
For example, some network analysis tasks require statistics about various network
services while others may require the patterns related to the hosts. We can use service
as the axis attribute to compute the association rules that describe the patterns related to
the services of the connections. It is even more important to use the axis attribute(s) to
constrain the item generation for frequent episodes. The basic algorithm can generate
serial episode rules that contain only the “unimportant" attribute values. For example
src bytes = 200; src bytes = 200 → dst bytes = 300; src bytes = 200
(We omit the support, confidence from the above rule.) Note that here each
attribute value, e.g., src bytes = 200, is from a different connection record.
2.4.3 Using the Reference Attribute (S).
Another interesting characteristic of system audit data is that some attributes
can be the references of other attributes. These reference attributes normally carry
information about some “subject", and other attributes describe the “actions" that refer
to the same “subject".
29
Page 30
Consider the log of visits to a Web site, as shown in figure 2.7. Here action and
request are the “actions" taken by the “subject", remote host. We see that for a number
of remote hosts, each of them makes the same sequence of requests:
“/images", “/images" and “/shuttle/missions/sts-71". Figure 2.7 shows a web log
record example.
Figure 2.7: Web log records
It is important to use the “subject" as a reference when finding such frequent
sequential “action" patterns because the “actions" from different “subjects" are
normally irrelevant. This kind of sequential pattern can be represented as:
(Subject = X; action = a);
(Subject = X; action = b) → (subject = X; action = c)
Note that within each occurrence of the pattern, the action values refer to the
same subject, yet the actual subject value may not be given in the rule since any
particular subject value may not be frequent with regard to the entire dataset. In other
words, subject is simply a reference (or a variable).
In other words,
IA (p) = 1 if the itemsets of p refer to the same reference attribute value
0 otherwise
2.5 Level-Wise Approximate Mining
30
Page 31
It is often necessary to include the low frequency patterns. In daily network
traffic, some services, for example, gopher, and account for very low occurrences. Yet
we still need to include their patterns into the network traffic profile (so that we have
representative patterns for each supported service)[10]. If we use a very low support
value for the data mining algorithms, we will then get unnecessarily a very large
number of patterns related to the high frequency services, for example, smtp.
Procedure
_______________________________________________________
Input:
The terminating minimum support s0;
The initial minimum support si;
The axis attribute(s);
_______________________________________________________
Output:
Association rules Rules
_______________________________________________________
Begin
(1) Rrestricted = ;
(2) Scan database to form L = {1-itemsets that meet s0};
(3) S = si;
(4) While (s s0) do begin
(5) Compute frequent episodes from L: each episode must contain at least one
axis attribute value that is not in Rrestricted;
(6) Append new axis attribute values to Rrestricted;
(7) Append episode rules to the output rule set Rules;
(8) S = s/2; /* a smaller support value for the next iteration */
end while
End
_______________________________________________________
31
Page 32
Figure 2.8: Level-wise Approximate Mining Procedure
Here the idea is to first find the episodes related to high frequency axis attribute
values. We then iteratively lower the support threshold to find the episodes related to
the low frequency axis values by restricting the participation of the “old" axis values
that already have output episodes. More specifically, when an episode is generated, it
must contain at least one “new" (low frequency) axis value.
2.6 The Basis of Algorithm
Let Д = {i1, i2,……,im} be a set of items. Let D, the task relevant data, be a set
of database transaction T is a set of items such that T⊆Д. Each transaction is associated
with an identifier. Called TID. Let A be a set of items. A transaction T is said to
contain A if and only if A⊆Д. An association rule is an implication of the form A→B,
where A ⊆ Д. B⊆Д, and A∩B = 0. The rule A→B holds in the transaction set D with
support s, where s is the percentage of transactions in D that contain A∪B. This is taken
to be the probability P(A∪B). The rule A→B has a confidence c in the transaction set
D if c is the percentage of transactions in D containing A that also contain B. This is
taken to be the conditional probability P(B/A). That is:
Support(A→B) = P(A∪B)
Confidence(A→B) = P(B/A)
Rules that satisfy a both minimum support threshold, min_sup, and a minimum
confidence threshold, min_conf, are called strong or interesting. By convention, we
write the confidence and support to lie between 0% and 100% rather than between 0.0
and 1.0.
A set of items is referred to as an itemset. An itemset that contains k items in
an itemset is called a k-itemset. The occurrence frequency of an itemset is the number
of transactions that contain the itemset. This is also known as the frequency or support
count of an itemset. An itemset satisfies a minimum support if the occurrence
32
Page 33
frequency of an itemset is greater than or equal to product of min_sup and number of
transactions in D. The number of transactions required for an itemset to satisfy
minimum support is therefore referred to as minimum support count. The set of
frequent k-itemset is denoted by Lk.
2.6.1 Association rule mining is a two-step process:
Find all frequent Itemsets: By definition, each of these itemsets will occur at
least as frequently as a predetermined support count.
Generate Association Rules: By definition, these rules must satisfy minimum
support and minimum confidence.
For example, an association rule from the shell command history file (which is a stream
of commands and their arguments) of a user is
trn → rec:humor, 0.3; 0.1;
Which indicates that 30% of the time when the user invokes trn, he or she is
reading the news in rec:humor, and reading this newsgroup accounts for 10% of the
activities recorded in his or her command history file.
Following is the A priori algorithm as given in [10]. First the frequent-n itemset
(lines 3-11) is generated followed by generation of rules (lines 12-16).
_______________________________________________________
Input: a database T of transactions and the minimum support for the rules
_______________________________________________________
Algorithm:
Begin(1) scan database D to form L1 = { frequent 1-itemsets};
(2) k = 2; /* k is the length of the itemsets */
(3) while Lk-1 do begin /* association generation */
(4) for each pair of l1k-1; lk-1 Lk-1 and l1
k-1 and l2k-1 where their first k - 2
items are the same do begin
(5) construct candidate itemset ck such that its first k - 2 items are the
same as l1k-1, and the last two items are the last item of l1
k-1 and the last item of
l2k-1;
(6) if there is a length k -1 subset sk-1 ck and sk-1 Lk-1
33
Page 34
then
remove ck; /* the prune step */
else
(8) add ck to Ck;
end for
(9) scan D and count the support of each ck Ck;
(10) Lk = {ck|support(ck) minimum support};
(11) k = k + 1;
end while
(12) for all lk; k > 2 do begin /* rule generation */
(13) for all subset am lk do begin
(14) conf = support(lk)/support(am);
(15) if conf minimum confidence then begin
output rule am (lk - am), with confidence = conf and
support = support(lk);
end if
end for
end for
end
_______________________________________________________
Output: Association rules
_______________________________________________________
Figure 2.9: Apriori Association Rules Algorithm
34
Page 35
Example
Consider a example database, D (Table 2.1) , consisting of 10
transactions .Suppose min. support count required is 2 (i.e. min_sup = 2/10 = 20 % )Let
minimum confidence required is 80%.We have to first find out the frequent itemset
using Apriori algorithm.Then, Association rules will be generated using min. support &
min. confidence.
Table 2.1: Example database D
Generating 1-itemset Frequent Pattern
Scan D for count of each candidate
Table 2.2 1-ItemSet C1
TID List of ItemsT1 A,B,DT2 B,CT3 A,E,fT4 B,E,FT5 A,C,D,FT6 D,E,FT7 A,B,C,DT8 E,FT9 A,D,E
T10 C,D,E
Itemset Sup.CountA 5B 4C 4D 6E 6F 5
35
Page 36
Compare candidate support count with minimum support count
Table 2.3 Final 1-Itemset L1
The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets
satisfying minimum support.In the first iteration of the algorithm, each item is a
member of the set of candidate.
Generate C2 candidates from L1
ItemSetA,BA,CA,DA,EA,FB,CB,DB,EB,FC,DC,EC,FD,ED,FE,F
Itemset Sup.CountA 5B 4C 4D 6E 6F 5
36
Page 37
Table 2.4 2-ItemSet C2
Scan D for count of each candidate
ItemSet Sup.CountA,B 2A,C 2A,D 4A,E 2A,F 2B,C 2B,D 2B,E 1B,F 1C,D 3C,E 1C,F 1D,E 3D,F 2E,F 4
Table 2.5 2-ItemSet with support count C2
Compare candidate support count with minimum support count
ItemSet Sup.CountA,B 2A,C 2A,D 4A,E 2A,F 2B,C 2B,D 2C,D 3D,E 3D,F 2E,F 4
37
Page 38
Table 2.6 Final 2-ItemSet L2
Generating 2-itemset Frequent Pattern
To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to
generate a candidate set of 2-itemsets, C2. Next, the transactions in D are scanned and
the support count for each candidate itemset in C2 is accumulated.(as shown in the
middle table).The set of frequent 2-itemsets, L2 , is then determined, consisting of
those candidate 2-itemsets in C2 having minimum support.
Generate C3 candidates from L2
ItemSetA,B,CA,B,DA,C,DA,D,EA,D,FA,E,FB,C,DB,D,EB,D,FD,E,F
Table 2.7 3-ItemSet C3
Scan D for count of each candidate
ItemSet Sup.CountA,B,C 1A,B,D 2A,C,D 2A,D,E 1A,D,F 1A,E,F 1B,C,D 1B,D,E 0
38
Page 39
B,D,F 0D,E,F 0
Table 2.8 3-itemset with support count C3
Compare candidate support count with min support count
Table 2.9 Final 3-ItemSet L3
Generating 3-itemset Frequent Pattern
• The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori
Property.
• In order to find C3, we compute L2 Join L2.
• In C3 = L2 Join L2 We got another sets by joining operation such as {B,D,E},
{C,D,E},{C,D,F} ,{B,D,F} Etc.
• Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune
step helps to avoid heavy computation due to large Ck.
• Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that four latter candidates cannot possibly be frequent.
• For example , lets take {A,B,C}. The 2-item subsets of it are {A,B}, {B,C} & {A,C}.
Since all 2-item subsets of {A,B,C} are members of L2, We will keep {A,B,C} in C3.
• Lets take another example of {B,D,F} which shows how the pruning is performed.
The 2-item subsets are {B,D}, {D,F} & {B,F}.
• BUT, {B,F} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {B,D,F} from C3.
• Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.
Generating 4-itemset Frequent Pattern
Itemset Sup. CountA,B,D 2A,C,D 2
39
Page 40
• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although
the join results in {{A,B,C,D}}, this itemset is pruned since its subset {{A,B,C}} is not
frequent.
• These frequent itemsets will be used to generate strong association rules ( where
strong association rules satisfy both minimum support & minimum confidence).
Association Rules from Frequent Itemsets
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s (l-s)” if support_count(l) /
support_count(s) >= min_conf where min_conf is minimum confidence threshold.
•We had L = {{A}, {B}, {C}, {D}, {E}, {F}, {A, B}, {A,C}, {A,D}, {A,E}, {A,F}
{B,C}, {B,D},{C,D},{D,E},{D,F},{E,F} {A,B,D}, {A,C,D}}.
-Lets take l = {A,B,D}.
-Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
• Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below, each listed with its confidence.
– R1: A ^ B D
• Confidence = sc{A,B,D}/sc{A,B} = 2/2 = 100%
• R1 is Selected.
– R2: A ^ D B
• Confidence = sc{A,B,D}/sc{A,D} = 2/4 = 50%
• R2 is Rejected.
– R3: B ^ D A
• Confidence = sc{A,B,D}/sc{B,D} = 2/2 = 100%
• R3 is Selected.
–R4: A B ^ D
•Confidence = sc{A,B,D}/sc{A} = 2/5 = 40%
•R4 is Rejected.
–R5: B A ^ D
• Confidence = sc{A,B,D}/sc{B} = 2/4 = 50%
•R5 is Rejected.
40
Page 41
–R6: D A ^ B
• Confidence = sc{A,B,D}/sc{D} = 2/6 = 33.3%
•R6 is Rejected.
In this way, We have found two strong association rules
2.7 Protocols
The passing of the data and network information down through the layers of the
sending device and back up through the layers of the receiving device is made possible
by an interface between each pair of adjacent layers. Each interface defines what
information and services a layer must provide for the layer above it. Well-defined
interfaces and layer functions provide modularity to a network. As long as a layer
provides the expected services to the layer above it, the specific implementation of its
functions can be modified or replaced without requiring changes to the surrounding
layers. At each layer, a header is added to the data unit and at data link layer; a trailer is
added as well [26].
2.7.1 IP (Internet Protocol)
In general, packet filters usually operate only on the header information found
in the packet. Because there are several different protocol headers in each packet, you
will look at the ones that are important to packet filtering. Header information found at
the Ethernet frame level is not used by most packet filters. The source address and other
similar information is of little use because it is either a local MAC hardware address for
a system on the local LAN or the same for a router responsible for the last leg of a
packet's journey through the Internet. Next up in the protocol stack would be the IP
packet header information.
There are three important pieces of information here
IP Address: source and destination IP address
Protocols: Such as TCP, UDP and ICMP
IP Options: Such as source routing
41
Page 42
( I ) IP Address
The most obvious piece of information that can be used is the source and
destination address fields. If one has only a limited number of host computers on the
Internet that one wants to allow through the firewall, one can filter incoming packets
based on their source address. The same thing works in reverse: One can filter packets
coming from inside your network so that only certain destination addresses are allowed
to get through the firewall and onto the Internet.
The IP address has its own hierarchy and is divided as Class A, Class B,
Class C, Class D and Class E. The address classes differ in size and number. Class A
addresses are the largest, but there are few of them. Class Cs are the smallest, but they
are numerous. Classes D and E are also defined, but not used in normal operation.
( II ) Protocols
The next IP header field that can be useful is the Protocol field. This field
defines the protocol that the packet payload will be used for. There are two principal
protocols used:
TCP
UDP
It can also be something such as ICMP. Whereas IP helps direct data to its
correct destination. As a rule, drop packets for any protocol that is not used on your
network or that can allow someone outside of your network to reconfigure how your
network operates. For example, some of the uses to which ICMP can be put include
telling your routers that a destination is not reachable or telling your router to
reconfigure its tables to change the route to a particular network.
The individual bit information stored in IP Header is as follows:
Version: Identifies the version number of the protocol—for example, IPv4 or IPv6.
The receiving workstation looks at this field first to determine whether it can read the
incoming data. If it cannot, it will reject the packet. Rejection rarely occurs, however,
because most TCP/IP-based networks use IPv4. This field is 4 bits long.
42
Page 43
Internet Header Length (IHL): Identifies the number of 4-byte (or 32-bit)
blocks in the IP header. The most common header length comprises five groupings, as
the minimum length of an IP header is 20 4-byte blocks. This field is important because
it indicates to the receiving node where data will begin (immediately after the header
ends). The IHL field is 4 bits long.
Differentiated Services (DiffServ) Field: Informs routers what level of
precedence they should apply when processing the incoming packet. This field is 8 bits
long. It used to be called the Type of Service (ToS) field, and its purpose was the same
as the re-defined Differentiated Services field. However, the ToS specification allowed
only eight different values regarding the precedence of a datagram, and the field was
rarely used. Differentiated Services allows for up to 64 values and a greater range of
priority handling options.
Total length: Identifies the total length of the IP datagram, including the header and
data, in bytes. An IP datagram, including its header and data, cannot exceed 65,535
bytes. The Total length field is 16 bits long.
Identification: Identifies the message to which a datagram belongs and enables the
receiving node to reassemble fragmented messages. This field and the following two
fields, Flags and Fragment offset, assist in reassembly of fragmented packets. The
Identification field is 16 bits long.
Flags: Indicates whether a message is fragmented and, if it is fragmented, whether this
datagram is the last in the fragment.
Fragment offset: Identifies where the datagram fragment belongs in the incoming
set of fragments. This field is 13 bits long.
Time to live (TTL): Indicates the maximum time that a datagram can remain on the
network before it is discarded. Although this field was originally meant to represent
units of time, on modern networks it represents the number of times a datagram has
been forwarded by a router, or the number of router hops it has endured. The TTL for
datagrams is variable and configurable, but is usually set at 32 or 64. Each time a
datagram passes through a router, its TTL is reduced by 1.When a router receives a
datagram with a TTL equal to 1, it discards that datagram (or more precisely, the frame
to which it belongs). The TTL field in an IP datagram is 8 bits long.
43
Page 44
Protocol: This 8-bit field defines the higher-level protocol that uses the services of
the IP layer. An IP datagram can encapsulate data from several higher level protocols
such as TCP, UDP, ICMP, and IGMP. This field specifies the final destination protocol
to which the IP datagram should be delivered. In other words, since the IP protocol
multiplexes and demultiplexes data from different higher-level protocols, the value of
this field helps in the demultiplexing process when the datagram arrives at its final
destination.
Header checksum: This 8-bits field defines the higher-level protocol that uses the
services of the IP layer. An Allows the receiving node to calculate whether the IP
header has been corrupted during transmission.
Source IP address: Identifies the full IP address (or Network layer address) of the
source node. This field is 32 bits long.
Destination IP address: Indicates the full IP address (or Network layer address) of
the destination node. This field is 32 bits long.
Options: May contain optional routing and timing information. The Options field
varies in length.
Padding: Contains filler bits to ensure that the header is a multiple of 32 bits. The
length of this field varies.
Data: Includes the data originally sent by the source node, plus information added by
TCP in the Transport layer. The size of the Data field varies.
2.7.2 TCP (Transmission Control Protocol)
TCP lies between the application and the network layer of the TCP/IP protocol
suite and serves as the intermediary between the application programs and the network
operations. TCP operates in the Transport layer of the OSI Model and provides reliable
data delivery services. TCP is a connection-oriented sub-protocol, which means that a
connection must be established between communicating nodes before this protocol will
transmit data. TCP further ensures reliable data delivery through sequencing and
checksums.
The IP is responsible for communication at the computer level (host-to-host
communication). As a network layer protocol, IP can deliver the message only to the
44
Page 45
destination computer. However, this is an incomplete delivery. The message still needs
to be handed to the correct application program. TCP is responsible for delivery of the
message to the appropriate application program.
The local host and the remote host are defined using IP address. To define the
client and server programs, we need second identifiers called port numbers. In the
TCP/IP protocol suite, the port numbers are integers between 0 and 65,535.
2.7.3 UDP (User Datagram Protocol)
UDP lies between the application and the network layer of the TCP/IP protocol
suite and serves as the intermediary between the application programs and the network
operations. UDP operates in the Transport layer of the OSI Model. UDP is a
connectionless transport service. In other words, UDP offers no assurance that packets
will be received in the correct sequence. In fact, this protocol does not guarantee that
the packets will be received at all. Furthermore, it provides no error checking or
sequencing. Nevertheless, UDP’s lack of sophistication makes it more efficient than
TCP. It can be useful in situations where a great volume of data must be transferred
quickly, such as live audio or video transmissions over the Internet. In these cases, TCP
—with its acknowledgments, checksums, and flow control mechanisms—would only
add more overhead to the transmission.
UDP is also more efficient for carrying messages that fit within one data
packet.In contrast to a TCP header’s 10 fields, the UDP header contains only four
fields: Source port, Destination port, Length, and Checksum. Use of the Checksum
field in UDP is optional. UDP is a very simple protocol using a minimum of overhead.
If a process wants to send a small message and does not care much about reliability, it
can use UDP. Sending a small message using UDP takes much less interaction between
the sender and receiver than using TCP.
2.7.4 ICMP (Internet Control Message Protocol)
ICMP is a network layer protocol. However, its messages are not passed
directly to the data link layer as would be expected. Instead, the messages are first
45
Page 46
encapsulated inside IP datagrams before going to the lower layer. The value of the
protocol field in the IP datagrams is 1 to indicate that the IP data is an ICMP message.
ICMP reports on the success or failure of data delivery. It can indicate when
part of network is congested, when data fails to reach its destination, and when data has
been discarded because the allotted time for its delivery (its TTL) expired. ICMP
announces these transmission failures to the sender, but ICMP cannot correct any of the
errors it detects; those functions are left to higher-layer protocols, such as TCP.
However, ICMP’s announcements provide critical information for troubleshooting
network problems.
2.8 TCPDump
TCPDump has a wide range of features and can be used in a number of ways.
This section gives a brief introduction to the basic features of TCPDump. TCPDump
can be used to capture some or all packets received by a network interface. The range
of packets captured can be specified by the using a combination of logical operators
and parameters such as source and destination Mac or IP addresses, protocol types (IP
and Ethernet) and TCP/UDP port numbers [16].
The packets captured can either be written to file as raw data for later
processing by tcpdump, or directed to standard output where they can be displayed or
processed using other tools and scripts. Data written to file can be examined using
TCPDump and the data directed to standard output.
It is quite common to use TCPDump to write to file a range of packets to file
and then read the packets required from this file, this allows the dataset to be examined
repeatedly while an expression is refined to extract exactly the packets required. It's
quite frustrating when you realize that you've only captures 98% of what you wanted,
its fat better to capture 120% and then filter!
Linux > tcmdump efdile name (text .txt)
After giving this commond download the network data which will be
auetomatically storedinto that file
46
Page 47
The content of this file may look like as below:
TCPDump output has the following output format.
For UDP datagrams
15:22:41.400299 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 110
Timestamp 15:22:41.400299
Source address orac.erg.abdn.ac.uk
Source port 1052
Destination address 224.2.156.220
Destination port 57392
Protocol udp
Size 110
For TCP datagrams
16:23:01.079553 churchward.erg.abdn.ac.uk.33635 > gordon.erg.abdn.ac.uk.32772: P
12765:12925(160) ack 19829 win 24820 (DF)
Timestamp 16:23:01.079553
Source address churchward.erg.abdn.ac.uk
Source port 33635
Destination address gordon.erg.abdn.ac.uk
Destination port 32772
Indicates that the PUSH flag is set P
Sequence number (also start byte) 12765:
Contained data bytes from sqeuence number upto but not including 12925
Number of user data bytes in datagram (160)
Details of acknowledgements, Window size and Header flags ack 19829 win
24820 (DF)
The TCP headers using TCPDump is not discussed, however this is a well
researched area, google is a good starting point.
To get tcpdump to display more information about each packet use the verbose
output mode
tcpdump -v <expression>
tcpdump -vv <expression>
47
Page 48
tcpdump -vvv <expression>
Time Stamps: TCPDump adds timestamps to packets by default, the timestamp is in
the following format - hours : minutes : seconds . seconds
15:22:41.400299 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 110
the following switches alter the timestamp format.
-t suppresses the timestamp output
orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 597
-tt gives an unfomatted time stamp, this value is a count in seconds from the OS
clock initial value
1029507868.335134 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 520
-tttt gives the interval between the packet recieved and the previous packet
358020 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 586
328704 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 893
391361 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 491
Source and Destination addresses and Ports: To capture packets to or from
particuar groups or hosts a range of expression can be used, here are some example.
To capture all traffic with host churchward as source or destination address
tcpdump host churchward
To capture all traffic with the tcp or udp, source or destination port number 53
tcpdump port 53
To capture all traffic with the source address churchward
tcpdump src host churchward
To capture all trafffic with the destination tcp or udp port 53
tcpdump dst port 53
To capture all TCP traffic with the source address churchward
tcpdump tcp src host churchward
To capture all trafffic with the destination udp port 53
tcpdump udp dst port 53
There are a huge range of options available, the examples above are intened to
give an introduction to teh structure and syntax.
48
Page 49
Logical Operators: Expressions can be combined using AND and OR with the
additional use of NOT.
To capture all traffic with the source address churchward AND with the destination
udp port 53
tcpdump src host churchward and udp dst port 53
To capture all traffic with the destination address 224.2.127.254 OR with the
destination address 239.255.255.255
tcpdump dst 224.2.127.254 or dst 239.255.255.255
To capture all traffic with the destination address 224.2.127.254 NOT with the
source address 139.133.204.110
tcpdump dst 224.2.127.254 and not src 139.133.204.110
Writing to and Reading from file
To write ram packets to a file for later processing the syntax is as follows
tcpdump -w <filename>
This can be combined with an expression to only write some packets to the file.
tcpdump -w dns-file udp dst port 53
This would write all packets to or from tcp or udp port 53 to file.
To read packets from a dump file
tcpdump -r <filename>
This can be combined with an expression to only read some packets from the file.
tcpdump -r dns-file src host churchward and udp
This would read any udp packets sent by host churchward from the file.
2.9 QT Designer
Qt Designer, a tool for designing and implementing user interfaces built with
the Qt multiplatform GUI toolkit. Qt Designer makes it easy to experiment with user
interface design. At any time you can generate the code required to reproduce the user
interface from the files Qt Designer produces, changing your design as often as you
49
Page 50
like. If you used an earlier version you will find yourself immediately productive in the
new version since the interface is very similar. And you will also find new widgets and
new and improved functionality which have been developed as a result of your
feedback.
Qt Designer helps you build user interfaces with layout tools that move and
scale your widgets (controls in Windows terminology) automatically at runtime. The
resulting interfaces are both functional and attractive, comfortably suiting your users'
operating environments and preferences. Qt Designer supports Qt's signals and slots
mechanism for type-safe communication between widgets. Qt Designer includes a code
editor which you can use to embed your own custom slots inside the generated code.
Those who prefer to separate generated code from hand crafted code can continue to
use the sub classing approach pioneered in the first version of Qt Designer.
2.10 Conclusion
In this chapter, we have seen the various classes of Intrusion detection system.
We studied the different types of detection mechanisms, data mining algorithms for
intrusion detection system and various protocols of communication network and the
programming language used.
50
Page 51
Chapter 3
ANALYSIS AND DESIGN
3.1 Introduction
In order to develop efficient software it is necessary to undergo a system study.
Analysis of the system includes various stages. The important step in analysis is
requirement analysis. In requirement analysis, we consider User Perspective analysis,
Developer Perspective analysis and Functional Perspective analysis. The user
perspective analysis specifies the requirements on the behalf of the user. The developer
perspective analysis specifies the developing methodologies of the system i.e. the
facilities provided by the system.
The functional perspective analysis specifies the main functional tasks for the
system. The requirement analysis is followed by the data structure analysis where we
consider the Data Flow Diagrams for the system. Using DFDs, we can specify the
requirement of functions, processes and data at various levels.
tion and module specifications. The internal modules communication process
explains the interaction of one module with other modules. This interaction may be in
the form of user data, system data, inputs for operation or outputs of operationIn the
system design for the IDS, we study the network policies which are applied for
designing the IDS data structure, the major system component required, their hardware
and software configurations.
After this study, we design the software architecture of the IDS, which contains
various modules required for the implementation. The system architecture is divided
into two parts, internal module communica. The module specification gives the
information about the functional behavior and operation of the module.
51
Page 52
3.2 Data Flow Diagram
Data Flow Diagrams (DFD) shows the transformation of data from input to
output through processes. These DFDs describe and analyze the movement of data
through the Intrusion Detection system. The main functions, processes, database and
data structures can be studied by using the DFD for the Intrusion Detection System.
Any functional model represents three generic functions –INPUT,
PROCESSING and OUTPUT. The functional model begins with a single context level
and over the series of iterations more and more details are provided.
3.2.1 Level-0 DFD
As shown in the 0th level Data flow diagram in figure 3.1 the Network Traffic
from driver (Network Interface Card driver) is the input information to the IDS. The
network traffic includes packets information that is captured by tcpdump. The output
from the IDS is Intrusion information. The intrusion information consist of connection
time, IP address, port, and status.
Figure 3.1: 0th level DFD
3.2.2 Level-1 DFD
The level-1 Data flow diagram is shown figure 3.2
52
IDSNetwork
Traffic from driver
Intrusion Information
Page 53
Input pre-processing
The network-based approached relies on the tcpdump data as input, which
gives per packet information. We used data a huge dataset that we collected over a long
time span. We used as the normal data. This data was pre-processed to generate
itemset.txt file. The basic task is to extract extensive set of features for data mining
algorithm. This task is performing on the basis of protocol of the packet and frequency
of item set. The frequent Item set and their frequency is stored in Item set file.
The training module
IDS is trained using a data set that contains candidate item sets and their
corresponding frequencies. The process in this module perform apriori algorithm to
generate association rules. For this algorithm we also need to specify minimum support
and minimum confidence. Both parts of the stream are necessary in this module to
perform a conventional association rules discovery. The output of this module is a
profile of rules that depict the behaviour of the network with the possible known
attacks. This profile of rules is stored in rule file.
53
Page 54
Figure 3.2: Level-1 DFD
The detection module
The actual detection of intrusions is implemented. The profile of rules along
with the network data set is fed to the Detection module which performs ID3 decision
tree algorithm to classify the incoming packets coming from network are normal or
attacks on the basis of association rules defined in training module. The output of this
module contains the information about intruders such as connection time, IP address,
port, and status.
3.2.3 Level-2 DFD
Input Pre-Processing:
The level-2 Data flow diagram for the input pre- processing is shown figure 3.4.
54
TrainingModule
1.2
Input Pre-processin
gModule
1.1
Detection Module
1.3
Tcpdump trace file
Item set file
Rules fileNetwork data file
Intrusion information file
If training mode = true
Candidate items
Rules
Rules
Candidate items
Network data
Intrusion information
If detection mode = true
Page 55
Here the Dataset as we coolect over long time is given as input to the algorithm
that generates the frequent itemset for further generation of the association rules.
Following is the A priori algorithm as given in [10]. First the frequent-n itemset
_______________________________________________________
Input: a database T of transactions and the minimum support for the rules
_______________________________________________________
Algorithm:
Begin
(1) scan database D to form L1 = { frequent 1-itemsets};
(2) k = 2; /* k is the length of the itemsets */
(3) while Lk-1 do begin /* association generation */
(4) for each pair of l1k-1; lk-1 Lk-1 and l1
k-1 and l2k-1 where their first k - 2
items are the same do begin
(5) construct candidate itemset ck such that its first k - 2 items are the same as
l1k-1, and the last two items are the last item of l1
k-1 and the last item of l2k-1;
(6) if there is a length k -1 subset sk-1 ck and sk-1 Lk-1
then
remove ck; /* the prune step */
else
(8) add ck to Ck;
end for
(9) scan D and count the support of each ck Ck;
(10) Lk = {ck|support(ck) minimum support};
(11) k = k + 1;
end while
_______________________________________________________
Output: Frequent ItemSet.
_______________________________________________________
Figure 3.3: Apriori algorithm for frequent itemset
55
Page 56
Figure 3.4: Level-2 DFD of Input Pre-Processing Module
In input pre-processing module, packet data are in text.txt works as input that is
collected form tcpdump. These data are processed to collect candidate item sets and
count its frequencies in each packet. The output that contains candidate item sets and its
frequencies is stored in itemset.txt.
The Training Module:
The level-2 Data flow diagram for the Training module is shown figure 3.6
IDS are trained using a data set that contains candidate item sets and their
corresponding frequencies. The process in this module perform apriori algorithm to
generate association rules. First we need to define rule format then generate frequent
item set from candidate item set via checking it for minimum support and minimum
confidence. These frequent item sets are then used to generate association rule and store
it into rules.txt. Following is the generation of rules .
_______________________________________________________
Input: ItemSet
_______________________________________________________
(1) for all lk; k > 2 do begin
56
Collect candidate
item 1.1.1
Find frequencies of all items
1.1.2
test.txt
itemset.txt
Candidate items
Candidate items and its frequencies
Packet data
Packet data
Page 57
(2) for all subset am lk do begin
(3) conf = support(lk)/support(am);
(4) if conf minimum confidence then begin
output rule am (lk - am), with confidence = conf and
support = support(lk);
end if
end for
end for
_______________________________________________________
Output: Association rules
_______________________________________________________
Figure 3.5 : Association Rules Algorithm
Figure 3.6: Level-2 DFD of Training Module
The Detection Module:
57
Define rule
format1.2.1
Check minimum support1.2.2
Check minimum confidence
1.2.3
Generate rules for frequent item sets
1.2.4
itemset.txt
interm.txt
rules.txt
>=min support
>= min confidence
Packet data
Packet data
Rules
Page 58
The level-2 Data flow diagram for the detection module is shown figure 3.8
IDS are detecting intruders from network data with help of association rules defined in
training module. To perform this task we need to implement ID3 decision algorithm. In
this algorithm, decision tree is constructed from rules and packet data. After
constructing decision tree we need to calculate entropy and measure information gain.
The information regarding intruders is stored in intrusioninfo.txt file. This information
includes connection time, IP address, port, and status. Decision tree programs construct
a decision tree T from a set of training cases.
_______________________________________________________
Input: R: a set of non-target attributes,
C: the target attribute,
S: a training set
_______________________________________________________
Algorithm:
begin
- If S is empty, return a single node with value Failure;
- If S consists of records all with the same value for the target attribute, return a single
leaf node with that value;
- If R is empty, then return a single node with the value of the most frequent of the
values of the target attribute that are found in records of S; [in that case there
may be be errors, examples that will be improperly classified];
- Let A be the attribute with largest Gain (A, S) among attributes in R;
-Let {aj| j=1, 2, .., m} be the values of attribute A;
- Let {Sj| j=1, 2, .., m} be the subsets of S consisting respectively of records with
value aj for A;
- Return a tree with root labeled A and arcs labeled a1, a2, .., am going respectively to
the trees (ID3(R-{A}, C, S1), ID3(R-{A}, C, S2), .....,ID3(R-{A}, C, Sm);
-Recursively apply ID3 to subsets {Sj| j=1,2, .., m} until they are empty
end
_______________________________________________________Output: A Decision tree
58
Page 59
_______________________________________________________
Figure 3.7: ID3 Decision Tree Algorithm
Figure 3.8: Level-2 DFD of Detection Module
3.3 Data Structure Design
This section introduces the data structure design for input and output data.
The test.txt file and netstat.txt contains packet data that captured by tcpdump
for training module and the netstat.txt file also contains packet data but it’s for
detection module. The general data structure design of the both file test.txt and
netstat.txt is shown in the tables as below.
For UDP datagram:
Packet Field Name (Timestamp, Packet type, Source address, Source port, Destination address, Destination port , Protocol, Size)
59
Construct decision
tree1.3.1
Calculate entropy
1.3.2
Measure Information
gain1.3.3
rules.txt netstat.txt
intrusioninfo.txt
Packet dataRules
Intrusion information
Page 60
For TCP datagram:
Packet Field Name (Timestamp, Packet type, Source address, Source port, Destination address, Destination port, flag, Sequence number, Contained data up to, Number of user data, Acknowledgement, Window size, Option )
The interm.txt is used as an intermediate file has same data structure design
like the test.txt and netstat.txt have.
The itemset.txt which is used by training module contains the candidate item
sets and its frequencies. The data structure design for this file is shown in the table as
below.
Candidate item set frequencies
The rules.txt file which is an output of the training module and one of the
inputs of detection module contains association rules generated by Apriori algorithm.
The data structure design for this file is shown in the table as below.
Frequent item set -> frequent item set
The intrusioninfo.txt which is generated by detection module contains the
information about intruder. The data structure design for this file is shown in the table
as below.
Connection time IP address Port status
3.4 Software Architecture
Our basic idea is to generate association rules from the audit record set. We
generate these rules using the A priori algorithm. Its implementation involves
generation of frequent itemset(s) {1..n} followed by the generation of rules.We intend
60
Page 61
to process the audit records, which reveal the transactions taking place in the network.
The records contain various fields and to generate frequent itemsets we need to
combine the items in different fields.The rules generator to generate the association
rules would then use these frequent itemsets. The rules generator will store the
generated association rules in an efficient data structure to facilitate efficient storing
and retrieval.The stored association rules will then used with the test data by the
Intrusion detector to flag intrusions. The component diagram is shown in figure 3.9.
Figure 3.9: IDS Component Diagram
Input Pre-Processing take test,txt file as input (file of data that we get over long period)
and write the frequent itemset into the itemset.txt.(through the apriori algorithm). Then
this itemset.txt file is further processed in training module to generate the association
rules and write the rule in rule.txt. Now the rule is generated so the actual data that we
have to check (netstst.txt) send as input with the rule.txt to detection module (here the
ID3 algorithm is used) for detecting intrusion and write it to on intrisioninfo.txt.
61
Input Pre-Processing
DetectionModule
Training Moduletest.txt
itemset.txtnetstat.txt
rules.txt
intrusioninfo.txt
N/W data
Item set
Rules
Item set
Rules
N/W data
Events
Events
Intrusion information
Page 62
3.5 Conclusion
This chapter has given a good feel for what its take to implement the intrusion
detection system. With the help of the requirement analysis and requirement analysis
tools such as data flow diagram, we have decided the flow of the information of the
system. We have designed the data structure and the software architecture required for
the implementation of the system.
62
Page 63
Chapter 4
EXPERIMENTS AND RESULTS
4.1 Introduction
We now presents details of the experiments, results obtained and their
interpretation and demonstrate the fact that intrusion detection is indeed possible as per
our proposition. We have tried to include a variety of data corresponding to both – audit
(valid) as well as test data set. The records represent different sessions, of varying sizes
(i.e. number of transactions) and of different characteristics.
4.2 Environment setup
First we need to take input data to train IDS by following command:
#tcpdump –i eth0 >>test.txt
This command captured packet and download it into test.txt file which is further
used to train the IDS using asso.exe.The IDS is trained by asso.exe that take test.txt as
input and generate itemset.txt and rules.txt
o Input – test.txt
o Output – itemset.txt and rules.txt
o Intermediate file – interm.txt
In the training process first it performs input pre-processing on test.txt and
generates itemset.txt that contains frequent Item set and it’s Frequency. After input pre-
processing asso.exe perform Apriori algorithm on itemset.txt and generate rules file
named rules.txt. Now IDS is trained, and we can start detection process. Network data
that should be tested for intrusion can fetched by following command in terminal
#tcpdump –i eth0 >>netstat.txt
63
Page 64
This command continuously stored data into netstat.txt until not press
<ctrl+z>.Now Administrator runs the IDS by login.exe which is login window by
which you inter in IDS control Center.
4.3 Screenshots
#tcpdump –i eth0 >>test.txt
After this command captured packet and download it into test.txt file .
Fig-4.1: Test Data(test.txt)
test.txt file contains all packets received by a network interface in the following
format:-
For UDP datagrams
64
Page 65
14:55:42.124288 IP 192.168.3.17.netbios-dgm > 192.168.3.255.netbios-dgm: NBT
UDP PACKET(138)
Timestamp 14:55:42.124288,
Packet type IP
Source address 192.168.3.17
Source port netbios-dgm
Destination address 192.168.3.255
Destination port netbios-dgm
Protocol UDP PACKET
Size 138
For TCP datagrams
14:55:51.635185 IP 172.16.10.240.squid > 172.16.1.167.58526: P
1096064673:1096064695(22) ack 1112655718 win 7240 <nop,nop,timestamp
155446588 2091229>
Timestamp 14:55:51.635185
Packet type IP
Source address 172.16.10.240
Source port squid
Destination address 172.16.1.167
Destination port 58526
the PUSH flag is set p
Sequence number 1096064673:
Contained data upto 1096064695
Number of user data (22)
Acknowledgement 1112655718
Window size 7240
Option <nop,nop,timestamp 155446588 2091229>
65
Page 66
Fig-4.2: Association screen(Asso.exe)
The IDS is trained by asso.exe that take test.txt as input and generate itemset.txt and
rules.txt When you clicked Training button, the training process is started.
In this process first it performs input pre-processing on test.txt and generates
itemset.txt that contains frequent Item set and it’s Frequency.
After input pre-processing asso.exe perform Apriori algorithm on itemset.txt
and generate rules file named rules.txt
66
Page 67
Fig-4.3: ItemSet Screen (itemset.txt)
itemset.txt contain frequent item set and its frequencies in the following format:
Frequent item 192.168.3.17
Frequencies 41
67
Page 68
Fig-4.4:Rule screen(rules.txt)
rules.txt contains association rules I the following format:
172.16.10.240 172.16.1.167
This rule says that if frequent item 172.16.10.240 is appeared in a packet then
frequent item 172.16.1.167 should also appear.
Now IDS is trained, and we can start detection process.
68
Page 69
Fig-4.5: Data file Screen (netstat.txt)
Information of the network data in data file screen. Network data that should be
tested for intrusion can fetched by following command in terminal After this command.
#tcpdump –i eth0 >>netstat.txt
69
Page 70
Fig-4.6: login Screen
Now Administrator run the IDS by login.exe which is login window by which
you inter in IDS control Center. In login form administrator enter the name and
password .
70
Page 71
Fig-4.7: Ids Control Center (main.exe)
Here packet monitor is contains progress bar for incoming packets on the basis
of particular type of packets.When you click start button detection process is started. In
this process, data in netstat.txt and rules.txt are put to ID3 algorithm to generate
intrusion information that is stored in intrusioninfo.txt shown in Fig-4.8.you can also
see the rule file and the itemset file through the main form the button given on the form
for rule file and itemset file both.When you clicked Stop button IDS is stop the
detection process.To Exit from IDS Control Center you can clicked Exit button
71
Page 72
Fig-4.8:Intrusion Information (Intrusioninfo.txt)
Intrusioninfo.txt contain information regarding intruder in following format:
22:11:58.533049 192.168.3.158 netbios-ns Attack
Timestamp 14:55:51.635185
IP address 172.16.10.240
port squid
Status Attack
The intrusioninfo.txt which is generated by detection module contains the
information about intruder. This information includes connection time, IP address, port,
72
Page 73
and status.The following is an extract from the sample rules generated by our rules
generator for the valid (normal) audit record set.service: telnet bytes(recv):500 -
>local:12 , [ confidence = 0.980794 , support = 1532.000000]
Here the first rule describes a regular telnet connection through a local
(renumbered) host 12 and 500 bytes of data were received by the local host during this
connection. The rule can be interpreted as follows: If the service is telnet and 500
bytes are received during the connection, then this transaction takes place at local host
12.Intuitively, this is not a strong rule. However, this corresponds to the normal usage
pattern. The fact that the rule is not strong is supported by the low confidence and
support values. The other rules can be interpreted similarly.The second rule says that If
the service is telnet and 500 bytes are received during the connection, then the data is
received from a remote host with IP address 195.32.222.22. It is stronger rule as
compared to the previous as can be seen from the confidence and support values.The
third rule says that a normal (state = sf i.e SYN/FIN) smtp connection receives 1000
bytes of data from the network. Intuitively, this is a strong rule since it describes the
characteristics of a normal smtp connection.These are only 3 of the 340-odd rules
generated from the 1MB (17700 connection records) dataset. The rules describe the
transactions taking place in our source network, which represent the normal state of the
network transactions.
4.4 Conclusion
In input pre-processing, the network data was pre-processed to generate
candidate items to train the training module of the IDS. The training module execute
Apriori algorithm to generate association rules. The output of this module is a profile of
rules that depict the behaviour of the network with the possible known attacks. The
profile of rules along with the network data set is fed to the Detection module which
performs ID3 decision tree algorithm to classify the incoming packets coming from
network is either normal or attacks. The output of this module contains the information
about intruders such as connection time, IP address, port, and status.
73
Page 74
Chapter 5
CONCLUSION AND FUTURE
PERSPECTIVES
5.1 Conclusion
We have proposed a data mining technique for building intrusion detection
models. We demonstrated that association rules from the audit data could be used to
guide audit data gathering and feature selection, the critical steps in building effective
classification models. We incorporated domain knowledge into these basic algorithms
using the axis attribute(s), reference attribute(s), and a level-wise approximate mining
procedure. Our experiments on real world audit data showed that the algorithms are
very effective.
The experiments on network tcpdump data demonstrated the effectiveness of
classification models in detecting anomalies. The accuracy of the detection models
depends on sufficient training data and the right feature set. We suggested that the
association rule could be used to compute the consistent patterns from audit data. These
frequent patterns form an abstract summary of an audit trail, and therefore can be used
to: guide the audit data gathering process; provide help for feature selection; and
discover patterns of intrusions. Intrusion detection systems add an early warning
capability to a company’s defenses, alerting to the type of suspicious activity that
typically occurs before and during an attack.
Since most cannot stop an attack, intrusion detection systems should not be
considered an alternative to traditional good security practices. There is no substitute
for a carefully thought out corporate security policy, backed up by effective security
procedures which are carried out by skilled staff using the necessary tools. Instead,
intrusion detection systems should be viewed as an additional tool in the continuing
battle against hackers and crackers.
74
Page 75
There are a large variety of intrusion detection systems available, suitable for
almost any circumstance. They range from freeware versions that can be deployed on
low-cost PC's, to commercial systems costing many thousand of pounds and requiring
the latest and greatest hardware. Some are designed to monitor whole networks, whilst
others are deployed on a per-machine basis. They all have their pro's and con's, and
there is a role for them all. However, our work aims to eliminate, as much as possible,
the manual and ad-hoc elements from the process of building an intrusion detection
system.
5.2 Future Work
The biggest challenge of using data mining approaches in intrusion detection is
that it requires a large amount of audit data in order to compute the profile rule sets.
And the fact that we may need to compute a detection model for each resource in a
target system makes the data mining task daunting. Moreover, this learning (mining)
process is an integral and continuous part of an intrusion detection system because the
rule sets used by the detection module may not be static over a long period of time.
For example, as a new version of system software arrives, we need to update the
“normal” profile rules. Given that data mining is an expensive process (in time and
storage), and real-time detection needs to be lightweight to be practical, we can’t afford
to have a monolithic intrusion detection system.
It has been proposed system architecture, as shown in Figure 5.1 that includes
two kinds of intelligent agents: the learning agents and the detection agents. A learning
agent, who may reside in a server machine for its computing power, is responsible for
computing and maintaining the rule sets for programs and users. It produces both the
base detection models and the meta detection models. The task of a learning agent, to
compute accurate models from very large amount of audit data, is an example of the
“scale-up” problem in machine learning.
We expect that, the research done in agent-based meta learning systems [14]
will contribute significantly to the implementation of the learning agents. Briefly, we
75
Page 76
are studying how to partition and dispatch data to a host of machines to compute
classifiers in parallel, and re-import the remotely learned classifiers and combine an
accurate (final) meta-classifier, a hierarchy of classifiers [12].
A detection agent is generic and extensible. It is equipped with a (learned and
periodically updated) rule set (i.e., a classifier) from the remote learning agent. Its
detection engine “executes” the classifier on the input audit data, and outputs evidence
of intrusions. The main difference between a base detection agent and the meta
detection agent is: the former uses preprocessed audit data as input while the later uses
the evidence from all the base detection agents. The base detection agents and the meta
detection agent need not be running on the same host. For example, in a network
environment, a meta agent can combine reports from (base) detection agents running on
each host, and make the final assertion on the state of the network.
Figure 5.1: Architecture for agent-based IDS
76
Page 77
The main advantages of such system architecture are:
It is easy to construct an intrusion detection system as a compositional hierarchy
of generic detection agents.The detection agents are lightweight since they can function
independently from the heavyweight-learning agents, in time and locale, so long as it is
already equipped with the rule sets. A detection agent can report new instances of
intrusions by transmitting the audit records to the learning agent, which can in turn
compute an updated classifier to detect such intrusions, and dispatch them to all
detection agents. Interestingly, the capability to derive and disseminate anti-virus codes
faster than the virus can spread is also considered a key requirement for anti-virus
systems.
77
Page 78
REFERENCES
[1] “A Data Mining Framework for Adaptive Intrusion Detection. DARPA”, W. Lee,
S. J.Stolfo, and K. W. Mok, 1998.
[2] “Towards taxonomy of intrusion-detection systems”, Debar H., Dacier M., Wespi
A., Computer Networks, 31, 1999, pp. 805-822.
[3] “Model Generation for an Intrusion Detection System Using Genetic Algorithms”,
Adhitya Chittur. 2001.
[4] “Improving Intrusion Detection Performance Using Keyword Selection and Neural
Networks”, R.P. Lippmann, R. K. Cunningham.
[5] “Fuzzy network profiling for intrusion detection”, J.E.dickerson,J.A.Dickerson,
IEEE 2000.
[6]“Detecting Intrusions Using System Calls: Alternative Data Models, IEEE
Symposium on Security and Privacy”, C.Warrender,S. Forrest,B.Pearlmutter.
IEEE Computer Society pp. 133-145 (1999).
[7] “Adaptive Intrusion Detection: a Data Mining Approach.” Artificial Intelligence
Review, 14(6), December 2000, pp. 533-567,
http://www.cc.gatech.edu/~wenke/papers/ai_review.ps
[8] “Mining association rules between sets of items in large databases” In Proceedings
78
Page 79
Of the ACM SIGMOD Conference on Management of Data, pages 207-216,
1993.
[9] “Fast algorithms for mining association rules” R. Agrawal and R. Srikant, In
Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.
[10] “Algorithms for Mining System Audit Data”, Wenkee Lee and Salvatore J. Stolfo
, In Proceedings of the 7th USENIX Security Symposium, San Antonio, TX,
January 2000.
[11] “A Hybrid Approach to the Profile Creation and Intrusion Detection”, Marin J.,
Ragsdale D., Surdu J.: Proceedings of the DARPA Information Survivability
Conference and Exposition – DISCEX 2001, June 2001,
http://www.itoc.usma.edu/Documents/Hybrid_DISCEX_AcceptedCopy.pdf
[12] “Using Artificial Anomalies to Detect Unknown and Known Network Intrusions”,
Fan W., Miller M., Stolfo S., Lee W., Chan P, In Proceedings of the First IEEE
International Conference on Data Mining, San Jose, CA, November 2001,
http://www.cc.gatech.edu/~wenke/papers/artificial_anomalies.ps
[13] “A data mining and CIDF based approach for detecting novel and distributed
Intrusions.” Lee W. i inni, Recent Advances in Intrusion Detection, Third
International Workshop, RAID 2000, Toulouse, France, October 2-4, 2000,
Proceedings. Lecture Notes in Computer Science 1907 Springer, 2000, pp. 49-
65. http://www.cc.gatech.edu/~wenke/papers/lee_raid_00.ps
[14] “Intrusion Detection Systems Multisensor Data Fusion: Creating Cyberspace
Situational Awareness”, Bass T. Communication of the ACM, Vol. 43, Number
1, January 2000, pp. 99-105, http://www.silkroad.com/papers/acm.fusion.ids.ps
[15] “A data mining analysis of RTID alarms”, Manganaris S., Christensen M., Zerkle
D, Hermiz K. Computer Networks, 34, 2000, pp. 571-577.
79
Page 80
[16] “Tcpdump. available via anonymous ftp to ftp.ee.lbl.gov”, V. Jacobson, C. Leres,
and S. McCanne. June 1989.
[17] “Adaptive intrusion detection: a data mining approach”, W. Lee, S. J. Stolfo, and
K.W. Mok., Artificial Intelligence Review, 1999.
[18] “Intrusion Detection Systems: A Taxomomy and Survey.” Axelsson S.,Technical
Report No 99-15, Dept. of Computer Engineering, Chalmers University of
Technology, Sweden, March 2000,
http://www.ce.chalmers.se/staff/sax/taxonomy.ps
[19] “Intrusion Detection, Theory and Practice.” Elson D. March 27, 2000,
http://online.securityfocus.com/infocus/1203
[20] “Network Intrusion Detection Signatures”, Frederick K. K., December 19, 2001,
http://online.securityfocus.com/infocus/1524
[21] “Intrusion Detection Systems (IDS). Group Test (Edition 3)”, NSS Group, July
2002, http://www.nss.co.uk/ids/edition3/index.htm
[22] “Computer system intrusion detection: a survey”, Jones A.K., Sielken R.S.:
09.02.2000, http://www.cs.virginia.edu/~jones/IDS-research/Documents/jones-
sielken-survey-v11.pdf
[23] “Classification and detection of computer intrusions”, Kumar S.A Ph.D. Thesis.
Purdue University. 1995, http://ftp.cerias.purdue.edu/pub/papers/sandeep-
kumar/kumar-intdet-phddiss.pdf
[24] “Data Mining: Concepts and Techniques”, Jiawei Han and Micheline Kamber.
Morgan Kaufmann Publishers 2006.
80
Page 81
[25] “TCP/IP Protocol Suite “, Forouzan Tata McGREW Hill Publication ,Third
Edition2006.
[26] “Computer Networks”, Tanen Baum Pearson Education ,Third edition 2006.
81