intrusion detection sytem

Chapter 1

INTRODUCTION

1.1 Introduction

Security of computers and network systems has become more and more

important, as more computers are connected to each other and more applications are

implemented in this "virtual" world. Applications, such as Electronic Commerce and

Online Banking, require a strict sense of security. While security may not be able to

stop all kinds of threats and attacks, there is a need for a way to know the events of any

attack that is happening. This record of attacks can be used as a tool to strengthen the

security and also can be used as a forensic tool for evidences of crime.

The tool for this purpose is known as Intrusion Detection System (IDS). As the

name suggested, IDS is built to provide detection of "intruder" in the system. With the

knowledge of attacks happening, it gives better way of countering them. Intrusion

prevention techniques, such as user authentication (e.g. using passwords or biometrics),

avoiding programming errors, and information protection (e.g., encryption) have been

used to protect computer systems as a first line of defense. Intrusion prevention alone is

not sufficient because as systems become ever more complex, there are always

exploitable weaknesses in the systems due to design and programming errors, or

various socially engineered penetration techniques. Intrusion detection is therefore

needed as another wall to protect computer systems.

An intrusion into a computer system can be compared to a physical intrusion

into a building by a thief. It is an entity gaining unauthorized access to resources. The

unauthorized access is intended to steal or change information or to disrupt the valid

use of the resource by an authorized user. Intrusion detection is the ability to determine

that an intruder has gained, or is attempting to gain unauthorized access. An intrusion-

detection system is a tool used to make this determination. The goal of any intrusion-

detection system is to alert an authority of unauthorized access before the intruders can

1

cause any damage or take any information, much like a burglar alarm system in a

building. However, a digital computer system is far more vulnerable than a building

and much harder to protect. The intruder can be hundreds of miles away when the

attack is initiated, leaving behind very little evidence.

There are some basic definitions of the terms used in this system:

Security: Security consists of mechanisms for providing confidentiality, integrity,

and availability. Confidentiality means that only the individuals allowed access to

particular information should be able to access that information. Integrity refers to

those controls that prevent information from being altered in any unauthorized

manner. Availability controls are those that prevent the proper functioning of

computer systems from being interfered with.

Threat: A threat is any situation or event that has potential to harm a system.

Threats may be external or internal. Threats from users consist of masqueraders

(those who use credentials of others) and clandestine users (those who avoid

auditing and detection). Misfeasors are legitimate users who exceed their privileges.

Attack: An intentional attempt to bypass computer security measures in some

fashion.

Intrusion: A successful attack. An intrusion can be defined as any set of actions

that attempt to compromise the confidentiality, integrity or availability of a

resource.

Signature: A pattern that can be matched to identify a particular type if activity.

Detection rules: A rule typically consists of a signature and associated contextual

and response information.

1.2 Motivation

Our objective is to eliminate, as much as possible, the manual and ad-hoc

elements from the process of building an intrusion detection system. We take a data-

centric point of view and consider intrusion detection as a data analysis process.

Anomaly detection is about finding the normal usage patterns from the audit data,

whereas misuse detection is about encoding and matching the intrusion patterns using

2

the audit data. The central theme of our approach is to apply data mining techniques to

intrusion detection. Data Mining generally refers to the process of (automatically)

extracting models from large stores of data. The recent rapid development in data

mining has made available a wide variety of algorithms, drawn from the fields of

statistics, pattern recognition, machine learning, and database. Several types of

algorithms are particularly relevant to our research.

1.3 Problem Definition

In this Research work, we describe a data mining framework for adaptively

building Intrusion Detection (ID) model. Datamining refers to Knowledge Discovry

from the Data(KDD). We say that knowledge mining from data that is knowledge

extraction, data analysis or pattern analysis. Datamining can be applicable to any kind

of data repossitory. It is possible to find the pattern for further references, aggregattion

operation and find the intresting mesure also. Through datamining large data tombs

turns into golden nuggets of the knowledge.

The central idea is to utilize auditing programs to extract an extensive set of

features that describe each network connection or host session, and apply data mining

programs to learn rules that accurately capture the behavior of intrusions and normal

activities. These nuggets or rule then used for misuse detection and anomaly detection.

The network-based approached relies on the Tcpdump data as input, which

gives per packet information. This data was pre-processed as grouping records

corresponding to their protocols and extract features from the data that is useful to train

the intrusion detection system.

After extract features from the Tcpdump data, the data mining algorithm is

operated on the data. Here the task is to get association rules from the data. The data-

mining algorithm will take the data that contains extensive set of features and produce

the rules. Apriori association algorithm is chosen for getting the rules.

The rules generated from the above algorithm are then used to detect intruders.

To perform this task ID3 algorithm is used that take data network data as input and

3

compare it with the association rules. This algorithm labels the events as either normal

or attacks.

1.4 The Challenges

Formulating the classification tasks, i.e., determining the class labels and the

set of features, from audit data is a very difficult and time-consuming task. Since

security is usually an after-thought of computer system design, there is no standard

auditing mechanisms and data format specifically for intrusion analysis purposes.

Considerable amount of data pre-processing, which involves domain knowledge, is

required to extract raw “action" level audit data into higher level “session/event"

records with the set of intrinsic system features. Figure 1.4 shows an example of audit

data preprocessing.

Here, binary tcpdump data is first converted into ASCII packet level data, where

each line contains the information of one network packet. The data is ordered by the

timestamps of the packets. Therefore, packets belonging to different connections may

be interleaved. For example, the 3 packets shown in the figure 1.4 [16] are from

different connections. The packet data is then processed into connection records with a

number of features (i.e. attributes), e.g., time (the starting time of the connection, i.e.,

the timestamp of its first packet), dur (the duration of the connection), src and dst

(source and destination hosts), bytes (number of data bytes from source to destination),

srv (the service, i.e., port, in the destination), and ag (how the connection conforms to

the network protocols, e.g., SF is normal, REJ is “rejected"), etc.

4

Figure1.1: Generation of audit data

These intrinsic features essentially summarize the packet level information within a

connection. There are commonly available programs that can process packet level data

into such connection records for network traffic analysis tasks. However, for intrusion

detection, the temporal and statistical characteristics of connections also need to be

considered because of the temporal nature of event sequences in network-based

computer systems.

For example, a large number of “rejected" connections, i.e., flag = REJ, within a short

time frame can be a strong indication of intrusions, because normal connections are

rejected rarely.

A critical requirement for using classification rules as an anomaly detector is

that we need to have “sufficient" training data that covers as much variation of the

normal behavior as possible, so that the false positive rate is kept low (i.e., we wish to

minimize detected “abnormal normal" behavior). It is not always possible to formulate

a classification model to learn the anomaly detector with limited (“insufficient")

training data, and then incrementally update the classifier using on-line learning

algorithms. This is because the limited training data may not have covered all the class

labels, and on-line algorithms. For example in modeling daily network traffic, we use

the services, e.g., http, telnet etc., of the connections as the class labels in training

models. We may not have connection records of the infrequently used services with,

say, only one week's traffic data.

5

A formal audit data gathering process therefore needs to take place first. As we

collect audit data, we need an indicator that can tell us whether the new audit data

exhibits any “new" normal behavior, so that we can stop the process when there is no

more variation. This indicator should be simple to compute and must be incrementally

updated.

1.5 Conclusion

This chapter describes the background and motivation System. This chapter also

contains introduction to data mining and its need, and problem statement of the

research.

Chapter 2

LITERATURE SURVEY

2.1 Introduction

Our objective is to develop general rather than intrusion-specific tools in

response to the challenges discussed in the previous section. The idea is to first

compute the association rules from audit data, which (intuitively) capture the intra-

(temporal) audit record patterns.

These patterns are then utilized, with user participation, to guide the data

gathering and feature selection processes. Here we use the term “audit data” to refer to

general data streams that can be processed for detection purposes. Examples of such

data streams are the connection records extracted from the raw tcpdump output, and the

Web site visit records processed using the Web site logs.

We assume that audit data records are time stamped. As described in [10], the

main challenge in developing these data mining algorithms is to provide support

mechanisms for domain knowledge so that “useful” patterns are computed. We next

6

describe these basic data mining algorithms and our proposed extensions that allow the

introduction of domain knowledge in a convenient manner.

2.2 Intrusion Detection System

Intrusion detection is the process of monitoring the events occurring in a

computer system or network and analyzing them for signs of intrusions, defined as

attempts to compromise the confidentiality, integrity, availability, or to bypass the

security mechanisms of a computer or network. Intrusions are caused by attackers

accessing the systems from the Internet, authorized users of the systems who attempt to

gain additional privileges for which they are not authorized, and authorized users who

misuse the privileges given them. Intrusion Detection Systems (IDSs) are software or

hardware products that automate this monitoring and analysis process.

Intrusion detection allows organizations to protect their systems from the threats

that come with increasing network connectivity and reliance on information systems.

Given the level and nature of modern network security threats, the question for security

professionals should not be whether to use intrusion detection, but which intrusion

detection features and capabilities to use.

IDSs have gained acceptance as a necessary addition to every organization’s

security infrastructure. Despite the documented contributions intrusion detection

technologies make to system security, in many organizations one must still justify the

acquisition of IDSs.

There are several compelling reasons to acquire and use IDSs:

To prevent problem behaviors by increasing the perceived risk of discovery and

punishment for those who would attack or otherwise abuse the system,

To detect attacks and other security violations that are not prevented by other

security measures,

To detect and deal with the preambles to attacks (commonly experienced as

network probes and other “doorknob rattling” activities),

To document the existing threat to an organization

7

To act as quality control for security design and administration, especially of large

and complex enterprises

To provide useful information about intrusions that do take place, allowing

improved diagnosis, recovery, and correction of causative factors.

2.2.1 Intrusion Detection System Taxonomy

There are several attributes by which all intrusion detection technology can be

classified. These attributes are information sources, type of analysis, timing,

architecture, and activeness. They can be used to compare and categorize specific

intrusion detection solutions. The following figure 2.1 illustrates classification of

intrusion detection system[18].

Figure 2.1 Classification of Intrusion Detection System

8

The main distinction among IDS is: network based and host based intrusion

detection systems on the basis of their source of information.

( I ) Host-based intrusion detection systems (HIDS):

These are concerned with what is happening on each individual computer or

host. Operating system and computer system details, such as memory and processor

use, user activities, and applications running, are examined for indications of misuse.

They are able to detect such things as repeated failed access attempts or changes to

critical system files. The HIDS reside on a particular computer and provide protection

for a specific computer system[19].

The advantages of HIDS are:

More detailed logging: Because host-based intrusion detection runs on the

monitored host, it can collect much more detailed information regarding exactly

what occurs during the course of an attack.

Increased Recovery: Because of increased granularity of tracking events in the

monitored system, recovery from a successful incident is usually more complete.

Detects unknown attacks: Host-based intrusion detection is better at detecting

unknown attacks that affect monitored host than is network-based intrusion

detection.

Fewer false positives: A side effect of the way host-based intrusion detection

works is to provide substantially fewer false alerts than produced by network-based

IDS.

HIDS products such as Snort, Dragon Squire, Emerald eXpert-BSM, NFR HID,

Intruder- Alert all perform this type of monitoring.

( II ) Network-based intrusion detection systems (NIDS):

These examine the individual packets flowing through a network. Unlike

firewalls, which typically only look at IP addresses, ports and ICMP types, network

9

based intrusion detection systems are able to understand all the different flags and

options that can exist within a network packet. A NIDS can therefore detect maliciously

crafted packets that are designed to be overlooked by a firewall's relatively simplistic

filtering rules. Hackers often craft such traffic in order to map out a network, as a form

of pre-attack reconnaissance[19].

Network-based intrusion detection is more popular than host-based intrusion

detection for several reasons:

Ease of deployment: Network-based IDS listens to activity on the network and

analyzes it. This model results in few performance or compatibility issues in the

monitored environment.

Cost: A handful of strategically placed sensors can be used to monitor a large

organizational environment. Host-based IDS requires software on each monitored

host.

Range of detection: The variety of malicious activities able to be detected through

the analysis of network traffic is wider than the variety able to be detected in host-

based IDS.

Forensics Integrity: If a host using host-based IDS is compromised, then all of the

intrusion detection activity logs become suspect because the attacker most likely

gained the ability to modify information on the host.

Detects all attempts, even failed ones: Network-based IDS analyzes activity

regardless of whether the activity is successful or unsuccessful attack. Host-based

IDS generally only detect successful attacks because most unsuccessful attacks

don’t affect the monitored host directly.The network-based intrusion systems

products are: Cisco Secure IDS (formerly NetRanger), Hogwash, Dragon, E-Trust

IDS.

2.2.1.1 Analysis Strategy

Intrusion detection systems must be capable of distinguishing between normal

(not security-critical) and abnormal user activities, to discover malicious attempts in

time. However translating user behaviors (or a complete user-system session) in a

10

consistent security-related decision is often not that simple, many behavior patterns are

unpredictable and unclear[18].

( I ) Misuse detection system:

It is based on extensive knowledge of patterns associated with known attacks

provided by human experts. It attempts to recognize attacks that follow intrusion

patterns that have been recognized and reported by experts. Existing approaches to

implement misuse detection systems are signature matching, expert systems, state

transition analysis and heuristic approach. Typical structure of misuse detection system

is shown in Figure 2.2.

Misuse detection systems are vulnerable to intruders who use new patterns of

behavior or who mask their illegal behavior to deceive the detection system.

Figure 2.2 Misuse Detection System

( II ) Anomaly detection system:

Anomaly detection methods were developed to counter the problem of misuse

detection systems. With the anomaly detection approach, one represents patterns of

normal behavior, with the assumption that an intrusion can be identified based on some

deviation from this normal behavior. When such a deviation is observed, an intrusion

11

alarm is produced. The major benefit of anomaly detection system is that it is able to

recognize unforeseen attacks. But its major limitation is high false alarm rate, since

detected deviations do not necessarily represent actual attacks[18].

Figure 2.3 shows typical structure of anomaly detection system. The sensor is a

network interface which collects the packets. The activity normalizer performs analysis

of the data. Note the two-way interchange between the activity normalizer and the

“normal” activity database. The activity normalizer must constantly adjust the baseline

of normal activity to reflect the dynamic nature of the monitored computer systems and

network.

Figure 2.3 Anomaly Detection System

The common approaches used to develop anomaly detection systems are

statistical methods, expert systems, neural networks, data mining, and outlier detection

schemes.

2.2.1.2 Time Aspects

Detection systems work in either real-time or at given intervals. At first glance,

real time systems are more desirable, but certain types of activities can be detected over

larger ranges of time. Given the amount of analysis being performed and the amount of

data that most real-time systems must handle, real-time systems have practical

limitations on the size of the window of time that can be examined. Most commercial

12

products offering real-time analysis are limited to a 5 to 15 minute window of time.

Off-line analysis analyzes the data when information about the sessions is already

collected. They are most useful for understanding attacker’s behavior. Most

commercial products today recognize this and use a combination of both types of

timings for the best effect.

2.2.1.3 Architecture

In case of centralized IDS the data analysis is performed at fixed number of

locations independent of how many hosts are being monitored. In distributed IDS data analysis is performed in a number of locations proportional

to the number of hosts that are being monitored. The distributed intrusion detection

system is necessary for detection of distributed/coordinated attacks targeted at

multiple networks/machines.

Some of the products use combination of both of these architectures.

2.2.1.4 Activeness

The other method of categorizing intrusion detection systems is by their passive

or reactive nature.In a passive system, the IDS sensor detects a potential security

breach, logs the information and signals an alert on the console i.e. no countermeasure

is actively applied to thwart the attack. While in a reactive system, the IDS responds to

the suspicious activity by logging off a user or by reprogramming the firewall to block

network traffic from the suspected malicious source, either autonomously or at the

command of an operator.

2.2.2 Data Processing techniques used in Intrusion Detection

Systems

Depending on the type of approach taken in intrusion detection, various

processing mechanisms (techniques) are employed for data that is to reach IDS. Below,

several systems are described briefly:

13

Expert systems: These work on a previously defined set of rules describing an

attack. All security related events incorporated in an audit trail are translated in

terms of if-then-else rules. Examples are Wisdom & Sense and ComputerWatch

(developed at AT&T)[24].

Signature analysis: Like expert System approach, this method is based on the

attack knowledge. They transform the semantic description of an attack into the

appropriate audit trail format. Thus, attack signatures can be found in logs or input

data streams in a straightforward way. Detection is accomplished by using common

text string matching mechanisms. Typically, it is a very powerful technique and as

such very often employed in commercial systems[20].

Colored Petri: The Colored Petri Nets approach is often used to generalize

attacks from expert knowledge bases and to represent attacks graphically. With this

technique, it is easy for system administrators to add new signatures to the system.

However, matching a complex signature to the audit trail data may be time-

consuming. The technique is not used in commercial systems [3].

State-transition: Here, an attack is described with a set of goals and transitions

that must be achieved by an intruder to compromise a system. Transitions are

represented on state-transition diagrams.

Statistical analysis approach: This is a frequently used method (for example

SECURENET). The user or system behavior (set of attributes) is measured by a

number of variables over time. Examples of such variables are: user login, logout,

number of files accessed in a period of time, usage of disk space, memory, CPU etc.

Neural networks: Neural networks use their learning algorithms to learn about

the relationship between input and output vectors and to generalize them to extract

new input/output relationships. With the neural network approach to intrusion

detection, the main purpose is to learn the behavior of actors in the system (e.g.,

users, daemons) [4].

14

User intention identification: This technique models normal behavior of

users by the set of high-level tasks they have to perform on the system (in relation

to the users’ functions). These tasks are taken as series of actions, which in turn are

matched to the appropriate audit data. The analyzer keeps a set of tasks that are

acceptable for each user. Whenever a mismatch is encountered, an alarm is

produced.

Computer immunology: Analogies with immunology has lead to the

development of a technique that constructs a model of normal behavior of UNIX

network services, rather than that of individual users. This model consists of short

sequences of system calls made by the processes. Attacks that exploit flaws in the

application code are very likely to take unusual execution paths. First, a set of

reference audit data is collected which represents the appropriate behavior of

services, and then the knowledge base is added with all the known “good”

sequences of system calls. These patterns are then used for continuous monitoring

of system calls to check whether the sequence generated is listed in the knowledge

base; if not an alarm is generated. This technique has a potentially very low false

alarm rate provided that the knowledge base is fairly complete. Its drawback is the

inability to detect errors in the configuration of network services. Whenever an

attacker uses legitimate actions on the system to gain unauthorized access, no alarm

is generated.

Machine learning: This is an artificial intelligence technique that stores the

user-input stream of commands in a vectorial form and is used as a reference of

normal user behavior profile. Profiles are then grouped in a library of user

commands having certain common characteristics [5].

Data mining: It refers to a set of techniques that use the process of extracting

previously unknown but potentially useful data from large stores of data. A typical

data mining technique is associated with finding association rules. It allows one to

extract previously unknown knowledge on new attacks or built on normal behavior

patterns. Anomaly detection often generates false alarms. With data mining it is

15

easy to correlate data related to alarms with mined audit data, thereby considerably

reducing the rate of false alarms . Data mining refers to as Knowledge Discovery

from the Data (KDD) .As we can say that knowledge mining from data , knowledge

extraction , data analysis or pattern analysis . Data mining should be applicable to

any kind of data repsitory . Data mining is to find the pattern to present the

knowledge in intregated form and remove the noise and unneccesary data form the

data resourse.Through Data mining , it is possible to find the pattern for further

references , aggregattion operation and find the intresting mesure also. We can view

Data mining as the evolution of the information technology. The widening gap

between data and information develop the Data mining , that turns the large data

repository or data tombs into golden nuggets of the knowledge Huge amount of

database is mine by tools to gain some meaningful knowledge. Data mining is the

process of automatically searching large volumes of data for patterns. Data mining

is a fairly recent and contemporary topic in computing. However, Data mining

applies many older computational techniques from statistics, machine learning, and

pattern recognition. Data mining can be defined as the nontrivial extraction of

implicit, previously unknown, and potentially useful information from data. It is the

science of extracting useful information from large data sets or databases[6,24].

2.2.3 Requirements for Intrusion Detection System

The basic requirements for a good intrusion detection system:

A system must recognize any suspect activity or triggering event that could

potentially be an attack.

Escalating behavior on the part of an intruder should be detected at the lowest level

possible.

Components on various hosts must communicate with each other regarding level of

alert and intrusions detected.

The system must respond appropriately to changing levels of alertness.

The detection system must have some manual control mechanisms to allow

administrators to control various functions and alert levels of the system.

The system must be able to adapt to changing methods of attack.

16

The system must be able to handle multiple concurrent attacks.

The system must be scalable and easily expandable as the network changes.

The system must be resistant to compromise, able to protect itself from intrusion.

The system must be efficient and reliable.

2.2.4 Shortfalls of current IDS

Despite 20 years of research, intrusion detection technology has quite a way to

go to achieve a perfect solution. There are still many challenges to achieve effective

intrusion detection as discussed below :

Alert handling: Until intrusion detection system is properly tunes to a specific

environment, there can be literally thousands of alerts generated on a daily basis.

The expertise and manpower required to handle alerts can be quite daunting.

Variants: As stated previously signatures are developed in response to new

vulnerabilities or exploits that have been posted or released. Integral to the success

of a signature, it must be unique enough to only alert on malicious traffic and rarely

on valid network traffic. The difficulty here is that exploit code can often be easily

changed. It is not uncommon for an exploit tool to be released and then have its

defaults changed shortly thereafter by the hacker community.

False positives: A common complaint is the amount of false positives IDS will

generate. Developing unique signatures is a difficult task and often times the

vendors will err on the side of alerting too often rather than not enough. This is

analogous to the story of the boy who cried wolf. It is much more difficult to pick

out a valid intrusion attempt if a signature also alerts regularly on valid network

activity. A difficult problem that arises from this is how much can be filtered out

without potentially missing an attack.

False negatives: This leads to the other concept of false negatives where an IDS

does not generate an alert when an intrusion is actually taking place. Simply put if a

signature has not been written for a particular exploit there is an extremely good

chance that the IDS will not detect it.

17

Evasion: An increasing number of attackers understand the shortcomings of the

some of the intrusion detection technology, such as signature-based IDS. As

attackers understand the weakness, their attacks are designed to bypass detection.

Architectural issues: Technology such as switches, Gigabit Ethernet, and

encryption make network-based intrusion detection much more challenging.

Data overload: Another aspect, which is extremely important, is how much data

an analyst can effectively and efficiently analyze. That being said the amount of

data one needs to look at seems to be growing rapidly. Depending on the intrusion

detection tools employed by a company and its size there is the possibility for logs

to reach millions of records per day.

2.3 Goals of Data Mining

Data mining is typically carried out with some end goals or applications.

Broadly speaking, these goals fall into the following classes: prediction, identification,

classification, and optimization [9]:

Prediction: Data mining can show how certain attributes within data will behave

in the future. Examples of predictive data mining include the analysis of buying

transaction to predict what consumers will buy under certain discounts, how much

sales volume a store would generate in a given period, and whether deleting product

line would yield more profits. In such applications, business logic is used coupled

with data mining.

Identification: Data patterns can be used to identify the existence of an item, an

event, or an activity. For example, in biological applications, existence of a gene

may be identified by certain sequences of nucleotide symbols in the DNA sequence.

Classification: Data mining can partition the data so that different classes or

categories can be identified based on combinations of parameters. For example,

customers in super market can be categorized into discount-seeking shoppers,

shoppers in rush, loyal regular shoppers, shoppers attached to name brands, and

infrequent shoppers. Sometimes classification based on common domain

18

knowledge is used as an input to decompose the mining problem and make it

simpler.

Optimization: One eventual goal of data mining may be to optimize the use of

limited resources such as time, space, money, or materials and to maximize output

variables such as sales or profit under a given set of constraints. As such, this goal

of data mining resembles the objective function used in operations research problem

that deals with optimization under constraints.

2.3.1 Types of Knowledge Discovered During Data Mining

The term “knowledge” is very broadly interpreted as involving some degree of

intelligence. There is progression from raw data to information to knowledge as we go

through additional processing. Knowledge is often classified as inductive versus

deductive. Deductive knowledge deduces new information based on applying pre-

specified logical rules of deduction on the given data. Data mining addresses inductive

knowledge, which discovers new rules and patterns from supplied data.

It is common to describe the knowledge discovered during data mining in five

ways, as follows[24]:

Association rules: Find the association between mesurin attributes. That how

one attribute is act or releted to another . These rules correlate the presence of a set

of items with another range of values for another set of variables[8].

Example: Female buys (S “handbeg”) Female buys (S “shoes”) .When female retail

shopper buys handbag, she is likely to buy shoes.

Classification hierarchies : It is the procedure to describe and distinguish the

concept to predict the classes which is unknown through training data. The goal is

to work from an existing set of events or transactions to create a hierarchy of

classes.

Example: A population may be divided into five ranges of credit worthiness based on a

history of previous credit transactions

Sequential patterns: find patteerns that is frequently in the data.Analysis the

data patterns to find which patterns is very frquent A sequence of actions or events

is sought[9].

19

Example: If a patient underwent cardiac bypass surgery for blocked arteries and later

developed high blood urea within year of surgery, he or she is likely to suffer from

kidney failure within the next 18 months.

Detection of events is equivalent to detecting associations among events with certain

temporal relationships.

Patterns within time series: Similarities can be detected within positions of

time series of data, which is sequence of data taken at regular intervals such as daily

sales[9].

Example: Two products show the same selling pattern in summer but different one in

winter.

Clustering: Analyze the data of unknown category or class lable .Data are

analyze such that the data within the same cluster have the similarity beyond the

base class they belong. Every cluster is a collection of objects that have the same

property. A given population of events or items can be partitioned into sets of

“similar” elements [10].

Example: An entire population of treatment on disease may be divided into groups

based on the similarity of side effects produced.

A typical data mining technique is associated with finding association rules.

Association rules are used to gather necessary knowledge about nature of audit data, on

assumption that discovering patterns within individual records in a trace can improve

specifies the correlation among different features.

2.3.2 Association Rules

Association rules were originally developed as a tool for analysis of retail sales.

A piece of sales data usually includes information about a transaction, such as

transaction date and items purchased. Association rules can be used to find the

correlation among different items in a transaction. For example, when a customer buys

item A, item B will also be purchased by the customer with the probability of 90%.

Agrawral and Srikant [11] have presented some fast algorithms to mine association

rules, including algorithm Apriori. Using the notation of Agrawal and Srikant, let D =

{T1, T2 , …, Tn} be the transaction database with n transactions in total and I = { i1 , i2 ,

20

…, im } be the set of all the items where each ij (1 ≤ j ≤ m) represents one kind of

item. Then each transaction Tl (1 ≤ l ≤ n) in D records the items purchased, i.e., Tl

I. Define an itemset as a nonempty subset of I. An association rule will have the

form: X→Y, c, s, where X I, Y I, and X Y = , i.e., X and Y are disjoint

itemsets. Here s represents the support of this association rule and c represents the

confidence of this association rule.

Assume the number of transactions that contains both the itemset X and the

itemset Y is n’; then s = support(X U Y) = n’/n; and c = support(X U Y)/support(X).

Intuitively, support(X) can be viewed as the occurrence frequency of the itemset X in

the whole transaction database D, while c indicates that when X is satisfied, there will

be the certainty of c that Y is also true. Two thresholds, minconfidence (representing

minimum confidence) and minsupport (representing minimum support), are used by the

mining algorithm to find all association rules X→Y, c, s such that c minconfidence

and s minsupport. Any itemset X is called a large itemset if support(X)

minsupport.

2.3.3 The Apriori Algorithm

The basic Apriori algorithm finds frequent itemsets for Boolean association

rules, receiving as input a database T of transactions and the minimum support for the

rules. It uses the Apriori property: if an itemset I is not frequent, the itemset I UA (A

is any other item) is also not frequent; i.e. “all nonempty subsets of a frequent itemset

must also be frequent”.

The Apriori algorithm [12] builds a set Ck (candidate itemsets of size k) and Lk

(frequent itemsets of size k) to create frequent itemsets of size k+1:

_____________________________________________________

Input: a database T of transactions and the minimum support for the rules

_______________________________________________________

Algorithm:

L1= {frequent items}

21

k=2

While (Lk! =Ø) {

Ck = GenerateCandidates (Lk);

for each transaction t in the database

Increment the count of all candidates in Ck that

belong to T

Lk+1 = candidates in Ck with enough support

K++;

}

return (L=L1UL2U……);

_______________________________________________________

Output: frequent itemsets for Boolean association rules

_______________________________________________________

Figure 2.4: The Apriori algorithm

GenerateCandidates() returns a subset of the join operation of Lk and Lk,

pruning itemsets that do not satisfy the Apriori property. Computing the support and

confidence of all nonempty subsets of each frequent itemset generates the set of

association rules.

2.3.4 ID3 Decision Tree Algorithm

Decision trees are powerful and popular tools for classification and prediction.

The attractiveness of decision trees is due to the fact that, in contrast to neural

networks, decision trees represent rules. Rules can readily be expressed so that humans

can understand them or even directly used in a database access language like SQL so

that records falling into a particular category may be retrieved.

In some applications, the accuracy of a classification or prediction is the only

thing that matters. In such situations we do not necessarily care how or why the model

works. In other situations, the ability to explain the reason for a decision is crucial. In

marketing one has describe the customer segments to marketing professionals, so that

they can utilize this knowledge in launching a successful marketing campaign. These

22

domain experts must recognize and approve this discovered knowledge, and for this we

need good descriptions. There are a variety of algorithms for building decision trees

that share the desirable quality of interpretability.

Decision Tree: Decision tree is a classifier in the form of a tree structure, where

each node is either:

A leaf node - indicates the value of the target attribute (class) of examples, or

A decision node - specifies some test to be carried out on a single attribute-

value, with one branch and sub-tree for each possible outcome of the test.

A decision tree can be used to classify an example by starting at the root of the

tree and moving through it until a leaf node, which provides the classification of the

instance.

Decision tree induction is a typical inductive approach to learn knowledge on

classification. The key requirements to do mining with decision trees are:

Attribute-value description: object or case must be expressible in terms of a

fixed collection of properties or attributes. This means that we need to discretize

continuous attributes, or this must have been provided in the algorithm.

Predefined classes (target attribute values): The categories to which

examples are to be assigned must have been established beforehand (supervised

data).

Discrete classes: A case does or does not belong to a particular class, and there

must be more cases than classes.

Sufficient data: Usually hundreds or even thousands of training cases.

Constructing Decision Trees: Most algorithms that have been developed for

learning decision trees are variations on a core algorithm that employs a top-down,

greedy search through the space of possible decision trees. Decision tree programs

construct a decision tree T from a set of training cases[24].

__________________________________________________

Input: R: a set of non-target attributes,

23

C: the target attribute,

S: a training set

_______________________________________________________

Algorithm:

begin

- If S is empty, return a single node with value Failure;

- If S consists of records all with the same value for the target attribute, return a

single leaf node with that value;

- If R is empty, then return a single node with the value of the most frequent of

the values of the target attribute that are found in records of S; [in that case

there may be errors, examples that will be improperly classified];

- Let A be the attribute with largest Gain (A, S) among attributes in R;

-Let {aj| j=1, 2, .., m} be the values of attribute A;

- Let {Sj| j=1, 2, .., m} be the subsets of S consisting respectively of records

with value aj for A;

- Return a tree with root labeled A and arcs labeled a1, a2, .., am going

respectively to the trees (ID3(R-{A}, C, S1), ID3(R-{A}, C,

S2), .....,ID3(R-{A}, C, Sm);

-Recursively apply ID3 to subsets {Sj| j=1,2, .., m} until they are empty

end

_______________________________________________________

Output: a decision tree

___________________________________________________

Figure 2.5: ID3 Decision Tree Algorithm

ID3 searches through the attributes of the training instances and extracts the

attribute that best separates the given examples. If the attribute perfectly classifies the

training sets then ID3 stops; otherwise it recursively operates on the m (where m =

number of possible values of an attribute) partitioned subsets to get their "best"

attribute.

24

The best classifier: The estimation criterion in the decision tree algorithm is the

selection of an attribute to test at each decision node in the tree. The goal is to select the

attribute that is most useful for classifying examples. A good quantitative measure of

the worth of an attribute is a statistical property called information gain that measures

how well a given attribute separates the training examples according to their target

classification. This measure is used to select among the candidate attributes at each step

while growing the tree.

Entropy: A measure of homogeneity of the set of examples. It is the most commonly

used discretization measure. It explores class distribution information in its calculation

determination of split points mean data values for partitioning an range of attribute.

In order to define information gain precisely, we need to define a measure

commonly used in information theory, called entropy, that characterizes the (im)purity

of an arbitrary collection of examples. Given a set S, containing only positive and

negative examples of some target concept (a 2 class problem), the entropy of set S

relative to this simple, binary classification is defined as[24].

Entropy (S) = - Pi Pi – Pj Pj

where Pi is the proportion of positive examples in S and Pj is the proportion of

negative examples in S. In all calculations involving entropy we define 0log0 to be 0.

If the target attribute takes on c different values, then the entropy of S relative to

this c-wise classification is defined as[24]

where pi is the proportion of S belonging to class i. Note the logarithm is still

base 2 because entropy is a measure of the expected encoding length measured in bits.

Note also that if the target attribute can take on c possible values, the maximum

possible entropy is Log 2 C.

25

Information gain: Given entropy as a measure of the impurity in a collection of

training examples, we can now define a measure of the effectiveness of an attribute in

classifying the training data. The measure we will use, called information gain, is

simply the expected reduction in entropy caused by partitioning the examples according

to this attribute. More precisely, the information gain, Gain (S, A) of an attribute A,

relative to a collection of examples S, is defined as [24]

where Values(A) is the set of all possible values for attribute A, and Sv is the

subset of S for which attribute A has value v (i.e., Sv = {s Î S | A(s) = v}). Note the first

term in the equation for Gain is just the entropy of the original collection S and the

second term is the expected value of the entropy after S is partitioned using attribute A.

The expected entropy described by this second term is simply the sum of the entropies

of each subset Sv, weighted by the fraction of examples |Sv|/|S| that belong to Sv. Gain

(S,A) is therefore the expected reduction in entropy caused by knowing the value of

attribute A. Put another way, Gain(S,A) is the information provided about the target

attribute value, given the value of some other attribute A. The value of Gain(S,A) is the

number of bits saved when encoding the target value of an arbitrary member of S, by

knowing the value of attribute A.

The process of selecting a new attribute and partitioning the training examples

is now repeated for each non-terminal descendant node, this time using only the

training examples associated with that node. Attributes that have been incorporated

higher in the tree are excluded, so that any given attribute can appear at most once

along any path through the tree. This process continues for each new leaf node until

either of two conditions is met:

Every attribute has already been included along this path through the tree, or

The training examples associated with this leaf node all have the same target

attribute value (i.e., their entropy is zero).

26

2.4 Extensions of Apriori Algorithm

These basic algorithms do not consider any domain knowledge and as a result

they can generate many “irrelevant" (i.e., uninteresting) rules. We need to limit the

generation of these “uninteresting” rules. Also, the rules need to be generalized so as to

use them in a generic manner for different rule set which are “situationally” similar[8].

The above limitations and generalization is achieved by incorporating two

modifications to the basic A priori algorithm. They are:

Axis attributes

Reference attributes

2.4.1 Interestingness measures based on attributes

We attempt to utilize the schema level information about audit records to direct

the pattern mining process. That is, although we cannot know in advance what patterns,

which involve actual attribute values, are interesting, we often know what attributes are

more important or useful given a data analysis task. Using the minimum support and

confidence values to output only the “statistically” significant" patterns, the basic

algorithms implicitly measure the interestingness (i.e., relevancy) of patterns by their

support and confidence values, without regard to any available prior domain

knowledge. That is, assume I is the interestingness measure of a pattern p, then

I(p) = f(support(p); confidence(p))

Where f is some ranking function. We propose here to incorporate schema level

information into the interestingness measures. Assume IA is a measure on whether a

pattern p contains the specified important (i.e. “interesting") attributes, our extended

interestingness measure is

Ie(p) = fe(IA(p); f(support(p); confidence(p))) = fe(IA(p); I(p))

27

Where fe is a ranking function that first considers the attributes in the pattern,

then the support and confidence values. In the following sections, we describe several

schema-level characteristics of audit data, in the forms of “what attributes must be

considered", that can be used to guide the mining of relevant features. We do not use

these IA measures in post-processing to filter out irrelevant rules by rank ordering.

Rather, for efficiency, we use them as item constraints, i.e., conditions, during

candidate itemset generation.

2.4.2 Using the Axis Attribute (S)

There is a partial “order of importance" among the attributes of an audit record.

Some attributes are essential in describing the data, while others only provide auxiliary

Information. Consider the audit data of network connections shown in Figure 2.6.

Figure 2.6 Network connection records

Here each record (row) describes a network connection. The continuous

attribute values, except the timestamps, are discretized into proper bins. A network

connection can be uniquely identified by

< timestamp; src host; src port; dst host; service >

That is, the combination of its start time, source host, source port, destination

host, and service (destination port). These are the essential attributes when describing

network data. We argue that the “relevant" association rules should describe patterns

related to the essential attributes. Patterns that include only the unessential attributes are

28

normally “irrelevant". For example, the basic association rules algorithm may generate

rules such as

(src bytes = 200) →(flag = SF)

These rules are not useful and to some degree are misleading. There is no

intuition for the association between the number of bytes from the source, src bytes,

and the normal status (flag = SF) of the connection, but rather it may just be a statistical

correlation evident from the dataset.

We call the essential attribute(s) axis attribute(s) when they are used as a form

of item constraints in the association rules algorithm. During candidate generation, an

item set must contain value(s) of the axis attribute(s). We consider the correlation

among non-axis attributes as not interesting. In other words,

IA (p) = 1 if p contains axis attribute(s)

= 0 otherwise

In practice, we need not designate all essential attributes as the axis attributes.

For example, some network analysis tasks require statistics about various network

services while others may require the patterns related to the hosts. We can use service

as the axis attribute to compute the association rules that describe the patterns related to

the services of the connections. It is even more important to use the axis attribute(s) to

constrain the item generation for frequent episodes. The basic algorithm can generate

serial episode rules that contain only the “unimportant" attribute values. For example

src bytes = 200; src bytes = 200 → dst bytes = 300; src bytes = 200

(We omit the support, confidence from the above rule.) Note that here each

attribute value, e.g., src bytes = 200, is from a different connection record.

2.4.3 Using the Reference Attribute (S).

Another interesting characteristic of system audit data is that some attributes

can be the references of other attributes. These reference attributes normally carry

information about some “subject", and other attributes describe the “actions" that refer

to the same “subject".

29

Consider the log of visits to a Web site, as shown in figure 2.7. Here action and

request are the “actions" taken by the “subject", remote host. We see that for a number

of remote hosts, each of them makes the same sequence of requests:

“/images", “/images" and “/shuttle/missions/sts-71". Figure 2.7 shows a web log

record example.

Figure 2.7: Web log records

It is important to use the “subject" as a reference when finding such frequent

sequential “action" patterns because the “actions" from different “subjects" are

normally irrelevant. This kind of sequential pattern can be represented as:

(Subject = X; action = a);

(Subject = X; action = b) → (subject = X; action = c)

Note that within each occurrence of the pattern, the action values refer to the

same subject, yet the actual subject value may not be given in the rule since any

particular subject value may not be frequent with regard to the entire dataset. In other

words, subject is simply a reference (or a variable).

In other words,

IA (p) = 1 if the itemsets of p refer to the same reference attribute value

0 otherwise

2.5 Level-Wise Approximate Mining

30

It is often necessary to include the low frequency patterns. In daily network

traffic, some services, for example, gopher, and account for very low occurrences. Yet

we still need to include their patterns into the network traffic profile (so that we have

representative patterns for each supported service)[10]. If we use a very low support

value for the data mining algorithms, we will then get unnecessarily a very large

number of patterns related to the high frequency services, for example, smtp.

Procedure

_______________________________________________________

Input:

The terminating minimum support s0;

The initial minimum support si;

The axis attribute(s);

_______________________________________________________

Output:

Association rules Rules

_______________________________________________________

Begin

(1) Rrestricted = ;

(2) Scan database to form L = {1-itemsets that meet s0};

(3) S = si;

(4) While (s s0) do begin

(5) Compute frequent episodes from L: each episode must contain at least one

axis attribute value that is not in Rrestricted;

(6) Append new axis attribute values to Rrestricted;

(7) Append episode rules to the output rule set Rules;

(8) S = s/2; /* a smaller support value for the next iteration */

end while

End

_______________________________________________________

31

Figure 2.8: Level-wise Approximate Mining Procedure

Here the idea is to first find the episodes related to high frequency axis attribute

values. We then iteratively lower the support threshold to find the episodes related to

the low frequency axis values by restricting the participation of the “old" axis values

that already have output episodes. More specifically, when an episode is generated, it

must contain at least one “new" (low frequency) axis value.

2.6 The Basis of Algorithm

Let Д = {i1, i2,……,im} be a set of items. Let D, the task relevant data, be a set

of database transaction T is a set of items such that T⊆Д. Each transaction is associated

with an identifier. Called TID. Let A be a set of items. A transaction T is said to

contain A if and only if A⊆Д. An association rule is an implication of the form A→B,

where A ⊆ Д. B⊆Д, and A∩B = 0. The rule A→B holds in the transaction set D with

support s, where s is the percentage of transactions in D that contain A∪B. This is taken

to be the probability P(A∪B). The rule A→B has a confidence c in the transaction set

D if c is the percentage of transactions in D containing A that also contain B. This is

taken to be the conditional probability P(B/A). That is:

Support(A→B) = P(A∪B)

Confidence(A→B) = P(B/A)

Rules that satisfy a both minimum support threshold, min_sup, and a minimum

confidence threshold, min_conf, are called strong or interesting. By convention, we

write the confidence and support to lie between 0% and 100% rather than between 0.0

and 1.0.

A set of items is referred to as an itemset. An itemset that contains k items in

an itemset is called a k-itemset. The occurrence frequency of an itemset is the number

of transactions that contain the itemset. This is also known as the frequency or support

count of an itemset. An itemset satisfies a minimum support if the occurrence

32

frequency of an itemset is greater than or equal to product of min_sup and number of

transactions in D. The number of transactions required for an itemset to satisfy

minimum support is therefore referred to as minimum support count. The set of

frequent k-itemset is denoted by Lk.

2.6.1 Association rule mining is a two-step process:

Find all frequent Itemsets: By definition, each of these itemsets will occur at

least as frequently as a predetermined support count.

Generate Association Rules: By definition, these rules must satisfy minimum

support and minimum confidence.

For example, an association rule from the shell command history file (which is a stream

of commands and their arguments) of a user is

trn → rec:humor, 0.3; 0.1;

Which indicates that 30% of the time when the user invokes trn, he or she is

reading the news in rec:humor, and reading this newsgroup accounts for 10% of the

activities recorded in his or her command history file.

Following is the A priori algorithm as given in [10]. First the frequent-n itemset

(lines 3-11) is generated followed by generation of rules (lines 12-16).

_______________________________________________________


_______________________________________________________

Algorithm:

Begin(1) scan database D to form L1 = { frequent 1-itemsets};

(2) k = 2; /* k is the length of the itemsets */

(3) while Lk-1 do begin /* association generation */

(4) for each pair of l1k-1; lk-1 Lk-1 and l1

k-1 and l2k-1 where their first k - 2

items are the same do begin

(5) construct candidate itemset ck such that its first k - 2 items are the

same as l1k-1, and the last two items are the last item of l1

k-1 and the last item of

l2k-1;

(6) if there is a length k -1 subset sk-1 ck and sk-1 Lk-1

33

then

remove ck; /* the prune step */

else

(8) add ck to Ck;

end for

(9) scan D and count the support of each ck Ck;

(10) Lk = {ck|support(ck) minimum support};

(11) k = k + 1;

end while

(12) for all lk; k > 2 do begin /* rule generation */

(13) for all subset am lk do begin

(14) conf = support(lk)/support(am);

(15) if conf minimum confidence then begin

output rule am (lk - am), with confidence = conf and

support = support(lk);

end if

end for

end for

end

_______________________________________________________

Output: Association rules

_______________________________________________________

Figure 2.9: Apriori Association Rules Algorithm

34

Example

Consider a example database, D (Table 2.1) , consisting of 10

transactions .Suppose min. support count required is 2 (i.e. min_sup = 2/10 = 20 % )Let

minimum confidence required is 80%.We have to first find out the frequent itemset

using Apriori algorithm.Then, Association rules will be generated using min. support &

min. confidence.

Table 2.1: Example database D

Generating 1-itemset Frequent Pattern

Scan D for count of each candidate

Table 2.2 1-ItemSet C1

TID List of ItemsT1 A,B,DT2 B,CT3 A,E,fT4 B,E,FT5 A,C,D,FT6 D,E,FT7 A,B,C,DT8 E,FT9 A,D,E

T10 C,D,E

Itemset Sup.CountA 5B 4C 4D 6E 6F 5

35

Compare candidate support count with minimum support count

Table 2.3 Final 1-Itemset L1

The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets

satisfying minimum support.In the first iteration of the algorithm, each item is a

member of the set of candidate.

Generate C2 candidates from L1

ItemSetA,BA,CA,DA,EA,FB,CB,DB,EB,FC,DC,EC,FD,ED,FE,F

Itemset Sup.CountA 5B 4C 4D 6E 6F 5

36



ItemSet Sup.CountA,B 2A,C 2A,D 4A,E 2A,F 2B,C 2B,D 2B,E 1B,F 1C,D 3C,E 1C,F 1D,E 3D,F 2E,F 4

Table 2.5 2-ItemSet with support count C2

Compare candidate support count with minimum support count

ItemSet Sup.CountA,B 2A,C 2A,D 4A,E 2A,F 2B,C 2B,D 2C,D 3D,E 3D,F 2E,F 4

37

Table 2.6 Final 2-ItemSet L2


To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to

generate a candidate set of 2-itemsets, C2. Next, the transactions in D are scanned and

the support count for each candidate itemset in C2 is accumulated.(as shown in the

middle table).The set of frequent 2-itemsets, L2 , is then determined, consisting of

those candidate 2-itemsets in C2 having minimum support.

Generate C3 candidates from L2

ItemSetA,B,CA,B,DA,C,DA,D,EA,D,FA,E,FB,C,DB,D,EB,D,FD,E,F



ItemSet Sup.CountA,B,C 1A,B,D 2A,C,D 2A,D,E 1A,D,F 1A,E,F 1B,C,D 1B,D,E 0

38

B,D,F 0D,E,F 0

Table 2.8 3-itemset with support count C3

Compare candidate support count with min support count

Table 2.9 Final 3-ItemSet L3


• The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori

Property.

• In order to find C3, we compute L2 Join L2.

• In C3 = L2 Join L2 We got another sets by joining operation such as {B,D,E},

{C,D,E},{C,D,F} ,{B,D,F} Etc.

• Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune

step helps to avoid heavy computation due to large Ck.

• Based on the Apriori property that all subsets of a frequent itemset must also be

frequent, we can determine that four latter candidates cannot possibly be frequent.

• For example , lets take {A,B,C}. The 2-item subsets of it are {A,B}, {B,C} & {A,C}.

Since all 2-item subsets of {A,B,C} are members of L2, We will keep {A,B,C} in C3.

• Lets take another example of {B,D,F} which shows how the pruning is performed.

The 2-item subsets are {B,D}, {D,F} & {B,F}.

• BUT, {B,F} is not a member of L2 and hence it is not frequent violating Apriori

Property. Thus We will have to remove {B,D,F} from C3.

• Now, the transactions in D are scanned in order to determine L3, consisting of those

candidates 3-itemsets in C3 having minimum support.


Itemset Sup. CountA,B,D 2A,C,D 2

39

• The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although

the join results in {{A,B,C,D}}, this itemset is pruned since its subset {{A,B,C}} is not

frequent.

• These frequent itemsets will be used to generate strong association rules ( where

strong association rules satisfy both minimum support & minimum confidence).

Association Rules from Frequent Itemsets

• For each frequent itemset “l”, generate all nonempty subsets of l.

• For every nonempty subset s of l, output the rule “s (l-s)” if support_count(l) /

support_count(s) >= min_conf where min_conf is minimum confidence threshold.

•We had L = {{A}, {B}, {C}, {D}, {E}, {F}, {A, B}, {A,C}, {A,D}, {A,E}, {A,F}

{B,C}, {B,D},{C,D},{D,E},{D,F},{E,F} {A,B,D}, {A,C,D}}.

-Lets take l = {A,B,D}.

-Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.

• Let minimum confidence threshold is , say 70%.

• The resulting association rules are shown below, each listed with its confidence.

– R1: A ^ B D

• Confidence = sc{A,B,D}/sc{A,B} = 2/2 = 100%

• R1 is Selected.

– R2: A ^ D B

• Confidence = sc{A,B,D}/sc{A,D} = 2/4 = 50%

• R2 is Rejected.

– R3: B ^ D A

• Confidence = sc{A,B,D}/sc{B,D} = 2/2 = 100%

• R3 is Selected.

–R4: A B ^ D

•Confidence = sc{A,B,D}/sc{A} = 2/5 = 40%

•R4 is Rejected.

–R5: B A ^ D

• Confidence = sc{A,B,D}/sc{B} = 2/4 = 50%

•R5 is Rejected.

40

–R6: D A ^ B

• Confidence = sc{A,B,D}/sc{D} = 2/6 = 33.3%

•R6 is Rejected.

In this way, We have found two strong association rules

2.7 Protocols

The passing of the data and network information down through the layers of the

sending device and back up through the layers of the receiving device is made possible

by an interface between each pair of adjacent layers. Each interface defines what

information and services a layer must provide for the layer above it. Well-defined

interfaces and layer functions provide modularity to a network. As long as a layer

provides the expected services to the layer above it, the specific implementation of its

functions can be modified or replaced without requiring changes to the surrounding

layers. At each layer, a header is added to the data unit and at data link layer; a trailer is

added as well [26].

2.7.1 IP (Internet Protocol)

In general, packet filters usually operate only on the header information found

in the packet. Because there are several different protocol headers in each packet, you

will look at the ones that are important to packet filtering. Header information found at

the Ethernet frame level is not used by most packet filters. The source address and other

similar information is of little use because it is either a local MAC hardware address for

a system on the local LAN or the same for a router responsible for the last leg of a

packet's journey through the Internet. Next up in the protocol stack would be the IP

packet header information.

There are three important pieces of information here

IP Address: source and destination IP address

Protocols: Such as TCP, UDP and ICMP

IP Options: Such as source routing

41

( I ) IP Address

The most obvious piece of information that can be used is the source and

destination address fields. If one has only a limited number of host computers on the

Internet that one wants to allow through the firewall, one can filter incoming packets

based on their source address. The same thing works in reverse: One can filter packets

coming from inside your network so that only certain destination addresses are allowed

to get through the firewall and onto the Internet.

The IP address has its own hierarchy and is divided as Class A, Class B,

Class C, Class D and Class E. The address classes differ in size and number. Class A

addresses are the largest, but there are few of them. Class Cs are the smallest, but they

are numerous. Classes D and E are also defined, but not used in normal operation.

( II ) Protocols

The next IP header field that can be useful is the Protocol field. This field

defines the protocol that the packet payload will be used for. There are two principal

protocols used:

TCP

UDP

It can also be something such as ICMP. Whereas IP helps direct data to its

correct destination. As a rule, drop packets for any protocol that is not used on your

network or that can allow someone outside of your network to reconfigure how your

network operates. For example, some of the uses to which ICMP can be put include

telling your routers that a destination is not reachable or telling your router to

reconfigure its tables to change the route to a particular network.

The individual bit information stored in IP Header is as follows:

Version: Identifies the version number of the protocol—for example, IPv4 or IPv6.

The receiving workstation looks at this field first to determine whether it can read the

incoming data. If it cannot, it will reject the packet. Rejection rarely occurs, however,

because most TCP/IP-based networks use IPv4. This field is 4 bits long.

42

Internet Header Length (IHL): Identifies the number of 4-byte (or 32-bit)

blocks in the IP header. The most common header length comprises five groupings, as

the minimum length of an IP header is 20 4-byte blocks. This field is important because

it indicates to the receiving node where data will begin (immediately after the header

ends). The IHL field is 4 bits long.

Differentiated Services (DiffServ) Field: Informs routers what level of

precedence they should apply when processing the incoming packet. This field is 8 bits

long. It used to be called the Type of Service (ToS) field, and its purpose was the same

as the re-defined Differentiated Services field. However, the ToS specification allowed

only eight different values regarding the precedence of a datagram, and the field was

rarely used. Differentiated Services allows for up to 64 values and a greater range of

priority handling options.

Total length: Identifies the total length of the IP datagram, including the header and

data, in bytes. An IP datagram, including its header and data, cannot exceed 65,535

bytes. The Total length field is 16 bits long.

Identification: Identifies the message to which a datagram belongs and enables the

receiving node to reassemble fragmented messages. This field and the following two

fields, Flags and Fragment offset, assist in reassembly of fragmented packets. The

Identification field is 16 bits long.

Flags: Indicates whether a message is fragmented and, if it is fragmented, whether this

datagram is the last in the fragment.

Fragment offset: Identifies where the datagram fragment belongs in the incoming

set of fragments. This field is 13 bits long.

Time to live (TTL): Indicates the maximum time that a datagram can remain on the

network before it is discarded. Although this field was originally meant to represent

units of time, on modern networks it represents the number of times a datagram has

been forwarded by a router, or the number of router hops it has endured. The TTL for

datagrams is variable and configurable, but is usually set at 32 or 64. Each time a

datagram passes through a router, its TTL is reduced by 1.When a router receives a

datagram with a TTL equal to 1, it discards that datagram (or more precisely, the frame

to which it belongs). The TTL field in an IP datagram is 8 bits long.

43

Protocol: This 8-bit field defines the higher-level protocol that uses the services of

the IP layer. An IP datagram can encapsulate data from several higher level protocols

such as TCP, UDP, ICMP, and IGMP. This field specifies the final destination protocol

to which the IP datagram should be delivered. In other words, since the IP protocol

multiplexes and demultiplexes data from different higher-level protocols, the value of

this field helps in the demultiplexing process when the datagram arrives at its final

destination.

Header checksum: This 8-bits field defines the higher-level protocol that uses the

services of the IP layer. An Allows the receiving node to calculate whether the IP

header has been corrupted during transmission.

Source IP address: Identifies the full IP address (or Network layer address) of the

source node. This field is 32 bits long.

Destination IP address: Indicates the full IP address (or Network layer address) of

the destination node. This field is 32 bits long.

Options: May contain optional routing and timing information. The Options field

varies in length.

Padding: Contains filler bits to ensure that the header is a multiple of 32 bits. The

length of this field varies.

Data: Includes the data originally sent by the source node, plus information added by

TCP in the Transport layer. The size of the Data field varies.

2.7.2 TCP (Transmission Control Protocol)

TCP lies between the application and the network layer of the TCP/IP protocol

suite and serves as the intermediary between the application programs and the network

operations. TCP operates in the Transport layer of the OSI Model and provides reliable

data delivery services. TCP is a connection-oriented sub-protocol, which means that a

connection must be established between communicating nodes before this protocol will

transmit data. TCP further ensures reliable data delivery through sequencing and

checksums.

The IP is responsible for communication at the computer level (host-to-host

communication). As a network layer protocol, IP can deliver the message only to the

44

destination computer. However, this is an incomplete delivery. The message still needs

to be handed to the correct application program. TCP is responsible for delivery of the

message to the appropriate application program.

The local host and the remote host are defined using IP address. To define the

client and server programs, we need second identifiers called port numbers. In the

TCP/IP protocol suite, the port numbers are integers between 0 and 65,535.

2.7.3 UDP (User Datagram Protocol)

UDP lies between the application and the network layer of the TCP/IP protocol

suite and serves as the intermediary between the application programs and the network

operations. UDP operates in the Transport layer of the OSI Model. UDP is a

connectionless transport service. In other words, UDP offers no assurance that packets

will be received in the correct sequence. In fact, this protocol does not guarantee that

the packets will be received at all. Furthermore, it provides no error checking or

sequencing. Nevertheless, UDP’s lack of sophistication makes it more efficient than

TCP. It can be useful in situations where a great volume of data must be transferred

quickly, such as live audio or video transmissions over the Internet. In these cases, TCP

—with its acknowledgments, checksums, and flow control mechanisms—would only

add more overhead to the transmission.

UDP is also more efficient for carrying messages that fit within one data

packet.In contrast to a TCP header’s 10 fields, the UDP header contains only four

fields: Source port, Destination port, Length, and Checksum. Use of the Checksum

field in UDP is optional. UDP is a very simple protocol using a minimum of overhead.

If a process wants to send a small message and does not care much about reliability, it

can use UDP. Sending a small message using UDP takes much less interaction between

the sender and receiver than using TCP.

2.7.4 ICMP (Internet Control Message Protocol)

ICMP is a network layer protocol. However, its messages are not passed

directly to the data link layer as would be expected. Instead, the messages are first

45

encapsulated inside IP datagrams before going to the lower layer. The value of the

protocol field in the IP datagrams is 1 to indicate that the IP data is an ICMP message.

ICMP reports on the success or failure of data delivery. It can indicate when

part of network is congested, when data fails to reach its destination, and when data has

been discarded because the allotted time for its delivery (its TTL) expired. ICMP

announces these transmission failures to the sender, but ICMP cannot correct any of the

errors it detects; those functions are left to higher-layer protocols, such as TCP.

However, ICMP’s announcements provide critical information for troubleshooting

network problems.

2.8 TCPDump

TCPDump has a wide range of features and can be used in a number of ways.

This section gives a brief introduction to the basic features of TCPDump. TCPDump

can be used to capture some or all packets received by a network interface. The range

of packets captured can be specified by the using a combination of logical operators

and parameters such as source and destination Mac or IP addresses, protocol types (IP

and Ethernet) and TCP/UDP port numbers [16].

The packets captured can either be written to file as raw data for later

processing by tcpdump, or directed to standard output where they can be displayed or

processed using other tools and scripts. Data written to file can be examined using

TCPDump and the data directed to standard output.

It is quite common to use TCPDump to write to file a range of packets to file

and then read the packets required from this file, this allows the dataset to be examined

repeatedly while an expression is refined to extract exactly the packets required. It's

quite frustrating when you realize that you've only captures 98% of what you wanted,

its fat better to capture 120% and then filter!

Linux > tcmdump efdile name (text .txt)

After giving this commond download the network data which will be

auetomatically storedinto that file

46

The content of this file may look like as below:

TCPDump output has the following output format.

For UDP datagrams

15:22:41.400299 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 110

Timestamp 15:22:41.400299

Source address orac.erg.abdn.ac.uk

Source port 1052

Destination address 224.2.156.220

Destination port 57392

Protocol udp

Size 110

For TCP datagrams

16:23:01.079553 churchward.erg.abdn.ac.uk.33635 > gordon.erg.abdn.ac.uk.32772: P

12765:12925(160) ack 19829 win 24820 (DF)

Timestamp 16:23:01.079553

Source address churchward.erg.abdn.ac.uk

Source port 33635

Destination address gordon.erg.abdn.ac.uk


Indicates that the PUSH flag is set P

Sequence number (also start byte) 12765:

Contained data bytes from sqeuence number upto but not including 12925

Number of user data bytes in datagram (160)

Details of acknowledgements, Window size and Header flags ack 19829 win

24820 (DF)

The TCP headers using TCPDump is not discussed, however this is a well

researched area, google is a good starting point.

To get tcpdump to display more information about each packet use the verbose

output mode

tcpdump -v <expression>

tcpdump -vv <expression>

47

http://www.google.com/

tcpdump -vvv <expression>

Time Stamps: TCPDump adds timestamps to packets by default, the timestamp is in

the following format - hours : minutes : seconds . seconds

15:22:41.400299 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 110

the following switches alter the timestamp format.

-t suppresses the timestamp output

orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 597

-tt gives an unfomatted time stamp, this value is a count in seconds from the OS

clock initial value

1029507868.335134 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 520

-tttt gives the interval between the packet recieved and the previous packet

358020 orac.erg.abdn.ac.uk.1052 > 224.2.156.220.57392: udp 586



Source and Destination addresses and Ports: To capture packets to or from

particuar groups or hosts a range of expression can be used, here are some example.

To capture all traffic with host churchward as source or destination address

tcpdump host churchward

To capture all traffic with the tcp or udp, source or destination port number 53

tcpdump port 53

To capture all traffic with the source address churchward

tcpdump src host churchward

To capture all trafffic with the destination tcp or udp port 53

tcpdump dst port 53

To capture all TCP traffic with the source address churchward

tcpdump tcp src host churchward

To capture all trafffic with the destination udp port 53

tcpdump udp dst port 53

There are a huge range of options available, the examples above are intened to

give an introduction to teh structure and syntax.

48

Logical Operators: Expressions can be combined using AND and OR with the

additional use of NOT.

To capture all traffic with the source address churchward AND with the destination

udp port 53

tcpdump src host churchward and udp dst port 53

To capture all traffic with the destination address 224.2.127.254 OR with the

destination address 239.255.255.255

tcpdump dst 224.2.127.254 or dst 239.255.255.255

To capture all traffic with the destination address 224.2.127.254 NOT with the

source address 139.133.204.110

tcpdump dst 224.2.127.254 and not src 139.133.204.110

Writing to and Reading from file

To write ram packets to a file for later processing the syntax is as follows

tcpdump -w <filename>

This can be combined with an expression to only write some packets to the file.

tcpdump -w dns-file udp dst port 53

This would write all packets to or from tcp or udp port 53 to file.

To read packets from a dump file

tcpdump -r <filename>

This can be combined with an expression to only read some packets from the file.

tcpdump -r dns-file src host churchward and udp

This would read any udp packets sent by host churchward from the file.

2.9 QT Designer

Qt Designer, a tool for designing and implementing user interfaces built with

the Qt multiplatform GUI toolkit. Qt Designer makes it easy to experiment with user

interface design. At any time you can generate the code required to reproduce the user

interface from the files Qt Designer produces, changing your design as often as you

49

like. If you used an earlier version you will find yourself immediately productive in the

new version since the interface is very similar. And you will also find new widgets and

new and improved functionality which have been developed as a result of your

feedback.

Qt Designer helps you build user interfaces with layout tools that move and

scale your widgets (controls in Windows terminology) automatically at runtime. The

resulting interfaces are both functional and attractive, comfortably suiting your users'

operating environments and preferences. Qt Designer supports Qt's signals and slots

mechanism for type-safe communication between widgets. Qt Designer includes a code

editor which you can use to embed your own custom slots inside the generated code.

Those who prefer to separate generated code from hand crafted code can continue to

use the sub classing approach pioneered in the first version of Qt Designer.

2.10 Conclusion

In this chapter, we have seen the various classes of Intrusion detection system.

We studied the different types of detection mechanisms, data mining algorithms for

intrusion detection system and various protocols of communication network and the

programming language used.

50

Chapter 3

ANALYSIS AND DESIGN

3.1 Introduction

In order to develop efficient software it is necessary to undergo a system study.

Analysis of the system includes various stages. The important step in analysis is

requirement analysis. In requirement analysis, we consider User Perspective analysis,

Developer Perspective analysis and Functional Perspective analysis. The user

perspective analysis specifies the requirements on the behalf of the user. The developer

perspective analysis specifies the developing methodologies of the system i.e. the

facilities provided by the system.

The functional perspective analysis specifies the main functional tasks for the

system. The requirement analysis is followed by the data structure analysis where we

consider the Data Flow Diagrams for the system. Using DFDs, we can specify the

requirement of functions, processes and data at various levels.

tion and module specifications. The internal modules communication process

explains the interaction of one module with other modules. This interaction may be in

the form of user data, system data, inputs for operation or outputs of operationIn the

system design for the IDS, we study the network policies which are applied for

designing the IDS data structure, the major system component required, their hardware

and software configurations.

After this study, we design the software architecture of the IDS, which contains

various modules required for the implementation. The system architecture is divided

into two parts, internal module communica. The module specification gives the

information about the functional behavior and operation of the module.

51

3.2 Data Flow Diagram

Data Flow Diagrams (DFD) shows the transformation of data from input to

output through processes. These DFDs describe and analyze the movement of data

through the Intrusion Detection system. The main functions, processes, database and

data structures can be studied by using the DFD for the Intrusion Detection System.

Any functional model represents three generic functions –INPUT,

PROCESSING and OUTPUT. The functional model begins with a single context level

and over the series of iterations more and more details are provided.

3.2.1 Level-0 DFD

As shown in the 0th level Data flow diagram in figure 3.1 the Network Traffic

from driver (Network Interface Card driver) is the input information to the IDS. The

network traffic includes packets information that is captured by tcpdump. The output

from the IDS is Intrusion information. The intrusion information consist of connection

time, IP address, port, and status.

Figure 3.1: 0th level DFD

3.2.2 Level-1 DFD

The level-1 Data flow diagram is shown figure 3.2

52

IDSNetwork

Traffic from driver

Intrusion Information

Input pre-processing

The network-based approached relies on the tcpdump data as input, which

gives per packet information. We used data a huge dataset that we collected over a long

time span. We used as the normal data. This data was pre-processed to generate

itemset.txt file. The basic task is to extract extensive set of features for data mining

algorithm. This task is performing on the basis of protocol of the packet and frequency

of item set. The frequent Item set and their frequency is stored in Item set file.

The training module

IDS is trained using a data set that contains candidate item sets and their

corresponding frequencies. The process in this module perform apriori algorithm to

generate association rules. For this algorithm we also need to specify minimum support

and minimum confidence. Both parts of the stream are necessary in this module to

perform a conventional association rules discovery. The output of this module is a

profile of rules that depict the behaviour of the network with the possible known

attacks. This profile of rules is stored in rule file.

53

Figure 3.2: Level-1 DFD

The detection module

The actual detection of intrusions is implemented. The profile of rules along

with the network data set is fed to the Detection module which performs ID3 decision

tree algorithm to classify the incoming packets coming from network are normal or

attacks on the basis of association rules defined in training module. The output of this

module contains the information about intruders such as connection time, IP address,

port, and status.

3.2.3 Level-2 DFD

Input Pre-Processing:

The level-2 Data flow diagram for the input pre- processing is shown figure 3.4.

54

TrainingModule

1.2

Input Pre-processin

gModule

1.1

Detection Module

1.3

Tcpdump trace file

Item set file

Rules fileNetwork data file

Intrusion information file

If training mode = true

Candidate items

Rules

Rules

Candidate items

Network data

Intrusion information

If detection mode = true

Here the Dataset as we coolect over long time is given as input to the algorithm

that generates the frequent itemset for further generation of the association rules.

Following is the A priori algorithm as given in [10]. First the frequent-n itemset

_______________________________________________________


_______________________________________________________

Algorithm:

Begin

(1) scan database D to form L1 = { frequent 1-itemsets};

(2) k = 2; /* k is the length of the itemsets */

(3) while Lk-1 do begin /* association generation */

(4) for each pair of l1k-1; lk-1 Lk-1 and l1

k-1 and l2k-1 where their first k - 2

items are the same do begin

(5) construct candidate itemset ck such that its first k - 2 items are the same as

l1k-1, and the last two items are the last item of l1

k-1 and the last item of l2k-1;

(6) if there is a length k -1 subset sk-1 ck and sk-1 Lk-1

then

remove ck; /* the prune step */

else

(8) add ck to Ck;

end for

(9) scan D and count the support of each ck Ck;

(10) Lk = {ck|support(ck) minimum support};

(11) k = k + 1;

end while

_______________________________________________________

Output: Frequent ItemSet.

_______________________________________________________

Figure 3.3: Apriori algorithm for frequent itemset

55

Figure 3.4: Level-2 DFD of Input Pre-Processing Module

In input pre-processing module, packet data are in text.txt works as input that is

collected form tcpdump. These data are processed to collect candidate item sets and

count its frequencies in each packet. The output that contains candidate item sets and its

frequencies is stored in itemset.txt.

The Training Module:

The level-2 Data flow diagram for the Training module is shown figure 3.6

IDS are trained using a data set that contains candidate item sets and their

corresponding frequencies. The process in this module perform apriori algorithm to

generate association rules. First we need to define rule format then generate frequent

item set from candidate item set via checking it for minimum support and minimum

confidence. These frequent item sets are then used to generate association rule and store

it into rules.txt. Following is the generation of rules .

_______________________________________________________

Input: ItemSet

_______________________________________________________

(1) for all lk; k > 2 do begin

56

Collect candidate

item 1.1.1

Find frequencies of all items

1.1.2

test.txt

itemset.txt

Candidate items

Candidate items and its frequencies

Packet data

Packet data

(2) for all subset am lk do begin

(3) conf = support(lk)/support(am);

(4) if conf minimum confidence then begin

output rule am (lk - am), with confidence = conf and

support = support(lk);

end if

end for

end for

_______________________________________________________

Output: Association rules

_______________________________________________________

Figure 3.5 : Association Rules Algorithm

Figure 3.6: Level-2 DFD of Training Module

The Detection Module:

57

Define rule

format1.2.1

Check minimum support1.2.2

Check minimum confidence

1.2.3

Generate rules for frequent item sets

1.2.4

itemset.txt

interm.txt

rules.txt

>=min support

>= min confidence

Packet data

Packet data

Rules

The level-2 Data flow diagram for the detection module is shown figure 3.8

IDS are detecting intruders from network data with help of association rules defined in

training module. To perform this task we need to implement ID3 decision algorithm. In

this algorithm, decision tree is constructed from rules and packet data. After

constructing decision tree we need to calculate entropy and measure information gain.

The information regarding intruders is stored in intrusioninfo.txt file. This information

includes connection time, IP address, port, and status. Decision tree programs construct

a decision tree T from a set of training cases.

_______________________________________________________

Input: R: a set of non-target attributes,

C: the target attribute,

S: a training set

_______________________________________________________

Algorithm:

begin

- If S is empty, return a single node with value Failure;

- If S consists of records all with the same value for the target attribute, return a single

leaf node with that value;

- If R is empty, then return a single node with the value of the most frequent of the

values of the target attribute that are found in records of S; [in that case there

may be be errors, examples that will be improperly classified];

- Let A be the attribute with largest Gain (A, S) among attributes in R;

-Let {aj| j=1, 2, .., m} be the values of attribute A;

- Let {Sj| j=1, 2, .., m} be the subsets of S consisting respectively of records with

value aj for A;

- Return a tree with root labeled A and arcs labeled a1, a2, .., am going respectively to

the trees (ID3(R-{A}, C, S1), ID3(R-{A}, C, S2), .....,ID3(R-{A}, C, Sm);

-Recursively apply ID3 to subsets {Sj| j=1,2, .., m} until they are empty

end

_______________________________________________________Output: A Decision tree

58

_______________________________________________________

Figure 3.7: ID3 Decision Tree Algorithm

Figure 3.8: Level-2 DFD of Detection Module

3.3 Data Structure Design

This section introduces the data structure design for input and output data.

The test.txt file and netstat.txt contains packet data that captured by tcpdump

for training module and the netstat.txt file also contains packet data but it’s for

detection module. The general data structure design of the both file test.txt and

netstat.txt is shown in the tables as below.

For UDP datagram:

Packet Field Name (Timestamp, Packet type, Source address, Source port, Destination address, Destination port , Protocol, Size)

59

Construct decision

tree1.3.1

Calculate entropy

1.3.2

Measure Information

gain1.3.3

rules.txt netstat.txt

intrusioninfo.txt

Packet dataRules


For TCP datagram:

Packet Field Name (Timestamp, Packet type, Source address, Source port, Destination address, Destination port, flag, Sequence number, Contained data up to, Number of user data, Acknowledgement, Window size, Option )

The interm.txt is used as an intermediate file has same data structure design

like the test.txt and netstat.txt have.

The itemset.txt which is used by training module contains the candidate item

sets and its frequencies. The data structure design for this file is shown in the table as

below.

Candidate item set frequencies

The rules.txt file which is an output of the training module and one of the

inputs of detection module contains association rules generated by Apriori algorithm.

The data structure design for this file is shown in the table as below.

Frequent item set -> frequent item set

The intrusioninfo.txt which is generated by detection module contains the

information about intruder. The data structure design for this file is shown in the table

as below.

Connection time IP address Port status

3.4 Software Architecture

Our basic idea is to generate association rules from the audit record set. We

generate these rules using the A priori algorithm. Its implementation involves

generation of frequent itemset(s) {1..n} followed by the generation of rules.We intend

60

to process the audit records, which reveal the transactions taking place in the network.

The records contain various fields and to generate frequent itemsets we need to

combine the items in different fields.The rules generator to generate the association

rules would then use these frequent itemsets. The rules generator will store the

generated association rules in an efficient data structure to facilitate efficient storing

and retrieval.The stored association rules will then used with the test data by the

Intrusion detector to flag intrusions. The component diagram is shown in figure 3.9.

Figure 3.9: IDS Component Diagram

Input Pre-Processing take test,txt file as input (file of data that we get over long period)

and write the frequent itemset into the itemset.txt.(through the apriori algorithm). Then

this itemset.txt file is further processed in training module to generate the association

rules and write the rule in rule.txt. Now the rule is generated so the actual data that we

have to check (netstst.txt) send as input with the rule.txt to detection module (here the

ID3 algorithm is used) for detecting intrusion and write it to on intrisioninfo.txt.

61

Input Pre-Processing

DetectionModule

Training Moduletest.txt

itemset.txtnetstat.txt

rules.txt

intrusioninfo.txt

N/W data

Item set

Rules

Item set

Rules

N/W data

Events

Events


3.5 Conclusion

This chapter has given a good feel for what its take to implement the intrusion

detection system. With the help of the requirement analysis and requirement analysis

tools such as data flow diagram, we have decided the flow of the information of the

system. We have designed the data structure and the software architecture required for

the implementation of the system.

62

Chapter 4

EXPERIMENTS AND RESULTS

4.1 Introduction

We now presents details of the experiments, results obtained and their

interpretation and demonstrate the fact that intrusion detection is indeed possible as per

our proposition. We have tried to include a variety of data corresponding to both – audit

(valid) as well as test data set. The records represent different sessions, of varying sizes

(i.e. number of transactions) and of different characteristics.

4.2 Environment setup

First we need to take input data to train IDS by following command:

#tcpdump –i eth0 >>test.txt

This command captured packet and download it into test.txt file which is further

used to train the IDS using asso.exe.The IDS is trained by asso.exe that take test.txt as

input and generate itemset.txt and rules.txt

o Input – test.txt

o Output – itemset.txt and rules.txt

o Intermediate file – interm.txt

In the training process first it performs input pre-processing on test.txt and

generates itemset.txt that contains frequent Item set and it’s Frequency. After input pre-

processing asso.exe perform Apriori algorithm on itemset.txt and generate rules file

named rules.txt. Now IDS is trained, and we can start detection process. Network data

that should be tested for intrusion can fetched by following command in terminal

#tcpdump –i eth0 >>netstat.txt

63

This command continuously stored data into netstat.txt until not press

<ctrl+z>.Now Administrator runs the IDS by login.exe which is login window by

which you inter in IDS control Center.

4.3 Screenshots

#tcpdump –i eth0 >>test.txt

After this command captured packet and download it into test.txt file .

Fig-4.1: Test Data(test.txt)

test.txt file contains all packets received by a network interface in the following

format:-

For UDP datagrams

64

14:55:42.124288 IP 192.168.3.17.netbios-dgm > 192.168.3.255.netbios-dgm: NBT

UDP PACKET(138)

Timestamp 14:55:42.124288,

Packet type IP

Source address 192.168.3.17

Source port netbios-dgm


Destination port netbios-dgm

Protocol UDP PACKET

Size 138

For TCP datagrams

14:55:51.635185 IP 172.16.10.240.squid > 172.16.1.167.58526: P

1096064673:1096064695(22) ack 1112655718 win 7240 <nop,nop,timestamp

155446588 2091229>

Timestamp 14:55:51.635185

Packet type IP

Source address 172.16.10.240

Source port squid



the PUSH flag is set p

Sequence number 1096064673:

Contained data upto 1096064695

Number of user data (22)

Acknowledgement 1112655718

Window size 7240

Option <nop,nop,timestamp 155446588 2091229>

65

Fig-4.2: Association screen(Asso.exe)

The IDS is trained by asso.exe that take test.txt as input and generate itemset.txt and

rules.txt When you clicked Training button, the training process is started.

In this process first it performs input pre-processing on test.txt and generates

itemset.txt that contains frequent Item set and it’s Frequency.

After input pre-processing asso.exe perform Apriori algorithm on itemset.txt

and generate rules file named rules.txt

66

Fig-4.3: ItemSet Screen (itemset.txt)

itemset.txt contain frequent item set and its frequencies in the following format:

Frequent item 192.168.3.17

Frequencies 41

67

Fig-4.4:Rule screen(rules.txt)

rules.txt contains association rules I the following format:

172.16.10.240 172.16.1.167

This rule says that if frequent item 172.16.10.240 is appeared in a packet then

frequent item 172.16.1.167 should also appear.

Now IDS is trained, and we can start detection process.

68

Fig-4.5: Data file Screen (netstat.txt)

Information of the network data in data file screen. Network data that should be

tested for intrusion can fetched by following command in terminal After this command.

#tcpdump –i eth0 >>netstat.txt

69

Fig-4.6: login Screen

Now Administrator run the IDS by login.exe which is login window by which

you inter in IDS control Center. In login form administrator enter the name and

password .

70

Fig-4.7: Ids Control Center (main.exe)

Here packet monitor is contains progress bar for incoming packets on the basis

of particular type of packets.When you click start button detection process is started. In

this process, data in netstat.txt and rules.txt are put to ID3 algorithm to generate

intrusion information that is stored in intrusioninfo.txt shown in Fig-4.8.you can also

see the rule file and the itemset file through the main form the button given on the form

for rule file and itemset file both.When you clicked Stop button IDS is stop the

detection process.To Exit from IDS Control Center you can clicked Exit button

71

Fig-4.8:Intrusion Information (Intrusioninfo.txt)

Intrusioninfo.txt contain information regarding intruder in following format:

22:11:58.533049 192.168.3.158 netbios-ns Attack

Timestamp 14:55:51.635185

IP address 172.16.10.240

port squid

Status Attack

The intrusioninfo.txt which is generated by detection module contains the

information about intruder. This information includes connection time, IP address, port,

72

and status.The following is an extract from the sample rules generated by our rules

generator for the valid (normal) audit record set.service: telnet bytes(recv):500 -

>local:12 , [ confidence = 0.980794 , support = 1532.000000]

Here the first rule describes a regular telnet connection through a local

(renumbered) host 12 and 500 bytes of data were received by the local host during this

connection. The rule can be interpreted as follows: If the service is telnet and 500

bytes are received during the connection, then this transaction takes place at local host

12.Intuitively, this is not a strong rule. However, this corresponds to the normal usage

pattern. The fact that the rule is not strong is supported by the low confidence and

support values. The other rules can be interpreted similarly.The second rule says that If

the service is telnet and 500 bytes are received during the connection, then the data is

received from a remote host with IP address 195.32.222.22. It is stronger rule as

compared to the previous as can be seen from the confidence and support values.The

third rule says that a normal (state = sf i.e SYN/FIN) smtp connection receives 1000

bytes of data from the network. Intuitively, this is a strong rule since it describes the

characteristics of a normal smtp connection.These are only 3 of the 340-odd rules

generated from the 1MB (17700 connection records) dataset. The rules describe the

transactions taking place in our source network, which represent the normal state of the

network transactions.

4.4 Conclusion

In input pre-processing, the network data was pre-processed to generate

candidate items to train the training module of the IDS. The training module execute

Apriori algorithm to generate association rules. The output of this module is a profile of

rules that depict the behaviour of the network with the possible known attacks. The

profile of rules along with the network data set is fed to the Detection module which

performs ID3 decision tree algorithm to classify the incoming packets coming from

network is either normal or attacks. The output of this module contains the information

about intruders such as connection time, IP address, port, and status.

73

Chapter 5

CONCLUSION AND FUTURE

PERSPECTIVES

5.1 Conclusion

We have proposed a data mining technique for building intrusion detection

models. We demonstrated that association rules from the audit data could be used to

guide audit data gathering and feature selection, the critical steps in building effective

classification models. We incorporated domain knowledge into these basic algorithms

using the axis attribute(s), reference attribute(s), and a level-wise approximate mining

procedure. Our experiments on real world audit data showed that the algorithms are

very effective.

The experiments on network tcpdump data demonstrated the effectiveness of

classification models in detecting anomalies. The accuracy of the detection models

depends on sufficient training data and the right feature set. We suggested that the

association rule could be used to compute the consistent patterns from audit data. These

frequent patterns form an abstract summary of an audit trail, and therefore can be used

to: guide the audit data gathering process; provide help for feature selection; and

discover patterns of intrusions. Intrusion detection systems add an early warning

capability to a company’s defenses, alerting to the type of suspicious activity that

typically occurs before and during an attack.

Since most cannot stop an attack, intrusion detection systems should not be

considered an alternative to traditional good security practices. There is no substitute

for a carefully thought out corporate security policy, backed up by effective security

procedures which are carried out by skilled staff using the necessary tools. Instead,

intrusion detection systems should be viewed as an additional tool in the continuing

battle against hackers and crackers.

74

There are a large variety of intrusion detection systems available, suitable for

almost any circumstance. They range from freeware versions that can be deployed on

low-cost PC's, to commercial systems costing many thousand of pounds and requiring

the latest and greatest hardware. Some are designed to monitor whole networks, whilst

others are deployed on a per-machine basis. They all have their pro's and con's, and

there is a role for them all. However, our work aims to eliminate, as much as possible,

the manual and ad-hoc elements from the process of building an intrusion detection

system.

5.2 Future Work

The biggest challenge of using data mining approaches in intrusion detection is

that it requires a large amount of audit data in order to compute the profile rule sets.

And the fact that we may need to compute a detection model for each resource in a

target system makes the data mining task daunting. Moreover, this learning (mining)

process is an integral and continuous part of an intrusion detection system because the

rule sets used by the detection module may not be static over a long period of time.

For example, as a new version of system software arrives, we need to update the

“normal” profile rules. Given that data mining is an expensive process (in time and

storage), and real-time detection needs to be lightweight to be practical, we can’t afford

to have a monolithic intrusion detection system.

It has been proposed system architecture, as shown in Figure 5.1 that includes

two kinds of intelligent agents: the learning agents and the detection agents. A learning

agent, who may reside in a server machine for its computing power, is responsible for

computing and maintaining the rule sets for programs and users. It produces both the

base detection models and the meta detection models. The task of a learning agent, to

compute accurate models from very large amount of audit data, is an example of the

“scale-up” problem in machine learning.

We expect that, the research done in agent-based meta learning systems [14]

will contribute significantly to the implementation of the learning agents. Briefly, we

75

are studying how to partition and dispatch data to a host of machines to compute

classifiers in parallel, and re-import the remotely learned classifiers and combine an

accurate (final) meta-classifier, a hierarchy of classifiers [12].

A detection agent is generic and extensible. It is equipped with a (learned and

periodically updated) rule set (i.e., a classifier) from the remote learning agent. Its

detection engine “executes” the classifier on the input audit data, and outputs evidence

of intrusions. The main difference between a base detection agent and the meta

detection agent is: the former uses preprocessed audit data as input while the later uses

the evidence from all the base detection agents. The base detection agents and the meta

detection agent need not be running on the same host. For example, in a network

environment, a meta agent can combine reports from (base) detection agents running on

each host, and make the final assertion on the state of the network.

Figure 5.1: Architecture for agent-based IDS

76

The main advantages of such system architecture are:

It is easy to construct an intrusion detection system as a compositional hierarchy

of generic detection agents.The detection agents are lightweight since they can function

independently from the heavyweight-learning agents, in time and locale, so long as it is

already equipped with the rule sets. A detection agent can report new instances of

intrusions by transmitting the audit records to the learning agent, which can in turn

compute an updated classifier to detect such intrusions, and dispatch them to all

detection agents. Interestingly, the capability to derive and disseminate anti-virus codes

faster than the virus can spread is also considered a key requirement for anti-virus

systems.

77

REFERENCES

[1] “A Data Mining Framework for Adaptive Intrusion Detection. DARPA”, W. Lee,

S. J.Stolfo, and K. W. Mok, 1998.

[2] “Towards taxonomy of intrusion-detection systems”, Debar H., Dacier M., Wespi

A., Computer Networks, 31, 1999, pp. 805-822.

[3] “Model Generation for an Intrusion Detection System Using Genetic Algorithms”,

Adhitya Chittur. 2001.

[4] “Improving Intrusion Detection Performance Using Keyword Selection and Neural

Networks”, R.P. Lippmann, R. K. Cunningham.

[5] “Fuzzy network profiling for intrusion detection”, J.E.dickerson,J.A.Dickerson,

IEEE 2000.

[6]“Detecting Intrusions Using System Calls: Alternative Data Models, IEEE

Symposium on Security and Privacy”, C.Warrender,S. Forrest,B.Pearlmutter.

IEEE Computer Society pp. 133-145 (1999).

[7] “Adaptive Intrusion Detection: a Data Mining Approach.” Artificial Intelligence

Review, 14(6), December 2000, pp. 533-567,

http://www.cc.gatech.edu/~wenke/papers/ai_review.ps

[8] “Mining association rules between sets of items in large databases” In Proceedings

78

http://www.cc.gatech.edu/~wenke/papers/ai_review.ps

Of the ACM SIGMOD Conference on Management of Data, pages 207-216,

1993.

[9] “Fast algorithms for mining association rules” R. Agrawal and R. Srikant, In

Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.

[10] “Algorithms for Mining System Audit Data”, Wenkee Lee and Salvatore J. Stolfo

, In Proceedings of the 7th USENIX Security Symposium, San Antonio, TX,

January 2000.

[11] “A Hybrid Approach to the Profile Creation and Intrusion Detection”, Marin J.,

Ragsdale D., Surdu J.: Proceedings of the DARPA Information Survivability

Conference and Exposition – DISCEX 2001, June 2001,

http://www.itoc.usma.edu/Documents/Hybrid_DISCEX_AcceptedCopy.pdf

[12] “Using Artificial Anomalies to Detect Unknown and Known Network Intrusions”,

Fan W., Miller M., Stolfo S., Lee W., Chan P, In Proceedings of the First IEEE

International Conference on Data Mining, San Jose, CA, November 2001,

http://www.cc.gatech.edu/~wenke/papers/artificial_anomalies.ps

[13] “A data mining and CIDF based approach for detecting novel and distributed

Intrusions.” Lee W. i inni, Recent Advances in Intrusion Detection, Third

International Workshop, RAID 2000, Toulouse, France, October 2-4, 2000,

Proceedings. Lecture Notes in Computer Science 1907 Springer, 2000, pp. 49-

65. http://www.cc.gatech.edu/~wenke/papers/lee_raid_00.ps

[14] “Intrusion Detection Systems Multisensor Data Fusion: Creating Cyberspace

Situational Awareness”, Bass T. Communication of the ACM, Vol. 43, Number

1, January 2000, pp. 99-105, http://www.silkroad.com/papers/acm.fusion.ids.ps

[15] “A data mining analysis of RTID alarms”, Manganaris S., Christensen M., Zerkle

D, Hermiz K. Computer Networks, 34, 2000, pp. 571-577.

79

http://www.silkroad.com/papers/acm.fusion.ids.ps

http://www.cc.gatech.edu/~wenke/papers/lee_raid_00.ps

http://www.cc.gatech.edu/~wenke/papers/artificial_anomalies.ps

http://www.itoc.usma.edu/Documents/Hybrid_DISCEX_AcceptedCopy.pdf

[16] “Tcpdump. available via anonymous ftp to ftp.ee.lbl.gov”, V. Jacobson, C. Leres,

and S. McCanne. June 1989.

[17] “Adaptive intrusion detection: a data mining approach”, W. Lee, S. J. Stolfo, and

K.W. Mok., Artificial Intelligence Review, 1999.

[18] “Intrusion Detection Systems: A Taxomomy and Survey.” Axelsson S.,Technical

Report No 99-15, Dept. of Computer Engineering, Chalmers University of

Technology, Sweden, March 2000,

http://www.ce.chalmers.se/staff/sax/taxonomy.ps

[19] “Intrusion Detection, Theory and Practice.” Elson D. March 27, 2000,

http://online.securityfocus.com/infocus/1203

[20] “Network Intrusion Detection Signatures”, Frederick K. K., December 19, 2001,


[21] “Intrusion Detection Systems (IDS). Group Test (Edition 3)”, NSS Group, July

2002, http://www.nss.co.uk/ids/edition3/index.htm

[22] “Computer system intrusion detection: a survey”, Jones A.K., Sielken R.S.:

09.02.2000, http://www.cs.virginia.edu/~jones/IDS-research/Documents/jones-

sielken-survey-v11.pdf

[23] “Classification and detection of computer intrusions”, Kumar S.A Ph.D. Thesis.

Purdue University. 1995, http://ftp.cerias.purdue.edu/pub/papers/sandeep-

kumar/kumar-intdet-phddiss.pdf

[24] “Data Mining: Concepts and Techniques”, Jiawei Han and Micheline Kamber.

Morgan Kaufmann Publishers 2006.

80

http://www.nss.co.uk/ids/edition3/index.htm



http://www.ce.chalmers.se/staff/sax/taxonomy.ps

[25] “TCP/IP Protocol Suite “, Forouzan Tata McGREW Hill Publication ,Third

Edition2006.

[26] “Computer Networks”, Tanen Baum Pearson Education ,Third edition 2006.

81