Maggi Thesis

Politecnico di MilanoDipartimento di Elettronica e Informazione

DOTTORATO DI RICERCA IN INGEGNERIADELLINFORMAZIONE

Integrated Detection of Anomalous Behaviorof Computer Infrastructures

Doctoral Dissertation of:FedericoMaggi

Advisor:Prof. Stefano Zanero

Tutor:Prof. Letizia Tanca

Supervisor of the Doctoral Program:Prof. Patrizio Colaneri

- XXII

P MDipartimento di Elettronica e Informazione

Piazza Leonardo da Vinci , I- Milano

iPreface

is thesis embraces all the eorts that I put during the last threeyears as a PhD student at Politecnico di Milano. I have been work-ing under the supervision of Prof. S. Zanero and Prof. G. Serazzi,who is also the leader of the research group I am part of. In thistime frame I had the wonderful opportunity of being initiated toresearch, which radically changed the way I look at things: I foundmy natural thinking outside the box attitude that was probablywell-hidden under a thick layer of lack-of-opportunities, I took partof very interesting joint works among which the year I spent atthe Computer Security Laboratory at UC Santa Barbara is at therst place, and I discovered the Zen of my life.

My research is all about computers and every other technologypossibly related to them. Clearly, the way I look at computers haschanged a bit since when I was seven. Still, I can remember me,typing on that Commodore in front of a tube TV screen, tryingto get that dn routine written in Basic to work. I was just playing,obviously, but when I recently found a picture of me in front of thatscreen...it all became clear.

COMMODORE 64 BASIC X2 64K RAM SYSTEM 38128 BASIC BYTES FREE

READY.

BENVENUTO NEL COMPUTER DI FEDERICO.SCRIVI LA PAROLA DORDINE:

at says WELCOME TO FEDERICOS COMPUTER.TYPE IN THE PASSWORD. So, although my attempt ofwriting a program to authenticate myself was a little bit naive being limited to a print instruction up to that point apart, of course I thought maybe I am not in the wrong place, and the fact that myresearch is still about security is a good sign!

Many years later, this work comes to life. ere is a humongousamount of people that, directly or indirectly, have contributed tomyresearch and, in particular, to this work. Since my rst step into the

ii

lab, I will not, ever, be thankful enough to Stefano, who, despite myskepticism, convinced me to submit that application for the PhDprogram. For trusting me since the very rst moment I am thankfulto Prof. G. Serazzi as well, who has been always supportive. Forhosting and supporting my research abroad I thank Prof. G. Vigna,Prof. C. Kruegel, and Prof. R. Kemmerer. Also, I wish to thankProf. M. Matteucci for the great collaboration, Prof. I. Epifani forher insightful suggestions and Prof. H. Bos for the detailed reviewand the constructive comments.

On the colleagues-side of this acknowledgments I put all thefellows of Room , Guido, the crew of the seclab and, in partic-ular, Wil with whom I shared all the pain of paper writing betweenSept and Jun .

On the friends-side of this list Lorenzo and Simona go rst,for being our family.

I have tried to translate in simple words the innite gratitudeI have and will always have to Valentina and my parents for beingmy xed point in my life. Obviously, I failed.

F MMilano

September

iv

Abstract

is dissertation details our research on anomaly detec-tion techniques, that are central to several classic security-related tasks such as network monitoring, but it also havebroader applications such as program behavior characteriza-tion or malware classication. In particular, we worked onanomaly detection from three dierent perspective, with thecommon goal of recognizing awkward activity on computerinfrastructures. In fact, a computer system has several weakspots that must be protected to avoid attackers to take ad-vantage of them. We focused on protecting the operatingsystem, central to any computer, to avoid malicious code tosubvert its normal activity. Secondly, we concentrated onprotecting the web applications, which can be considered themodern, shared operating systems; because of their immensepopularity, they have indeed become the most targeted entrypoint to violate a system. Last, we experimented with noveltechniques with the aim of identifying related events (e.g.,alerts reported by intrusion detection systems) to build newand more compact knowledge to detect malicious activity onlarge-scale systems.

Our contributions regarding host-based protection sys-tems focus on characterizing a process behavior through thesystem calls invoked into the kernel. In particular, we en-gineered and carefully tested dierent versions of a multi-model detection system using both stochastic and determin-istic models to capture the features of the system calls duringnormal operation of the operating system. Besides demon-strating the eectiveness of our approaches, we conrmedthat the use of nite-state, deterministic models allow to de-tect deviations from the process control owwith the highestaccuracy; however, our contribution combine this eective-ness with advanced models for the system calls argumentsresulting in a signicantly decreased number of false alarms.

Our contributions regarding web-based protection sys-tems focus on advanced training procedures to enable learn-ing systems to performwell even in presence of changes in theweb application source code particularly frequent in theWeb . era. We also addressed data scarcity issues that is areal problem when deploying an anomaly detector to protecta new, never-used-before application. Both these issues dra-matically decrease the detection capabilities of an intrusion

vdetection system but can be eectively mitigated by adopt-ing the techniques we propose.

Last, we investigated the use of dierent stochastic andfuzzy models to perform automatic alert correlation, which isas post processing step to intrusion detection. We proposeda fuzzy model that formally denes the errors that inevitablyoccur if time-based alert aggregation (i.e., two alerts are con-sidered correlated if they are close in time) is used. ismodel allow to account for measurements errors and avoidfalse correlations due to delays, for instance, or incorrect pa-rameter settings. In addition, we dened a model to describethe alert generation as a stochastic process and experimentedwith non-parametric statistical tests to dene robust, zero-conguration correlation systems.

e aforementioned tools have been tested over dierentdatasets that are thoroughly documented in this document and lead to interesting results.

vi

Sommario

Questa tesi descrive in dettaglio la nostra ricerca sulletecniche di anomaly detection. Tali tecniche sono fonda-mentali per risolvere problemi classici legati alla sicurezza,come per esempio il monitoraggio di una rete, ma hanno an-che applicazioni di piu ampio spettro come lanalisi del com-portamento di un processo in un sistema o la classicazionedi malware. In particolare, il nostro lavoro si concentra su treprospettive dierenti, con lo scopo comune di rilevare atti-vita sospette in un sistema informatico. Difatti, un sistemainformatico ha diversi punti deboli che devono essere protettiper evitare che un aggressore possa approttarne. Ci siamoconcentrati sulla protezione del sistema operativo, presentein qualsiasi computer, per evitare che un programma possaalterarne il funzionamento. In secondo luogo ci siamo con-centrati sulla protezione delle applicazioni web, che possonoessere considerate il moderno sistema operativo globale; in-fatti, la loro immensa popolarita ha fatto si che diventasseroil bersaglio preferito per violare un sistema. Inne, abbia-mo sperimentato nuove tecniche per identicare relazioni traeventi (e.g., alert riportati da sistemi di intrusion detection)con lo scopo di costruire nuova conoscenza per poter rilevareattivita sospette su sistemi di larga-scala.

Riguardo ai sistemi di anomaly detection host-based cisiamo focalizzati sulla caratterizzazione del comportamen-to dei processi basandoci sul usso di system call invocatenel kernel. In particolare, abbiamo ingegnerizzato e valuta-to accuratamente diverse versioni di un sistema di anoma-ly detection multi-modello che utilizza sia modelli stocasti-ci che modelli deterministici per catturare le caratteristichedelle system call durante il funzionamento normale del si-stema operativo. Oltre ad aver dimostrato lecacia dei no-stri approcci, abbiamo confermato che lutilizzo di modellideterministici a stati niti permettono di rilevare con estre-ma accuratezza quando un processo devia signicativamen-te dal normale control ow; tuttavia, lapproccio che propo-niamo combina tale ecacia con modelli stocastici avanzatipermodellizzare gli argomenti delle system call per diminuiresignicativamente il numero di falsi allarmi.

Riguardo alla protezione delle applicazioni web ci siamofocalizzati su procedure avanzate di addestramento. Lo sco-po e permettere ai sistemi basati su apprendimento non su-

vii

pervisionato di funzionare correttamente anche in presenzadi cambiamenti nel codice delle applicazioni web feno-meno particolarmente frequente nellera del Web .. Ab-biamo anche arontato le problematiche dovute alla scari-sita di dati di addestramento, un ostacolo piu che realisticospecialmente se lapplicazione da proteggere non e mai sta-ta utilizzata prima. Entrambe le problematiche hanno comeconseguenza un drammatico abbassamento delle capacita didetection degli strumenti ma possono essere ecacementemitigate adottando le tecniche che proponiamo.

Inne abbiamo investigato lutilizzo di diversi modelli,sia stocastici che fuzzy, per la correlazione di allarmi auto-matica, fase successiva alla rilevazione di intrusioni. Abbia-mo proposto un modello fuzzy che denisce formalmente glierrori che inevitabilmente avvengono quando si adottano al-goritmi di correlazione basati sulla distanza nel tempo (i.e.,due allarmi sono considerati correlati se sono stati riportatipiu o meno nello stesso istante di tempo). Questo model-lo permette di tener conto anche di errori di misurazione edevitare decisioni scorrete nel caso di ritardi di propagazione.Inoltre, abbiamo denito un modello che descrive la genera-zione di allarmi come un processo stocastico e abbiamo spe-rimentato con dei test non parametrici per denire dei criteridi correlazione robusti e che non richiedono congurazione.

Contents

Introduction . Todays Security reats . . . . . . . . . . . . . .

.. e Role of Intrusion Detection . . . . . . . Original Contributions . . . . . . . . . . . . . . .

.. Host-based Anomaly Detection . . . . . . .. Web-based Anomaly Detection . . . . . . .. Alert Correlation . . . . . . . . . . . . . .

. Document Structure . . . . . . . . . . . . . . . .

DetectingMalicious Activity . Intrusion Detection . . . . . . . . . . . . . . . . .

.. Evaluation . . . . . . . . . . . . . . . . . .. Alert Correlation . . . . . . . . . . . . . . .. Taxonomic Dimensions . . . . . . . . . .

. Relevant Anomaly Detection Techniques . . . . . .. Network-based techniques . . . . . . . . . .. Host-based techniques . . . . . . . . . . . .. Web-based techniques . . . . . . . . . . .

. Relevant Alert Correlation Techniques . . . . . . . Evaluation Issues and Challenges . . . . . . . . .

.. Regularities in audit data of IDEVAL . . . .. e base-rate fallacy . . . . . . . . . . . .

. Concluding Remarks . . . . . . . . . . . . . . . .

Host-based Anomaly Detection . Preliminaries . . . . . . . . . . . . . . . . . . . . . Malicious System Calls Detection . . . . . . . . .

.. Analysis of SyscallAnomaly . . . . . . . .

viii

CONTENTS ix

.. Improving SyscallAnomaly . . . . . . . . . .. Capturing process behavior . . . . . . . . .. Prototype implementation . . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Mixing Deterministic and Stochastic Models . . . .. Enhanced Detection Models . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Forensics Use of Anomaly Detection Techniques . .. Problem statement . . . . . . . . . . . . . .. Experimental Results . . . . . . . . . . . .


Anomaly Detection ofWeb-based Attacks . Preliminaries . . . . . . . . . . . . . . . . . . . .

.. Anomaly Detectors of Web-based Attacks .. AComprehensiveDetection System toMit-

igate Web-based Attacks . . . . . . . . . . .. Evaluation Data . . . . . . . . . . . . . .

. Training With Scarce Data . . . . . . . . . . . . . .. Non-uniformly distributed training data . .. Exploiting global knowledge . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Addressing Changes in Web Applications . . . . . .. Web Application Concept drift . . . . . . .. Addressing concept drift . . . . . . . . . . .. Experimental Results . . . . . . . . . . . .


Network and Host Alert Correlation . Fuzzy Models and Measures for Alert Fusion . . .

.. Time-based alert correlation . . . . . . . . . Mitigating Evaluation Issues . . . . . . . . . . . .

.. A common alert representation . . . . . . .. Proposed Evaluation Metrics . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Using Non-parametric Statistical Tests . . . . . . .. e Granger Causality Test . . . . . . . . .. Modeling alerts as stochastic processes . .


x CONTENTS

Conclusions

Bibliography

Index

List of Acronyms

List of Figures

List of Tables

Introduction

Network connected devices such as personal computers, mobilephones, or gaming consoles are nowadays enjoying immense pop-ularity. In parallel, the Web and the humongous amount of ser-vices it oers have certainly became the most ubiquitous tools ofall the times. Facebook counts more than millions active usersof which millions are using it on mobile devices; not to men-tion that more than billion photos are uploaded to the site eachmonth [Facebook, ]. And this is just one, popular website.One year ago, Google estimated that the approximate number ofunique Uniform Resource Locators (URLs) is trillion [Alpert andHajaj, ], while YouTube has stockedmore than million videosas ofMarch , with ,, views just on the most popularvideo as of January [Singer, ]. And people from all overthe world inundate the Web with more than million tweets perday. Not only theWeb . has became predominant; in fact, think-ing that on December the Internet was made of one site andtoday it counts more than million sites is just astonishing [Za-kon, ].

e Internet and theWeb are huge [MiniwattsMarketingGrp.,]. e relevant fact, however, is that they both became themost advanced workplace. Almost every industry connected itsown network to the Internet and relies on these infrastructures for

. I

a vast majority of transactions; most of the time monetary transac-tions. As an example, every year Google looses approximately millions of US Dollars in ignored ads because of the Im feelinglucky button. e scary part is that, during their daily work activi-ties, people typically pay poor or no attention at all to the risks thatderive from exchanging any kind of information over such a com-plex, interconnected infrastructure. is is demonstrated by theeectiveness of social engineering [Mitnick, ] scams carriedover the Internet or the phone [Granger, ]. Recall that of the phishing is related to nance. Now, compare this landscapeto what the most famous security quote states.

e only truly secure computer is one buried inconcrete, with the power turned o and the networkcable cut.

Anonymous

In fact, the Internet is all but a safe place [Ofer Shezaf andJeremiahGrossman andRobert Auger, ], withmore than ,known data breaches between and [Clearinghouse, ]and an estimate of ,, records stolen by intruders. Onemay wonder why the advance of research in computer security andthe increased awareness of governments and public institutions arestill not capable of avoiding such incidents. Besides the fact thatthe aforementioned numbers would be order of magnitude higherin absence of countermeasures, todays security issues are, basically,caused by the combination of two phenomena: the high amount ofsoftware vulnerabilities and the eectiveness of todays exploitationstrategy.

software aws (un)surprisingly, software is aected by vulner-abilities. Incidentally, tools that have to do with the Web,namely, browsers and rd-party extensions, and web applica-tions, are the most vulnerable ones. For instance, in , Se-cunia reported around security vulnerabilities forMozillaFirefox, for Internet Explorers ActiveX [Secunia, ].Oce suites and e-mail clients, that are certainly the must-have-installed tool on every workstation, hold the second po-sition [e SANS Institute, ].

.. Todays Security reats

massication of attacks in parallel to the explosion of the Web., attackers and the underground economy have quicklylearned that a sweep of exploits run against every reachablehost have more chances to nd a vulnerable target and, thus,is much more protable compared to a single eort to breakinto a high-value, well-protected machine.

ese circumstances have initiated a vicious circle that providesthe attackers with a very large pool of vulnerable targets. Vul-nerable client hosts are compromised to ensure virtually unlimitedbandwidth and computational resources to attackers, while serverside applications are violated to host malicious code used to in-fect client visitors. And so forth. An old fashioned attacker wouldhave violated a single site using all the resources available, stolendata and sold it to the underground market. Instead, a modernattacker adopts a vampire approach and exploit client-side soft-ware vulnerabilities to take (remote) control of million hosts. Inthe past the diusion of malicious code such as viruses was sus-tained by sharing of infected, cracked software through oppy orcompact disks; nowadays, the Web oers unlimited, public storageto attackers that deploy their exploit on compromised websites.

us, not only the type of vulnerabilities has changed, posingvirtually every interconnected device at risk. e exploitation strat-egy created new types of threats that take advantage of classic ma-licious code patterns but in a new, extensive, and tremendously ef-fective way.

. Todays Securityreats

Every year, new threats are discovered and attacker take advan-tage of them until eective countermeasures are found. en, newthreats are discovered, and so forth. Symantec quanties the amountof new malicious code threats to be ,, as of [Turneret al., ], , one year earlier and only , in .us, countermeasures must advance at least with the same growrate. In addition:

[...] the current threat landscape such as the in-creasing complexity and sophistication of attacks, the

. I

F .: Illustration taken from [Holz, ] and cIEEE. Authorized license limited to Politecnico di Milano.

evolution of attackers and attack patterns, and mali-cious activities being pushed to emerging countries show not just the benets of, but also the need for in-creased cooperation among security companies, gov-ernments, academics, and other organizations and in-dividuals to combat these changes [Turner et al., ].

Todays underground economy run a very procientmarket: ev-eryone can buy credit card information for as low as $.$, fullidentities for just $.$ or rent a scam hosting solution for $$ per week plus $-$ for the design [Turner et al., ].

e main underlying technology actually employs a classic typeof software called bot (jargon for robot), which is not malicious perse, but is used to remotely control a network of compromised hosts,called botnet [Holz, ]. Remote commands can be of any typeand typically include launching an attack, starting a phishing orspam campaign, or even updating to the latest version of the botsoftware by downloading the binary code from a host controlled by


the attackers (usually called bot master) [Stone-Gross et al., ].e exchange good has now become the botnet infrastructure itselfrather than the data that can be stolen or the spam that can be sent.ese are mere outputs of todays most popular service oered forrent by the underground economy.

.. e Role of Intrusion Detection

e aforementioned, dramatic big picture may lead to think thatthe malicious software will eventually proliferate at every host ofthe Internet and no eective remediation exists. However, a morecareful analysis reveals that, despite the complexity of this scenario,the problems that must be solved by a security infrastructure canbe decomposed into relatively simple tasks that, surprisingly, mayalready have a solution. Let us look at an example.

Example .. is is how a sample exploitation can be structured:

injection a malicious request is sent to the vulnerable web ap-plication with the goal of corrupting all the responses sentto legitimate clients from that moment on. For instance,more than one releases of the popular WordPress blog appli-cation are vulnerable to injection attacks that allow an at-tacker to permanently include arbitrary content to the pages.Typically, such an arbitrary content is malicious code (e.g.,JavaScript, VBSCrip, ActionScript, ActiveX) that, every timea legitimate user requests the infected page, executes on theclient host.

infection Assuming that the compromised site is frequently ac-cessed this might be the realistic case of the WordPress-powered ZDNet news blog a signicant amount of clientsvisit it. Due to the high popularity of vulnerable browsersand plug-ins, the client may run InternetExplorer that isthe most popular or an outdated release of Firefox onWin-dows. is create the perfect circumstances for the maliciouspage to successfully execute. In the best case, it may down-load a virus or a genericmalware from awebsite under control

http://secunia.com/advisories/http://wordpress.org/showcase/zdnet/

. I

of the attacker, so infecting the machine. In the worst case,this code may also exploit specic browser vulnerabilities andexecute in privileged mode.

control & use e malicious code just download installs andhides itself onto the victims computer, which has just joineda botnet. As part of it, the client host can be remotely con-trolled by the attackers who can, for instance, rent it, use itsbandwidth and computational power along with other com-puters to run a distributed Denial of Service (DoS) attack.Also, the host can be used to automatically perform the sameattacks described above against other vulnerable web appli-cations. And so forth.

is simple yet quite realistic example shows the various kindsof malicious activity that are generated during a typical drive-byexploitation. It also shows its requirements and assumptions thatmust hold to guarantee success. More precisely, we can recognize:

network activity clearly, the whole interaction relies on a net-work connection over the Internet: the HyperText TransferProtocol (HTTP) connections used, for instance, to down-load the malicious code as well as to launch the injection at-tack used to compromise the web server.

host activity similarly to every other type of attack against anapplication, when the client-side code executes, the browser(or one of its extension plug-ins) is forced to behave improp-erly. If the malicious code executes till completion the attacksucceeds and the host is infected. is happens only if theplatform, operating system, and browser all match the re-quirements assumed by the exploit designer. For instance,the attack may succeed on Windows and not on MacOS X,although the vulnerable version of, say, Firefox is the same onboth the hosts.

HTTP trac in order to exploit the vulnerability of the webapplication, the attacking clientmust generatemaliciousHTTPrequests. For instance, in the case of an Structured QueryLanguage (SQL) injection that is the second most com-mon vulnerability in a web application instead of a regular


GET /index.php?username=myuser

the web server might be forced to process a

GET /index.php?username= OR x=x--\&content=

that causes the index.php page to behave improperly.

It is now clear that protection mechanisms that analyze the net-work trac, the activity of the clients operating system, the webservers HTTP logs, or any combination of the three, have chancesof recognizing that something malicious is happening in the net-work. For instance, if the Internet Service Provider (ISP) networkadopt Snort, a lightweight Intrusion Detection System (IDS) that an-alyzes the network trac for known attack patterns, could block allthe packets marked as suspicious. is would prevent, for instance,the SQL injection to reach the web application. A similar protec-tion level can be achieved by using other tools such as ModSecu-rity [Ristic, ]. One of the problems that may arise with theseclassic, widely adopted solutions is if a zero day attack is used. Azero day attack or threat exploits a vulnerability that is unknown tothe public, undisclosed to the software vendor, or a x is not avail-able; thus, protection mechanisms that merely blacklist knownma-licious activity immediately become ineective. In a similar vein,if the client is protected by an anti-virus, the infection phase canbe blocked. However, this countermeasure is once again successfulonly if the anti-virus is capable of recognizing the malicious code,which assumes that the code is known to be malicious.

Ideally, an eective and comprehensive countermeasure can beachieved if all the protection tools involved (e.g., client-side, server-side, network-side) can collaborate together. For instance, if a web-site is publicly reported to bemalicious, a client-side protection toolshould block all the content downloaded from that particular web-site. is is only a simple example.

us, countermeasures against todays threats already exist butare subject to at least two drawbacks:

they oer protection only against known threats. To be ef-fective we must assume that all the hostile trac can be enu-merated, which is clearly an impossible task.

. I

Why is Enumerating Badness a dumb idea?Its a dumb idea because sometime around the amount of Badness in the Internet began tovastly outweigh the amount of Goodness. For ev-ery harmless, legitimate, application, there are dozensor hundreds of pieces of malware, worm tests, ex-ploits, or viral code. Examine a typical antiviruspackage and youll see it knows about ,+ virusesthat might infect your machine. Compare that tothe legitimate or so apps that Ive installed onmy machine, and you can see its rather dumb totry to track , pieces of Badness when even asimpleton could track pieces ofGoodness [Ranum,].

they lack of cooperation, which is crucial to detect global andslow attacks.

is said, we conclude that classic approaches such as dynamicand static code analysis and IDS already oer good protection butindustry and research should move toward methods that require lit-tle or no knowledge. In this work, we indeed focus on the so calledanomaly-based approaches, i.e., those that attempt to recognize thethreats by detecting any variation from a systems normal operation,rather than looking for signs of known-to-be-malicious activity.

. Original Contributions

Ourmain research area is IntrusionDetection (ID). In particular, wefocus on anomaly-based approaches to detect malicious activities.Since todays threats are complex, a single point of inspection isnot eective. A more comprehensive monitoring system is moredesirable to protect both the network, the applications running on acertain host, and the web applications (that are particularly exposeddue to the immense popularity of the Web). Our contributionsfocus on the mitigation of both host-based and web-based attacks,along with two techniques to correlate alerts from hybrid sensors.

.. Original Contributions

.. Host-based Anomaly Detection

Typical malicious processes can be detected by modeling the char-acteristics (e.g., type of arguments, sequences) of the system callsexecuted by the kernel, and by agging unexpected deviations asattacks. Regarding this type of approaches, our contributions focuson hybrid models to accurately characterize the behavior of a binaryapplication. In particular:

we enhanced, re-engineered, and evaluated a novel tool formodeling the normal activity of the Linux . kernel. Com-pared to other existing solutions, our system shows better de-tection capabilities and good contextualization of the alertsreported. ese results are detailed in Section ..

We engineered and evaluated an IDS to demonstrate thatthe combined use of () deterministic models to characterizea process control ow and () stochastic models to capturenormal features of the data ow, lead to better detection ac-curacy. Compared to the existing deterministic and stochas-tic approaches separately, our system shows better accuracy,with almost zero false positives. ese results are detailed inSection ..

We adapted our techniques for forensics investigation. Byrunning experiments on real-world data and attacks, we showthat our system is able to detect hidden tamper evidence al-though sophisticated anti-forensics tools (e.g., userland pro-cess execution) have been used. ese results are detailed inSection ..

.. Web-based Anomaly Detection

Attempts of compromising a web application can be detected bymodeling the characteristics (e.g., parameter values, character dis-tributions, session content) of the HTTP messages exchanged be-tween servers and clients during normal operation. is approachcan detect virtually any attempt of tampering with HTTP mes-sages, which is assumed to be evidence of attack. In this researcheld, our contributions focus on training data scarcity issues alongwith the problems that arise when an application changes its legitbehavior. In particular:

. I

we contributed to the development of a system that learnsthe legit behavior of a web application. Such a behavior isdened by means of features extracted from ) HTTP re-quests, ) HTTP responses, ) SQL queries to the underly-ing database, if any. Each feature is extracted and learned byusing dierentmodels, some of which are improvements overwell-known approaches and some others are original. emain contribution of this work is the combination of databasequery models with HTTP-based models. e resulting sys-tem has been validated through preliminary experiments thatshown very high accuracy. ese results are detailed in Sec-tion ...

we developed a technique to automatically detect legit changesin web applications with the goal of suppressing the largeamount of false detections due to code upgrades, frequent intodays web applications. We run experiments on real-worlddata to show that our simple but very eective approach ac-curately predict changes in web applications and can distin-guish good vs. malicious changes (i.e., attacks). ese resultsare detailed in Section ..

We designed and evaluated a machine learning technique toaggregate IDS models with the goal of ensuring good detec-tion accuracy even in case of scarce training data available.Our approach relies on clustering techniques and nearest-neighbor search to look-up well-trained models used to re-place under-trained ones that are prone to overtting andthus false detections. Experiments on real-world data haveshown that almost every false alert due to overtting is avoidedwith as low as - training samples per model. ese re-sults are described in Section ..

Although these techniques have been developed on top of aweb-based anomaly detector, they are suciently generic to be eas-ily adapted to other systems using learning approaches.

.. Alert Correlation

IDS alerts are usually post-processed to generate compact reportsand eliminate redundant, meaningless, or false detections. In this

.. Document Structure

research eld, our contributions focus on unsupervised techniquesapplied to aggregate and correlate alert events with the goal of re-ducing the eort of the security ocer. In particular:

We developed and tested an approach that accounts for thecommon measurement errors (e.g., delays and uncertainties)that occur in the alert generation process. Our approach ex-ploits fuzzy metrics both to model errors and to constructan alert aggregation criterion based on distance in time. istechnique has been show to be more robust compared to clas-sic time-distance based aggregationmetrics. ese results aredetailed in Section ..

We designed and tested a prototype that models the alertgeneration process as a stochastic process. is setting al-lowed us to construct a simple, non-parametric hypothesistest that can detect whether two alert streams are correlatedor not. Besides its simplicity, the advantage of our approachis to not requiring any parameter. ese results are describedin Section ..

e aforementioned results have been published in the proceed-ings of international conferences and international journals.

. Document Structure

is document is structured as follows. Chapter introduces theID, that is the topic of our research. In particular, Chapter rigor-ously describes all the basic components that are necessary to denethe ID task and an IDSs. e reader with knowledge on this subjectmay skip the rst part of the chapter and focus on Section . and. that include a comprehensive review of the most relevant andinuential modern approaches on network-, host-, web-based IDtechniques, along with a separate overview of the alert correlationapproaches.

As described in Section ., the description of our contributionsis structured into three chapters. Chapter focuses on host-basedtechniques, Chapter regards web-based anomaly detection, whileChapter described two techniques that allow to recognize rela-tions between alerts reported by network- and host-based systems.Reading Section .. is recommended before reading Chapter .

. I

e reader interested in protection techniques for the operatingsystem can skim through Section .. and then read Chapter .e reader with interests on web-based protection techniques canread Section .. and then Chapter . Similarly, the reader inter-ested in alert correlation systems can skim through Section ..and .. and then read Chapter .

Detecting Malicious Activity

Malicious activity is a generic term that refers to automated orman-ual attempts of compromising the security of a computer infrastruc-ture. Examples of malicious activity include the output generated(or its eect on a system) by the execution of simple script kiddies,viruses, DoS attacks, exploits of Cross-Site Scripting (XSS) or SQLinjection vulnerabilities, and so forth. is chapter describes theresearch tools and methodologies available to detect and mitigatemalicious activity on a network, a single host, a web-server and thecombination of the three.

First, the background concepts and the ID problem are de-scribed in this chapter along with a taxonomic description of themost relevant aspects of an IDS. Secondly, a detailed survey of theselected state-of-the-art anomaly detection approaches is presentedwith the help of further classication dimensions. In addition, theproblem of alert correlation, that is an orthogonal topic, is describedand the most relevant, recent research approaches are overviewed.Last but not least, the problem of evaluating an IDS is presentedto provide the essential terminology and criteria to understand theeectiveness of both the reviewed literature and our original con-tributions.

. D M A

ISP network

Internet

ClientsCu

stom

ers'

clie

nts

Broadbandnetwork

Virtual hostingCustomers' servers DBs

= Anti-malware= host-based IDS

= network-based IDS

Deployable protection mechanisms

= web-based IDS

F .: Generic and comprehensive deployment scenario forIDSs.

. Intrusion Detection

ID is the practice of analyzing a system to nd malicious activity.is section denes this concept more precisely by means of sim-ple but rigorous denitions. e context of such denitions is thegeneric scenario of an Internet site, for instance, the network of anISP. An example is depicted in Figure .. An Internet site and,in general, the Internet itself is the state-of-the-art computer in-frastructure. In fact, it is a network that adopts almost any kind ofknown computer technology (e.g., protocols, standards, contain-ment measures), it runs a rich variety of servers such as HTTP,File Transfer Protocol (FTP), Secure SHell (SSH), Virtual PrivateNetwork (VPN) to support a broad spectrum of sophisticated ap-plications and services (e.g., web applications, e-commerce, appli-cations, the FacebookPlatform, GoogleApplications). In addition, ageneric Internet site receives and process trac from virtually anyuser connected to the Net and thus represents the perfect researchtestbed for IDS.

Denition .. (System) A system is the domain of interest forsecurity.

Note that a system can be a physical or a logical one. For instance,

.. Intrusion Detection

a physical system could be a host (e.g., theDataBase (DB) server orone of the client machines shown in the gure), a whole network(e.g., the ISP network shown in the gure); a logical system can bean application, a service, such as one of the web services run in avirtual machine deployed in the ISP network. While running, eachsystem produces activity, that we dene as follows.

Denition .. (System activity) A system activity is any data gen-erated while the system is working. Such activity is formalized asequence of events I = [I1; I2; Ii; : : : ; IN ].

For instance, each of the clients of Figure . produces system logs:in this case I would contain an Ii for each log entry. A humanreadable representation follows.

chatWithContact:(null)] got a nil targetContact. Aug 18 00:29:40[0x0-0x1b01b].org.mozilla.firefox[0]: NPP_Initialize called Aug 1800:29:40 [0x0-0x1b01b].org.mozilla.firefox[0]: 2009-08-18 00:29:40.039firefox-bin[254:10b] NPP_New(instance=0x169e0178,mode=2,argc=5) Aug 1800:29:40 [0x0-0x1b01b].org.mozilla.firefox[0]: 2009-08-18 00:29:40.052firefox-bin[254:10b] NPP_NewStream end=396239

Similarly, the web servers generates HTTP requests and re-sponses: in this case Iwould contain an Ii for eachHTTPmessage.Its human readable representation follows.

/media//images/favicon.ico HTTP/1.0 200 1150 - Mozilla/5.0(Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10)Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4 128.111.48.4[20/May/2009:15:26:44 -0700] POST /report/ HTTP/1.0 200 19171http://www.phonephishing.info/report/ Mozilla/5.0 (Macintosh; U;Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315Firefox/3.0.10 Ubiquity/0.1.4 128.111.48.4 [20/May/2009:15:26:44-0700] GET /media//css/main.css HTTP/1.0 200 5525http://www.phonephishing.info/report/ Mozilla/5.0 (Macintosh; U;Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315Firefox/3.0.10 Ubiquity/0.1.4 128.111.48.4 [20/May/2009:15:26:44-0700] GET /media//css/roundbox.css HTTP/1.0 200 731http://www.phonephishing.info/media//css/main.css Mozilla/5.0(Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10)Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4

Other examples of system activity are described in Section ...In this document, we often used the term normal behavior refer-

ring to a set of characteristics (e.g., the distribution of the charactersof string parameters, the mean and standard deviation of the values

. D M A

of integer parameters) extracted from the system activity gatheredduring normal operation (i.e., without being compromised). More-over, in the remainder of this document, we need other denitions.

Denition .. (Activity Prole) e activity prole (or activitymodel) cI is a set of models

cI = hm(1); : : : ;m(u); : : : ;m(U)i

generated by extracting features from the system activity I.

is denition will be used in Section ... and Example ..and .. describe an instance ofm(u). An example of a real-worldprole is described in Example ... We can now dene the:

Denition .. (System Behavior) e system behavior is the setof features (or models), along with their numeric values, extractedby (or contained in) the activity prole.

In particular, we will use this term as a synonym of normal sys-tem behavior, referring to the system behavior during normal op-eration.

Given the high-accessibility of the Internet, publicly availablesystems such as web servers, web applications, DB servers, are con-stantly at risk. In particular, they can be compromised with thegoal of stealing valuable information, deploying infection kits orrunning phishing and spam campaigns. ese are all examples ofintrusions. More formally.

Denition .. (Intrusion) An intrusion is the automated orman-ual act of violating one or more security paradigms (i.e., conden-tiality, integrity, availability) of a system. Intrusions are formalizedas a sequence of events:

O = [O1; O2; Oi; : : : ; OM ] ITypically, when an intrusion takes place, a system behaves unex-pectedly and, as a consequence, its activity diers than during nor-mal operation. is is because an attack or the execution ofmalwarecode often exploit vulnerabilities to bring the system into states itwas not designed for. For this reason, the activity that the system


generates is called malicious activity; often, this term is also usedto indicate the attack or the malware code execution itself. An ex-ample of Oi is the following log entry: it shows evidence of a XSSattack that will make a vulnerable page to display the arbitrary con-tent supplied as part of the GET request (while the page was notintentionally designed to this purpose).

/report/add/comment//

HTTP/1.0 200 731 http://www.phonephishing.info/report/add/Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10)Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4

Note .. We adopt a simplied representation of intrusions withrespect to a datasetD (i.e., both normal andmalicious): withO Dwe indicate that the activity I contains malicious events O; how-ever, strictly speaking, intrusions events and activity events can beof completely dierent types, thus the relation may not be de-ned in a strict mathematical sense.

If a system is well-designed, any intrusion attempts always leavesome traces in the systems activity. ese traces are called tamperevidence and are essential to perform intrusion detection.

Denition .. (Intrusion Detection) Intrusion detection is the sep-aration of intrusions from normal activity through the analysis ofthe activity of a system, while the system is running. Intrusionsare marked as alerts, which are formalized as a sequence of eventA = [A1; A2; Ai; : : : ; AL] O.Similarly, A O must not be interpreted in a strict sense: it isjust a notation to indicate that for each intrusion, an alert may ormay not exist. Note that ID can be also performed by manuallyinspecting a system activity. is is clearly a tedious and unecienttask, thus research and industrial interests are focused on automaticapproaches.

Denition .. (Intrusion Detection System) An intrusion detec-tion system is an automatic tool that performs the ID task.

. D M A

..I;O .IDS .A

F .: Abstract I/O model of an IDS.

Given the above denitions, an abstract model of an IDS is shownin Figure ..

Although each IDS relies on its own data model and formatto represent A, the Internet Engineering Task Force (IETF) pro-posed IntrusionDetectionMessage Exchange Format (IDMEF) [De-bar et al., ] as a common format for reporting alert streamsgenerated by dierent IDS.

.. Evaluation

Evaluating an IDS means running it on collected data D = I [ Othat resembles real-world scenarios. is means that such data in-cludes both intrusions O and normal activity I, i.e., jIj; jOj > 0and I \ O = ?, i.e., I must include no malicious activity otherthan O. e system is run in a controlled environment to collect Aalong with performance indicators for comparison with other sys-tems. is section presents the basic criteria used to evaluate mod-ern IDSs.

More precisely, to perform an evaluation experiment correctlya fundamental hypothesis must hold: the set O must be knownand perfectly distinguishable from I. In other words, this meansthat, D must be labeled with a truth le, i.e., a list of all the eventsknown to be malicious. is allows to treat the ID problem as aclassic classication problem, for which a set of well-establishedevaluation metrics are available. In particular, we are interested atcalculating the following sets.

Denition .. (True Positives (TPs)) e set of true positives isTP := fAi 2 A j 9Oj : f(Ai) = Ojg.Where f : A 7! O is a generic function that, given an alertAi ndsthe corresponding intrusion Oj by parsing the truth le. e TPset is basically the set of alerts that are red because a real intrusionhas taken place. e perfect IDS is such that TP O.


Denition .. (True Positives (TPs)) e set of false positives is

FP := fAi 2 A j6 9Oj : f(Ai) = Ojg:

On the other hand, the alerts in FP are incorrect because no realintrusion can be found in the observed activity. e perfect IDS issuch that FP = ?.

Denition .. (True Negatives (TNs)) e set of true negativesis

TN := fIj 2 I j6 9Ai : f(Ai) = Ijg:Note that the set of TN does not contain alerts. Basically, it isthe set of correctly unreported alerts. e perfect IDS is such thatTN I.

Denition .. (False Negatives (FNs)) e set of false negativesis

FN := fOj 2 O j6 9Ai : f(Ai) = Ojg:Similarly to TN , FN does not contain alerts. Basically, it is theset of incorrectly unreported alerts. e perfect IDS is such thatFN = ?. Note that, TP + TN + FP + TN = 1 must hold.

In this and other documents, the term false alert refers to FN [FP . Given the aforementioned sets, aggregated measures can becalculated.

Denition .. (Detection Rate (DR)) e detection rate, or truepositive rate, is dened as:

DR :=TP

TP + FN:

e perfect IDS is such that DR = 1. us, the DR measures thedetection capabilities of the system, that is, the amount of maliciousevents correctly classied and reported as alerts. On the other side,the False Positive Rate (FPR) is dened as follows.

Denition .. (FPR) e false positive rate is dened as:

FPR :=FP

FP + TN:

. D M A

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Det

ectio

n R

ate

(DR)

False Positive Rate (FPR)

Random classifier

F .: e ROC space.

e perfect IDS is such that FPR = 0. us, the FPR measuresthe inaccuracies of the system, that is, the amount of legit eventsincorrectly classied and reported as alerts. ere are also othermetrics such as the accuracy and the precision which are often usedto evaluate information retrieval systems; however, these metricsare not popular for IDS evaluation.

e ROC analysis, originally adopted to measure transmissionquality of radar systems, is often used to produce a compact and easyto understand evaluation of classication systems. Even thoughthere is no standardized procedure to evaluate an IDS, the researchcommunity agrees on the use of ROC curves to compare the detec-tion capabilities and quality of an IDS. A ROC curve is the plot ofDR = DR(FPR) and is obtained by tuning the IDS to trade oFalse Positives (FPs) against true positives. Without going into thedetails, each point of a ROC curve correspond to a xed amount ofDR and FPR calculated under certain conditions (e.g., sensitivityparameters). By modifying its conguration the quality of the clas-sication changes and other points of the ROC are determined.e ROC space is plotted in Figure . along with the perfor-mances of a random classiers, characterized byDR = FPR. eperfect, ideal IDS can increase itsDR from to with FPR = 0:however, it must hold that FPR! 1) DR! 1.


.. Alert Correlation

e problem of intrusion detection is challenging in todays com-plex networks. In fact, it is common to have more than one IDSdeployed, monitoring dierent segments and dierent aspects ofthe whole infrastructure (e.g., hosts, applications, network, etc.).e amount of alerts reported by a network of IDSs running in acomplex computer infrastructure is larger, by several orders of mag-nitude, than what was common in the smaller networks monitoredyears ago. In such a context, network administrators are loadedby several alerts and long security reports often containing a non-negligible amount of FPs. us, the creation of a clean, compact,and unied view of the security status of the network is needed.is process is commonly known as alert correlation [Valeur et al.,] and it is currently one of the most dicult challenges of thisresearch eld. More precisely.

Denition .. (Alert Correlation) e alert correlation is the iden-tication of relations among alerts

A1; A2; Ai; : : : ; AL 2 A1 [ A2 [ [ AKto generate a unique, more compact and comprehensive sequence

of alerts A0 = [A01; A02; : : : ; A0P ].

A desirable property is that A0 should be as complete as A1 [A2 [ [AK without introducing errors such as alerts that do notcorrespond to a real intrusion. As for the ID, alert correlation canbe a manual, and tedious, task. Clearly, automatic alert correlationsystems are more attractive and can be considered a complement ofa modern IDS.

Denition .. (Alert Correlation System) An alert correlationsystem is an automatic tool that performs the alert correlation task.

e overall IDS abstract model is complemented with an alert cor-relation engine as shown in Figure ..

.. Taxonomic Dimensions

e rst comprehensive taxonomy of IDSs has been proposed in[Debar et al., ] and revised in [Debar et al., ]. Another

. D M A

..I;O .IDS

.A1...An

.ACS .A0

F .: Abstract I/Omodel of an IDSwith an alert correlationsystem.

good survey appeared in [Axelsson, a].Compared to the classic survey found in the literature, this sec-

tion complements the basic taxonomic dimensions by focusing onmodern techniques. In particular, todays intrusion detection ap-proaches can be categorized bymeans of the specicmodeling tech-niques appeared in recent research.

It is important to note that the taxonomic dimensions that arehereby suggested are not exhaustive, thus certain IDSs may not tinto it (e.g., [Costa et al., ; Portokalidis et al., ; Newsomeand Song, ]). On the other hand, an exhaustive and detailedtaxonomy would be dicult to read. To overcome this diculty,in this section we describe a high-level, technique-agnostic taxon-omy based on the dimensions summarized in Table .; in eachsub-section of Section ., which focus on anomaly-based mod-els, we expand the taxonomic dimensions by listing and accuratelydetailing further classication criteria.

... Type of model

IDSs must be divided into two opposite groups: misuse-based vs.anomaly-based. e former create models of the malicious activ-ity while the latter create models of normal activity. Misuse-basedmodels look for patterns of malicious activity; anomaly-based mod-els look for unexpected activity. In some sense, IDSs can eitherblacklist or whitelist the observed activity.

Typically, the rst type of models consists in a database of all theknown attacks. Besides requiring frequent updates which is justa technical diculty and can be easily automated misuse-basedsystems assumes the feasibility of enumerating all the maliciousactivity.

Despite the limitation of being inherently incomplete, misuse-based systemswidely adopted in the real-world [Roesch, , ].


is is mainly due to their simplicity (i.e., attack models are trig-gered by means of pattern matching algorithms) and accuracy (i.e.,they generate virtually no false alerts because an attack signaturecan either match or not).

Anomaly-based approaches are more complex because creat-ing a specication of normal activity is obviously a dicult task.For this reason, there are no well-established and widely-adoptedtechniques; instead, misuse models are as sophisticated as a patternmatching problem. In fact, the research on anomaly-based systemsis very active.

ese systems are eective only under the assumption that ma-licious activity, such as an attack or malware being executed, al-ways produces sucient deviations from normal activity such thatmodels are triggered, i.e., anomalies. is clearly has the positiveside-eect of requiring zero knowledge on the malicious activity,which makes these systems particularly interesting. e negativeside-eect is their tendency of producing a signicant amount offalse alerts (the notion of signicant amount of false alerts will bediscussed in detail in Section ..).

Obviously, an IDS can benet of both the techniques by, forinstance, enumerating all the known attacks and using anomaly-based heuristics to prompt for suspicious activity. is practice isoften adopted by modern anti-viruses.

It is important to remark that, dierently from the other taxo-nomic dimensions, the distinction between misuse- and anomaly-based approaches is fundamental: they are based on opposite hy-potheses and yield to completely dierent system designs and re-sults. eir duality is highlighted in Table ..

Example .. (Misuse vs. Anomaly) A misuse-based system Mand an anomaly-based system A process the same log containing afull dump of the system calls invoked by the kernel of an auditedmachine. Log entries are in the form:

(, , ...)

e systemM has the following simple attack model:

Generated from the real world exploit http://milw0rm.com/exploits/9303

. D M A

Feature M- A-

Modeled activity: Malicious NormalDetection method: Matching Deviationreats detected: Known AnyFalse negatives: High LowFalse positives: Low High

Maintenance cost: High LowAttack desc.: Accurate Absent

System design: Easy Dicult

Table .: Duality between misuse- and anomaly-based intrusiondetection techniques. Note that, an anomaly-based IDS can detectAny threat, under the assumption that an attack always generatesa deviation in the modeled activity.

if (function_name == read) { /* ... */ if (match(decode(arg3_value), a{4}b{4}c{4}d{4}e{4}\ f{4}...x{4}3RH~TY7{33}QZjAXP0A0AkAAQ2AB2BB0BBAB\ XP8ABuJIXkweaHrJwpf02pQzePMhyzWwSuQnioXPOHuBxKn\ aQlkOjpJHIvKOYokObPPwRN1uqt5PA... )) fire_alert(VLC bug 35500 is being exploited!); /* ... */ }

e simple attack signature looks for a pattern generated fromthe exploit. If the content of the buer (i.e., arg3 value) that storesthe malicious le matches the given pattern then an alert is red.On the other hand, the systemA has the followingmodel, based onthe sample character distribution of each les content. Such fre-quencies are calculated during normal operation of the application.

/* ... */ cd[


Obviously, more sophisticated models can be designed. epurpose of this example is that of highlighting the main dierencesbetween the two approaches.

A generalization of the aforementioned examples allows us tobetter dene an anomaly-based IDS.

Denition .. (Anomaly-based IDS) An anomaly-based IDS isa type of IDS that generate alerts A by relying on normal activityproles (Denition ..).

... System activity

IDSs can be classied based on the type of the activity they mon-itor. e classic literature distinguishes between network-basedand host-based systems; Network-based Intrusion Detection System(NIDS) andHost-based Intrusion Detection System (HIDS), respec-tively. e former inspect network trac (i.e., raw bytes sniedfrom the network adapters), and the latter inspect the activity ofthe operating system. e scope of a network-based system is aslarge as the broadcast domain of the monitored network. On theother hand, the scope host-based systems is limited to the singlehost.

Network-based IDSs have the advantage of having a large scope,while host-based ones have the advantage of being fed with de-tailed information about the host they run on (e.g., process infor-mation, Central Processing Unit (CPU) load, number of active pro-cesses, number of users). is information is often unavailable tonetwork-based systems and can be useful to rene a decision re-garding suspicious activity. For example, by inspecting both thenetwork trac and the kernel activity, an IDS can lter the alertsregarding the Apache web server version .. on all the hosts run-ning version ... On the other hand, network-based systems arecentralized and are much more easy to manage and deploy. How-ever, NIDS are limited to the inspection of unencrypted payload,while HIDS may have it decrypted by the application layer. Forexample, a NIDS cannot detect malicious HTTPS trac.

e network stack is standardized (see Note ..), thus thedenition of network-based IDS is precise. On the other hand, be-cause of the immense variety of operating system implementations,

. D M A

a clear denition of host data lacks. Existing host-based IDSs an-alyze audit log les in several formats, other systems keep track ofthe commands issued by the users through the console. Some ef-forts have been made to propose standard formats for host data:Basic Security Module (BSM) and its modern re-implementationcalled OpenBSM [Watson and Salamon, ; Watson, ] areprobably the most used by the research community as they allowdevelopers to gather the full dump of the system calls before exe-cution in the kernel.

Note .. Although the network stack implementation may varyfrom system to system (e.g.,Windows and Cisco platforms have dif-ferent implementation of Trasmission Control Protocol (TCP)), it isimportant to underline that the notion of IP, TCP, HTTP packet iswell dened in a system-agnostic way, while the notion of operatingsystem activity is rather vague and by no means standardized.

Example .. describes a sample host-based misuse detectionsystem that inspects the arguments of the system calls. A similar,but more sophisticated, example based on network trac is Snort[Roesch, , ].

Note .. (Inspection layer) Network-based systems can be fur-ther categorized by their protocol inspection capabilities, with re-spect to the network stack. webanomaly [Kruegel et al., ; Robert-son, ] is network-based, in the sense that runs in promiscu-ous mode and inspects network data. On the other hand, it alsoHTTP-based, in the sense that decodes the payload and recon-structs the HTTP messages (i.e., request and response) to detectattacks against web applications.

... Model construction method

Regardless of their type, misuse- or anomaly-based, models can bespecied either manually or automatically. However, for their na-ture, misuse models are often manually written because they arebased on the exhaustive enumeration of known malicious activ-ity. Typically, these models are called attack signatures; the largestrepository of manually generated misuse signatures is released bythe SourcereVulnerabilityResearchTeamTM [Sourcere, ]. Mis-use signatures can be generated automatically: for instance, in [Singh


et al., ] amethod to buildmisusemodels of worms is described.Similarly, low-interaction honeypots often uses malware emulationto automatically generate signatures. Two recently proposed tech-niques are [Portokalidis et al., ; Portokalidis and Bos, ].

On the other hand, anomaly-basedmodels are more suitable forautomatic generation. Most of the anomaly-based approaches inthe literature focus on unsupervised learning mechanisms to con-struct models that precisely capture the normal activity observed.Manually specied models are typically more accurate and are lessprone to FPs, although automatic techniques are clearly more de-sirable.

Example .. (Learning character distributions) InExample ..,the described system A adopts a learning based character distribu-tion model for strings. Without going into the details, the ideadescribed in [Mutz et al., ] observes string arguments and es-timate the characters distribution over the American Standard forInformation Interxchange (ASCII) set. More practically, the modelis a histogramH(c); 8c 2 0; 255, whereH(c) is the normalized fre-quency of character c. During detection, a 2 test is used to decidewhether or not a certain sting is deemed anomalous.

Beside the obvious advantage of being resilient against evasion,this model requires no human intervention.

. D M A

. Relevant Anomaly Detection Techniques

Our research focuses on anomaly detection. In this section, theselected state of the art approaches are reviewed with particular at-tention to network-, host- and web-based techniques, along withthe most inuential approaches in the recent literature alert corre-lation. is section provides the reader with the basic concepts tounderstand our contributions.

.. Network-based techniques

Our research does not include network-based IDSs. Our contri-butions in alert correlation, however, leverage both network- andhost-based techniques, thus a brief overview of the latest (i.e., pro-posed between and ) network-based detection approachesis provided in this section. We remark that all the techniques in-cluded in the following are based on TCP/IP, meaning that modelsof normal activity are constructed by inspecting the decoded net-work frames up to the TCP layer.

In Table . the selected approaches are marked with bulletsto highlight their specic characteristics. Such characteristics arebased on our analysis and experience, thus, other classicationsmaybe possible. ey are dened as follows:

Time refers to the use of timestamp information, extracted fromnetwork packets, to model normal packets. For example,normal packets may be modeled by their minimum and max-imum inter-arrival time.

Header means that the TCP header is decoded and the elds aremodeled. For example, normal packets may be modeled bythe observed ports range.

Payload refers to the use of the payload, either at Internet Protocol(IP) or TCP layer. For example, normal packets may bemodeled by the most frequent byte in the observed payloads.

Stochastic means that stochastic techniques are exploited to createmodels. For example, the model of normal packets may beconstructed by estimating the sample mean and variance ofcertain features (e.g., port number, content length).

.. Relevant Anomaly Detection Techniques

Deterministic means that certain features are modeled followinga deterministic approach. For example, normal packets maybe only those containing a specied set of values for the TimeTo Live (TTL) eld.

Clustering refers to the use of clustering (and subsequent classi-cation) techniques. For instance, payload byte vectors may becompressed using a Self Organizing Map (SOM) where classof dierent packets will stimulate neighbor nodes.

Note that, since recent research have experimented with severaltechniques and algorithms, mixed approaches exist and often leadto better results.

In [Mahoney and Chan, ] a mostly deterministic, simpledetection technique is presented. During training, each eld ofthe header of the TCP packets are extracted and tokenized into bytes bins (for memory eciency reasons). e tokenized valuesare clustered by means of their values and every time a new value isobserved the clustering is updated. e detection approach is de-terministic, since a packet is classied as anomalous if the values ofits header do not match any of the clusters. Besides the fact of beingcompletely unsupervised, this techniques detects between and of the probe and DoS attacks, respectively, in Intrusion Detec-tion eVALuation (IDEVAL) [Lippmann et al., ]. Slightperformance issues and a rate of FPs per day (i.e., roughly, false alert every hours) are the only disadvantage of the approach.

e approach described in [Kruegel et al., ] reconstructsthe payload stream for each service (i.e., port); this avoids to evadethe detection mechanism by using packet fragmentation. In addi-tion to this, a basic application inspection is performed to distin-guish among the dierent types of request of the specic service,e.g., for HTTP the service type could be GET, POST, HEAD.e sample mean and variance of the content length are also calcu-lated and the distribution of the bytes (interpreted as ASCII char-acters) found in the payload is estimated using simple histograms.e anomaly score used for detection aggregates information re-garding the type of service, expected content length and payloaddistribution. With a low performance overhead and low FPR, thissystem is capable of detecting anomalous interactions with the ap-

. D M A

A

T

H

P

S

D

.C

[Mahoney

andChan,]

[Kruegelet

al.,]

[Sekar

etal.,]

[Ram

adas,]

[M

ahoneyand

Chan,b]

[Zanero

andSavaresi,]

[W

angand

Stolfo,]

[Zanero,b]

[Bolzoniet

al.,]

[Wang

etal.,]

Table

.:Taxonom

yofthe

selectedstate

ofthe

artapproaches

fornetw

ork-basedanom

alydetection.


plication layer. One critique is that the system has not been testedon several applications other than Domain Name System (DNS).

Probably inspired by the misuse-based, nite-state techniquedescribed in [Vigna and Kemmerer, ], in [Sekar et al., ]the authors describe a system to learn the TCP specication. ebasic nite state model is extended with a network of stochasticproperties (e.g., the frequency of certain transitions, the most com-mon value of a state attribute, the distribution of the values foundin the elds of the IP packets) among states and transitions. Suchproperties are estimated during training and exploited at detectionto implement smoother thresholds that ensure as low as . falsealerts per day. On the other hand, the deterministic nature of thenite state machine detects attacks with a DR.

Learning Rules for AnomalyDetection (LERAD), the system de-scribed in [Mahoney andChan, b] is an optimized ruleminingalgorithm that works well on data with tokenized domains suchas the elds of TCP or HTTP packets. Although the idea im-plemented in LERAD can be applied at any protocol layer, it hasbeen tested on TCP and HTTP but no more than the of theattack in the testing dataset were detected. Even if the FPR is ac-ceptable (i.e., alerts per day) its limited detection capabilitiesworsen if real-world data is used instead of synthetic datasets suchas IDEVAL. In [Tandon and Chan, ] the LERAD algorithm(Learning Rules for Anomaly Detection) is used to mine rules ex-pressing normal values of arguments, normal sequences of systemcalls, or both. No relationship is learned among the values of dier-ent arguments; sequences and argument values are handled sepa-rately; the evaluation is quite poor however, and uses non-standardmetrics.

Unsupervised learning techniques to mine pattern from pay-load of packets has been shown to be an eective approach. Boththe network-based approaches described so far and other proposals[Labib and Vemuri, ; Mahoney, ] had to cope with datarepresented using a high number of dimensions (e.g., a vector with dimensions, that is the maximum number of bytes in the TCPpayload). While the majority of the proposals circumvent the is-sue by ignoring the payload, the aforementioned issue is brilliantlysolved in [Ramadas, ] and extended in Unsupervised LearningIDS with -Stages Engine (ULISSE) [Zanero and Savaresi, ;Zanero, b] by exploiting the clustering capabilities of a SOM

. D M A

[Kohonen, ] with a faster algorithm [Zanero, a], specif-ically designed for high-dimensional data. e payload of TCPpackets is indeed compressed into the bi-dimensional grid repre-senting the SOM, organized in such a way that class of packetscan be quickly extracted. e approach relies on, and conrms,the assumption that the trac belongs to a relatively small numberof services and protocols that can be mapped onto a small numberof clusters. Network packet are modeled as a multivariate time-series, where the variables include the packet class into the SOMplus some other features extracted from the header. At detection,a fast discounting learning algorithm for outlier detection [Yaman-ishi et al., ] is used to detect anomalous packets. Although thesystem has not been released yet, the prototype is able to reach a. DR with as few as . FPs. In comparison, one of theprototype dealing with payloads available in literature [Wang et al.,], the best overall result leads to the detection of . of theattacks, with a FPR that is between . and . e main weak-ness of this approach is that it works at the granularity of the pack-ets and thus might be prone to simple evasion attempts (e.g., bysplitting an attack onto several malicious packets, interleaved withlong sequence of legit packets). Inspired by [Wang et al., ]and [Zanero and Savaresi, ], [Bolzoni et al., ] has beenproposed.

e approach presented in [Wang et al., ] diers from theone described in [Zanero and Savaresi, ; Zanero, b] eventhough the underlying key idea is rather similar: byte frequency dis-tribution. Both the two approaches, and also [Kruegel et al., ],exploit the distribution of byte values found in the network packetsto produce some sort of normality signatures. [Wang et al., ]utilizes a simple clustering algorithm to aggregate similar packetsand produce a smoother and more abstract signature, [Zanero andSavaresi, ; Zanero, b] introduces the use of SOM to ac-complish the task of nding classes of normal packets.

An extension to [Wang and Stolfo, ], which uses 1-grams,is described in [Wang et al., ] that uses higher-order, random-ized n-grams to mitigate mimicry attacks. In addition, the newerapproach does not estimate the frequency distribution of the n-grams, which causes many FPs if n increases. Instead, it adoptsa ltering technique to compress the n-grams into memory e-cient arrays of bits. is decreased the FPR of about two orders of


magnitude.

.. Host-based techniques

A survey of the latest (i.e., proposed between and ) host-based detection approaches is provided in this section. Most ofthe techniques leverage system call invoked by the kernel to createmodels of normal behavior of processes.

In Table . the selected approaches are marked with bulletsto highlight their specic characteristics. Such characteristics arebased on our analysis and experience, thus, other classicationsmaybe possible. ey are dened as follows:

Syscall refers, in general, to the use of system calls to characterizenormal host activity. For example, a process may be modeledby the stream of system calls invoked during normal opera-tion.

Stochastic means that stochastic techniques are exploited to createmodels. For example, the model of normal processes may beconstructed by estimating the sample mean and variance ofcertain features (e.g., length of the opens path argument).

Deterministic means that certain features are modeled following adeterministic approach. For example, normal processes maybe only those that use a xed number, say , of le descrip-tors.

Comprehensive approaches are those that have been extensivelydeveloped and, in general, incorporate a rich set of features,beyond the proof-of-concept.

Context refers to the use of context information in general. Forexample, the normal behavior of processes can be modeledalso by means of the number of the environmental variablesutilized or by the sequence of system calls invoked.

Data means that the data ow is taken into account. For example,normal processes may be modeled by the set of values of thesystem call arguments.

Forensics means that the approach has been also evaluated for o-line, forensics analysis.

. D M A

Our contributions are included in Table . and are detailedin Chapter . Note that, since recent research have experimentedwith several techniques and algorithms, mixed approaches exist andoften lead to better results.

Host-based anomaly detection has been part of intrusion de-tection since its very inception: it already appears in the seminalwork [Anderson, ]. However, the denition of a set of sta-tistical characterization techniques for events, variables and coun-ters such as the CPU load and the usage of certain commands isdue to [Denning, ]. e rst mention of intrusion detectionthrough the analysis of the sequence of syscalls from system pro-cesses is in [Forrest et al., ], where normal sequences of sys-tem calls are considered. A similar idea was presented earlier in[Ko et al., ], which proposes a misuse-based idea by manuallydescribe the canonical sequence of calls of each and every program,something evidently impossible in practice.

In [Lee and Stolfo, ] a set of models based on data miningtechniques is proposed. In principle, the models are agnostic withrespect to the type of raw event collected, which can be user activity(e.g., login time, CPU load), network packets (e.g., data collectedwith tcpdump), or operating system activity (e.g., system call tracescollected with OpenBSM). Events are processed using automaticclassication algorithms to assign labels drawn from a nite set.In addition, frequent episodes are extracted and, nally, associationrules among events are mined. Such there algorithms are combinedtogether at detection time. Events marked with wrong labels, un-expectedly frequent episodes or rule violations will all trigger alerts.

Alternatively, other authors proposed to use static analysis, asopposed to dynamic learning, to prole a programs normal behav-ior. e technique described in [Wagner and Dean, ] com-bines the benets of dynamic and static analysis. In particular, threemodels are proposed to automatically derive a specication of theapplication behavior: call-graph, context-free grammars (or non-deterministic pushdown automata), and digraphs. All the mod-els building blocks are system calls. e call-graph is staticallyconstructed and then simulated, while the program is running, toresolve non-deterministic paths. In some sense, the context-freegrammar model called abstract stack is the evolution of thecall-graph model as it allows to keep track of the state (i.e., callstack). e digraph actually k-graph model is keeps track of


A

S

D.

S

C

.

C

D

F

[Lee

andStolfo,]

[Sekaretal.,]

[WagnerandDean,]

[TandonandChan,]

[ Kruegeletal.,a]

[Zanero,]

[G

inetal.,]

[Mutzetal.,]

[Bhatkaretal.,]

[Mutzetal.,]

[FetzerandSuesskraut,]

[Maggietal.,]

[M

aggietal.,a]

[ Frossietal.,]

Table.:

Taxonom

yoftheselected

stateoftheartapproachesforhost-based

anom

alydetection.

Our

contributions

arehighlighted.

. D M A

k-long sequences of system calls from an arbitrary point of the ex-ecution. Despite is simplicity, which ensures a negligible perfor-mance overhead with respect to the others, this model achieves thebest detection precision.

In [Tandon and Chan, ] the LERAD algorithm (Learn-ing Rules for Anomaly Detection) is described. Basically, it isa learning system to mine rules expressing normal values of ar-guments, normal sequences of system calls, or both. In partic-ular, the basic algorithm learns rules in the form A = a;B =b; ) X 2 fx1; x2; : : : g where uppercase letters indicate pa-rameter names (e.g., path, flags) while lowercase symbols indicatetheir corresponding values. For some reason, the rule-learning al-gorithm rst extracts random pairs from the training set to gener-ate a rst set of rules. After this, two optimization steps are runto remove rules with low coverage and those prone to generate FPs(according to a validation dataset). A system call is deemed anoma-lous if no matching rule is found. A similar learning and detectionalgorithm is run among sequences of system calls. e main ad-vantage of the described approach is that no relationship is learnedamong the values of dierent arguments of the same system call.Beside the unrealistic assumption regarding the availability of a la-beled validation dataset, another side issue is that the evaluation isquite poor and uses non-standard metrics.

LibAnomaly [Kruegel et al., a] is a library to implementstochastic, self-learning, host-based anomaly detection systems. Ageneric anomaly detection model is trained using a number of sys-tem calls from a training set. At detection time, a likelihood ratingis returned by the model for each new, unseen system call (i.e., theprobability of it being generated by the model). A condence rat-ing can be computed at training for any model, by determining howwell it ts its training set; this value can be used at runtime to pro-vide additional information on the reliability of the model. Whendata is available, by using cross-validation, an overtting rating canalso be optionally computed.

LibAnomaly includes four basicmodels. e string lengthmodelcomputes, from the strings seen in the training phase, the samplemean and variance 2 of their lengths. In the detection phase,given l, the length of the observed string, the likelihood p of the in-put string length with respect to the values observed in training is


equal to one if l < + and 2

(l)2 otherwise. As mentioned in theExample .., the character distribution model analyzes the dis-crete probability distribution of characters in a string. At trainingtime, the so called ideal character distribution is estimated: eachstring is considered as a set of characters, which are inserted intoan histogram, in decreasing order of occurrence, with a classicalrank order/frequency representation. During the training phase, acompact representation of mean and variance of the frequency foreach rank is computed. For detection, a 2 Pearson test returnsthe likelihood that the observed string histogram comes from thelearned model. e structural inference model encodes the syntaxof strings. ese are simplied before the analysis, using a set of re-duction rules, and then used to generate a probabilistic grammar bymeans of a Markov model induced by exploiting a Bayesian merg-ing procedure, as described in [Stolcke and Omohundro, c,b, b]. e token search model is applied to argumentswhich contain ags or modes. During detection, if the eld hasbeen agged as a token, the input is compared against the storedvalues list. If it matches a former input, the model returns (i.e.,not anomalous), else it returns (i.e., anomalous).

In [Zanero, ] a general Bayesian framework for encodingthe behavior of users is proposed. e approach is based on hintsdrawn from the quantitative methods of ethology and behavioralsciences. e behavior of a user interacting with a text-based con-sole is encoded as a Hidden Markov Model (HMM). e observa-tion set includes the commands (e.g., ls, vim, cd, du) encounteredduring training. us, the system learns the user behavior in termsof the model structure (e.g., number of states) and parameters (e.g.,transitionmatrix, emission probabilities). At detection, unexpectedor out of context commands are detected as violations (i.e., lowervalue) of the learned probabilities. One of the major drawbacks ofthis system is its applicability to real-world scenarios: in fact, to-days host-based threats perform more sophisticated and stealthyoperations than invoking commands.

In [Gin et al., ] an improved version of [Wagner andDean, ] is presented. It is based on the analysis of the binariesand incorporates the execution environment as a model constraint.More precisely, the environment is dened as the portion of inputknown at process load time and xed until it exits. In addition,

. D M A

the technique deals with dynamically-linked libraries and is capableof constructing the data-ow analysis even across dierent shared-objects.

An extended version of LibAnomaly is described in [Mutz et al.,]. e basic detectionmodels are essentially the same. In addi-tion, the authors exploit Bayesian networks instead of naive thresh-olds to classify system calls according to each model output. isresults in an improvement in the DRs. Some of our work describedin Section is based upon this and the original version of LibAno-maly.

Data-ow analysis has been also recently exploited in [Bhatkaret al., ], where an anomaly detection framework is developed.Basically, it builds an Finite State Automaton (FSA) model of eachmonitored program, on top of which it creates a network of rela-tions (called properties) among the system call arguments encoun-tered during training. Such a network of properties is the maindierence with respect to other FSA based IDSs. Instead of apure control ow check, which focuses on the behavior of the soft-ware in terms of sequences of system calls, it also performs a socalled data ow check on the internal variables of the program alongtheir existing cycles. is approach has really interesting properties,among which the fact that not being stochastic useful propertiescan be demonstrated in terms of detection assurance. On the otherhand, though, the set of relationships that can be learned is limited(whereas the relations encoded by means of the stochastic modelswe describe in Section .. are not decided a priori and thus vir-tually innite). e relations are all deterministic, which leads toa brittle detection model potentially prone to FPs. Finally, it doesnot discover any type of relationship between dierent argumentsof the same call.

is knowledge is exploited in terms of unary and binary re-lationships. For instance, if an open system call always uses thesame lename at the same point, a unary property can be derived.Similarly, relationships among two arguments are supported, by in-ference over the observed sequences of system calls, creating con-straints for the detection phase. Unary relationships include equal(the value of a given argument is always constant), elementOf (anargument can take a limited set of values), subsetOf (a generaliza-tion of elementOf, indicating that an argument can take multiplevalues, all of which drawn from a set), range (species boundaries


source dir = d i r ; t a r g e t f i l e = f i l e ;out = open ( t a r g e t f i l e , WR) ;

push ( source dir ) ; while ( ( dir name =pop ( ) ) != NULL) f

d = opendir ( dir name ) ; foreach (d i r en t ry 2 d ) f

if ( i sd i r ec to ry ( d i r en t ry ) ) push ( d i r en t ry ) ; else f in = open ( d i r en t ry , RD) ; read (

in , buf ) ; wri te ( out , buf ) ; close ( in ) ; g g g close ( out ) ; return 0; g

1

3

6

8

11 12

13

14

18

1920

start(I, O)

FD3 = open(F3,M3)M3elementOf{WR}

M3equal O

opendir(F6)

isWithinDirI

F8isWithinDir F6

isDirectory F8

F8isWithinDir F6

isDirectory F8

FD11 = open(F11,M11)

F11equal F8

read(FD12)

FD12equal FD11

write(FD13)

FD13equal FD3

close(FD14)

FD14

equal FD11

close(FD14)

FD14equal FD11

close(FD18)

FD18

equal FD3

return(0)

1

F .: A data ow example with both unary and binary rela-tions.

for numeric arguments), isWithinDir (a le argument is always con-tained within a specied directory), hasExtension (le extensions).Binary relationships include: equal (equality between system calloperands), isWithinDir (le located in a specied directory; con-tains is the opposite), hasSameDirAs, hasSameBaseAs, hasSameEx-tensionAs (two arguments have a common directory, base directoryor extension, respectively).

e behavior of each application is logged by storing the ProcessIDentier (PID), the Program Counter (PC), along with the systemcalls invoked, their arguments and returned value. e use of thePC to identify the states in the FSA stands out as an importantdierence from other approaches. e PC of each system call isdetermined through stack unwinding (i.e., going back through theactivation records of the process stack until a valid PC is found).e technique obviously handles process cloning and forking.

e learning algorithm is rather simple: each time a new valueis found, it is checked against all the known values of the sametype. Relations are inferred for each execution of the monitoredprogram and then pruned on a set intersection basis. For instance,if relations R1 and R2 are learned from an execution trace T1 butR1 only is satised in trace T2, the resulting model will not containR2. Such a process is obviously prone to FPs if the training phaseis not exhaustive, because invalid relations would be kept instead

. D M A

of being discarded. Figure . shows an example (due to [Bhatkaret al., ]) of the nal result of this process. During detection,missing transitions or violations of properties are agged as alerts.e detection engine keeps track of the execution over the learnedFSA, comparing transitions and relations with what happens, andraising an alert if an edge is missing or a constraint is violated.

is FSA approach is promising and has interesting featuresespecially in terms of detection capabilities. On the other hand, itonly takes into account relationships between dierent types of ar-guments. Also, the set of properties is limited to pre-dened onesand totally deterministic. is leads to a possibly incomplete de-tection model potentially prone to false alerts. In Section . wedetail how our approach improves the original implementation.

Another approach based on the analysis of system calls is [Fet-zer and Suesskraut, ]. In principle, the system is similar to thebehavior-based techniques we mentioned before. However, the au-thors have tried to overcome two limitations of the learning-basedapproaches which, typically, have high FPRs and require a quiteample training set. is last issue is mitigated by adopting a com-pletely dierent approach: instead of requiring training, the sys-tem administrator is required to specify a set of small whitelist-likemodels of the desired behavior of a certain application. At run-time, these models are evolved and adapted to the particular con-text the protected application runs into; in particular, the systemexploits taint analysis to update a system call model on-demand.is system can oer very high levels of protection but the eortrequired to specify the initial model may not be so trivial; however,the eort may be worth for mission-critical applications on whichcustomized hardening would be needed anyways.

.. Web-based techniques

A survey of the latest (i.e., proposed between and ) host-based detection approaches is provided in this section. All the tech-niques included in the following are based onHTTP, meaning thatmodels of normal activity are constructed either by inspecting thedecoded network frames up to the HTTP layer, or by acting asreverse HTTP proxies.

In Table . the selected approaches are marked with bulletsto highlight their specic characteristics. Such characteristics are


based on our analysis and experience, thus, other classicationsmaybe possible. ey are dened as follows:

Adaptive refers to the capability of self-adapting to variations inthe normal behavior.

Stochastic means that stochastic techniques are exploited to createmodels. For example, the model of normal HTTP requestsmay be constructed by estimating the sample mean and vari-ance of certain features (e.g., length of the string parameterscontained in a POST request).

Deterministic means that certain features are modeled followinga deterministic approach. For example, normal HTTP ses-sions may be only those that are generated by a certain nitestate ma

Maggi Thesis

Documents