Top Banner
1 MINDS – A High Performance Data MINDS – A High Performance Data Mining Based Intrusion Detection Mining Based Intrusion Detection System System Vipin Kumar University of Minnesota http://www.cs.umn.edu/research/minds/ Team Members: Varun Chandola, Eric Eilertson, Benjamin Mayer, Gyorgy Simon, Mark Shaneck, Michael Steinbach, Vipin Kumar
25

1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota Team.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

1

MINDS – A High Performance Data Mining MINDS – A High Performance Data Mining

Based Intrusion Detection SystemBased Intrusion Detection System

Vipin Kumar

University of Minnesota http://www.cs.umn.edu/research/minds/

Team Members: Varun Chandola, Eric Eilertson, Benjamin Mayer, Gyorgy

Simon, Mark Shaneck, Michael Steinbach, Vipin Kumar

Page 2: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

2

Objectives• Objectives

– Develop innovative high-performance techniques for detecting sophisticated attacks in an on-line and real-time manner

• Detect stealth attacks by sophisticated adversaries that are specifically designed to evade detection by known intrusion-detection system (IDS) tools

• Track down the source of attacks and the scope of the compromise after the break-in is detected

• Goals of Current Research– Development of new, scalable algorithms for analyzing large amounts of

network data• Scan Detection• Summarization of Network Traffic• Profiling• Sequential Pattern Analysis• Context Extraction

– Incorporate these algorithms into MINDS and the ARL-CIMP’s Interrogator

Page 3: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

3

Relevance to Army and Research Portfolio

• As the Army, and the DoD as whole, shifts to network-centric warfare, protecting the network means more than just protecting sensitive information. It now also means protecting the lives of the war fighter, and innocent civilians.

• In a wartime situation, a breach of the DoD’s computer networks puts the lives and operations of all allies in grave danger, as the enemy may know about operations before the soldier on the ground does.• This makes intrusion detection not only an important technology for ensuring cyber security, but is critical for ensuring military superiority.

Page 4: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

4

Background of Problem

www.snort.org

Example of SNORT rule

(MS-SQL “Slammer” worm)

any -> udp port 1434 (content:"|81 F1 03 01 04 9B 81 F1 01|"; content:"sock"; content:"send")

• Traditional intrusion detection system IDS tools are based on signatures of known attacks and have well known limitations

– Signature database has to be manually revised for each new type of discovered intrusion

– Substantial latency in deployment of newly created signatures across the computer system

– Cannot detect emerging cyber threats– Not suitable for detecting policy violations and insider abuse– Do not provide understanding of network traffic– Generate too many false alarms– Not suited for detecting multi-step attacks

• Data Mining based techniques offer great promise for addressing these limitations

Spread of SQL Slammer worm 10 minutes after its deployment

Page 5: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

5

Relevance to HPC

• Network traffic monitoring generates a large amount of data

• HPC is critical for on-line analysis and scalability– Parallel versions of anomaly detection algorithms are required

for on-line and distributed anomaly detection

– Scalable, parallel algorithms for clustering, association analysis, summarization and 2nd level analysis will enable the analysis of data over months/years to detect long-term patterns and trends in network traffic

Page 6: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

6

Work done in the past 1 year

• Protocol Anomaly Detection• Clustering Long Term Patterns• Summarization of Network Traffic• Scan Detection

– Data Mining Approach– Automatic Labeling of Training Data

• 2nd Level Analysis Tools– Improving the Netflows Database Schema

• Profiling of Long-term Patterns– On-demand Profiling

• Privacy Preserving Data Mining– Distributed Outlier Detection, Clustering, Classification

Page 7: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

7

Scan Detection - Introduction

• Scans are reconnaissance operations to map services and find vulnerabilities

• Administrators can take preventive measures to protect network assets targeted by scans

• Scanners hide their activity– Slow or distributed scans touch very

few hosts during a time interval

• Current scan detection tools– For each source IP, count the number of destination IPs it connects to on each destination

port. If this count exceeds a threshold, source is scanning.– Improvement: distinguish whether service was offered or not (TRW – Jung et al., 2004)– Improvement: make use of frequency of service offered on (destination IP, port)

combination (Ertoz,Eilertson,Kumar et al., 2004)– Low thresholds have high false alarm rate and/or low coverage

Page 8: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

8

Data Mining for Scan Detection• Scanning behavior follows certain patterns that are difficult to capture manually

– Numerous features -> exponentially many combinations– Too many potential patterns for a human to systematically explore all of them

• If we observe sources for sufficiently long time, they can be labeled as scanner or normal with high confidence

– Not useful for real time scan detection– Requires too much memory due to state explosion

• Data mining can help build models for these patterns if labeled data is available• Key issues: (1) feature selection, (2) labeling and (3) building a classifier

Page 9: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

9

Evaluation

• University of Minnesota traffic• 13 observation periods between 03/21/2005.00:00 and

03/22/2005.12:00• Each observation period 20 minutes (approximately 4M flows)

Page 10: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

10

Comparison

• Model built on ID #1 and tested on the remaining 12 periods• TRW (threshold of 2)• Ripper shows outstanding and consistent performance

Page 11: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

11

Claim 1• Our data mining approach enables early detection of scanners.

– In some cases, as early as first connection attempt on a specific port– Out of 59,860 SIDPs encountered in data set ID #5, 37,475 made

connection attempts to only one destination IP on each destination port.

• Performance on the portion of the data that contains source IPs making at most one connection attempt on each destination port– Model built on ID #1, tested on ID #5– TRW-1: TRW at threshold of 1

• TRW at a threshold of 2 or higher will not find such scanners

– RIPPER: Our proposed method

Page 12: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

12

Claim 2

• Our data mining approach is capable of filtering out scanning-like benign traffic such as P2P or backscatter

• Performance on the portion of the data set (ID #5) that contains P2P and scanning traffic only– Model built on ID #1– TRW-P: an SIDP making connection attempts to a P2P host is declared non-

scanning– TRW-1,2, Ripper

The experimental results for this table in the paper have similar qualitative behavior but are incorrect due to a bug in one of the scripts for producing the output

Page 13: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

13

Claim 3

• Our data mining approach successfully extracted the characteristics of scanning behavior from the long-term observation

• Rules (model built on ID #1) make sense– Rule #2 is the workhorse rule

ID Coverage Rules for SIDP (<sip,dprt>)

TP FP

1 1287 2 blk >= .5, nosrv >= .67, # dst IPs touched on dprt>= 2

2 2668 38 blk >= .5, nosrv >= .78

3 62 28 dprt blocked, # dst IPs >=2, # dst ports <= 2

… 19 9 ….

Page 14: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

14

Summarization of Large Data Sets

• Summarization is a technique to find a compact and meaningful representation for analyzing large data sets for which manual monitoring is not possible

• Clustering can be used to summarize large datasets but cannot handle categorical attributes

• In domains like network intrusion detection, the data is huge and has a mix of categorical and continuous attributes

A sample network data set with 17 records. Each record has 8 different features which are categorical or continuous.

A representative summary of the above data set.

Page 15: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

15

Our Contributions

• Developed two approaches to solve this problem– Clustering based approach

• Generate clusters from the data set and replace the members of each cluster with a feature-wise intersection of all members in that cluster.

– Association Rule based multi-step approach• Generate frequent itemsets from the data in the first step and

then select a subset of these frequent itemsets as the summary of the data in the second step.

• The selection of subset is done heuristically to optimize the information loss for a given compaction gain. We propose a suite of several heuristic based algorithms which can generate approximately good summary

• Formulated the problem of summarization of transactions that contain categorical data, as a dual optimization problem and characterize a good summary using two metrics

• compaction gain •Size of data/Size of summary

• information loss •Weighted sum of missing features

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

For the dataset shown in last slide and above summary,

Compaction Gain = 17/3

Information Loss (if all features have weight = 1) = 19

Ranked among best 5 papers (with student as main author) at the 5 th ICDM Conference, 2005 (total submissions = 630)

Page 16: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

16

Privacy Preserving Data Mining• Goal: provide a comprehensive, cryptographic, privacy

preserving solution for nearest neighbor search and its major applications in data mining

PP NN Search PrimitiveBasic K-distance

Pre-clustering Low Dimension Approx. NN

Vertical Partitioning

Implem

entation

GeneralSolution

HomomorphicEncryption

ObliviousTransfer

MultiplicationPrimitive

ComprehensiveDistance Measure

More SecurePrimitives

More EfficientPrimitives

DotProduct

DivisionEuclideanDistance

LOF kNN SNN

Page 17: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

17

2nd Level Analysis

• Detecting an attack and cleaning up the affected computers is not enough, and can even be harmful– Allows the attacker to determine what our detection capabilities

are– Attacker can reorganize his attack to try and go undetected– Attacker may go after a different organization that is easier to

break into• Security analysts need to quickly determine

– WHEN a compromise happened, HOW a compromise occurred, WHAT the attacker is after, WHERE the attacker came from, WHO the attacker is, HOW many computers are compromised

• The above questions are answered by 2nd level analysts– Currently done almost entirely manually– Takes days to months to answer some questions, if they are ever

answered

Page 18: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

18

Continuing Work on 2nd Level Analysis

• Developing algorithms and tools for automating much of the 2nd level analysis process– Algorithms for creating and operating on

communication graphs– On-demand profiling used in pruning communication

graphs

• Some building blocks for performing 2nd level analysis are already in use at the ARL-CIMP– High-speed data collection– Massive data storage– Quick information retrieval

Page 19: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

19

On Demand Context Extraction

• Starting from a suspected bad computer search for other computers communicating with it.– Currently in use at the ARL-CIMP in the form of

Flowinator, part of the Interrogator architecture.• Flowinator contains billions of network connections

– Can answer in seconds questions which used to take hours to answer.

• Future work– Incorporate profiling to automate iterative extraction

– Allow looking for multiple IPs at once

Page 20: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

20

On Demand Host Profiling

• Developing techniques for profiling hosts on the fly to rank the computers returned based upon how anomalous the activity was.– Uses the Flowinator database

– Preliminary versions of this have worked well at the CIMP, but has not been incorporated in Interrogator

• Future work– Determine if additional data needs to be captured for profiling

• Portions of the payload, histograms of the payload, packet arrival times within a session

– Develop a voting mechanism to increase accuracy• Potential voter are profiles of host, network, and computer class (e.g.

workstation, server, programmer, secretary)

Page 21: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

21

Publications

• Journals– Varun Chandola and Vipin Kumar, “Summarization - Compressing Data into an Informative

Representation." To Appear in the Knowledge And Information Systems (KAIS), Springer, 2006.

– Hui Xiong Pang-Ning Tan, and Vipin Kumar, "Hyperclique Pattern Discovery". Accepted for publication in Data Mining and Knowledge Discovery (DMKD), 2006.

– Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, "Enhancing Data Analysis with Noise Removal, IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 18, no. 3, pp. 304-319, March, 2006.

– Michael Steinbach and Vipin Kumar, “Generalizing the Notion of Confidence”, Accepted for publication to Knowledge and Information Systems (KAIS), Springer, 2006.

– Hui Xiong, Shashi Shekhar, Pang-Ning Tan, and Vipin Kumar, TAPER: A Two-Step Approach for All-strong-pairs Correlation Query in Large Databases, IEEE Transactions on Knowledge and Data Engineering (TKDE), accepted for publication as a regular paper, 2006.

– Jieping Ye, Qi Li, Hui Xiong, Haesun Park, Ravi Janardan, Vipin Kumar, "IDR/QR: An Incremental Dimension Reduction Algorithm via QR Decomposition", IEEE Transactions on Knowledge and Data Engineering, 17(9), pp. 1208-1222, Sept 2005

Page 22: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

22

Publications• Books

– Pang N. Tan, Michael Steinbach, Vipin Kumar, “Introduction to Data Mining” Addison-Wesley (May, 2005)

– Vipin Kumar, Jaideep Srivastava, Aleksander Lazarevic, Eds, “Managing Cyber Threats: Issues, Approaches and Challenges”, Kluwer, 2005

• Book Chapters– Varun Chandola, Eric Eilertson, Levent Ertoz, Gyorgy Simon and Vipin Kumar, Data Mining for Cyber

Security, To Appear in Data Warehousing and Data Mining Techniques for Computer Security, editor Anoop Singhal, Springer, 2006

• Conference Proceedings– Gyorgy Simon, Hui Xiong, Eric Eilertson and Vipin Kumar. “Scan Detection: A Data Mining Approach”.

Proceedings of 6th SIAM International Conference on Data Mining (SDM), 2006.– Varun Chandola and Vipin Kumar, “Summarization - Compressing Data into an Informative

Representation” Proceedings of 5th International Conference on Data Mining (ICDM) 2005, TR-2005-037.– Michael Steinbach and Vipin Kumar, “Extending the Notion of Confidence”. Proceedings of 5th

International Conference on Data Mining (ICDM) 2005, TR-2005-039.• Technical Reports

– Mark Shaneck, Varun Chandola, Haiyang Liu, Changho Choi, Gyorgy Simon, Eric Eilertson, Yongdae Kim, Zhi-li Zhang, Jaideep Srivastava, and Vipin Kumar, “A Multi-Step Framework for Detecting Attack Scenarios”, Technical Report 06-004, 2006, Computer Science Department, University of Minnesota

– Mark Shaneck, Yongdae Kim, and Vipin Kumar, Privacy Preserving Nearest Neighbor Search, CS Technical Report 06-014, 2006, Computer Science Department, University of Minnesota

Page 23: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

23

Participation in Government and DoD Forums & Army Interactions

• Vipin Kumar attended and gave a talk at Workshop on Edge Computing Using New Commodity Architectures (EDGE), organized by various funding agencies including ARO, DTO and NSF at UNC, May 23 - 24, 2006

• Eric Eilertson met with ARL personnel Jan 30th - Feb 4th, in Adelphi MD to incorporate updates to the MINDS software

• Eric Eilertson visited the ARL-CIMP July 12th – 16th, 2005, September 26th – 30th, 2005 and November 28th – December 3rd, 2005 to help with the continued design of 2nd level analysis tools and a framework for 2nd level analysis.

• Vipin Kumar served as the co-organizer for the AHPCRC PGAS workshop in September 2005

• Benjamin Mayer attended the AHPCRC PGAS Workshop in Sept 2005• Benjamin Mayer visited the ARL-CIMP as a summer intern during May

17th to July 30th 2005• Benjamin Mayer and Eric Eilertson attended the DREN Networkers

Conference, October 2005

Page 24: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

24

Invited Talks and Presentations• Vipin Kumar, “Scalable Benchmarks and Kernels for Data Mining and

Analytics”, Invited Talk, Workshop on Edge Computing Using New Commodity Architectures (EDGE), UNC, May 23 - 24, 2006

• Vipin Kumar, “High-Performance Data Mining for Cyber Security”. Invited Talk, Distinguished Speaker Series, University of California, Davis, Feb 16, 2006.

• Vipin Kumar, “High Performance Data Mining for Cyber Security”, invited talk at IIT Roorkee (Dec 27th, 2005).

• Vipin Kumar, “High Performance Data Mining for Cyber Security”, invited talk at IIT Delhi (December 22nd, 2005).

• Benjamin Mayer, Eric Eilertson and Vipin Kumar, “Analyzing Long Term Network Data for Cyber Attacks Using HPC. A Comparison of MPI and UPC Implementations.” Presented at DREN Networkers Conference, October 2005.

• Benjamin Mayer, Eric Eilertson, Kerry Long, Tony Pressley and Vipin Kumar, “NPADS – Network Protocol Anomaly Detection System”. Presented at DREN Networkers Conference, October 2005.

• Benjamin Mayer and A. Karl Keller, “High Productivity Parallel Programming with Objective C.” AHPCRC PGAS Workshop, September 2005.

Page 25: 1 MINDS – A High Performance Data Mining Based Intrusion Detection System Vipin Kumar University of Minnesota  Team.

25

Significant Professional Activities & Awards

• Vipin Kumar, “Scaling Data Analytics”, Tutorial at Supercomputing-2005, Seattle, 14th Nov., 2005

• Vipin Kumar, Elected ACM Fellow, Dec 2005• Vipin Kumar, Technical Accomplishment Award, IEEE

Computer Society,2005• Varun Chandola, IBM Research Student Travel Award for the

paper titled “Summarization - Compressing Data into an Informative Representation”, at 5th ICDM Conference, 2005. Ranked as one of the top 5 student papers.