Top Banner
1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003
30

1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

Jan 13, 2016

Download

Documents

Merryl Stewart
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

1

CISAContinually Improving Stream Analysis

Nancy McMillanDoug MooneyDave Burgoon

March 14, 2003

Page 2: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 2

Agenda

Background and Overview Architecture Algorithms Results

Page 3: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 3

MURALS:Multiple Use Real-time Analytics for Large Scale Data

Major information technology initiative• Objective: Develop intellectual property addressing the challenges created by:

– Data generation/collection at previously unimaginable rates– Growing expectation that real time decision-making is feasible and necessary for

competitive advantage– Dramatic increase in the data to information ratio– Compelling need for balance between result precision and timeliness

Sponsored development of two technologies• InfoRes: Addresses IT issues associated with real-time querying of very large

relational databases• CISA: Addresses IT issues associated with real-time analysis of high volume

(varying arrival speed) stream data

Page 4: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 4

Background:Our problem space

Many data sources supplying stream data

Stream data can be summarized by a set of features/summary statistics over some time window

Each data source needs continually classified or characterized

Classification/characterization of a single data source may depend on data from other data sources

Examples:• Computers connecting to a firewall• Sensor networks

Page 5: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 5

Internet Security Example Who is trying to inappropriately access a company’s network?

There are 19 firewalls recording connections in a log file• Date/Time • Source and Destination IP addresses• Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule

Inbound and outbound connections and warnings over a six day period in July 2002 were logged• but connections from site to site VPNs are not• only externally initiated connections are being analyzed• more data (6 days in September) were provided later

Page 6: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 6

The Problem: The faster data arrives, the more processing power required for real-time analysis.

Every data arrival initiates some tasks (store data, recalculate features, update decisions, etc.), which each require computational time• Systems designed for gushing data

waste resources when data trickles.• Systems designed for slower data

flow fail when data arrives too fast.

More sophisticated analysis techniques (better features, decision algorithms, etc.) require more computational time, but can provide better answers• Analytics designed for gushing data

don’t provide the best answer possible when data trickles.

• Analytics designed for slower data flow don’t provide timely answers when data arrives too fast

To what data arrival rate should system be designed?

Page 7: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 7

The CISA Answer: A precision-speed trade-off

When the data arrives more slowly than the system design rate, the best possible answer is provided• All data is considered.• Best analysis techniques are used.

As the data flows faster than the system design rate the accuracy and/or precision of the solution degrades smoothly.

System achieves precision-speed trade-off through:• Architecture

– Answer not based on all current data– Requires feedback from algorithm so most important data is considered

• Algorithms– Partial/approximate solutions provided

Page 8: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 8

Architecture and Algorithm OverviewHow CISA achieves precision-speed trade-off

Architecture• Assign analysis tasks to

asynchronously operating objects

– storage, characterization, decision-making, and visualization

• Prioritize analysis tasks associated with each new piece of data

– Data likely to impact analysis is analyzed sooner

Algorithm• Use incremental algorithms

where possible– Update previous answer with new

data rather than re-analyze all data

• Stop or modify iterative or multi-step algorithms before completion when new data arrivals need to enter algorithm

– Partial/approximate solutions provided

Page 9: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 9

Agenda

Background and Overview Architecture Algorithms Results

Page 10: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 10

CISA Architectural ComponentsDiagram

DatabaseSource 1

Source 2

Source 1 Source 2

PRIORITIZE

PRIORITIZEAlgorithm Visualization/

Monitor

. . .

Raw DataSummary Statistic/FeatureAlgorithmDirect Connection

Source Data Objects

Data Management Object

Algorithm Objects

Page 11: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 11

Internet Security Example ArchitectureDiagram

DecisionMaker

DatabaseSource 1

Source 2

Firewall 1

Listener-Publisher

DatabaseRequester

PRIORITIZE

Topic

Database

Topic

Source 1Feature

Topic

Source 2Feature

PRIORITIZE

Topic

DecisionUpdate

Listener-Publisher

Topic

DecisionMade

Listener

Visualization/State Reporter

Publisher

Publisher

Firewall 2

Topic

Source 1Data

Topic

Source 2Data

Listener-Publisher

Source 1Feature/State

Listener-Publisher

Source 2Feature/State

... ......Source Data Objects

...

Data Management Object

Algorithm Object

Log Data MessageFeature calculation MessageState Update MessageDirect Connection

DecisionMaker

DatabaseSource 1

Source 2

Firewall 1

Listener-Publisher

DatabaseRequester

PRIORITIZE

Topic

Database

Topic

Source 1Feature

Topic

Source 2Feature

PRIORITIZE

Topic

DecisionUpdate

Listener-Publisher

Topic

DecisionMade

Listener

Visualization/State Reporter

Publisher

Publisher

Firewall 2

Topic

Source 1Data

Topic

Source 2Data

Listener-Publisher

Source 1Feature/State

Listener-Publisher

Source 2Feature/State

... ......Source Data Objects

...

Data Management Object

Algorithm Object

Log Data MessageFeature calculation MessageState Update MessageDirect Connection

Access database

JMS object communication

SAS Analytics

Java

Page 12: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 12

Advantages / IssuesRelated to rapid prototyping decisions

Advantages• Asynchronous• Prioritized Lists• Open Source / Off-the-shelf• Platform Independent

Issues• Slow – system resources,

”thrashing”, db, (network speeds)• JMS Implementations vary slightly

Advantages• Easy communication with Java• Easily and quickly developed

– data storage and– feature calculation

Issues• Slow• Not available on many platforms

JMS Access

Page 13: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 13

Agenda

Background and Overview Architecture Algorithms Results

Page 14: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 14

Candidate CISA AlgorithmsA very broad group of statistical methods…

Feature characteristics• Relies on more than one

feature• Some of the individual

features take time to compute or measure

• Meaningful nested "sub-algorithms" can be built on increasing sets of features

Data source characteristics• The algorithm can efficiently,

update its current solution when feature values for only a small group of source objects change

• There is a natural method for prioritizing objects

Page 15: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 15

Construction MethodologiesGeneral

Feature Priority• Order features (statically)• Create series of nested models that use an increasing number of features• Develop a function to assign priorities based on feature order and current object

classification

Data Source Priority• Order data sources (dynamically)• Assign priorities based on uncertainty of classification or cost of misclassification• Incremental algorithms are usually essential

Combinations of Both

Page 16: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 16

Construction MethodologiesExamples

Feature Priority: Decompose an algorithm into subalgorithms that use subsets of features. Prioritize feature computation.• Example: Decision tree using X1,X2,… , Xn

• Prioritize order of Xi computation based on tree structure• Use pruned trees to classify:

{X1}, {X1,X2}, {X1, X2, X3}, …, {X1, X2, …, Xn}

Data Source Priority: • Example: Cluster analysis—All features needed• Objects with incomplete feature sets get higher priority• Objects with more uncertain classifications get higher priority

Page 17: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 17

Feature Priority ConstructionDecision tree example

|X1<0.00134771

X2<0.16844

X4<0.148293

X6<0.722813

X3<0.248832

X5<34.5G

G B

G

B

B G

Page 18: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 18

Agenda

Background and Overview Architecture Algorithms Results

Page 19: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 19

Internet Security Example Who is trying to inappropriately access the company’s network?

There are 19 firewalls recording connections in a log file• Date/Time • Source and Destination IP addresses• Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule

Inbound and outbound connections and warnings over a six day period in July 2002 were logged• but connections from site to site VPNs are not• only externally initiated connections are being analyzed• more data (6 days in September) were provided later

Page 20: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 20

External Network Connectors Summary statistics/features

Quickly calculated features• % Drop• % Accept• Hits/Sec• # Hits

More time consuming features• # Different Services• Different Services/Hit• # Different IPs• Different IPs/Hit

Page 21: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 21

N=36Port Scans

High ServicesLarge Drop %

N=3Slow Port and IP Scans

High ServicesHigh Number of IPsHigh Number of Hits

Low Hits/SecLarge Drop %

N=10Fast IP Address Scans

Low ServicesHigh Number of Hits

High IP/HitHigh Number of Hits/Sec

Large Drop %Mostly Foreign

Represent 40% of External Connections

N=4636Suspicious

Large Drop %Medium IP/Hit

Low everything else

N=8055Suspicious-Too Early to Tell

Large Drop %High IP/HitFew Hits

N=7828Normal

High Accept %

Dates: 7/21/02 -7/27/02

Page 22: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 22

External Network ConnectorsClassifications

70%-80% of IPs stay in same group from day to day.

Class Sources Connections PercentagePort Scans 36 218,658 14.40%Mostly Foreign IP Sweeps 10 602,438 39.68%Port and IP Sweeps 3 9,165 0.60%Normal 7,828 205,990 13.57%Suspicious 4,636 455,687 30.02%Few Connections 8,055 26,163 1.72%

Page 23: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 23

External Network ConnectorsRule-based, feature priority classification algorithm

Level Features Added0

1 NormalToo Early

to Tell Drop %

2 NormalIP Scan

OnlyPort Scan

Only

Both IP and Port

Scan UnknownToo Early

to Tell Ratio Measures

3 NormalIP Scan

OnlyPort Scan

Only

Both IP and Port

Scan UnknownToo Early

to Tell Distinct Services

4 NormalIP Scan

OnlyPort Scan

Only

Both IP and Port

Scan UnknownToo Early

to TellDistinct IP Addresses

Suspicious

Too Early to TellClassification

Priority

Page 24: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 24

Correctly classified same level algorithmCorrectly classified different level algorithmConsistently classifiedInconsistently classified

Connections per second

0

%

100

Precision-Speed Trade-offExpected results

Page 25: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 25

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%60.0%

70.0%

80.0%

90.0%

100.0%

Connections per Second

%

Correctly classifiedsame level algorithm

Correctly classifieddifferent level algorithm

Consistently classified

Inconsistently classified

Precision-Speed Trade-offObserved results

Page 26: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 26

External Network ConnectorsDynamic, data source priority algorithm

Traditional cluster analysis (e.g., K-means) is time consuming on large datasets

Incremental clustering algorithm required for reasonable performance

Our approach: • After first cluster analysis, use centroid locations to seed the next

analysis • Used the SAS procedure FASTCLUS for proof-of-concept purposes

Page 27: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 27

Outlier Outlier: n=1 (0.32% of connections) Extremely high services China

Dates: 8/11/02 - 8/17/02

Page 28: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 28

Cluster 1

Cluster 4

Cluster 0

Cluster 3 Cluster 5

Cluster 2

Cluster 0: n = 5207 (10.11% of connections) High Accept % Mix Max Hits Mix IP/Hit

Cluster 1: n = 2561 (17.16% of connections) High Drop % Medium IP/Hit

Cluster 2: n = 7 (50.35% of connections) High Drop % High Num Hits High Num IPs High Max Hits/Sec

Cluster 3: n = 180 (17.81% of connections) High Services and/or Max Hits/Sec Mixed

Cluster 4: n = 4 (01.42% of connections) High Drop % High Services 94.5% of connections from Korea 1 of 4 IPs from Korea Average 23 sec between hits

Cluster 5: n = 5104 (02.82% of connections) High IP/Hit High Drop %

Dates: 8/11/02 - 8/17/02

Page 29: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 29

External Network Connector Classifications Dashboard report

Drop %Service/HitIPS/HitMax Hit/SecIPs ScannedServices Scanned% of Sources% Connections

Page 30: 1 CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003.

04/21/23 30

External Network Connector ClassificationsOutlier report

Src: 211.96.31.129Country: CHINA

Org ID: SCH-CHENGDU-HUITEC

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6

40 Minutes

cluster 0 1 2 3 4 5 6 7

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

Services Scanned

Drop %Service/HitIPS/HitMax Hit/SecIPs Scanned