1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.

Post on 19-Dec-2015

216 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

1

Reversible Sketches for Efficient and Accurate Change Detection over

Network Data Streams

Robert Schweller Ashish GuptaElliot ParsonsYan Chen

Computer Science Department, Northwestern University

2

Online Change Detection• Network anomalies are common

– Flash crowds, failures, DoS, worms, …

Online Detection over Data Streams

• Data Stream: key/update pairs (k,u)

–Heavy hitters (lots of prior work)

–Heavy changes

3

-first to detect flow-level heavy changes in massive data streams at network traffic speeds.

k-ary sketch [Krishnamurthy, Sen, Zhang, Chen, 2003][Krishnamurthy, Sen, Zhang, Chen, 2003]

1

j

H

0 1 K-1…

……

4

k-ary sketch [Krishnamurthy, Sen, Zhang, Chen, 2003][Krishnamurthy, Sen, Zhang, Chen, 2003]

1

j

H

0 1 K-1…

……

hj(k)

hH(k)

h1(k)

Update (k, u): Tj [ hj(k)] += u (for all j)

Estimate v(S, k): sum of updates for key k

K

KsumkhT jjj /11

/)]([median

5

??

6

??

• Main problem– Cannot efficiently report keys with heavy change

• Our Contribution– Determine set of keys that have “large” estimates in sketch

• Requires very little space:–E.g. 5 hash tables with 16 K buckets = 80 KB–Fits in high speed memory

7

1

2

3

5

4

“Heavy”

Input:

Output: Set of keys that hash to heavy buckets in majority (or all) hash tables

-Sketch-Threshold

Reverse Sketch Problem

8

Outline

Streamingdatarecording

k-ary sketch

value

key

Heavychangedetection

k-ary sketch

heavychangekeys

changethreshold

fast

slow

Modularhashing

IP mangling

ReverseHashing

Algorithms

Improve Heavy Change Detection

9

• Intersect A1, A2, A3, A4, A5

Taking Intersections

H = 5 K = 212 #keys = 232 (IP addresses)

E[false positives] << 1

10

The problem with simple intersection• Why is this difficult ?

• Each set Ai can be very large !

H = 5 K = 212 #keys = 232 (IP addresses)

|A1| = 232 / 212 = 220

11

The problem with simple intersection• Why is this difficult ?

• Each set Ai can be very large !

• Solution:

Modular hashing

12

Modular hashing reduces the set size

32 bits

8 bits

10010100 10101011 10010101 10100011

010 110 001 101

h()

12 bits

13

Modular hashing reduces the set size

32 bits

8 bits

10010100 10101011 10010101 10100011

h1() h2() h3() h4()

010 110 001 101

010 110 001 101

Greatly reduces size of reverse mapped sets

14

Modular hashing reduces the set size

32 bits

8 bits

10010100 10101011 10010101 10100011

h1() h2() h3() h4()

010 110 001 101

010 110 001 101

Greatly reduces size of reverse mapped sets

28/23 = 25

15

1

2

3

5

4

b1

b2

b4

b5

b3

A1: 25 * 25 * 25 * 25

Modular hashing reduces the set size

Intersection:

Only 32 elements per partition

16

1

2

3

5

4

b1

b2

b4

b5

b3

A1: 25 * 25 * 25 * 25 A2: 25 * 25 * 25 * 25

Modular hashing reduces the set size

Intersection:

Only 32 elements per partition

17

1

2

3

5

4

b1

b2

b4

b5

b3b3

b1

b2

b4

b5

Handling Multiple Intersections…

2H different intersections

Much more difficult - Need sophisticated Reverse Hashing algorithms ( see tech report )

18

Problem: Too many collisions

129.105.56.23 129.105.56.28129.105.56.109129.105.56.35129.105.56.98 ...

7 . 4 . 0 . *

32 bits 12 bits

19

Problem: Too many collisions

129.105.56.23 129.105.56.28129.105.56.109129.105.56.35129.105.56.98 ...

7 . 4 . 0 . *

32 bits 12 bits

IP Mangling

Solution:

20

IP-mangling

21

Invertible Modular Linear Equation

f(x) a·x mod n

To be invertible: Must be relatively prime

• a is odd, chosen randomly

22

Modular Hashing

Optimal Hashing

23

Modular Hashing

Modular Hashing with IP Mangling Optimal Hashing

24

Recap:

Streamingdatarecording

reversiblek-ary

sketch

value storedvalue

Modularhashing

IP manglingkey

Heavychangedetection

reversiblek-ary

sketch

Reversehashing

ReverseIP mangling

heavychangekeys

changethreshold

)( loglog/1 nn

)loglog

log(

n

n

25

Evaluation• Traffic traces from Northwestern University edge router

– Each 5 min interval average traffic 7.5 GB in each interval

• Compared with Ground Truth• 6 hash tables, 4K buckets each, totally 192KB memory• Up to 140 true heavy change keys in 1.5 seconds

– Over 95% TPP– Less than 2% FPP

• All missing changes are due to boundary effects

26

Conclusions/ Future Work

• Sketches: efficient summary structures • Our contribution: Reversible Sketches

– efficient online detection of keys with heavy changes

Work in Progress (see tech report)

• Improved reverse hashing• Statistical guarantee on detection accuracy• More advanced applications:

– Hierarchical change detection• E.g. 129.105.100.* shows a big change !

27

See tech report for more!

http://list.cs.northwestern.edu

Thank you !

top related