Top Banner
Network Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group College of Computing Georgia Institute of Technology Joint work with: Abhishek Kumar, Qi Zhao, Minho Sung, Jun Li, Ellen Zegura, Georgia Tech. Jia Wang, Olivier Spatscheck, AT&T Labs - Research Li Li, Bell Labs Networking area Qualifying Exam: Oral Presentation
23

Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Sep 03, 2018

Download

Documents

HoàngAnh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Network Data Streaming – A Computer Scientist’s Journey inSignal Processing

Jun (Jim) XuNetworking and Telecommunications Group

College of ComputingGeorgia Institute of Technology

Joint work with:Abhishek Kumar, Qi Zhao, Minho Sung, Jun Li, Ellen Zegura, Georgia Tech.

Jia Wang, Olivier Spatscheck, AT&T Labs - ResearchLi Li, Bell Labs

Networking area Qualifying Exam: Oral Presentation

1

Page 2: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Outline

•Motivation and introduction

• Our six representative data streaming works

– 3 in Single-node single-stream data streaming (like SISD)

– 1 in Distributed Collaborative Data Streaming (like SIMD)

– 2 in Distributed Coordinated Data Streaming (like MIMD)

2

Page 3: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Motivation for network data streaming

Problem: to monitor network links for quantities such as

• Elephant flows (traffic engineering, billing)

• Number of distinct flows, average flow size (queue manage-ment)

• Flow size distribution (anormaly detection)

• Per-flow traffic volume (anormaly detection)

• Entropy of the traffic (anormaly detection)

• Other “unlikely” applications: traffic matrix estimation, P2Prouting, IP traceback

3

Page 4: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

The challenge of high-speed monitoring

•Monitoring at high speed is challenging

– packets arrive every 25ns on a 40 Gbps (OC-768) link

– has to use SRAM for per-packet processing

– per-flow state too large to fit into SRAM

• Traditional solution using sampling:

– Sample a small percentage of packets

– Process these packets using per-flow state stored in slowmemory (DRAM)

– Using some type of scaling to recover the original statistics,hence high inaccuracy with low sampling rate

– Fighting a losing cause: higher link speed requires lowersampling rate

4

Page 5: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Network data streaming – a smarter solution

• Computational model: process a long stream of data (pack-ets) in one pass using a small (yet fast) memory

• Problem to solve: need to answer some queries about thestream at the end or continuously

• Trick: try to remember the most important information aboutthe streampertinent to the queries– learn to forget unimportantthings

• Comparison with sampling: streaming peruses every piece ofdata for most important information while sampling digests asmall percentage of data and absorbs all information therein.

5

Page 6: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

The “hello world” data streaming problem

• Given a long stream of data (say packets), count the number ofdistinct elements in it

• Say in a, b, c, a, c, b, d, a – this number is 4

• Think about trillions of packets belonging to billions of flows...

• A simple algorithm: choose a hash functionh with range (0,1)

• X̂ := 1/min(h(d1), h(d2), ...)

•We can proveX̂ is an unbiased estimator

• Then average hundres of suchX up to get an accurate result

6

Page 7: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Data Streaming Algorithm for Estimating Flow Size Distribu-tion [Sigmetrics04]

• Problem: To estimate the probability distribution of flow sizes.In other words, for each positive integeri, estimateni, the num-ber of flows of sizei.

• Applications: Traffic characterization and engineering, net-work billing/accounting, anomaly detection, etc.

• Importance: The mother of many other flow statistics such asaverage flow size (first moment) and flow entropy

• Definition of a flow: All packets with the same flow-label.The flow-label can be defined as any combination of fieldsfrom the IP header, e.g.,<Source IP, source Port, Dest. IP,Dest. Port, Protocol>.

• Existing sampling-based work is not very accurate.

7

Page 8: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Our approach: network data streaming

• Design philosophy: “Lossy data structure + Bayesian statis-tics = Accurate streaming”

– Information loss is unavoidable: (1) memory very smallcompared to the data stream (2) too little time to put datainto the “right place”

– Control the loss so that Bayesian statistical techniques suchas Maximum Likelihood Estimation can still recover a de-cent amount of information.

8

Page 9: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Architecture of our Solution — Lossy data structure

•Measurement proceeds in epochs (e.g. 100 seconds).

•Maintain an array of counters in fast memory (SRAM).

• For each packet, a counter is chosen via hashing, and incre-mented.

• No attempt to detect or resolve collisions.

• Each 32-bit counter only uses 9-bit of SRAM (due to [Ramab-hadran & Varghese 2003])

• Data collection is lossy (erroneous), but very fast.

9

Page 10: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

The shape of the “Counter Value Distribution”

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

freq

uenc

y

flow size

Actual flow distributionm=1024Km=512Km=256Km=128K

The distribution of flow sizes and raw counter values (bothx andy axes are in log-scale).m = number of counters.

10

Page 11: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Estimatingn andn1

• Let total number of counters bem.

• Let the number of value-0 counters bem0

• Thenn̂ = m ∗ ln(m/m0)

• Let the number of value-1 counters bey1

• Thenn̂1 = y1en̂/m

• Generalizing this process to estimaten2, n3, and the whole flowsize distribution will not work

• Solution: joint estimation using Expectation Maximization

11

Page 12: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Estimating the entire distribution,φ, using EM

• Begin with a guess of the flow distribution,φini.

• Based on thisφini, compute the various possible ways of “split-ting” a particular counter value and the respective probabilitiesof such events.

• This allows us to compute a refined estimate of the flow distri-butionφnew.

• Repeating this multiple times allows the estimate to convergeto a local maximum.

• This is an instance ofExpectation maximization.

12

Page 13: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Estimating the entire flow distribution — an example

• For example, a counter value of 3 could be caused by threeevents:

– 3 = 3 (no hash collision);– 3 = 1 + 2 (a flow of size 1 colliding with a flow of size 2);– 3 = 1 + 1 + 1 (three flows of size 1 hashed to the same

location)

• Suppose the respective probabilities of these three events are0.5, 0.3, and 0.2 respectively, and there are 1000 counters withvalue 3.

• Then we estimate that 500, 300, and 200 counters split in thethree above ways, respectively.

• So we credit 300 * 1 + 200 * 3 = 900 ton1, the count of size 1flows, and credit 300 and 500 ton2 andn3, respectively.

13

Page 14: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Evaluation — Before and after running the Estimation algorithm

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

freq

uenc

y

flow size

Actual flow distributionRaw counter values

Estimation using our algorithm

14

Page 15: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Sampling vs. array of counters – Web traffic.

1

10

100

1000

10000

100000

1 10 100 1000

freq

uenc

y

flow size

Actual flow distributionInferred from sampling,N=10

Inferred from sampling,N=100Estimation using our algorithm

15

Page 16: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Sampling vs. array of counters – DNS traffic.

1

10

100

1000

10000

100000

1 10 100

freq

uenc

y

flow size

Actual flow distributionInferred from sampling,N=10

Inferred from sampling,N=100Estimation using our algorithm

16

Page 17: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Extending the work to estimating subpopulation FSD [Sigmet-rics 05]

•Motivation: there is often a need to estimate the FSD of a sub-population (e.g., “what is FSD of all the DNS traffic”).

• Definitions of subpopulation not known in advance and therecan be a large number of potential subpopulations.

• Our scheme can estimate the FSD of any subpopulation definedafter data collection.

•Main idea: perform both streaming and sampling, and thencorrelate these two outputs.

17

Page 18: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Space Code Bloom Filter for per-flow measurement [IMC03,Infocom04]

Problem: To keep track of the total number of packets belongingto eachflow at a high speed link.

Applications: Network billing and anomaly detection

Challenges: traditional techniques such as sampling will notwork at high link speed.

Our solution: SCBF encodes the frequency of elements in amultiset like BF encodes the existence of elements in a set.

18

Page 19: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Distributed Collaborative Data Streaming

1. Update

Packet stream

Monitoring

4. Anayze & sound alarm3. compose bitmap

to stationsFeedback

Valued customersCentral Monitoring Station

m x n

1 x n

2. Transmit digest

station mstation 2

AnalysisData

Module

MonitoringMonitoringstation 1

CollectionData

ModuleCollection

Data

ModuleCollection

Data

Module

1. Update

Packet stream

1. Update

Packet stream

19

Page 20: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Application to Traffic Matrix Estimation [Sigmetrics05]

• Traffic matrix quantifies the traffic volume between origin/destinationpairs in a network.

• Accurate estimation of traffic matrixTi,j in a high-speed net-work is very challenging.

• Our solution based on distributed collaborative data streaming:

– Each ingress/egress node maintains a synopsis data structure(cost< 1 bit per packet).

– Correlating data structures generated by nodesi andj allowus to obtainTi,j.

– Average accuracy around 1%, which is about one order ofmagnitude better than the current approaches.

20

Page 21: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Distributed coordinated data streaming – a new paradigm

• A network of streaming nodes

• Every node is both a producer and a consumer of data streams

• Every node exchanges data with neighbors, “streams” the datareceived, and passes it on further

•We applied this kind of data streaming to two unlikely networkapplications: (1) P2P routing [Infocom05] and (2) IP traceback[IEEE S&P04].

21

Page 22: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Related publications

• Kumar, A., Xu, J., Zegura, E. “Efficient and Scalable QueryRouting for Unstructured Peer-to-Peer Networks”, to appear inProc. of IEEE Infocom 2005.

• Kumar, A., Sung, M., Xu, J., Wang, J. “Data Streaming Algo-rithms for Efficient and Accurate Estimation of Flow Distribu-tion", in Proc. of ACM Sigmetrics 2004/IFIP WG 7.3 Perfor-mance 2004,Best Student Paper Award.

• Li, J., Sung, M., Xu, J., Li, L. “Large-Scale IP Tracebackin High-Speed Internet: Practical Techniques and TheoreticalFoundation", in 2004 IEEE Symp. on Security & Privacy.

• Kumar, A., Xu, J., Wang, J., Spatschek, O., Li, L. “Space-CodeBloom Filter for Efficient Per-Flow Traffic Measurement”, inProc. of IEEE Infocom 2004.

22

Page 23: Network Data Streaming – A Computer Scientist’s … Data Streaming – A Computer Scientist’s Journey in Signal Processing Jun (Jim) Xu Networking and Telecommunications Group

Related publications (continued)

• Zhao, Q., Kumar, A., Wang, J., Xu, J. “Data Streaming Algo-rithms for Accurate and Efficient Measurement of Traffic andFlow Matrices” to appear inProc. of ACM SIGMETRICS 2005.

• Kumar, A., Sung, M., Xu, J., Zegura, E. "A Data StreamingAlgorithm for Estimating Subpopulation Flow Size Distribu-tion", to appear inProc. of ACM SIGMETRICS 2005.

23