Multisite Internet Data Analysis Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor [email protected]http://www.eecs.umich.edu/~hero Research supported in part by: NSF CCR-0325571 1.Network Data Collection 2.Distributed Data Analysis 1. Dimension Reduction 2. Model-Based Data Analysis 3.Conclusions
Multisite Internet Data Analysis. Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor [email protected] http://www.eecs.umich.edu/~hero. Network Data Collection Distributed Data Analysis Dimension Reduction Model-Based Data Analysis Conclusions. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Multisite Internet Data Analysis
Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor
– Global: monitoring centers aggregate statistics from sites distributed around network to detect, classify, or estimate global network state while ensuring information privacy constraints
– Local: collection sites gather data relevant to local network state and share information as necessary to enhance local analysis.
• Types of data measured– Active: queries and requests, packet probes– Passive: netflow, router fields, honeypots, backscatter
ISP 1
ISP 2
ISP 3
Local data collectionand probing site
Monitoring Center
Datacollection site
: Data collector
Abilene Netflow DataProtocol
No. Flows
Avg. Duration
Std. Duration
Avg Packets
Std. Packets
Avg Bytes
Std. Bytes
No.
Flo
ws
Avg
. Dur
atio
n
Std
. Dur
atio
n
Avg
Pac
kets
Std
. Pac
kets
Avg
Byt
es
Std
. Byt
es
Dat
aset
1
Dataset 2
Abilene Netflow DataR
outer
No. Flows
Avg. Duration
Std. Duration
Avg Packets
Std. Packets
Avg Bytes
Std. Bytes
No.
Flo
ws
Avg
. Dur
atio
n
Std
. Dur
atio
n
Avg
Pac
kets
Std
. Pac
kets
Avg
Byt
es
Std
. Byt
es
Dat
aset
1
Dataset 2
Abilene Netflow Data
0 20 40 60 80 100 120 140 160 180 2005.5
6
6.5
7
7.5
8x 10
4
Time in sec.
Num
of F
low
s
Total Number of Flows for Data Set 1
Challenges and Approaches• Challenges
– High dimensional measurement space– Non-linear dependencies and non-stationarity– Privacy and proprietary concerns– Insufficient bandwidth for cts sampled data
• Approaches– Dimension reduction– Model-based distributed inference – Controlled information sharing– Hierarchical and modular collection/analysis
Hierarchical Architecure
2. Distributed Data Analysis
• Hypothesis: data collected at sites A,B,C follow a statistical distribution defined over a lower dimensional manifold.
• Overall objective: Find distributed strategies to perform reliable statistical inference with minimum amount of data sharing
Site ASite C
Site B
Sampling
2.1 Distributed Dimension Reduction
UnknownDistribution
ObservedSample
UnknownManifold
UnknownEmbedding
Geodesic Entropic GraphsA Planar Sample and its Euclidean MST
8.64 x 105 MST Length for 3 Land Vehicles (=1,m=20)
n
Ln
GMST Estimatesd=13H=120(bits)_
Distributed GMST Estimator• Principal MST convergence result:
• Distributed BHH (Aggregation rule):
• Tight upper and lower bounds on limit: if exchange rooted dual graphs [Yukich:97] among sites
BHH Theorem:
2.2 Distributed Model-based Inference• Global likelihood model
• Global M-estimator recursion:
– Global Fisher score function
• Local Fisher score functions
Distributed M-estimator
Compute Compute
k=k+1 k=k+1
A B
Properties
• Communication requirement is: – 2p bytes/update/site.
• If data are independent attain stationary points of global likelihood
• All local MLE’s are available to each site.
• For multimodal likelihood, improvement on local MLE’s can be achieved by aggregation under mixture model.
Global maximum
Local maxima
x xx x xx xxxx x xx
Local MLE’s
Global Likelihood Function
Key Theoretical Result• The asymptotic distribution of local estimates is a
Gaussian mixture dependent on global likelihood
• Parameters
M
mm
mTm
mKm tCtC
ptf0
1
2/ˆ )()(21exp
2)(
Proof: asymptotic normal theory of local maxima (Huber:67): see Blatt&Hero:2003
SampleCovariance
Analysis
Local Estimator Aggregation Algorithm
Estimator 1
Estimator 2
Estimator N
Estimation of
Gaussian Mixture
Parameters
(FS,EM…)
AggregationTo FinalEstimate
Local maximum
Ambiguity function.
Global maximum
IID Observation Model:
• Each site observes 2 component Gaussian mixture
• Identical component variances
• Unknown mixing parameters
• Unknown component means
• 200 data collection sites
• 100 samples/site• CEM2 algorithm
implemented for estimation and aggregation
Simple Example
0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
1
2
Clustering and Discrimination
Global maximum
Inverse FIM
Local maximum
Empirically estimated covariances via CEM2
Validation of Key Result
QQ for Cluster 1 QQ for Cluster 2
Conclusions
• Lossless distributed dimension reduction and model-based inference requires:– Reliable local inference methods – Aggregation rules for combining local statistics
• Information sharing constraints?• Effects of bandwidth constraints - data
compression? • Tracking in dynamical models?
References
• A. O. Hero, B. Ma, O. Michel and J. D. Gorman, “Application of entropic graphs,” IEEE Signal Processing Magazine, Sept 2002.
• J. Costa and A. O. Hero, “Manifold learning with geodesic minimal spanning trees,” accepted in IEEE T-SP (Special Issue on Machine Learning), 2004.
• D. Blatt and A. Hero, "Asymptotic distribution of log-likelihood maximization based algorithms and applications," in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J. Zerubia, Springer-Verlag, 2003
• M.F. Shih and A. O. Hero, "Unicast-based inference of network link delay distributions using mixed finite mixture models," IEEE T-SP, vol. 51, No. 9, pp. 2219-2228, Aug. 2003
• N. Patwari, A. O. Hero, and Brian Sadler, "Hierarchical censoring sensors for Change Detection,” Proc. Of SSP, St. Louis, Sept. 2003.
Information Sharing Game
Addition of other Discriminants
00.5
11.5
22.5
3
00.5
11.5
22.5
3-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1
2
lo
g f(
y i; 1,
2) - E
{ lo
g f(
y i; 1,
2) }
Value-added due totransmission of likelihood values