Decentralized Real-Time Monitoring of Network-Wide Aggregates Rolf Stadler Mads Dam, Alberto Gonzalez, Fetahi Wuhib KTH Royal Institute of Technology Stockholm, Sweden www.ee.kth.se/~stadler Large-scale Distributed Systems and Middleware (LADIS ’08) IBM TJ Watson Research Lab, NY, Sept 15-17, 2008 Outline A self-organizing Monitoring Layer Continuous Monitoring of Aggregates with Accuracy Objectives (A-GAP) Performance comparison gossip vs. tree-based monitoring (GAP vs. G-GAP) 2
17
Embed
Decentralized Real-Time Monitoring of Network-Wide Aggregates
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Decentralized Real-Time Monitoring of Network-Wide Aggregates
Rolf StadlerMads Dam, Alberto Gonzalez, Fetahi Wuhib
KTH Royal Institute of TechnologyStockholm, Sweden
www.ee.kth.se/~stadler
Large-scale Distributed Systems and Middleware (LADIS ’08)IBM TJ Watson Research Lab, NY, Sept 15-17, 2008
Outline
A self-organizing Monitoring Layer
Continuous Monitoring of Aggregates with Accuracy Objectives (A-GAP)
Performance comparison gossip vs. tree-based monitoring (GAP vs. G-GAP)
Monitoring and configuration,generally FCAPS functions, performed on a per-device basis.
Managed System
Successful for- small number of components- small rate of change.
4
A Management Layer inside the Network
analyze
exceptionsreportspolicies
5
A Monitoring System for Large-scale Dynamic Environments
1. Engineer a self-organizing monitoring layer inside the managed system.
2 Support monitoring of aggregates in real time2. Support monitoring of aggregates in real-time. across neighborhood, domain, networksum, max, average, percentile, histogram, …
3. Provide primitives for polling, continuous monitoring, detection of threshold crossings.
4. Support controlling the performance trade-offs.accuracy, overhead, execution time, robustness
6
Continuous Monitoring of Aggregates with Accuracy Objectives (A-GAP)
7
The Problem
•Find an efficient solution for continuous monitoring of aggregates i l l d i t k i tin large-scale dynamic network environments
-Aggregation functions: SUM, MAX and AVERAGE, …-Sample aggregates: total number of VoIP flows, maximum link utilization, histogram of current load across routers in a network domain
•Key Application Areas: Network Supervision,
8
y pp p ,Quality Assurance, Proactive Fault Management
Tradeoff between Estimation and Overhead
Overhead
Management solutions deployed today usually
Estimation Error
•Management solutions deployed today usually provide qualitative control of the accuracy
•Goal: Control trade-off through error objective
9
Decentralized in-Network Aggregation
Computing Aggregates•Self-stabilizing spanning tree•Incremental in networkGlobal
Management Station
•Incremental, in-network aggregation
•Push-based
Efficient Operation•Local filters conform to error
objective
AggregatingNode
Leaf Node
Global Aggregate
PartialAggregate
Localvariable
RootPhysicalNode
4
12
1
10
3
25
10
•Adapt dynamically to network statistics
3
3
5 2
2
7
75
Local Adaptive Filters
Local variable or partial aggregate
Last update value
Filter width
Filter Exceeded: 1) Triggers an update to parent2) Filter is shifted
11
Local filter on a node• Controls the management overhead by filtering updates• Drops updates with small change to partial aggregate• Periodically adapts to the dynamics of network environment
time
Problem Formalization
Find filter widths to monitor aggregatefor a given accuracy objective, with minimal overhead
Overhead: maximum processing load ωn
over all management processesAccuracy objective: { }n
nMaxω s.t. E[|Eroot|]≤ εMinimizeaverage error
12
{ }nnMaxω s.t. p(| Eroot |>γ) ≤ θMinimize
{ }nnMaxω s.t. |Eroot|≤ κMinimize
percentile error
maximum error
A-GAP: A Distributed Heuristic
•The global problem is mapped onto a local problem for each node
i i i { }πM ( ) nnEE ≤
•Attempts to minimize the maximum processing load over all nodes by minimizing the load within each node’s neighborhood
•Filter computation: decentralized and asynchronous
Minimize { }ππ
ωMax s.t. ( ) nnoutEE ε≤
13
•Each node independently runs a control cycle:every τ seconds {
request model variables from childrencompute new filters and accuracy objectives for childrencompute model variables for local node }
λn Update rate
An Stochastic Model for the Monitoring Process
•Model based on discrete-time Markov chains
Snin
Snout
En
Enout
Gn
Fn
ωn
Updates from children
Updates to parent Step sizes
Estimation Error
Update rate(processing load)
Step sizes
E ti ti E
Node state
Filter width
•It relates for each node n-the error of its partial aggregate-evolution of the partial aggregate-the rate of updates n sends -the width of the local filter
•It permits to compute for each
14
Enin Estimation Error
p pnode
-the distribution of estimation error -the protocol overhead
•Aggregate: Number of http flows in the domain•Traces
-From two 1 Gbit/s links that connect University of Twente to a research network
•Control cycle
17
y-τ=1 sec
Tradeoff: Accuracy vs Overhead
400
500
600
ec
ARCε =0
0
100
200
300
0 5 10 15 20 25 30
Upd
ates
/se
Tmin=0.03
0.100.05
0.04
0.20
A-GAP
ε =2
ε =5
ε =10 ε =15 ε =20
18
• Overhead decreases monotonically
• Overhead depends on the changes of the aggregate, not on its value.• A-GAP outperforms a rate-control scheme (ARC)
0 5 10 15 20 25 30Avg Error
Scalability
0,8
1
aliz
ed)
Grid 25Grid 221
Grid 613
0,2
0,4
0,6
Upd
ates
/sec
(nor
m
emin
19
• Minimum error emin increases with the network size • Overhead increases linearly with network size for same error objective
00 5 10 15 20 25 30 35 40
Avg Error
Robustness
500
1500
2500
3500
Estim
atio
n Er
ror
40
60
80
m L
oad
(Upd
ates
/sec
)
-500140 145 150 155 160 165 170 175Time
Node A fails End of Transient
20
• Estimation error: several spikes during sub-second transient period • Overhead: single peak with a long transient
0
20
140 145 150 155 160 165 170 175Time
Max
imum
Error Prediction by A-GAP vs Actual Error
0,04
0,05 Absolute Avg Error
A t l E
Error Predictedby A-GAP
0,01
0,02
0,03
Actual Errory
21
• Accurate prediction of the error distribution• Maximum error >> average error (one order of magnitude)
0-40 -30 -20 -10 0 10 20 30 40Error
Management Station
A-GAP Prototype
Lab testbed at KTH•16 monitoring nodes
Aggr
egat
ion
Tree
Node 1
Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
•16 Cisco 2600 Series routers•Smartbits 6000 traffic
generator•A-GAP implemented in Java
22
Phy
sica
lN
etw
ork
Prototype: Management Station Interface
SelectAggregation Function
Evolution of the Aggregate
(True Value and A-GAP Estimation)
SelectAccuracyObjective
Show Aggregation Tree
A GAP Estimation)
Overhead Distribution and Evolution
SelectRoot Node
23
Tree
Real-time Estimation ofError Distribution and Trade-off
Simulation vs Testbed Measurements
10
12
14
TestbedExperiment
0
2
4
6
8
Upd
ates
/sec
Simulation
24
•Curves are close: difference in overhead below 3,5%•Prototype validates simulation mode
0 2 4 6 8 10 12Avg Error e
Prototype: Error Estimation by A-GAP vs Actual Error
0,08
0,10
Measured Error Error Estimatedby A-GAP
0,02
0,04
0,06
Absolute Avg Error
25
•Accurate estimation of the error distribution•Maximum error >> average error (one order of magnitude)
0,00-30 -20 -10 0 10 20 30Error
g
Prototype: Overhead Estimation by A-GAP vs Actual Overhead
2,5
3
0
0,5
1
1,5
2
Upd
ates
/sec
26
•Accurate estimation of the overhead•Estimation tends to be more accurate for
nodes close to the root
node 1 node 2 node 3 node 4 node 5 node 6 node 7
Measured Estimated
Gossip vs. Tree-based Aggregation
27
Gossip protocols
Gossip protocols are round-based,during each round a node randomly selects aduring each round a node randomly selects a subset of neighbors and interacts with them.
•Continuous monitoring of aggregates with accuracy objectives-Efficient, scalable and adaptable monitoring using aggregation trees is feasible.Model based monitoring allows for performance prediction-Model-based monitoring allows for performance prediction.
•Tree-based vs. gossip-based continuous monitoring -In a traditional wireline networking scenario, tree-based aggregation outperforms gossip-based aggregation
33
ReferencesF. Wuhib, M. Dam, R. Stadler: “Decentralized Detection of Global
A. Gonzalez Prieto, R. Stadler: “A-GAP: An Adaptive Protocol for , pContinuous Network Monitoring with Accuracy Objectives,” IEEE Transactions on Network and Service Management (TNSM), Vol. 4, No. 1, June 2007.
F. Wuhib, M. Dam, R. Stadler, A. Clemm: “Robust Monitoring of Network-wide Aggregates through Gossiping,” 10th IFIP/IEEE International Symposium on Integrated Management (IM 2007), Munich, Germany, May 21-25, 2007.
K.S. Lim and R. Stadler: “Real-time views of network traffic using decentralized management,” 9th IFIP/IEEE International Symposium on Integrated Network Management (IM 2005), Nice, France, May 16-19, 2005.