Top Banner
Decentralized Real-Time Monitoring of Network-Wide Aggregates Rolf Stadler Mads Dam, Alberto Gonzalez, Fetahi Wuhib KTH Royal Institute of Technology Stockholm, Sweden www.ee.kth.se/~stadler Large-scale Distributed Systems and Middleware (LADIS ’08) IBM TJ Watson Research Lab, NY, Sept 15-17, 2008 Outline A self-organizing Monitoring Layer Continuous Monitoring of Aggregates with Accuracy Objectives (A-GAP) Performance comparison gossip vs. tree-based monitoring (GAP vs. G-GAP) 2
17

Decentralized Real-Time Monitoring of Network-Wide Aggregates

Mar 16, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Decentralized Real-Time Monitoring of Network-Wide Aggregates

Rolf StadlerMads Dam, Alberto Gonzalez, Fetahi Wuhib

KTH Royal Institute of TechnologyStockholm, Sweden

www.ee.kth.se/~stadler

Large-scale Distributed Systems and Middleware (LADIS ’08)IBM TJ Watson Research Lab, NY, Sept 15-17, 2008

Outline

A self-organizing Monitoring Layer

Continuous Monitoring of Aggregates with Accuracy Objectives (A-GAP)

Performance comparison gossip vs. tree-based monitoring (GAP vs. G-GAP)

2

Page 2: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Today’s Management Systemsfor Today’s Network Technologies

analyze Management System

act observe

Management intelligence outsidemanaged system.

Clear separation between management system and managed

t b d i

Managed System

system, by design.

3

Today’s Management Systemsfor Today’s Network Technologies (2)

analyze

Management System

Monitoring and configuration,generally FCAPS functions, performed on a per-device basis.

Managed System

Successful for- small number of components- small rate of change.

4

Page 3: Decentralized Real-Time Monitoring of Network-Wide Aggregates

A Management Layer inside the Network

analyze

exceptionsreportspolicies

5

A Monitoring System for Large-scale Dynamic Environments

1. Engineer a self-organizing monitoring layer inside the managed system.

2 Support monitoring of aggregates in real time2. Support monitoring of aggregates in real-time. across neighborhood, domain, networksum, max, average, percentile, histogram, …

3. Provide primitives for polling, continuous monitoring, detection of threshold crossings.

4. Support controlling the performance trade-offs.accuracy, overhead, execution time, robustness

6

Page 4: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Continuous Monitoring of Aggregates with Accuracy Objectives (A-GAP)

7

The Problem

•Find an efficient solution for continuous monitoring of aggregates i l l d i t k i tin large-scale dynamic network environments

-Aggregation functions: SUM, MAX and AVERAGE, …-Sample aggregates: total number of VoIP flows, maximum link utilization, histogram of current load across routers in a network domain

•Key Application Areas: Network Supervision,

8

y pp p ,Quality Assurance, Proactive Fault Management

Page 5: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Tradeoff between Estimation and Overhead

Overhead

Management solutions deployed today usually

Estimation Error

•Management solutions deployed today usually provide qualitative control of the accuracy

•Goal: Control trade-off through error objective

9

Decentralized in-Network Aggregation

Computing Aggregates•Self-stabilizing spanning tree•Incremental in networkGlobal

Management Station

•Incremental, in-network aggregation

•Push-based

Efficient Operation•Local filters conform to error

objective

AggregatingNode

Leaf Node

Global Aggregate

PartialAggregate

Localvariable

RootPhysicalNode

4

12

1

10

3

25

10

•Adapt dynamically to network statistics

3

3

5 2

2

7

75

Page 6: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Local Adaptive Filters

Local variable or partial aggregate

Last update value

Filter width

Filter Exceeded: 1) Triggers an update to parent2) Filter is shifted

11

Local filter on a node• Controls the management overhead by filtering updates• Drops updates with small change to partial aggregate• Periodically adapts to the dynamics of network environment

time

Problem Formalization

Find filter widths to monitor aggregatefor a given accuracy objective, with minimal overhead

Overhead: maximum processing load ωn

over all management processesAccuracy objective: { }n

nMaxω s.t. E[|Eroot|]≤ εMinimizeaverage error

12

{ }nnMaxω s.t. p(| Eroot |>γ) ≤ θMinimize

{ }nnMaxω s.t. |Eroot|≤ κMinimize

percentile error

maximum error

Page 7: Decentralized Real-Time Monitoring of Network-Wide Aggregates

A-GAP: A Distributed Heuristic

•The global problem is mapped onto a local problem for each node

i i i { }πM ( ) nnEE ≤

•Attempts to minimize the maximum processing load over all nodes by minimizing the load within each node’s neighborhood

•Filter computation: decentralized and asynchronous

Minimize { }ππ

ωMax s.t. ( ) nnoutEE ε≤

13

•Each node independently runs a control cycle:every τ seconds {

request model variables from childrencompute new filters and accuracy objectives for childrencompute model variables for local node }

λn Update rate

An Stochastic Model for the Monitoring Process

•Model based on discrete-time Markov chains

Snin

Snout

En

Enout

Gn

Fn

ωn

Updates from children

Updates to parent Step sizes

Estimation Error

Update rate(processing load)

Step sizes

E ti ti E

Node state

Filter width

•It relates for each node n-the error of its partial aggregate-evolution of the partial aggregate-the rate of updates n sends -the width of the local filter

•It permits to compute for each

14

Enin Estimation Error

p pnode

-the distribution of estimation error -the protocol overhead

Page 8: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Stochastic Model (leaf node)

⎩⎨⎧ ≤+≤−+

=.0 otherwiseFXiFXi

jnnnnnn

n

Estimating step size (MLE)

Evolution of local variable

nX

Transition Matrix

Step Size

⎩⎨⎧

=≠≤

−−<<−+−=−=

=0

0,)()(

)(n

nnn

nnnnnnn

nnnnij j

jFjiFXiFPiXP

ijXPt

⎪⎪

⎪⎪⎪

=−==

>−==

== ∑ ∑

−=

+

−=

+

−=

0)()(

)()(

)( szdGPzXP

FszsGPzXP

sSPn

n

n

n

n

n

F

Fd

Fd

Fdz

nn

Fs

Fsz

nnn

nout

15

Estimation Error

Management Overhead

⎪⎪

⎩ .0 otherwise

))0(1( =−= nout

n SPλ

nnout GE =

A-GAP: Model-based Monitoring

0,025

0,03

0,035

Error Objective

Estimation Error

0

0,005

0,01

0,015

0,02

-100 -75 -50 -25 0 25 50 75 100Estimation model variables

Optimization Problem

Stochastic Modelof Monitoring Process

Filter Widths

Overhead

0

0,5

1

1,5

2

2,5

3

node 1 node 2 node 3 node 4 node 5 node 6 node 7

Upd

ates

/sec

Measured Estimated

Step Sizes

Measurementslocal variables

Tree-based aggregation

AggregateEstimation

xi(t)

16

Page 9: Decentralized Real-Time Monitoring of Network-Wide Aggregates

A-GAP: Evaluation through Simulation

•Overlay topologies -Abovenet: 654 nodes, 1332 linksGrids: 25 85 221 613 nodes-Grids: 25, 85, 221, 613 nodes

•Aggregate: Number of http flows in the domain•Traces

-From two 1 Gbit/s links that connect University of Twente to a research network

•Control cycle

17

y-τ=1 sec

Tradeoff: Accuracy vs Overhead

400

500

600

ec

ARCε =0

0

100

200

300

0 5 10 15 20 25 30

Upd

ates

/se

Tmin=0.03

0.100.05

0.04

0.20

A-GAP

ε =2

ε =5

ε =10 ε =15 ε =20

18

• Overhead decreases monotonically

• Overhead depends on the changes of the aggregate, not on its value.• A-GAP outperforms a rate-control scheme (ARC)

0 5 10 15 20 25 30Avg Error

Page 10: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Scalability

0,8

1

aliz

ed)

Grid 25Grid 221

Grid 613

0,2

0,4

0,6

Upd

ates

/sec

(nor

m

emin

19

• Minimum error emin increases with the network size • Overhead increases linearly with network size for same error objective

00 5 10 15 20 25 30 35 40

Avg Error

Robustness

500

1500

2500

3500

Estim

atio

n Er

ror

40

60

80

m L

oad

(Upd

ates

/sec

)

-500140 145 150 155 160 165 170 175Time

Node A fails End of Transient

20

• Estimation error: several spikes during sub-second transient period • Overhead: single peak with a long transient

0

20

140 145 150 155 160 165 170 175Time

Max

imum

Page 11: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Error Prediction by A-GAP vs Actual Error

0,04

0,05 Absolute Avg Error

A t l E

Error Predictedby A-GAP

0,01

0,02

0,03

Actual Errory

21

• Accurate prediction of the error distribution• Maximum error >> average error (one order of magnitude)

0-40 -30 -20 -10 0 10 20 30 40Error

Management Station

A-GAP Prototype

Lab testbed at KTH•16 monitoring nodes

Aggr

egat

ion

Tree

Node 1

Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

•16 Cisco 2600 Series routers•Smartbits 6000 traffic

generator•A-GAP implemented in Java

22

Phy

sica

lN

etw

ork

Page 12: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Prototype: Management Station Interface

SelectAggregation Function

Evolution of the Aggregate

(True Value and A-GAP Estimation)

SelectAccuracyObjective

Show Aggregation Tree

A GAP Estimation)

Overhead Distribution and Evolution

SelectRoot Node

23

Tree

Real-time Estimation ofError Distribution and Trade-off

Simulation vs Testbed Measurements

10

12

14

TestbedExperiment

0

2

4

6

8

Upd

ates

/sec

Simulation

24

•Curves are close: difference in overhead below 3,5%•Prototype validates simulation mode

0 2 4 6 8 10 12Avg Error e

Page 13: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Prototype: Error Estimation by A-GAP vs Actual Error

0,08

0,10

Measured Error Error Estimatedby A-GAP

0,02

0,04

0,06

Absolute Avg Error

25

•Accurate estimation of the error distribution•Maximum error >> average error (one order of magnitude)

0,00-30 -20 -10 0 10 20 30Error

g

Prototype: Overhead Estimation by A-GAP vs Actual Overhead

2,5

3

0

0,5

1

1,5

2

Upd

ates

/sec

26

•Accurate estimation of the overhead•Estimation tends to be more accurate for

nodes close to the root

node 1 node 2 node 3 node 4 node 5 node 6 node 7

Measured Estimated

Page 14: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Gossip vs. Tree-based Aggregation

27

Gossip protocols

Gossip protocols are round-based,during each round a node randomly selects aduring each round a node randomly selects a subset of neighbors and interacts with them.

Applications- information dissemination- database replication - failure detection- failure detection- resource discovery- computing aggregates- …

28

Page 15: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Computing aggregates with gossiping

Push Synopses [Kempe et al. ‘03]

• The protocol

Round 0 { 1.

ii xs = ;

2. 1=iw ;

computes AVERAGE of the local variables xi.

• After each round a new estimate of the aggregate is computed as si/wi.

• Exponential convergencefor uniform gossip and

3. send ),( ii ws to self }Round 1+r {

1. Let * *{( , )}l ls w be all pairs sent to i

during round r 2. *

li ls s=∑ ; *

li lw w=∑

3. choose shares 0, ≥jiα for all nodes j

such that ∑ =j ji 1,α

for uniform gossip and complete graphs

• Protocol Invariants:

4. for all j send )*,*( ,, ijiiji ws αα to each j }

, ,r i r ii is x=∑ ∑ , ,r i ri

w n=∑

29

5. for all j Neighbors∈ { a. , , , ,( , ) ( , )i j i j i j i jrs rw rs rw= +

: ( )

(( ( ), ( ) ( ), ( )))m orig m j

rs m rw m acks m ackw m=

−∑

b. , , , ,( , ) ( , )i j i j i j i jacks ackw srs srw= +

( ( ) ( ))∑

Round 0 { 1. ii xs = ;

2 1

The G-GAP protocol

: ( )

( ( ), ( ))m orig m j

s m w m=∑

c. if (detected_failure(j)) { i. , ,( , ) ( , ) ( , )i i i i i j i js w s w rs rw= +

ii. , , , ,( , ) ( , ) (0,0)i j i j i j i jrs rw srs srw= =

iii. \i iL L j= } }

6. for all ij L∈ {

a. choose 0, ≥jiα such that 1, =∑ j jiα

b choose 0≥β such that

2. 1=iw ;

3. { }iL i= ;

4. for each node j )0,0(),( ,, =jiji rwrs ;

5. for each node j )0,0(),( ,, =jiji srwsrs ;

6. send )0,0,0,0,,( ii ws to self; 7. for all ij ≠ send )0,0,0,0,0,0( to j }

Round r+1 { 1. Let M be all messages received

by i during round r 2. , 1,( )( )i r i r im M

x xs s m −∈−= +∑ ; ( )i m M

w w m∈

=∑ b. choose 0, ≥jiβ such that

∑ =j ji 1,β and 0, =iiβ

c. , , , , , ,( , ) ( ), ( 1)i j i j i j i i i i i j i i isrs srw s x wβ α β α= − −

d. send , , , , , ,( , , , , , )i j i i j i i j i j i j i js w srs srw acks ackwα α to j

e. ),(),( ,,,,,, ijijiijijijiji wrwsrsrwrs αα ++=

} }

3. for all j , ,( , ) (0,0)i j i jacks ackw =

4. ( )i iL L orig M= ∪

30

Page 16: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Accuracy vs. Overheadgossip- and tree-based aggregation protocol

GAP and G-GAP654 node networkGoCast overlay, connectivity 10 aggregation: AVERAGEUT trace4 rounds/secno failures

31

Accuracy vs. Failure Rategossip- and tree-based aggregation protocol

GAP and G-GAP654 node networkGoCast overlay, connectivity 10 aggregation: AVERAGEUT trace4 rounds/secnodes fail randomly, y,recover after 10 sec

32

Page 17: Decentralized Real-Time Monitoring of Network-Wide Aggregates

Summary

•A self-organizing monitoring layer inside the managed system

-Monitoring network-wide aggregates.Monitoring network wide aggregates. -Polling, continuous monitoring, threshold detection. -Controlling the performance trade-offs.

•Continuous monitoring of aggregates with accuracy objectives-Efficient, scalable and adaptable monitoring using aggregation trees is feasible.Model based monitoring allows for performance prediction-Model-based monitoring allows for performance prediction.

•Tree-based vs. gossip-based continuous monitoring -In a traditional wireline networking scenario, tree-based aggregation outperforms gossip-based aggregation

33

ReferencesF. Wuhib, M. Dam, R. Stadler: “Decentralized Detection of Global

Threshold Crossings Using Aggregation Trees,” Computer Networks, Vol. 52, No. 9, pp 1745-1761, 2008.

A. Gonzalez Prieto, R. Stadler: “A-GAP: An Adaptive Protocol for , pContinuous Network Monitoring with Accuracy Objectives,” IEEE Transactions on Network and Service Management (TNSM), Vol. 4, No. 1, June 2007.

F. Wuhib, M. Dam, R. Stadler, A. Clemm: “Robust Monitoring of Network-wide Aggregates through Gossiping,” 10th IFIP/IEEE International Symposium on Integrated Management (IM 2007), Munich, Germany, May 21-25, 2007.

K.S. Lim and R. Stadler: “Real-time views of network traffic using decentralized management,” 9th IFIP/IEEE International Symposium on Integrated Network Management (IM 2005), Nice, France, May 16-19, 2005.

34