1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group,
Post on 22-Dec-2015
214 Views
Preview:
Transcript
1
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
Limiting the Impact of Failures on
Network Performance
Joint work withSupratik Bhattacharyya, and Christophe Diot
High Performance Networking Group, 25 Feb. 2004
Yashar GanjaliComputer Systems Lab.Stanford University
yganjali@stanford.edu
http://www.stanford.edu/~yganjali
2
Motivation
The core of the Internet consists of several large networks (IP backbones).
IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery.
Failures occur on a daily basis as a result of Physical layer malfunction, Router hardware/software failures, Maintenance, Human errors, …
Failures affect the quality of service delivered to backbone customers.
3
Outline
Background Sprint’s IP backbone Data
Impact Metrics Time-based metrics Link-based metrics
Measurements Reducing the impact
Identifying critical failures Causes analysis Reducing critical failures
4
Background – Sprint’s IP backbone
IP layer operates above DWDM with SONET framing. IS-IS protocol used to route traffic
inside the network. IP-level restoration
When an IP link fails, all routers in the network independently compute a new path around the failure
No protection in the underlying optical infrastructure.
5
Data
IS-IS Link State PDU logs Collected by passive listeners from Sprint’s
North America backbone. Feb. 1st, 2003 to Jun. 30th, 2003.
SNMP logs Link loads recorded once in every 5
minutes. SONET layer alarms
Corresponding to minor and major problems in the optical layer
We are only interested in two alarms:SLOS, and SLOS cleared.
8
Outline
Background Sprint’s IP backbone Data
Impact Metrics Time-based metrics Link-based metrics
Measurements Reducing the impact
Identifying critical failures Causes analysis Reducing critical failures
10
Two Perspectives
For a given impact metric Time-based analysis: Measure the
impact of failures on the given metric as a function of time.
Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.
11
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
14
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Time-based Impact Metrics
17
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP
prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Time-based Impact Metrics
19
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
22
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
24
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
26
Maximum Load Throughout the Network
96% of link failures were
not followed by an immediate
change in maximum load.
27
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
35
Link-based Impact Metrics
1. Number of Link Failures2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path coverage5. Total rerouted traffic6. Peak factor
36
Outline
Background Sprint’s IP backbone Data
Impact Metrics Time-based metrics Link-based metrics
Measurements Reducing the impact
Identifying critical failures Causes analysis Reducing critical failures
37
Critical Failures
For each time-based metric Removing failures occuring during 1-
5% of time improves the metrics by a factor of at least 5.
For each link-based metric Removing failures on 1-7% of links
improves the metric by a factor of at least 3.
39
Critical Links
Any link which has a critical failures, is called a Critical Link.
We are interested in fixing such links.
41
Correlation of the Critical Sets
Metric Size 1 2 3 4 5 6 7 8 9 10
1) Simultaneous failures 11 -0.38
0.33
0.27
0.23
0.11
0.13
0.08
0.15
0.05
2) # of O-D pairs 9 - -0.37
0.21
0.25
0.12
0.14
0.06
0.09
0.06
3) # of BGP prefixes 6 - - -0.18
0.32
0.09
0.05
0.10.07
0.03
4) Path unavailability 5 - - - -0.41
0.14
0.11
0.08
0.12
0.04
5) Total rerouted traffic 6 - - - - -0.09
0.11
0.09
0.08
0.08
6) # of failures 2 - - - - - -0.29
0.31
0.25
0.17
7) # of O-D pairs 3 - - - - - - -0.29
0.30.18
8) # of BGP prefixes 2 - - - - - - - -0.13
0.19
9) Path coverage 6 - - - - - - - - -0.08
10) Total rerouted traffic 1 - - - - - - - - - -
Overall 23% of all links are critical.
42
Cause Analysis
Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04]. Maintenance Unplanned
• Shared failures– Router-related– Optical-related– Unspecified
• Individual failures
About 70% of all unplanned
failures
43
Matching SLOS Alarms with IP Link Failures
Time IP link failure
SLOS~
20ms
SLOS Cleared~
12sec
58% of all link failures are due to optical layer problems.
84% of critical failures are due to optical layer problems.
44
Reducing Critical Failures
Replace old optical fibers/parts.
Optical Protection.
Push the traffic away. Also works for maximum load and
peak factor.
45
Performance Improvement
Time-based metrics Link-based MetricsMetric
% improvement
Metric%
improvement
# of failures# of affected O-D pairs
# of BGP prefixesPath unavailability
Total rerouted traffic
4136323929
# of failures# of affected O-D pairs
# of BGP prefixesPath coverage
Total rerouted traffic
4537294238
46
Reducing Link Down-time
Low-failure links: Failure are very rare. Damping doesn’t help.
High-failure links: Failure rate changes very slowly. Fixed damping is wasteful.
47
Adaptive Damping
Input: : time difference between the last two failures: threshold: constant
function Adaptive_Dampingbegin
if ( < )ADT := x ;
elseADT := 0;
end;
Output:ADT: Adaptive damping timer
top related