Page 1
1
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
Limiting the Impact of Failures on
Network Performance
Joint work withSupratik Bhattacharyya, and Christophe Diot
High Performance Networking Group, 25 Feb. 2004
Yashar GanjaliComputer Systems Lab.Stanford University
[email protected]
http://www.stanford.edu/~yganjali
Page 2
2
Motivation
The core of the Internet consists of several large networks (IP backbones).
IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery.
Failures occur on a daily basis as a result of Physical layer malfunction, Router hardware/software failures, Maintenance, Human errors, …
Failures affect the quality of service delivered to backbone customers.
Page 3
3
Outline
Background Sprint’s IP backbone Data
Impact Metrics Time-based metrics Link-based metrics
Measurements Reducing the impact
Identifying critical failures Causes analysis Reducing critical failures
Page 4
4
Background – Sprint’s IP backbone
IP layer operates above DWDM with SONET framing. IS-IS protocol used to route traffic
inside the network. IP-level restoration
When an IP link fails, all routers in the network independently compute a new path around the failure
No protection in the underlying optical infrastructure.
Page 5
5
Data
IS-IS Link State PDU logs Collected by passive listeners from Sprint’s
North America backbone. Feb. 1st, 2003 to Jun. 30th, 2003.
SNMP logs Link loads recorded once in every 5
minutes. SONET layer alarms
Corresponding to minor and major problems in the optical layer
We are only interested in two alarms:SLOS, and SLOS cleared.
Page 6
6
Link Failures in Sprint’s IP Backbone – 9408 Failures
Page 7
7
Inter-POP vs. Intra-POP
ANA-2
ANA-3
ANA-1
ANA-4
Page 8
8
Outline
Background Sprint’s IP backbone Data
Impact Metrics Time-based metrics Link-based metrics
Measurements Reducing the impact
Identifying critical failures Causes analysis Reducing critical failures
Page 9
9
Inter-POP Link Failures in Sprint’s IP Backbone
Page 10
10
Two Perspectives
For a given impact metric Time-based analysis: Measure the
impact of failures on the given metric as a function of time.
Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.
Page 11
11
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Page 12
12
Number of Simultaneous Failures
Page 13
13
Number of Simultaneous Failures
Page 14
14
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Time-based Impact Metrics
Page 15
15
Number of Affected O-D Pairs
A C F
B
D E
Page 16
16
Number of Affected O-D Pairs
Page 17
17
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP
prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Time-based Impact Metrics
Page 18
18
Number of Affected BGP Prefixes
Page 19
19
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Page 20
20
Path Unavailability
A C F
B
D E
Page 21
21
Path Unavailability
Page 22
22
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Page 23
23
Total Rerouted Traffic
Page 24
24
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Page 25
25
Maximum Load Throughout the Network
Page 26
26
Maximum Load Throughout the Network
96% of link failures were
not followed by an immediate
change in maximum load.
Page 27
27
Time-based Impact Metrics
1. Number of Simultaneous Link Failures
2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path unavailability5. Total rerouted traffic6. Maximum load
Page 28
28
Number of Failures per Link
Page 29
29
Number of Affected OD Pairs per Link
Page 30
30
Number of Affected BGP Prefixes per Link
Page 31
31
Path Coverage
A C F
B
D E
Page 32
32
Path Coverage of Links
Page 33
33
Total Rerouted Traffic on a Link
Page 34
34
Peak Factor of a Link
Page 35
35
Link-based Impact Metrics
1. Number of Link Failures2. Number of affected O-D pairs3. Number of affected BGP prefixes4. Path coverage5. Total rerouted traffic6. Peak factor
Page 36
36
Outline
Background Sprint’s IP backbone Data
Impact Metrics Time-based metrics Link-based metrics
Measurements Reducing the impact
Identifying critical failures Causes analysis Reducing critical failures
Page 37
37
Critical Failures
For each time-based metric Removing failures occuring during 1-
5% of time improves the metrics by a factor of at least 5.
For each link-based metric Removing failures on 1-7% of links
improves the metric by a factor of at least 3.
Page 38
38
Critical Time Periods
Page 39
39
Critical Links
Any link which has a critical failures, is called a Critical Link.
We are interested in fixing such links.
Page 40
40
Correlation of Critical Sets
Page 41
41
Correlation of the Critical Sets
Metric Size 1 2 3 4 5 6 7 8 9 10
1) Simultaneous failures 11 -0.38
0.33
0.27
0.23
0.11
0.13
0.08
0.15
0.05
2) # of O-D pairs 9 - -0.37
0.21
0.25
0.12
0.14
0.06
0.09
0.06
3) # of BGP prefixes 6 - - -0.18
0.32
0.09
0.05
0.10.07
0.03
4) Path unavailability 5 - - - -0.41
0.14
0.11
0.08
0.12
0.04
5) Total rerouted traffic 6 - - - - -0.09
0.11
0.09
0.08
0.08
6) # of failures 2 - - - - - -0.29
0.31
0.25
0.17
7) # of O-D pairs 3 - - - - - - -0.29
0.30.18
8) # of BGP prefixes 2 - - - - - - - -0.13
0.19
9) Path coverage 6 - - - - - - - - -0.08
10) Total rerouted traffic 1 - - - - - - - - - -
Overall 23% of all links are critical.
Page 42
42
Cause Analysis
Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04]. Maintenance Unplanned
• Shared failures– Router-related– Optical-related– Unspecified
• Individual failures
About 70% of all unplanned
failures
Page 43
43
Matching SLOS Alarms with IP Link Failures
Time IP link failure
SLOS~
20ms
SLOS Cleared~
12sec
58% of all link failures are due to optical layer problems.
84% of critical failures are due to optical layer problems.
Page 44
44
Reducing Critical Failures
Replace old optical fibers/parts.
Optical Protection.
Push the traffic away. Also works for maximum load and
peak factor.
Page 45
45
Performance Improvement
Time-based metrics Link-based MetricsMetric
% improvement
Metric%
improvement
# of failures# of affected O-D pairs
# of BGP prefixesPath unavailability
Total rerouted traffic
4136323929
# of failures# of affected O-D pairs
# of BGP prefixesPath coverage
Total rerouted traffic
4537294238
Page 46
46
Reducing Link Down-time
Low-failure links: Failure are very rare. Damping doesn’t help.
High-failure links: Failure rate changes very slowly. Fixed damping is wasteful.
Page 47
47
Adaptive Damping
Input: : time difference between the last two failures: threshold: constant
function Adaptive_Dampingbegin
if ( < )ADT := x ;
elseADT := 0;
end;
Output:ADT: Adaptive damping timer
Page 48
48
Number – Duration Pareto Curve