Top Banner
Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savage
94

Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Jul 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Daniel Turner

Kirill Levchenko,

Alex C. Snoeren,

Stefan Savage

Page 2: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Failure is a reality for large network

Achieving high availability requires engineering the network to be robust to failure

Designing mechanisms to effectively mitigate failures requires deep understanding of real failures

Page 3: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Big Failures generate news stories

Page 4: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Big Failures generate news stories

Page 5: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Big Failures generate news stories◦ Rarely contain useful details

◦ Most networks failures are not catastrophic

Page 6: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Big Failures generate news stories◦ Rarely contain useful details

◦ Most networks failures are not catastrophic

Page 7: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 8: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Collecting comprehensive failure data is difficult◦ Lightweight techniques are limited

◦ Special purpose monitoring is expensive

Access to network data is limited data◦ A few publicly available studies [A. Markopoulou ToN ’08] [C. Cranor SIGMOD 03]

◦ Many networks consider data proprietary

Some networks can’t invest time or capital

Page 9: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Methodology to reconstruct failure history of a network◦ Using only commonly available data

◦ No need for additional instrumentation

Analyze a production network

Page 10: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

A time series of Layer-3 failure events◦ I.e, for each link a set of state transitions between up and down

And, where possible, annotated with:◦ What caused the failure?

◦ What was the impact of the failure?

Page 11: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 12: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

Page 13: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet0/2

ip address 137.211.23.2 255.255.255.254

Page 14: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.9 255.255.255.254

Page 15: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet3/2

ip address 137.211.25.9 255.255.255.254

Page 16: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 17: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 18: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 19: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Router x:

Interface 1/1

DOWN

Page 20: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Router Y:

Interface 2/3

DOWN

Page 21: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 22: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Router x:

Interface 1/1

UP

Page 23: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 24: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Router Y:

Interface 2/3

UP

Page 25: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team has scheduled an emergency repair

Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06

SCOPE: Shark bites through cable

IMPACT: Loss of redundancy between San Francisco and Los Angles

COMMENTSIt left behind a tooth

Page 26: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

How can we reconstruct a failure 4 years later?◦ Syslog

Describes interface state changes

◦ Router Configuration Files

Maps interfaces to Links

◦ Operation announcements

Caveat: data not intended for failure reconstruction

Page 27: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 28: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

Page 29: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

Page 30: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

Page 31: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

137.211.22.9

Page 32: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

interface GigabitEthernet0/2

ip address 137.211.23.2 255.255.255.254

137.211.22.9

Page 33: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

137.211.22.9

Page 34: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

interface GigabitEthernet1/1

ip address 137.211.22.8 255.255.255.254

interface GigabitEthernet1/1

ip address 137.211.22.9 255.255.255.254

137.211.22.9

Page 35: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 36: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 37: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up

Page 38: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up

Page 39: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up

SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up

Page 40: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up

SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up

Page 41: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 42: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team is performing an emergency repair

Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06

SCOPE: Shark bites through cable

IMPACT: Loss of redundancy between San Francisco and Los Angles

COMMENTSIt left behind a tooth

Page 43: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team is performing an emergency repair

Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06

SCOPE: Shark bites through cable

IMPACT: Loss of redundancy between San Francisco and Los Angles

COMMENTSIt left behind a tooth

Page 44: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team is performing an emergency repair

Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06

SCOPE: Shark bites through cable

IMPACT: Loss of redundancy between San Francisco and Los Angles

COMMENTSIt left behind a tooth

Page 45: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team is performing an emergency repair

Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06

SCOPE: Shark bites through cable

IMPACT: Loss of redundancy between San Francisco and Los Angles

COMMENTSIt left behind a tooth

Page 46: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team is performing an emergency repair

Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06

SCOPE: Shark bites through cable

IMPACT: Loss of redundancy between San Francisco and Los Angles

COMMENTSIt left behind a tooth

Page 47: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Serving Californiaeducational institutions

Over 200 routers

5 years of dataLAX

SLO

SOL

SVL

OAK

Page 48: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 49: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 50: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team has scheduled maintenance

Start 0001 PDT, FRI 8/17/05End 0200 PDT, FRI 8/17/05

SCOPE: Routing protocol parameter change

IMPACT: San Fransico PoP

COMMENTS: Other PoPs to follow

Page 51: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 52: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 53: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 54: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

This message is to alert you that the CENIC network engineering team has scheduled a repair

Start 1930 PDT, FRI 11/17/06End 2000 PDT, FRI 11/17/06

SCOPE: Faulty optical amplifier

IMPACT: San Diego PoP

COMMENTS: …

Page 55: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Motivation

Methodology◦Limitations

◦ Validation

Findings in the CENIC network

Page 56: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Syslog messages are sent from routers to a central server◦ Using UDP

Messages are lost

Page 57: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Page 58: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

SyslogDown

Page 59: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

SyslogDown

SyslogUP

Page 60: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

SyslogDown

SyslogUP

Page 61: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Page 62: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Page 63: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Page 64: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

What happened?

Page 65: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

What happened?

Message Lost

Page 66: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

What happened?

Message Lost

Spurious Message

Page 67: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Exclude time between

2 & 3

Page 68: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Page 69: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Page 70: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Time

Link State

0 1 2 3 4 5

Up

Down

Same issue with double

UPs

Page 71: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Configuration files are logged intermittently

Configuration files do not describe layer 2 topology

Page 72: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Operational announcements are written by humans◦ Selection bias

Categorization is subjective

Page 73: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Are there events mentioned in announcements that aren’t in syslog◦ Manually checked random 1% of announcements 97% of events were confirmed

Page 74: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

How do we know syslog is accurate?

CAIDA Skitter project (now Ark)◦ Traceroutes to every /24 on the Internet

◦ 75 Million probes over 6 months traversed CENIC

confirmed no traffic over any interface that we thought was down

Page 75: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Can we verify links were down?◦ Routing protocols aim to mask failures

◦ Isolation is externally visible BGP updates are sent

Route Views project records BGP traffic◦ Verified 105 out of 147 isolation events

Page 76: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Motivation

Methodology◦ Limitations

◦ Validation

Findings in the CENIC Network

Page 77: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

LAX

SLO

SOL

SVL

OAK

Three Types of Links:◦ Backbone

◦ Customer Access

◦ High Performance Backbone

Page 78: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

99.9%

99.999%

99.99%

Page 79: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 80: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

> 60% of failures last

less than1 Minute

Page 81: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

7,000 email announcements

3,000 events

28% of events describe a failure

18% of observed failures are explained

Page 82: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 83: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Other

* Machine room flooded

* DoS attack

* Construction crews

demolished a manhole with

active cables

* Or unsolved

Page 84: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 85: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 86: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 87: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 88: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 89: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Not all downtime is equal◦ Some failures are unexpected Fiber cuts

◦ Some failures are scheduled Software upgrades

Page 90: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 91: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Scheduled vs. Unscheduled

◦ Simple metric to evaluate impact

Difficult to gauge impact of most failures

◦ Only 18% of failures are covered by an email

Customer isolation events have a clear impact

◦ Recall, BGP traffic makes these easy to spot

Page 92: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 93: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3
Page 94: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3

Engineering for failure requires real data◦ Data has historically been difficult to obtain

Methodology to perform historical failure analysis with low-quality data sources

Shared our findings in the CENIC network◦ Reliability of individual components

◦ Causes of failures

◦ Impact of failures