Top Banner
Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan <[email protected]>
54

Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Scalability & Stability of the Internet Infrastructure

Farnam JahanianDepartment of EECS

University of Michigan<[email protected]>

Page 2: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Context

NetworkInfrastructure

•Network Attacks

•S/H Failures

•Operational Faults

•Windmill Probes

•Netflow Statistics

•Protocol Scrubbers

•Event Aggregation

•Data Mining

•Replication schemes

ActiveResponse

Capabilities

AnalysisEngines

•Routers •Name Servers•Critical Services

AnomalousNetwork Events

Coarse andFine GrainedMeasurement

Tools

•Countermeasures

LIGHTHOUSE: Survivable Network

Infrastructure

Joint projects between U. Michigan & Merit Network

Page 3: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Motivation

Increasing reliance of financial and national utility infrastructures on interconnected IP-based networks

Explosive growth in both size and topological complexity of the underlying communication infrastructure

Reliance on off-the-self infrastructure & shrink-wrapped code

Network infrastructure is vulnerable:– inherent instability and transient oscillations – delayed convergence and long failover– coordinated denial of service attacks on network resources– hardware and software failures– operational faults and misconfigurations

Page 4: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Imminent Collapse of the Internet

Collapse of the Internet

Now

?

Page 5: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet Growth

Explosive growth in both size and topological complexity

Internet end-system growth Traffic volume & characteristics Infrastructure topological evolution

Page 6: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Infrastructure Topological Evolution

Between 1995-1999:

Decentralization: from a single backbone network to a conglomeration of 100s of backbone and 1000s ISP.

Loss of hierarchy and abstraction: from strict hierarchical network to increasingly a full-mesh interconnection.

Significant bandwidth increase: from signle T3 (45MB) circuit and T1 (1MB) links to multiple OC48 (1.2GB) circuits and OC12 (622MB) lines between nodes.

Page 7: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet Evolution: NSFNet

NSFNet Backbone

RegionalRegional Regional

Campus Campus Campus Campus

Hello/EGP Hello/EGPHello/EGP

Hierarchical network with a single central backbone

Page 8: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet Evolution: Today

AS1AS2

AS3AS4

C4

C2

C3

C1

Full-mesh interconnection of ISP backbones and customers

Page 9: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Impact of Instability & Failures

– Increased end-to-end Loss/Latency

– Increased delay in convergence & network reachability

– Backbone infrastructure CPU/Memory requirements

– Backbone “route flap storms”

– Network management complexity

Page 10: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Background: Internet Architecture

BGP

BGP

BGP

Page 11: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Background: Internet Routing

Two major categories– Inter-domain (BGP between autonomous systems)– Intra-domain (OSPF, ISIS, IGRP inside an AS)

BGP– Incremental: announcements and withdraws– Updates include policy (e.g. MED, ASPath)– Maintain multiple possible routes

Page 12: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Background: BGP Routing Protocol

BGP is an incremental protocol that sends update information only upon changes in network topology or routing policy.

Two forms of messages: announcements:

New network accessible Prefer another route to network destination

withdrawals: Destination network is no longer accessible

Routing policies vs. shortest number of hops

Page 13: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Background: Internet Core

Networks aggregated into CIDR (Classless Inter-Domain

Routing) prefixes

Prefix represents a set of destination IP addresses

At Internet “core” all routers maintain paths to “default-

free” routes

Originally 5 major Internet Exchange Points (IXPs)

In 1996, approximately 30,000 default-free routes

Page 14: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Roadmap

Study of stability of routing in the Internet backbone– Transient oscillations, pathological redundant updates– congestion collapse and correlation to network usage– SIGCOMM’97 and INFOCOMM’99

Study of route availability and failover rates– long-term availability of Internet backbone routes– Case study of regional provider– FTCS’99

Study of convergence behavior of routing protocols– Injection of route changes into the Internet backbone– Impact of convergence delay on end-to-end path

– 18-month study & ongoing

Page 15: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet Exchange Points

Deployed probes machines at five public exchange pointsCollected all routing updates at IXPs over four year period

Page 16: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet Routing Instability Results

Number of BGP routing updates exchanged per day in the Internet core is orders of magnitude larger than expected.

Most routing information is dominated by pathological, or redundant updates, which do not directly reflect changes in routing policy or topology.

Instability and redundant updates exhibit a specific periodicity of 30 and 60 seconds.

Instability and redundant updates show a surprising correlation to network usage and exhibit corresponding daily and weekly cyclic trends.

Page 17: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Instability Results (Continued)

Instability is not dominated by a small set of autonomous systems or routes.

Instability is not disproportionately dominated by prefixes of specific lengths, i.e. independent of aggregation.

Discounting policy fluctuation and pathological behavior, there remains a significant level of Internet forwarding instability.

Details: SIGCOMM’97 & INFOCOMM’99

Page 18: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Growth in Routing State

Linear growth in routing table

Page 19: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Initial Findings (SIGCOMM’97)

Up to 60 million BGP updates/day for only 30,000 default-free routes! – On avg. 2-6 Million withdraws per day (mostly duplicates)– e.g., ISP A had 259 routes but withdrew 2.4 million routes

All state changes well distributed across prefix lengths, autonomous systems

Unexpected frequency components– 30 second inter-arrival time between updates– Daily/weekly components

Page 20: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

More Initial Observations

Most routing updates pathological (millions!)

– Some due to misconfiguration Private networks Host routes Multicast routes

– Majority duplicate updates Duplicate withdraws (WWDup > 99.99%) Duplicate announcements (AADup)

Page 21: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

BGP Updates

Page 22: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

30 Second Frequency Components 1997

Page 23: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Origins of Pathological Updates (INFOCOM99)

Majority stem from two router software implementation issues:– stateless BGP withdraws– non-transitive attribute filtering

Frequency due to non-jittered router timers– lack of precise specification

Others sources of pathologies:– BGP/IBGP misconfiguration– Still others DSU/CSU oscillation– And still others due distance-vector algorithm

Page 24: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

After Initial Publication of Results

One popular vendor validated our conjectures and released updated software in 1997– Software rapidly deployed by ISPs– Stateful BGP reduced updates by orders of magnitude– Addition of random intervals to timers diminished frequency

components

Page 25: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

BGP Announcements and Withdraws

NANOG presentation ISP Geeks Release Mainline Release

Page 26: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Frequency Components

1997 1998

Page 27: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

BGP Failures -- Congestion Collapse(BGP Frequency)

Page 28: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

A Short Story

Sigcomm '97 findings were puzzling:

Bandwidth Utilization Instability

Hypothesis:

Congestion causes underlying TCP to backoff

BGP-level timers expire, causing termination

Page 29: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

MCI Sprint

Border Gateway Protocol (BGP)

Interdomain protocol between Autonomous Systems Routing peers exchange reachability information incrementally BGP uses TCP as the transport protocol between peer routers

Page 30: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

BGP Congestion Collapse Hypothesis

Congestion causes underlying TCP to backoff

BGP-level timers expire, causing termination

Interaction between BGP and TCP leads to router congestion collapse

High bandwidth utilization BGP Instability

Validated using Windmill tool (SIGCOMM98)

Page 31: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

What about Failures?

Some state changes due to policy changes & network failures

Cannot distinguish between policy, intra-domain and inter-domain failures

Methodology:– Measure long-term rate of failure for Internet backbone routes– Case study of regional provider

Page 32: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet Infrastructure Failures (FTCS99)

Internet significantly less reliable and available than PSTN telephone network.

After a network becomes unreachable, in most cases, it takes longer than 5 mins before it is reachable again.

Even for transient oscillations, convergence of backbone routing states may be in the order of mins!

Route failover (re-routing of traffic to a given network) occurs on average of once every three days or more.

A small fraction of network paths contribute disproportionately to number of long-term outages

Page 33: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Definitions

Route Failure: Prefix destination unavailable for 30 or more minutes

Route Repair: A failed route becomes available

Route Failover: A route replaced with one associated with a different path

Page 34: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Route Failures: How long before a network is unreachable?

Page 35: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Route Repairs: How long before a network is reachable again?

Page 36: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Failover: How long before traffic is re-routed?

Page 37: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Conventional Wisdom on Convergence

Internet is highly redundant

– Just reroute around in a few milliseconds Routing protocol convergence takes only a few ???? “Bad news travels fast”

– Fast withdraw propagation valid goal

– Announcements slower because bundled BGP has great convergence properties

– Path vector solved the convergence and counting to infinity (looping) problems

All my customers are multi-homed, triple-homed

– Convergence -- what, me worry?

Not True!

Page 38: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

18-Month Study of Convergence Behavior

Instrument the Internet

– Inject routes into geographically and topologically diverse provider BGP peering sessions (Japan, Michigan, US Exchange Points, Canada, UK)

– Periodically fail and change these routes (i.e. send withdraws or new attributes)

– Time events using ICMP ping and NTP synchronized BGP “routeviews” monitoring machines

– Wait 18 months… (50,000 routing events)

Page 39: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet

ISP4

Stub AS

RouteViews Data

CollectionProbe

ISP5

ISP6

ISP3

UpstreamISP1

Stub AS

Fault Injection Server

Upstream

ISP2BGP Fault

BGP Fault

BGP

BGP

BGP

ICMPEchos

Passive & Active Measurement Infrastructure

Page 40: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Terminology

Tdown: A previously available route is withdrawn. This is a route failure.

Tup : previously unavailable route is announced as available. This is a route repair.

Tshort: A route is replaced with another route having a shorter path. This is a route failover.

Tlong: A route is replaced by another route with a longer path. This is a route failover.

Page 41: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Avg. number of messages generated byeach ISP following a routing update event

1

1.5

2

2.5

3

3.5

Tdown Tlong Tup Tshort

Japan

Verio

Michnet

CANet

• Tdown and Tlong generated more messages than Tup and Tshort

• Significant variation among ISPs within each category of message

Page 42: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Withdraw Convergence (Tdown)

After a BGP route is withdrawn, barring other failures, how long does it take Internet routing tables to reach steady-state?

Page 43: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Convergence delay after a Tdown

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Seconds Until Convergence

Cu

mu

lati

ve

Pe

rcen

tag

e o

f F

ault

s

per.japan

per.canet

per.michnet

per.verio

Withdraw Convergence

Page 44: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Withdraw Convergence

Different providers exhibit different behavior

70% of withdraws from most ISPs take more than a minute

For ISP in Canada, 20% withdraws took more than three minutes

to converge

Observed latencies of up to 10 mins for certain events

No correlation between convergence latency and geography or

topological (except for MichNet)

Page 45: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Failovers and Repairs

What are the relative convergence latencies for failovers and repairs?

Does bad news (withdraws) travel faster?

Page 46: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Failures, Failovers and Repairs

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160

Seconds Until Convergence

Cu

mu

lati

ve P

erc

en

tag

e o

f E

ven

ts

Tup

Tshort

Tlong

Tdow nBad News Does Not Travel

Fast!

Page 47: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Failures, Failovers and Repairs

Bad news does not travel fast… Repairs (Tup) exhibit similar convergence properties as

long short path failover Failures (Tdown) and short long failovers also similar

– Slower than Tup (e.g. a repair)

– 60% take longer than two minutes

– Failover times degrade the greater the degree of multi-homing!

Page 48: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

End2End Connectivity

Impact of delayed convergence on E2E connectivity?

After a failover, how long before my site is reachable?

– Modified ICMP pings sent once a second

– Source IP address block of pseudo-AS

– 100 randomly chosen web sites from cache logs

Page 49: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Impact of Convergence Delay on End-to-End Path

0

10

20

30

40

50

60

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

One Minute Bins Before and After Fault

Pe

rce

nt

Pa

cke

t L

oss

Tlong

Tshort

Fault

Avg. packet loss to 100 web sites (1 min bins in the ten mins preceding and following a routing update)

Page 50: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

What is Happening?

Non-deterministic ordering of BGP update messages leads to

– Transient oscillations

– Each change in FIB adds delay (CPU, BGP bundling timer)

– At extreme, convergence triggers BGP dampening

Page 51: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

BGP Bad News

Given best current routing practices, inter-domain BGP convergence times degrade exponentially with increase in the degree of interconnectivity for a given route

… and the degree of inter-connectivity (multi-homing, transit, etc) is increasing

Page 52: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Internet vs. Telephone Network

Packet-switched vs. circuit-switched

No explicit reservation on the Internet

Fault-tolerant switches in telephone networks

Significantly shorter development, testing and deployment

cycle in the Internet world

Reliability vs. time-to-market

Relative degree of operational experience

Small number of telecommunication companies vs. a

conglomeration of thousands of ISPs

Page 53: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

The Next Challenge Jeopardizing the ExplosiveGrowth of the Web is AVAILABILITY.

Growing reliance on the Internet for commerce, healthcare, education, ...

Challenges Facing Today’s Internet areBandwidth and Latency

Page 54: Scalability & Stability of the Internet Infrastructure Farnam Jahanian Department of EECS University of Michigan.

Acknowledgements

Michigan Students & Merit Staff: Abha Ahuja, Mukesh Agrawal, Paul Howell, Craig Labovitz, Rob Malan, Matt Smart, David Watson

Sponsors: National Science Foundation, DARPA, Intel, IBM, HP