Top Banner
Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. <[email protected]> Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian, Abhijit Bose University of Michigan Apricot - 2000
23

Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Mar 27, 2015

Download

Documents

Jessica Curran
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Experimental Measurement of Delayed Convergence

Abha AhujaInternap/Merit Network, Inc.

<[email protected]>

Craig Labovitz Microsoft Research/Merit Network, Inc.

Farnam Jahanian, Abhijit BoseUniversity of Michigan

Apricot - 2000

Page 2: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

The Internet: Failure Analysis

Mostly seems to work

Something happens.Doesn’t work.

Tim

e

Mostly seems to work

Page 3: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Routing Protocol Convergence

• Unlike connection oriented PSTN (~30 ms), Internet does not have fail-over.

• Instead, each node recalculates on a hop-per-hop basis (i.e. no flooding of changes)

• Distance-vector algorithms (e.g. RIP, BGP) exhibit slower convergence than link state protocols

• During convergence– Latency, loss, out of order – Additional update messages (CPU processing)

Page 4: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Distance Vector (BF) Protocols

• Suffer from counting to infinity problem

• Solutions– Poison reverse– Split horizon– Path vectors

A

B

C

Example

Page 5: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Conventional Wisdom• “Restoral is not an issue in the IP world”

– Just reroute around in a few milliseconds or whatever

• BGP convergence takes only a few _____ • “Bad news travels fast”

– Fast withdraw propagation valid goal– Announcements slower because bundled

• BGP has great convergence properties – ASPath solved the convergence and counting to infinity

problems

• All my customers are multi-homed, triple-homed – Convergence -- what, me worry?

Page 6: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

More Conventional Wisdom

• Enough bandwidth will solve anything

“It will all be one big network one day soon anyways”

(Especially after yesterday)

Page 7: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Internet Failures

• Replication, round-robin DNS, etc. helps reliability of inter-domain content oriented services

• Inter-domain transaction oriented services (e.g. VoIP, EBay, database commits, etc.) still pose a challenge

• Important model how long does it take for the Internet to converge– After Failure– After Fail-Over– After Repair

Page 8: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

BGP: Bad news

• With unconstrained policies (Griffin99, Varadhan96)

– Divergence– Possible create mutually unsatisfiable policies– NP-complete to identify these policies in IRR– Happening today?

• With constrained policies (e.g. shortest path first)– Transient oscillations – BGP usually converges – It might just take a very long time….

• This talk is about constrained policies

Page 9: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Some Observations

• How do we study convergence?– From BGP logs (e.g. debug ip bgp), difficult to

determine causal relationships– Earlier work studied BGP pathologies and failures– Still lots of BGP duplicates and oscillations

• Failure/repair data (next slide) for default-free routes shows 30 minute curve– Examined long-lived default-free routes from 24

providers for a year– Restoral time for given provider after failure (i.e. route

withdrawn)

Page 10: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

How long until routes return? (From A Study of Internet Failures)

What is happening here?

Page 11: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

16 Month Study of Convergence

• Instrument the Internet– Inject routes into geographically and topologically

diverse provider BGP peering sessions (Mae-West, Japan, Michigan, London)

– Periodically fail and change these routes (i.e. send withdraws or new attributes)

– Time events using ICMP echos and NTP synchronized BGP “routeviews” monitoring machines (also http gets)

– Write lots of Perl scripts– Wait a sixteen months… (45,000 routing events)

Page 12: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Setup

Page 13: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

How Many Announcements Does it Take For an AS to Withdraw a Route?

7/5 19:33:25 Route R is withdrawn

7/5 19:34:15 AS6543 announce R 6543 66665 8918 1 5696 999

7/5 19:35:00 AS6543 announce R 6543 66665 8918 67455 6461 5696 999

7/5 19:35:37 AS6543 announce R 6543 66665 4332 6461 5696 999

7/5 19:35:39 AS6543 announce R 6543 66665 5378 6660 67455 6461 5696 999

7/5 19:35:39 AS6543 announce R 6543 66665 65 6461 5696 999

7/5 19:35:52 AS6543 announce R 6543 66665 6461 5696 999

7/5 19:36:00 AS6543 announce R 6543 66665 5378 6765 6660 67455 6461 5696 999

7/5 19:38:22 AS6543 withdraw R

Answer: Up to 19(AS6543 chosen as an example – all AS’es exhibit similar behavior)

Abha made me change the AS numbers

Page 14: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Withdraw Convergence

After a BGP route is withdrawn, barring other failures, how long does it take Internet routing tables to reach steady-state?

Page 15: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Withdraw Convergence

AS1

AS2

AS3

AS4

Page 16: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Withdraw Convergence

• Probability distribution

• Providers exhibit different, but related convergence behaviors

• 80% of withdraws from all ISPs take more than a minute

• For ISP4, 20% withdraws took more than three minutes to converge

Page 17: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Fail-Overs and Repairs

What are the relative convergence latencies for fail-overs and repairs?

Does bad news (withdraws) travel faster?

Page 18: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Failures, Fail-overs and Repairs

Page 19: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

Failures, Fail-overs and Repairs

• Bad news does not travel fast…• Repairs (Tup) exhibit similar convergence

properties as long-short ASPath fail-over• Failures (Tdown) and short-long fail-overs (e.g.

primary to secondary path) also similar– Slower than Tup (e.g. a repair)

– 60% take longer than two minutes

– Fail-over times degrade the greater the degree of multi-homing!

Page 20: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

End2End Connectivity

After a repair, how long before my site is reachable?

– Modified ICMP pings and HTTP sent once a second

– Source IP address block of pseudo-AS– 100 randomly chosen web sites from parent

cache logs

Page 21: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

ICMP Response after Repairs

0 10 20 30 40 50 60 700

20

40

60

80

100

120

140

Time (2 hours + Jitter)

Se

cond

s U

ntil

Re

ach

abl

eTitle

Series1Series2Series3Series4Series5Series6Series7Series8Series9Series10Series11Series12Series13Series14Series15Series16Series17Series18Series19Series20Series21Series22

Page 22: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

What is Happening?

• Non-deterministic ordering of BGP update messages leads to– Transient oscillations– Each change in FIB adds delay (CPU, BGP

bundling timer)– At extreme, convergence triggers BGP

dampening

Page 23: Experimental Measurement of Delayed Convergence Abha Ahuja Internap/Merit Network, Inc. Craig Labovitz Microsoft Research/Merit Network, Inc. Farnam Jahanian,

BGP Bad News

Given best current routing practices, inter-domain BGP convergence times degrade exponentially with increase in the degree of interconnectivity for a given route

… and the degree of inter-connectivity (multi-homing, transit, etc) is increasing