Top Banner
A Bug-Tolerant Router Jennifer Rexford Princeton University http://verb.cs.princeton.edu Joint work with Eric Keller (Princeton), Minlan Yu (Princeton), and Matt Caesar (UIUC)
27

btr-upenn.ppt

Feb 11, 2015

Download

Documents

techdude

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: btr-upenn.ppt

A Bug-Tolerant Router

Jennifer RexfordPrinceton University

http://verb.cs.princeton.edu

Joint work with Eric Keller (Princeton), Minlan Yu (Princeton), and Matt Caesar (UIUC)

Page 2: btr-upenn.ppt

Routers run complex software, so…

2

Page 3: btr-upenn.ppt

Router Bugs in the News

3

Page 4: btr-upenn.ppt

• One misconfiguration tickled 2 bugs (2 vendors)– Real bugs on Feb 16, 2009– Huge increase in the global rate of updates– 10x increase in global instability for an hour

Misconfiguration:as-path prepend 47868

MikroTik bug: no-range check

prepended 252 times

Did not filter

Cisco bug:Long AS paths

AS pathPrependingAfter: len > 255

Notification

AS47878AS47878 AS29113AS29113

4

Example of Router Bugs

Global Instability by Country

Page 5: btr-upenn.ppt

Router Bugs

• Router bugs are a serious problem– Routers are getting more complicated• Quagga 220K lines, XORP 826K lines

– Vendors are allowing third-party software– Other outages are becoming less common

• Router bugs are hard to detect and fix – Byzantine failures don’t simply crash the router– Violate protocol, can cause cascading outages– Often discovered after serious outage

5

How to detect bugs and stop their effects before they spread?

How to detect bugs and stop their effects before they spread?

Page 6: btr-upenn.ppt

Avoiding Bugs via Diversity

• Run multiple, diverse routing instances– Use voting to select majority result– Software and Data Diversity (SDD)• E.g., XORP and Quagga, different update timing

• SDD is an old idea, applied in other fields– But routing raises new challenges and opportunities

6

Vote

Page 7: btr-upenn.ppt

SDD Challenges in Routers• Making replication transparent– Interoperate with existing routers– Duplicate network state to routing instances– Present a common configuration interface

• Handling transient, real-time nature of routers– React quickly to network events • E.g., buggy behaviors, link failures

– But not over-react to transient inconsistency

7time

Routing Instance IAA

Routing Instance IIBB CC

BB AA CC

Page 8: btr-upenn.ppt

SDD Opportunities in Routers

• Easy to vote on standardized output– Control plane: IETF-standardized routing protocols– Data plane: forwarding-table entries

• Easy to recover from errors via bootstrap– Routing has limited dependency on history – Don’t need much information to bootstrap instance

• Diversity is effective in avoiding router bugs– Based on our studies on router bugs and code

8

Page 9: btr-upenn.ppt

Outline

• Exploiting software and data diversity (SDD)– Effective in avoiding bugs– Enough hardware resources to support diversity

• Bug-tolerant router (BTR) architecture– Make replication transparent with low overhead– React quickly and handle transient inconsistency

• Prototype and evaluation– Small, trusted code base– Low processing overhead

9

Page 10: btr-upenn.ppt

Outline

• Exploiting software and data diversity (SDD)– Effective in avoiding bugs– Enough hardware resources to support diversity

• Bug-tolerant router (BTR) architecture– Make replication transparent with low overhead– React quickly and handle transient inconsistency

• Prototype and evaluation– Small, trusted code base– Low processing overhead

10

Page 11: btr-upenn.ppt

Why Diversity Works? • Enough diversity in routers– Software: Quagga, XORP, BIRD– Protocols: OSPF and IS-IS– Environment: timing, ordering, memory

• Enough resources for diversity– Extra processor blades for hardware reliability– Multi-core processors, separate route servers

• Effective in avoiding bugs

11

Page 12: btr-upenn.ppt

Evaluating Benefits of Diversity

• Most bugs can be avoided by diversity – Reproduce and avoid real bugs – … in bugzilla database for XORP and Quagga

• Diversity of execution environmentDiversity Mechanism Avoid bugs in

database

Timing/Order of Messages

39%

Configuration 25%

Timing/Order of Connections

12%

Combining all execution diversity

88%12

Page 13: btr-upenn.ppt

Effect of Software Diversity

• Sanity check on implementation diversity– Picked 10 bugs from XORP, 10 bugs from Quagga– None were present in the other implementation

• Static code analysis on version diversity– Overlap decreases quickly between versions• 75% of bugs in Quagga 0.99.1 are fixed in Quagga 0.99.9• 30% of bugs in Quagga 0.99.9 are newly introduced

• Vendors can also achieve software diversity– Different code versions, different code trains– Code from acquired companies, open-source

13

Page 14: btr-upenn.ppt

Outline

• Exploiting software and data diversity (SDD)– Effective in avoiding bugs– Enough hardware resources to support diversity

• Bug-tolerant router (BTR) architecture– Make replication transparent with low overhead– React quickly and handle transient inconsistency

• Prototype and evaluation– Small, trusted code base– Low processing overhead

14

Page 15: btr-upenn.ppt

Bug-tolerant Router Architecture

15

UPDATE VOTER

FIB VOTER

REPLICAMANAGER

Hypervisor

Forwarding table (FIB)Interface 1

Iinterface 2

Protocol daemon

Routing table

Protocol daemon

Routing table

Protocol daemon

Routing table

Page 16: btr-upenn.ppt

UPDATE VOTER

FIB VOTER

REPLICAMANAGER

Hypervisor

Forwarding table (FIB)Interface 1

Iinterface 2

Protocol daemon

Routing table

Protocol daemon

Routing table

Protocol daemon

Routing table

Replicating Incoming Routing Messages

12.0.0.0/8Update

No need for protocol parsing – operates at socket level

16

Page 17: btr-upenn.ppt

UPDATE VOTER

FIB VOTER

REPLICAMANAGER

Hypervisor

Forwarding table (FIB)Interface 1

Iinterface 2

Protocol daemon

Routing table

Protocol daemon

Routing table

Protocol daemon

Routing table

Voting: Updates to Forwarding Table

12.0.0.0/8 IF 2

12.0.0.0/8Update

17

Transparent by intercepting calls to “Netlink”

Page 18: btr-upenn.ppt

UPDATE VOTER

FIB VOTER

REPLICAMANAGER

Hypervisor

Forwarding table (FIB)Interface 1

Iinterface 2

Protocol daemon

Routing table

Protocol daemon

Routing table

Protocol daemon

Routing table

Voting: Control-Plane Messages

12.0.0.0/8 IF 2

12.0.0.0/8Update

18

Transparent by intercepting socket system calls

Page 19: btr-upenn.ppt

Simple Voting Mechanisms • Tolerate transient periods of disagreement– Different replicas can have different outputs– … during routing-protocol convergence

• Several different voting mechanisms– Master-slave: speeding reaction time– Continuous majority: handling transient differences

19

Routing Instance IAA

Routing Instance IIBB CC

BB AA CC

AA CCRouting Instance III time

master

Page 20: btr-upenn.ppt

Simple Voting Mechanisms • Tolerate transient periods of disagreement– Different replicas can have different outputs– … during routing-protocol convergence

• Several different voting mechanisms– Master-slave: speeding reaction time– Continuous majority: handling transience

20

Routing Instance IAA

Routing Instance IIBB CC

BB AA CC

AA CCRouting Instance III time

Continuous majorityAA

BB

AA

AA

BB CC

CC

CC

CC

Page 21: btr-upenn.ppt

Simple Voting and Recovery

• Recovery– Hiding replica failure from neighboring routers– Hypervisor kills faulty instance, invokes new one

• Small, trusted software component– No parsing, treats data as opaque strings– Just 514 lines of code in voter implementation

21

Page 22: btr-upenn.ppt

Outline

• Exploiting software and data diversity (SDD)– Effective in avoiding bugs– Enough hardware resources to support diversity

• Bug-tolerant router (BTR) architecture– Make replication transparent with low overhead– React quickly and handle transient inconsistency

• Prototype and evaluation– Small, trusted code base– Low processing overhead

22

Page 23: btr-upenn.ppt

Prototype• Prototype implementation– No modification of routing software– Simple, trusted hypervisor – Built on Linux with XORP and Quagga

• Evaluation environment– Evaluated in 3GHz Intel Xeon– BGP trace from Route Views on March, 2007

• Evaluation metric– Voting delay and fault rate of different voting algo.– Delay of hypervisor

23

Page 24: btr-upenn.ppt

Effectiveness of Voting• 3 XORP and 3 Quagga routing instances• Inject bugs of realistic frequency and duration– 1.2 million sec interarrival, 600 sec duration

24

Voting algorithm

Avg voting delay (sec)

Fault rate

Single router - 0.066%

Master-slave 0.02 0.0006%

Continuous-majority

0.035 0.00001%

Page 25: btr-upenn.ppt

Small Overhead

• Small increase on FIB pass through time– Time between receiving an update to FIB changes – Delay overhead of just hypervisor is 0.1% (0.06sec)– Delay overhead of 5 routing instances is 4.6%

• Little effect on network-wide convergence– ISP networks from Rocketfuel, and cliques– Found no significant change in convergence (beyond the

pass through time)

25

Page 26: btr-upenn.ppt

Conclusion

• Seriousness of routing software bugs– Cause outages, misbehaviors, vulnerabilities– Violate protocol semantics, so not handled by

traditional failure detection and recovery

• Software and data diversity (SDD) – Effective, has reasonable overhead

• Design and prototype of bug-tolerant router– Works with Quagga and XORP software– Low overhead, and small trusted code base

26

Page 27: btr-upenn.ppt

• More information at http://verb.cs.princeton.edu

• Thanks!

• Questions?

27