★Detecting BGP Configuration Faults with Static Analysis ★ IP Fault Localization Via Risk Modeling ★ Finding a Needle in a Haystack: Pinpointing … Nick Feamster et al Ramana Rao Kompella et al Jian Wu et al Presented by Mikyung Han
★Detecting BGP Configuration Faults with Static Analysis★ IP Fault Localization Via Risk Modeling★ Finding a Needle in a Haystack: Pinpointing …
Nick Feamster et al Ramana Rao Kompella et al
Jian Wu et al
Presented by Mikyung Han
Detecting BGP Configuration Faults
2nd Symposium on Networked Systems Design and Implementation (NSDI)
,Boston, MA, May 2005
Nick Feamster
Hari Balakrishnan★Best Paper Award
With Static Analysis
3/53
The Internet is increasingly becoming part of the mission-critical Infrastructure (a public utility!).
Big problem: Very poor understanding of how to manage it.
Is correctness really that important?
4/53
Why does routing go wrong?
Complex policies Competing / cooperating
networks Each with only limited visibility Large scale Tens of thousands networks …each with hundreds of routers …each routing to hundreds of
thousands of IP prefixes
5/53
What can go wrong?
Two-thirds of the problems are caused by configuration of the routing protocol
Some things are out of the hands of networking research
But…
6/53
Categories of BGP Configurations
Ranking: route selection
Dissemination: internal route advertisement
Filtering: route advertisement
Customer
Competitor
Primary
Backup
…. More Flexibility bringsMore
COMPLEXITY!
7/53
These problems are real“…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.”
-- news.com, April 25, 1997“Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001“WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue."
-- cnn.com, October 3, 2002"A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).”
-- dslreports.com, February 23, 2004
8/53
Routing Faults Discussed on NANOG mailing List
0102030405060708090
Filtering RouteLeaks
RouteHijacks
RouteInstability
RoutingLoops
Blackholes
# T
hre
ad
s o
ve
r S
tate
d P
eri
od
1994-1997 1998-2001 2001-2004
9/53
Why is routing hard to get right?
Defining correctness is hardInteractions cause unintended consequences Each network independently configured Unintended policy interactionsOperators make mistakes Configuration is difficult Complex policies, distributed configuration
10/53
Today: Tweak-N-Pray
Problems cause downtimeProblems often not immediately apparent
What happens if I tweak this policy…?
Configure ObserveWait for
Next ProblemDesired Effect?
RevertNo
Yes
11/53
Goal: Proactive ApproachIdea: Analyze configuration before deployment
ConfigureDetectFaults
Deploy
rcc
Many faults can be detected with static analysis.
12/53
Router Configuration Checker (rcc)
A tool that finds faults in BGP configuration with static analysis Does not require additional work of operators
Detects Path Visibility Faults Route Validity Faults Only detects faults in single AS Only detects faults that cause persistent
failures
13/53
What is so cool about rcc?
Finds faults proactively before deployment
Just convenient for now BGP might need a high level specification of
policies in the future To do so,
High level specification language needed Network operators need to learn and deploy Even so, they may well write it incorrectly!
No additional works from network operators!
14/53
“rcc”
rcc Overview
Normalized Representation
CorrectnessSpecification
Constraints
Faults
Analyzing complex, distributed configurationDefining a correctness specificationMapping specification to constraints
Challenges
Distributed routerconfigurations
(Single AS)
15/53
rcc Implementation
Preprocessor Parser
Verifier
Distributed routerConfigurations
(offline)Relational Database(mySQL)
Constraints
Faults
(Cisco, Avici, Juniper, Procket, etc.)
Normalized Representation
More Parsable Version
Runs simple queriesSelect, join, etc
16/53
Which faults does rcc detect?
Faults found by rcc
Latent faults
Potentially active faults
End-to-end failures
17/53
Correctness SpecificationSafetyThe protocol converges to a stable path assignment for every possible initial state and message ordering
The protocol does not oscillate
Path Visibility Every destination with a usable path has a route advertisement
Route Validity Every route advertisement corresponds to a usable path
Example violation: Network partition
Example violation: Routing loop
If there exists a path, then there exists a route
If there exists a route, then there exists a path
18/53
Path Visibility in iBGP
“iBGP”
c c c
RR
c
RR RR
Default: “Full mesh” iBGP. Doesn’t scale.
Large ASes use “Route reflection” Route reflector: non-client routes over client sessions; client routes over all sessions Client: don’t re-advertise iBGP routes.
19/53
iBGP Fault ExampleNetwork Partition W learns r1 via eBGP X does not
readvertise to other iBGP sessions
Then Y and Z won’t learn r1 to d
Suboptimal Routing Even if Y and Z learn
a route to d via eBGP, this would be worse than r1 learned by W
20/53
iBGP Signaling: Static CheckTheorem.Suppose the iBGP reflector-client relationship graph contains no cycles. Then, path visibility is satisfied if, and only if, the set of routers that are not route reflector clients forms a full mesh.rcc checks whether iBGP signaling graph
G is connected and acyclic, and whether the routers at the top layer of G form a full mesh.
21/53
Route Validity: Policy Related Problems
rcc operates without a specification of the intended policy For the convenience’s sake
rcc forms beliefs Assume intended policies conform to best
common practice Analyze the configuration for common
patterns and look for deviations from those patterns
Still useful but some false positives
22/53
Route Validity: Best Common Practice
A route learned from peers should not be re-advertised to another peer Ex: Ensuring no routes learned from
Worldcom propagate to Sprint
AS should advertise routes with equally good attributes to each peer at every peering point Violations
when routers in AS have different policy set to same peer
When there exists iBGP signaling partition
23/53
Route Validity: Configuration Anomalies
When the configurations for sessions at different routers to a neighboring AS are the same except at one or two routers, rcc reports faults!False Positives of course …
24/53
Analyzing Real-World Configuration
Downloaded by 70 network operators, some of them shared their configurations Reluctant to share because its proprietary Because they don’t like researchers
finding faults on their network
Detected more than 1000 faults previously undiscovered in 17 ASes
25/53
Summary: Faults across 17 ASes
0
2
4
6
8
10
iBG
PS
ign
ali
ng
Pa
rtit
ion
Du
pli
ca
teL
oo
pb
ac
k
Inc
om
ple
teiB
GP
Se
ss
ion
Inc
on
sis
ten
tE
xp
ort
Inc
on
sis
ten
tIm
po
rt
Tra
ns
itB
etw
ee
nP
ee
rs
Un
de
fin
ed
Fil
ter
Inc
om
ple
teF
ilte
r
Nu
mb
er o
f A
Ses
Route Validity Path Visibility
Every AS had faults, regardless of network sizeMost faults can be attributed to distributed configuration
26/53
rcc: Take-home lessonsBetter intra-AS route dissemination protocol needed Current route reflection causes many faults!
BGP needs to be configured with a centralized higher-level specification language Current distributed, low-level nature introduces
complexity, obscurity, and possibility to misconfiguration
But! trade-off with flexibility and expressiveness
27/53
DiscussionStrength Proves static configuration analysis
uncovers many errors Identifies major causes of error
Distributed configuration Intra-AS dissemination is too complex Mechanistic expression of policy
Weakness rcc is not sound or complete More room for improvement on ‘beliefs’
IP Fault Localizationvia Risk Modeling
2nd ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI)
,Boston, MA, May 2005
Ramana Rao Kompella
Jennifer YatesAlbert Greenberg
Alex C Snoeren
29/53
IP Network Fault-Tolerance
InternetInternet
XXIP FaultIP Fault
Alternate PathAlternate Path
IP Networks are designed to be fault-tolerant!IP Networks are designed to be fault-tolerant!
RouterRouter
AliceAlice EveEve
Any failure that causes an IP link to fail is
termed “IP Fault”
30/53
Fault RepairFast Repair is necessary because Probability of a simultaneous failure increases
with down-time Expensive to provision too many alternate
paths
Fault Localization is a bottleneck for fault repair!
31/53
What makes fault localization hard?
A typical Tier-I ISP network has About a thousand routers A few thousand IP links Tens of thousands of optical components About 50-100 thousand miles of optical fiber Complicated topologies (mesh, ring etc.)
Current alarms do not indicate root-causeOften problematic to monitor actual component failureFailure alerts can get lost
Operators Need an automated tool for fast fault localization
32/53
Key Ideas: Shared Risk!
Risk modeling to localize faults across the IP and optical layersSRLG: Shared Risk Link Groups A physical object represents shared risk
for a group of logical entities at IP layer
SCORE: Spatial Correlation Engine cross-correlates dynamic fault
information from two disparate network layers
33/53
Los AngelesLos Angeles
San JoseSan Jose WashingtonWashington
AtlantaAtlanta
HoustonHouston
Logical/Physical IP Network
QWEST IP Network
34/53
Logical/Physical IP Network
Los AngelesLos Angeles
San JoseSan Jose WashingtonWashington
AtlantaAtlanta
HoustonHouston
San JoseSan Jose WashingtonWashington
AtlantaAtlanta
HoustonHouston
Los AngelesLos Angeles
SHARED SHARED RISKRISK
XX
XXDWDM DWDM failed ?failed ?
Links that share a Links that share a “Shared Risk” form an “Shared Risk” form an
Shared Risk Link Group Shared Risk Link Group (SRLG)(SRLG)
XX
DWDM O-E-O Conversion
Router
35/53
Various types of SRLGsPhysical Shared Risks SONET (e.g. DWDM, ADM, Optical Amplifiers) Fiber Fiber Span Router Module Port
Logical Shared Risks Autonomous System OSPF Areas
36/53
SRLG Prevalence
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000
CD
F
SRLG Cardinality (no. of links per group)Logscale
Fiber SpansFiber
SONET Network ElementsPorts
Router ModulesRouters
AreasAggregated Database
At least 47% of all SRLGs have atleast
two links
More than 85% of OSPF Areas have
atleast 10 links
Source : Section of ATT Backbone Network
37/53
Problem Formulation
A set of link C = {c1, c2, … , cn}
A set of risk Group G = {G1, G2, … , Gm} Gi = {ci1, ci2, … , cik}, st cix are likely to fail
simultaneouslyAn observation
O = {ce1, ce2, … , cem}Find Hypothesis H
H = {Gh1, Gh2, … , Ghk} which explains O Every member of O belongs to at least one member of
H and all the members of a given group Ghi belong to O Many Hs!
Occam’s Razor: Let’s not assume more than what is necessarySimplicity is the Best
38/53
SRLG DatabaseR0 – {L0,L1}R1 –
{L0,L2,L3,L4}R2 – {L4,L5}R3 – {L3,L5,L6}R4 – {L1,L2,L6}D1 – {L0,L1,L2}D2 – {L3,L5,L6}D3 – {L3,L4,L5}F0 – {L0,L1}F1 – {L0,L2}
…
R0R0
R1R1
R2R2
R3R3R4R4
L0L0
L1L1
L2L2 L3L3
L4L4
L5L5L6L6
R0R0
R1R1
R2R2
R3R3R4R4
D1D1
D2D2
D3D3F0F0
F1F1
F2F2
F3F3
F4F4
F5F5
F6F6F7F7
39/53
Bipartite Graph Formulation
DWDM1DWDM1 DWDM2DWDM2 FiberFiberSpan0Span0
R0R0 R1R1
L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6XX XX XX XX
XX
HypothesisHypothesis : Possible Explanation : Possible Explanation
ObservationObservation: Temporally Correlated: Temporally Correlated
R2R2 R3R3 R4R4DWDM3DWDM3 FiberFiber
Span1Span1
40/53
Bipartite Graph Formulation
DWDM1DWDM1 DWDM2DWDM2R0R0 R1R1
L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6XX XX XX
R2R2 R3R3 R4R4DWDM3DWDM3
XX
XXXXHypothesisHypothesis : Can contain multiple simultaneous failures : Can contain multiple simultaneous failures
FiberFiberSpan0Span0
FiberFiberSpan1Span1
Set cover of a given Observation : NP-HardSet cover of a given Observation : NP-Hard
41/53
Greedy Approximation
DWDM1DWDM1 DWDM2DWDM2R0R0 R1R1
L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6XX XX XX
R2R2 R3R3 R4R4DWDM3DWDM3
XX
FiberFiberSpan 0Span 0
FiberFiberSpan 1Span 1
Hit RatioHit Ratio of R0 = |G of R0 = |Gii O|/|Gi| = 1/2 = 50% O|/|Gi| = 1/2 = 50%
Coverage RatioCoverage Ratio of R0 = | G of R0 = | Gii O|/|O| = 1/4 = 25% O|/|O| = 1/4 = 25%
42/53
Greedy ApproximationXX XX XX XX
DWDM1DWDM1 DWDM2DWDM2R0R0 R1R1
L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6
R2R2 R3R3 R4R4DWDM3DWDM3 FiberFiber
Span 0Span 0 FiberFiber
Span 1Span 1
R0=(50%,25%),R1=(75%,75%),R2=(100%,50%),R3=(33%,25%),R4=(66%,50%)R0=(50%,25%),R1=(75%,75%),R2=(100%,50%),R3=(33%,25%),R4=(66%,50%)
D1=(66%,50%),D2=(33%,25%),D3=(66%,50%),F0=(50%,25%),F1=(100%,50%)D1=(66%,50%),D2=(33%,25%),D3=(66%,50%),F0=(50%,25%),F1=(100%,50%)
Out of all groups with hit-ratio 100%, pick group with max coveragePrune links associated with this group and add this group to hypothesisRepeat with pruned observation until no unexplained Observation
43/53
Modeling ImperfectionsIdeally, If a shared component fails, all associated links fail
Not true in practice sometimes Failure message could get lost! (transported by
UDP) Inaccurate modeling of risk groups
Solution : Use an error threshold for the hit-ratios Accounts for losses in data Inaccurate modeling of SRLGs
44/53
Modified Greedy Approximation
Select groups that have hit ratio > error thresholdOut of these groups, identify the group with maximum coveragePrune the set of links that are explained by this groupRecursively repeat the above steps until all links are fully explained
45/53
SCORE Spatial Correlation Module
Intelligence is built onto the SRLG database and reflected in the SCORE queriesObtains minimum set hypothesis
46/53
SCORE System Architecture
Data Data TranslatorTranslator WWWWWW
Router Router SyslogsSyslogs
Spatial Correlation Spatial Correlation (SCORE)(SCORE)
FAULT LOCALIZATION POLICIESFAULT LOCALIZATION POLICIES
Data Data TranslatorTranslator
Data Data TranslatorTranslator
SNMP TrapsSNMP TrapsSONETSONETPM dataPM data
SRLG DatabaseSRLG Database
APIAPIInput : <Ckt1, Ckt2 ..>,Input : <Ckt1, Ckt2 ..>,
Error ThresholdError ThresholdOutput : <Grp1, Grp2..>Output : <Grp1, Grp2..>
Multiple QueryMultiple Query
1. Event Clustering-captures events close together in time2. Localization Heuristics: -uses multiple error threshold outputs H with min cost (|H|/eThresh)
-queries clustered events with similar signature
47/53
Evaluation : Artificial Faults
Artificially generated faults but real SRLG database from (a section of) AT&T backbone networkPicked a set of components to failObservation then fed to SCORE No losses in data no database
inconsistency
Hypothesis compared with injected faults
48/53
Perfect Fault Notification
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
0 2 4 6 8 10 12 14 16 18 20
Fra
ctio
n o
f Co
rre
ct H
ypo
the
ses
Number of simultaneously induced failures
FIBERSPANPORT
MODULEROUTER
AREASONET
Aggregated
Accuracy Greater than 95% for 5 failures
49/53
Imperfect Fault Notifications
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.05 0.1 0.15 0.2 0.25 0.3
Fra
ctio
n o
f Co
rre
ct H
ypo
the
ses
Loss Probability (eThresh 0.6)
One FailureTwo Failures
Three FailuresFour FailuresFive Failures
Almost linear accuracy tradeoff
with loss probability
50/53
Evaluation : Real FaultsA set of 18 faults studied and diagnosed Where root-cause well-known
One Case Study OSPF Area wide problem that affected about 70
links SCORE identified about 20 SRLG groups as
hypothesis Further analysis revealed that error due to
incorrect SRLG modeling Relaxed error threshold to 0.7 brought it down to 4 Only OSPF interfaces with MPLS enabled got
affected by the protocol bug
51/53
Evaluation: Real Faults
Similarly, SCORE uncovered Database problems Missing error reports from certain
links Other inconsistencies
Shows how error-thresholds are effective in uncovering these inconsistencies and data losses
52/53
Localization Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CD
F
Localization Fraction
About 40% of faults About 40% of faults could be localized to could be localized to
less than 5% of less than 5% of componentscomponents
About 80% of faults About 80% of faults could be localized to could be localized to
less than 10% of less than 10% of componentscomponents
53/53
DiscussionStrength Captured the spatial correlation between IP
links Database inconsistencies are resolved in
SCORE using a simple error threshold scheme
Weakness Fails to model either very high-level risk group
or very low-level risk group Extremely hard to select a single error
threshold for all observations! Need more intelligent heuristics to fault
localization policy
Finding a Needle in a Haystack:Pinpointing Significant BGP Routing Changes in an IP Network
Proc. Networked Systems Design and ImplementationMay 2005
Jian Wu
Z. Morley Mao Jennifer Rexford
Jia Wang
55/53
Challenges & Goals
Large volume of BGP updates Millions daily, very bursty Too much for an operator to manageDifferent than root-cause analysis Identify changes and their effects Focus on actionable events Diagnose causes only in/near the ASGoal Covert millions of BGP updates into a few
dozen of actionable reports!
56/53
System Architecture
Event Classification
Event Classification
“Typed”Events
EEBR
EEBR
EEBR
BGP Updates
(106)
BGP Update Grouping
BGP Update Grouping
Events
Persistent Flapping Prefixes
(101)
(105)
EventCorrelation
EventCorrelation
Clusters
Frequent Flapping Prefixes
(103)
(101)
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBREEBR EEBR
LargeDisruptions
Netflow Data
(101)
57/53
Grouping BGP Update into Events
Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers
Solution: •Group all updates for a prefix with inter-arrival < 70 seconds•Flag prefixes with changes lasting > 10 minutes.
BGP Update Grouping
BGP Update Grouping
EEBR
EEBR
EEBR
BGP Updates
Events
Persistent Flapping Prefixes
58/53
Event Classification
Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of traffic through the network
Event Classification
Event Classification
Events “Typed”Events
Solution: classify events by severity of their impacts
59/53
Event Correlation
Challenge: A single routing change affects multiple destination prefixes
EventCorrelation
EventCorrelation“Typed”
EventsClusters
Solution: group events of same type that occur close in time
60/53
Statistics on Event Classification
Events Updates
No Disruption 50.3% 48.6%
Internal Disruption 15.6% 3.4%
Single External Disruption
20.7% 7.9%
Multiple External Disruption
7.4% 18.2%
Loss/Gain of Reachability 6.0% 21.9%First 3 categories have significant variations from day to dayUpdates per event depends on the type of events and the number of affected routers
61/53
Traffic Impact Prediction
Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBR
Clusters LargeDisruptions
Netflow Data
EEBR EEBR
Solution: weigh each cluster by traffic volume
62/53
Conclusion
BGP anomaly detection Fast, online fashion Significant information reduction (to a few
dozen of actionable reports!)
Uncovered important network behaviors Persistent flapping prefixes Hot-potato changes Session resets and interface failures