8/6/2019 Bgp Trouble
1/22
1
BGP Anomaly Detection in an ISP
Jian Wu (U. Michigan)Z. Morley Mao (U. Michigan)
Jennifer Rexford (Princeton)Jia Wang (AT&T Labs)
http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf
8/6/2019 Bgp Trouble
2/22
2
Goal
Identify important anomalies Lost reachability
Persistent flapping Large traffic shifts
Contributions:
Build a tool to identify a small number ofimportant routing disruptions from a largevolume of raw BGP updates in real time.
Use the tool to characterize routingdisruptions in an operational network
8/6/2019 Bgp Trouble
3/22
3
Capturing Routing Changes
CBR
CBR
CPE
BGP
Monit
or
CBR
CBR
CBR
CBR
CBR
CBR
CBR
CBR
CBR
CBR
iBGP
iBGP
iBG
P
iB
GP
iBGP
iBGP
eBGP
eB
GP
eBGP
eBGP
eBGP
eBGP
Updates
Update
s
BestroutesBe
stro
utes
Large operational network(8/16/2004 10/10-2004)
8/6/2019 Bgp Trouble
4/22
4
Challenges
Large volume of BGP updates Millions daily, very bursty
Too much for an operator to manage Different than root-cause analysis
Identify changes and their effects
Focus on actionable events
Diagnose causes only in/near the AS
8/6/2019 Bgp Trouble
5/22
5
System Architecture
EventClassification
EventClassification
Typed
Events
E
EBR
E
EBR
E
EBR
BGP
Updates(106)
BGP UpdateGrouping
BGP UpdateGrouping
Events
PersistentFlapping
Prefixes
(101)
(105)
EventCorrelation
EventCorrelation
Clusters
FrequentFlapping
Prefixes
(103
)
(101)
Traffic ImpactPrediction
Traffic ImpactPrediction
E
EBRE
EBR E
EBR
Large
Disruptions
Netflow
Data
(101)
8/6/2019 Bgp Trouble
6/22
6
Grouping BGP Update into Events
Challenge: A single routing change leads to multiple update messages
affects routing decisions at multiple routers
Solution:Group all updates fora prefix with inter-arrival < 70 secondsFlag prefixes withchanges lasting > 10minutes.
BGP UpdateGrouping
BGP UpdateGrouping
E
EBR
E
EBR
E
EBR
BGP
Updates
Events
Persistent
Flapping
Prefixes
8/6/2019 Bgp Trouble
7/22
7
Grouping Thresholds
Based on data analysis and ourunderstanding of BGP
Event timeout: 70 seconds 2 * MRAI timer + 10 seconds
98% inter-arrival time < 70 seconds
Convergence timeout: 10 minutes BGP usually converges within minutes
99.9% events < 10 minutes
8/6/2019 Bgp Trouble
8/22
8
Persistent Flapping Prefixes
Causes of persistent flapping Conservative damping parameters (78.6%)
Protocol oscillations due to MED (18.3%) Unstable interface or BGP session (3.0%)
Surprising finding: 15.2% of updates werecaused by persistent flapping prefixes, eventhough flap damping was enabled!
8/6/2019 Bgp Trouble
9/22
9
Example: Unstable eBGP Session
ISP Peer
CustomerEC
EB
EA ED
p
Flap damping parameters are session-based Damping not implemented for iBGP sessions
8/6/2019 Bgp Trouble
10/22
10
Event Classification
Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers
Change of flow of traffic through the network
EventClassification
EventClassificationEvents
Typed
Events
Solution: classify events by severity of their impacts
8/6/2019 Bgp Trouble
11/22
11
Event Category No Disruption
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
No Traffic Shift
No Disruption: each of the border routers hasno traffic shift. (50.3%)
8/6/2019 Bgp Trouble
12/22
12
Event Category Internal Disruption
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
Internal Traffic Shift
Internal Disruption: all of the traffic shifts areinternal traffic shift. (15.6%)
8/6/2019 Bgp Trouble
13/22
13
Event Category Single ExternalDisruption
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
external Traffic Shift
Single External Disruption: only one of thetraffic shifts is external traffic shift. (20.7%)
8/6/2019 Bgp Trouble
14/22
14
Statistics on Event Classification
Events Updates
No Disruption 50.3% 48.6%
Internal Disruption 15.6% 3.4%
Single External Disruption 20.7% 7.9%Multiple External Disruption 7.4% 18.2%
Loss/Gain of Reachability 6.0% 21.9%
First 3 categories have significant variations fromday to day Updates per event depends on the type of events
and the number of affected routers
8/6/2019 Bgp Trouble
15/22
15
Event Correlation
Challenge: A single routing change affects multiple destination prefixes
EventCorrelation
EventCorrelation
Typed
EventsClusters
Solution: group events of same type that occur close in time
8/6/2019 Bgp Trouble
16/22
16
EBGP Session Reset
Caused most single external disruption events Check if the number of prefixes using that session as
the best route changes dramatically
Validation with Syslog router report (95%)
time
Number of prefixes
session
failure
sessionrecovery
8/6/2019 Bgp Trouble
17/22
17
Hot-Potato Changes
Hot-Potato Changes
Caused internal disruption events Validation with OSPF measurement (95%)
[Teixeira et al SIGMETRICS 04]
ISP
P
EA EB
EC
10119
Hot-potato routing =route to closest egress point
8/6/2019 Bgp Trouble
18/22
18
Traffic Impact Prediction
Challenge: Routing changes have differentimpacts on the network which depends onthe popularity of the destinations
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBR
Clusters Large
Disruptions
Netflow
DataEEBR EEBR
Solution: weigh each cluster by traffic volume
8/6/2019 Bgp Trouble
19/22
19
Traffic Impact Prediction
Traffic weight Per-prefix measurement from Netflow
10% prefixes accounts for 90% of traffic
Traffic weight of a cluster Sum of traffic weight of the prefixes
A few clusters have large traffic weight
Mostly session resets & hot-potato changes
8/6/2019 Bgp Trouble
20/22
20
Performance Evaluation
Memory Static memory: current routes, 600 MB
Dynamic memory: clusters, 300 MB
Speed 99% of intervals of 1 second of updates can
be process within 1 second
Occasional execution lag
Every interval of 70 seconds of updates canbe processed within 70 seconds
Measurements were based on 900MHz CPU
8/6/2019 Bgp Trouble
21/22
21
Conclusion
BGP anomaly detection Fast, online fashion
Operator concerns (reachability, flapping,traffic)
Significant information reduction
Uncovered important network behaviors
Persistent flapping prefixes Hot-potato changes
Session resets and interface failures
8/6/2019 Bgp Trouble
22/22
22
Detecting Peering Violations
Consistent export requirement Peer should advertise prefixes at all peering points,
with the same AS path length
Allows the AS to do hot-potato routing
Detecting violations Using iBGP feeds from the border routers
Some inference tricks to identify inconsistencies
Results of the study http://www.nanog.org/mtg-0410/feamster.html http://www.cs.princeton.edu/~jrex/papers/imc04.pdf