Top Banner

of 108

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    1/108

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    2/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Cell Phones

    Who am I?

    Who are you?

    Service Provider

    Enterprise

    Studying for CCIE Advanced Class

    Assume BGP Operational Experience Basic configuration

    Show commands

    Understand BGP attributes

    IntroductionHousekeeping

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    3/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    IOS vs. IOS-XR vs. NX-OS

    Troubleshooting concepts are the same

    Some variation in show command syntax and output

    Will use all three in this presentation

    IntroductionOperating Systems

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    4/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Generic Troubleshooting Advice

    Troubleshooting Peers Bestpath Algorithm

    Table Version

    Initial Convergence Periodic Convergence

    High Utilization

    Layer 3 VPNs Looking Glasses

    IntroductionAgenda

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    5/108

    Generic Troubleshooting Advice

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    6/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Narrow down the problem Can you reproduce it?

    Which device(s) are the cause of theproblem?

    Reduce your configs

    Troubleshoot one thing at a time

    100k routes flapping? Pick one route andfocus on that one route

    Have a co-worker take a look Forces you to talk through the problem

    Different set of eyes may spot something Sniffer capture, sniffer capture, sniffer

    capture

    Generic Troubleshooting Advice

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    7/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Use NTP to sync timestamps on your routers cl ock t i mezone EST - 5 0

    cl ock summer - t i me EDT r ecur r i ng nt p ser ver x. x. x. x

    Use a syslog server

    l oggi ng moni t or i nf or mat i onal l oggi ng host x. x. x. x

    ser vi ce t i mest amps l og dat et i me msec l ocal t i me

    Generic Troubleshooting AdviceSyslogs

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    8/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Centralized/Timesynced syslogs are a great troubleshoo

    Generic Troubleshooting AdviceSyslogs

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    9/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    bgp l og- nei ghbor - changes

    Generates a syslog message when a peer goes up or down Always configure this

    OSPF, ISIS, and EIGRP all have log-neighbor-changes too

    Generic Troubleshooting Advicelog-neighbor-changes

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    10/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    The CPU on this router is high

    High compared to what?

    What is the CPU load normally at this time of day?

    Things to keep track of

    CPU load

    Free Memory Largest block of memory

    Input/Output load for interfaces

    Rate of BGP bestpath changes

    Etc, etc

    Generic Troubleshooting AdviceDefine Normal

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    11/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Cacti is a handy tool for polling and graphing data from various netwdevices

    http://www.cacti.net/

    Generic Troubleshooting AdviceDefine Normal

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    12/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Use SPAN to get traffic to your sniffer

    monitor session 1 source interface Te2/4 rx

    monitor session 1 destination interface Te2/2

    IOS-XR

    Only supported on ASR-9000

    Use ACLs to control what packets to SPAN RSPAN

    RSPAN has all the features of SPAN, plus support for source ports and dports that are distributed across multiple switches, allowing one to monitordestination port located on the RSPAN VLAN. Hence, one can monitor theone switch using a device on another switch.

    Generic Troubleshooting AdviceSniffer Captures

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    13/108 2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Ability to capture packets on the router

    Primarily for control-plane traffic Difficult to capture transit traffic on distributed platforms

    Is supported on some platforms

    Very handy if a dedicated sniffer is not availableAvailable on IOS and NX-OS

    Generic Troubleshooting AdviceEmbedded Packet Capture

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    14/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Create a buffer moni t or capt ur e buf f er buf 1 si ze 512 max- si ze 512 ci r cul ar

    Define which interface and direction to capture moni t or capt ur e poi nt i p cef dwal t on- cap gi g 0/ 0 i n

    Associate the buffer with the capture moni t or capt ur e poi nt associ at e dwal t on- cap buf 1

    Start/Stop the capture moni t or capt ur e poi nt st ar t dwal t on- cap

    moni t or capt ur e poi nt st op dwal t on- cap

    Export the capture to a .pcap file moni t or capt ur e buf f er buf 1 expor t t f t p: / / 172. 26. 2. 254/ buf 1.

    Generic Troubleshooting AdviceIOS Embedded Packet Capture

    G

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    15/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    You probably know this already but

    Wireshark is your best friend

    It is free

    You can get it here

    http://www.wireshark.org/

    Generic Troubleshooting AdviceWireshark

    G i T bl h ti Ad i

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    16/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Generic Troubleshooting AdviceWireshark

    G i T bl h ti Ad i

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    17/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Can do complex filters

    ANDs, ORs, ()s, etc

    If the filter is red, your syntax is busted

    If the filter is green, your syntax is correct

    Generic Troubleshooting AdviceWireshark

    G i T bl h ti Ad i

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    18/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Wireshark does a LOT

    Enough for someone to

    800 page book on how t

    ISBN-13: 978-18939399

    Generic Troubleshooting AdviceWireshark

    G i T bl h ti Ad i

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    19/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Send output to the logging buffer, not the console l oggi ng buf f er ed no l oggi ng consol e

    Use milli-second timestamps service timestamps debug datetime msec localtime service timestamps log datetime msec localtime

    Use ACLs to limit output br ai n1( conf i g) #access- l i st 100 per mi t i p host 1. 1. 1. 1 host br ai n1#debug i p packet 100 I P packet debuggi ng i s on f or access l i st 100 br ai n1#

    If you need to enable a very chatty debug r el oad i n 10 Run your debug r el oad cancel

    Generic Troubleshooting AdviceDebugs

    Generic Troubleshooting Advice

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    20/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Collects event information for various protocols

    Runs in the background

    Events are stored in memory

    Debug output is not generated

    Syslogs are not generated

    Finite number of most recent events are stored

    Use show commands later to

    Display an event in a debug like format

    Merge events from various protocols

    Easier on the box than debugs

    Generic Troubleshooting AdviceEvent Tracing

    Generic Troubleshooting Advice

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    21/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    br ai n1( conf i g) #moni t or event - t r ace ?

    adj acency Adj acency Event sal l - t r aces Conf i gur e mer ged event t r aces

    at om AToM Event Tr ace

    cef CEF t r aces

    [ sni p]

    br ai n1( conf i g) #moni t or event - t r ace adj acency enabl e

    br ai n1( conf i g) #end

    Generic Troubleshooting AdviceEvent Tracing

    Generic Troubleshooting Advice

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    22/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    br ai n1#show moni t or event - t r ace adj acency al l

    Feb 14 17: 15: 48. 270: GLOBAL: adj mgr not i f i ed of f i bi db st at e change i nt Fast Et her net 0/ 0 t o

    Feb 14 17: 15: 50. 958: GLOBAL: adj mgr not i f i ed of f i bi db st at e change i nt Fast Et her net 0/ 0 t o

    Feb 14 17: 15: 51. 682: GLOBAL: adj i pv4 bundl e changed t o I Pv4 no f i xup adj oce [ OK]

    Feb 14 17: 15: 51. 682: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: updat e oce bundl e, I Pv4 i ncompl[ OK]

    Feb 14 17: 15: 51. 682: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: al l ocat e [ OK]

    Feb 14 17: 15: 51. 686: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: r equest r esol ut i on [ OK]

    Feb 14 17: 15: 51. 734: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: r equest t o add ARP [ OK]

    Feb 14 17: 15: 51. 734: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: al l ocat e [ I gnr ]

    Feb 14 17: 15: 51. 734: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: add sour ce ARP [ OK]

    Feb 14 17: 15: 51. 734: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: r equest t o updat e [ OK]

    Feb 14 17: 15: 51. 734: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: updat e oce bundl e, I Pv4 no f i xu

    Feb 14 17: 15: 51. 734: ADJ : I P 172. 26. 38. 1 Fast Et her net 0/ 0/ 0: updat e [ OK]

    br ai n1#

    Generic Troubleshooting AdviceEvent Tracing

    Generic Troubleshooting Advice

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    23/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Dont be the person who ha

    hours to console into a box If you dont have out of ban

    for every router and/or switcnetwork.get it.please

    Generic Troubleshooting AdviceOut Of Band Access

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    24/108

    Troubleshooting Peers

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    25/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    R2

    R1

    Failed PeeringConfigurations

    R1#sh tcp brief allTCB Local Addr ess For ei gn Addr ess ( st at e)64328548 *.179 2.2.2.2.* LISTEN

    R1#

    Check

    AS Numbers IP addresses for TCP

    eBGP Multihop?

    i nt er f ace Loop0i p addr ess 2. 2. 2. 2/ 32

    !r out er bgp 100nei ghbor 1. 1. 1. 1 r emot e- a

    nei ghbor 1. 1. 1. 1 updat e- s

    i nt er f ace Loop0i p addr ess 1. 1. 1. 1/ 32

    !

    r out er bgp 100nei ghbor 2. 2. 2. 2 r emot e- anei ghbor 2. 2. 2. 2 updat e- s

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    26/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    i nt er f ace Loop0i p addr ess 1. 1. 1. 1/ 32

    !

    r out er bgp 100nei ghbor 2. 2. 2. 2 r emot e- asnei ghbor 2. 2. 2. 2 updat e- so

    i nt er f ace Loop0

    i p addr ess 2. 2. 2. 2/ 32!r out er bgp 100nei ghbor 1. 1. 1. 1 r emot e- asnei ghbor 1. 1. 1. 1 updat e- so

    R2

    R1

    Failed PeeringConnectivity

    Check

    Extended ping betweenBGP peering addresses

    R1#pi ng 2. 2. 2. 2 sour ce Loop0Sendi ng 5, 100- byt e I CMP Echos t o 2. 2. 2. 2Packet sent wi t h a sour ce addr ess of 1. 1. 1. 1. . . . .Success r at e i s 0 per cent ( 0/ 5)R1#

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    27/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP runs on top of IP and caaffected by many things

    No connectivity? IGP issues

    Access Lists

    TCP problems

    Peers come up but flap, are s MTU Issues extended ping an

    address ranges, DF bit, etc

    Rate limiting

    Traffic shaping

    Debugs may be needed

    Failed PeeringConnectivity

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    28/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP NOTIFICATIONs consist of an error code, subcode and data

    All Error Codes and Subcodes can be found here

    http://www.iana.org/assignments/bgp-parameters/bgp-parameters.xml http://tinyurl.com/bgp-notification-codes

    Data portion may contain what triggered the notification Example: corrupt part of the UPDATE

    Pay attention to who sent vs. received the NOTIFICATION

    If Router X sent the NOTIFICATION, it means he noticed the issue

    Does not mean Router X is the cause of the issue

    Failed PeeringNotifications

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    29/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Failed PeeringNotifications

    The first 2 in 2/2 is the Error Code.so OPEN Message Error

    Value Name Reference

    1 Message Header Error RFC 4271

    2 OPEN Message Error RFC 4271

    3 UPDATE Message Error RFC 4271

    4 Hold Timer Expired RFC 4271

    5 Finite State Machine Error RFC 4271

    6 Cease RFC 4271

    %BGP- 3- NOTI FI CATI ON: sent t o nei ghbor 2. 2. 2. 2 2/2 ( peer i n wr ong2 byt es 00C8 FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 002D 0104 00C00B4 0202 0202 1002 0601 0400 0100 0102 0280 0002 0202 00

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    30/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Subcode # Subcode Name Subcode Descrip

    1 Unsupported BGP version The version of BGP the peer is run

    with the local version of BGP

    2 Bad Peer AS The AS this peer is locally configur

    the AS the peer is advertising

    3 Bad BGP Identifier The BGP router ID is the same as

    ID4 Unsupported Optional Parameter There is an option in the packet wh

    speaker doesnt recognize

    6 Unacceptable Hold Time The remote BGP peer has request

    which is not allowed (too low)

    7 Unsupported Capability The peer has asked for support for

    local router does not support

    OPEN Message Subcodes shown aboveThe second 2 in 2/2 is the Error Subcode.so Bad Peer AS

    Failed PeeringNotifications

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    31/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Sniff of BGP Notification Sent from R2 to R1

    R2# show l og | i ncl ude NOTI FI CATI ON%BGP- 3- NOTI FI CATI ON: sent t o nei ghbor 10. 1. 2. 1 2/ 2 ( peer i n wr ong AS)2 byt es 0064 FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 002D 01040064 00B4 0101 0101 1002 0601 0400 0100 0102 0280 0002 0202 00

    x0064 = data of NOTIFICATION

    x0064 = decimal 100

    Failed PeeringNotifications

    10.

    1.

    2.

    1

    R2

    R

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    32/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Question: What did R1 see?

    R1#sh l og | i ncl ude NOTI FI CATI ON%BGP- 3- NOTI FI CATI ON: r ecei ved f r om nei ghbor 10. 1. 2. 2 2/ 2 ( peer i n wr ong AS

    Failed PeeringNotifications

    r out er bgp 200no synchr oni zat i onbgp l og- nei ghbor - changesnei ghbor 10. 1. 2. 1 r emot e- as 10no aut o- summar y

    r out er bgp 100no synchr oni zat i onbgp l og- nei ghbor - changes

    nei ghbor 10. 1. 2. 2 r emot e- as 200no aut o- summar y

    10.

    1.2

    .1

    10

    .1.

    2.

    2

    R2

    R1

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    33/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    What if a peer sends you a message that causes us to send a NOTIFIC

    Corrupt UPDATE

    Bad OPEN message, etc View the message that triggered the NOTIFICATION

    show i p bgp nei ghbor 1. 1. 1. 1 | begi n Last r eset

    Last r eset 5d12h, due t o BGP Not i f i cat i on sent , i nval i d or cor r upt

    Message received that caused BGP to send a Notification:

    FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF

    005C0200 00004140 01010040 0206065D

    1CFC059F 400304D5 8C20F480 04040000 05054005 04000000 55C0081C 329C4844

    329C6E28 329C6E29 58F50082 58F5EACE

    58F5FA02 58F5FA6E 18D14E70

    gDecoding Hex

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    34/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    You dont like reading hex?

    Nice write-up here on converting hex output to wireshark .pc

    http://ccie-in-3-months.blogspot.com/2010/08/decoding-ripe-expe http://tinyurl.com/bgp-hex-decode

    In a nutshell, put the hex dump in this format

    gDecoding Hex

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    35/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Now use Wiresharks text2pcap.exe to add the needed heade

    gDecoding Hex

    Open bgp_message.pcap with Wireshark

    Troubleshooting Peers

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    36/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    geBGP TTL

    BGP uses a TTL of 1 for eBGP peers

    Also verifies if NEXTHOP is directly connected For eBGP peers that are more than 1 hop away

    a larger TTL must be used

    No longer verifies if NEXTHOP is directly connected neighbor x.x.x.x ebgp-multihop [2-255]

    R1

    R2

    Troubleshooting Peers

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    37/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    geBGP TTL

    Loopback peering to directly connected eBGP peer Typically used to load-balance over multiple links

    Two options for configuring this

    Option #1 The old way Use ebgp-multihop

    Change the TTL to 2

    Disables the is the NEXTHOP on a connected subnet check

    R2

    R1

    M

    se

    lo

    R1#r out er bgp 100

    no synchr oni zat i onbgp l og- nei ghbor - changesnei ghbor 2. 2. 2. 2 r emot e- as 200neighbor 2.2.2.2 ebgp-multihop 2

    nei ghbor 2. 2. 2. 2 updat e- sour ce Loopback0

    no aut o- summar y

    Troubleshooting Peers

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    38/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    eBGP TTL

    Option #2 The new way

    Use disable-connected-check

    Still uses a TTL of 1 Disables the is the NEXTHOP on a connected

    subnet check

    R2

    R1

    M

    se

    loR1#r out er bgp 100no synchr oni zat i onbgp l og- nei ghbor - changesnei ghbor 2. 2. 2. 2 r emot e- as 200

    neighbor 2.2.2.2 disable-connected-checknei ghbor 2. 2. 2. 2 updat e- sour ce Loopback0no aut o- summar y

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    39/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Assume R1 sends hold time expired NOTIFICATION to R2 R1 did not receive a KA from R2 for holdtime seconds

    One of two issues R2 is not generating keepalives R2 is generating keepalives but R1 is not receiving them

    First figure out if R2 is building keepalives Is R2 out of memory or CPU?

    Output drops on the outbound interface towards R1? When did R2 last build a keepalive? R2#show i p bgp nei ghbor s 1. 1. 1. 1 Last r ead 00: 00: 15, last write 00:00:44, hol d t i me i s 180

    keepal i ve i nt er val i s 60 seconds Is the TCP window open?

    show i p bgp summar y Watch R2s MsgSent counter for R1.does it increment?

    Notifications Hold Time Expired

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    40/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Assuming R2 is sending keepalives, why isnt R1 receiving them? Input drops on R1 Lost in transit?

    Do R1 and R2 still have IP connectivity? Ping using peering addresses (loopback to loopback) Ping with mss (max-segment-size) with df-bit set

    MSS Max Segment Size

    536 bytes by default Path MTU Discovery finds smallest MTU between R1 and R2 Subtract 40 bytes for TCP/IP overhead

    Note the MSS and ping accordingly

    R1#sh ip bgp neighborsBGP nei ghbor i s 2. 2. 2. 2, r emot e AS 2, ext er nal l i nkDat agr ams ( max dat a segment i s 1460 byt es) :R1#ping 2.2.2.2 source loop0 size 1500 df-bit

    Notifications Hold Time Expired

    Failed Peering

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    41/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    show i p bgp summar y

    Watch R1s MsgRcvd counter for R2it should be incrementing

    When did R1 last receive keepalive?R1#show i p bgp nei ghbor s 2. 2. 2. 2

    Last read 00:00:95, l ast wr i t e 00: 00: 44, hol d t i me i s 1

    keepal i ve i nt er val i s 60 seconds

    Notifications Hold Time Expired

    Speaker FlapC S d

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    42/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    There are lots of possibilities here

    Is R1 having a problem sending keepalives? CPU at 100%?

    Out of memory?

    Are the keepalives being lost in the cloud?

    Is R2 having a problem receiving the keepalive?

    Case Study

    R1 R2

    %BGP- 5- ADJ CHANGE: nei ghbor 1. 1. 1. 1 Down BGP Not i f i cat i on sent%BGP- 3- NOTI FI CATI ON: sent t o nei ghbor 1. 1. 1. 1 4/ 0 ( hol d t i me expi r ed)

    R2#show ip bgp neighbor 1.1.1.1 | include last resetLast r eset 00: 01: 02, due t o BGP Not i f i cat i on sent , hol d t i me expi r ed

    NOTIFICATION

    Speaker FlapC St d

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    43/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Did R1 build and transmit a keepalive for R2? debug i p bgp keepal i ve show i p bgp nei ghbor

    When did we last send or receive data with the peer?R2#show i p bgp nei ghbor s 1. 1. 1. 1BGP nei ghbor i s 1. 1. 1. 1, r emot e AS 100, ext er nal l i nkBGP ver si on 4, r emot e r out er I D 1. 1. 1. 1

    BGP st at e = Est abl i shed, up f or 00: 12: 49Last read 00:01:15, last write 00:00:44, hol d t i me i s 180,

    keepal i ve i nt er val i s 60 seconds

    R2 hasnt received a Keepalive in more than keepalive interval second

    Time to check R1 How is R1 on memory? What is the R1s CPU load? Is R2s TCP window open?

    Case Study

    Speaker FlapC St d

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    44/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    R1#show ip bgp sum | begin NeighborNei ghbor MsgRcvd MsgSent Tbl Ver I nQ Out Q Up/ Down St at e/ Pf xRcd2. 2. 2. 2 53 284 10167 0 98 00: 02: 24 0

    R1#show ip bgp sum | begin NeighborNei ghbor MsgRcvd MsgSent Tbl Ver I nQ Out Q Up/ Down St at e/ Pf xRcd2. 2. 2. 2 53 284 10167 0 97 00: 01: 20 0

    At least one BGP kee

    interval apart

    The number of packets

    generated is increasingThe number of packets

    transmitted is not increasing

    OutQ is incrementing due to Keepalives

    MsgSent is not incrementing

    Something is stuck in the OutQ

    The keepalives arent leaving R1!

    Case Study

    Speaker FlapC St d

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    45/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    This is a layer 2 or 3 transport issue, etc.

    BGP OPENs and Keepalives are small

    UPDATEs can be much larger So maybe small packets work but larger packets do not?

    Case Study

    R1#ping 2.2.2.2

    Type escape sequence t o abor t .Sendi ng 5, 100- byt e I CMP Echos t o 2. 2. 2. 2, t i meout i s 2 seconds:! ! ! ! !Success r at e i s 100 per cent ( 5/ 5) , r ound- t r i p mi n/ avg/ max = 16/ 21/ 24 ms

    R1#ping 2.2.2.2 size 1500 df-bit

    Type escape sequence t o abor t .Sendi ng 5, 1500- byt e I CMP Echos t o 2. 2. 2. 2, t i meout i s 2 seconds:Packet sent wi t h t he DF bi t set. . . . .

    Success r at e i s 100 per cent ( 5/ 5) , r ound- t r i p mi n/ avg/ max = 1/ 1/ 1 ms

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    46/108

    Bestpath Algorithm

    Best PathAlgorithm

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    47/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Algorithm

    1 Not synchronized Only happens if sync is configured AND the route isnt in yo

    2 Inaccessible NEXTHOP IGP does not have a route to the BGP NEXTHOP

    3 Received-only paths Happens if soft-reconfig inbound is applied. A path will be r

    was denied/modified by inbound policy.

    Quick bestpath review

    Remember BGP only advertises one path per prefixthe bestpath

    Cannot advertise path from one iBGP peer to another

    Bestpath selection process is a little lengthy

    First eliminate paths that are ineligible for bestpath

    Best Path Algorithm

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    48/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Best PathAlgorithm

    1 Weight Highest wins Scope is router only

    2 LOCAL_PREFERENCE Highest wins Scope is AS only

    3 Locally Originated Redistribution or network statement favored

    address

    4 AS_PATH Shortest wins Skipped if bgp bestpath as-path ignore coAS_SET counts as 1

    CONFED parts do not count

    5 ORIGIN Lowest wins IGP < EGP < Incomplete

    6 MED Lowest wins MEDs are compared only if the first AS in this the same

    7 eBGP over iBGP

    8 Metric to Next Hop Lowest wins IGP cost to the BGP NEXTHOP

    9 Multiple Paths in RIB Flag path as multipath is max-paths is con

    10 Oldest External Wins Unless BGP best path compare router-id co

    11 BGP Router ID Lowest

    12 CLUSTER_LIST Smallest Shorter CLUSTER_LIST wins

    13 Neighbor Address Lowest Lowest neighbor address

    Best PathAlgorithm

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    49/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    show i p bgp x. x. x. x best pat h Will show you only the bestpath for x.x.x.x

    Handy if you have lots of paths for a prefixR2#sh i p bgp 7. 4. 4. 0/ 24 best pat hBGP r out i ng t abl e ent r y f or 7. 4. 4. 0/ 24, ver si on 2Pat hs: ( 20 available, best #13, t abl e Def aul t - I P- Rout i ngFl ag: 0x820

    Not adver t i sed t o any peer100

    192. 150. 6. 11 f r om 192. 150. 6. 11 ( 192. 150. 6. 11)Or i gi n I GP, met r i c 0, l ocal pr ef 100, val i d, ext er n

    R2#

    show i p bgp x. x. x. x mul t i pat h Same concept but will show you all of the multipaths for x.x.x.x

    Algorithm

    Best Path Algorithm

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    50/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    IOS-XR has sh i p bgp x. x. x. x best pat h- compar e

    Explains why the bestpath is the best

    Best PathAlgorithm

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    51/108

    BGP Table Version

    BGP Table Version

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    52/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP Table Version

    Lots of things must happen when bestpaths change

    RIB must be notified Peers must be informed Must have a way to track who has been informed of which bestpath

    Prefix Table Version

    Each prefix has a 32 bit number that is its table version A prefixs table version is bumped for every bestpath change Bumped means the table version changes from the current version

    available version #. Assume 10.0.0.0/8 has a table version of #27 and the highest table

    by any prefix is #30. If 10.0.0.0/8 has a bestpath change his table vbumped to #31.

    BGP Table Version

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    53/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP Table Version

    show i p bgp x. x. x. x will show you a prefixs table versionR1#sh i p bgp 10. 0. 0. 0BGP r out i ng t abl e ent r y f or 10. 0. 0. 0/ 8, ver si on 31Pat hs: ( 1 avai l abl e, best #1, t abl e Def aul t - I P- Rout i ng- Tabl e)Fl ag: 0x820

    Not adver t i sed t o any peer200

    2. 2. 2. 2 f r om 2. 2. 2. 2 ( 2. 2. 2. 2)

    Or i gi n I GP, met r i c 0, l ocal pr ef 100, val i d, ext er nal , bestR1#

    BGP Table Version

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    54/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP Table Version

    RIB & Peer Table Versions

    We have a table version for the RIB Also have a table version for each peer Used to keep track of which bestpath changes have been propagate

    If peer 1.1.1.1 has a table version of #60 this tells us we have

    1.1.1.1 of all bestpath changes for prefixes with a table versio If any prefix has a table version > #60 then we need to inform

    that prefixs bestpath

    Once 1.1.1.1 has been updated his table version will be updat

    accordingly Same concept for the RIB and its table version

    BGP Table Version

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    55/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    G ab e e s o

    show ip bgp summary is best for viewing RIB and peer version #sR2#show i p bgp summ

    BGP r out er i dent i f i er 2. 2. 2. 2, l ocal AS number 200BGP t abl e ver si on i s 13, mai n r out i ng t abl e ver si on 13

    3 net work ent r i es usi ng 351 byt es of memory

    3 pat h ent r i es usi ng 156 byt es of memory

    Nei ghbor V AS MsgRcvd MsgSent Tbl Ver I nQ Out Q Up/ Down Stat e/ Pf x

    1. 1. 1. 1 4 100 4386 4388 13 0 0 01: 20: 24 1

    R2#

    Highest table version of any prefix = main routing table version

    RIB is converged

    1.1.1.1 is converged

    BGP Table Version

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    56/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Example

    Assume the highest table version of any prefix is #10

    The RIB has a table version of #10 The RIB is up to date for all prefixes

    All peers have a table version of #10 Our peers are currently converged

    5 prefixes experience a bestpath change

    Highest table version is now #15

    Inform the RIB of these 5 changes Do RIB adds, deletes, and/or modifies

    When complete, set the RIB table version to #15 Inform our peers of these 5 changes

    Build updates and/or withdraws for each peer

    When complete, set our peers table versions to #15

    BGP Table Version

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    57/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Why am I babbling about this?

    Gives you a way to know who has been informed about what

    Provides a way to tell how many bestpath changes your network is ex You have 150k routes and see the table version increase by 150k every minu

    is wrong!! You have 150k routes and see the table version increase by 300 every minut

    normal network churn

    You should monitor the table version in your network to determine whafor you

    If the table version is increasing rapidly then that could explain why Band BGP IO are busy

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    58/108

    Initial Convergence

    BGP Convergence

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    59/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    HeyWho are you calling slow?

    Two general convergence situations

    Initial startup Periodic route changes

    g

    ConvergenceInitial Startup

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    60/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Initial convergence happens when:

    A router boots

    RP failover clear ip bgp *

    How long initial convergence takesis a factor of the amount of work to

    be done and the router/networksability to do this fast and efficiently

    p

    ConvergenceInitial Startup

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    61/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Initial convergence can

    stressfulif you are apBGP scalability limits tyou will see issues.

    p

    ConvergenceInitial Startup

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    62/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    What work needs to be done?

    1) Accept routes from all peers

    Not too difficult

    2) Calculate bestpaths This is easy

    3) Install bestpaths in the RIB Also fairly easy

    4) Advertise bestpaths to all peers This can be difficult and may take several minutes depending on the f

    variables

    p

    ConvergenceKey Variables

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    63/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP Variables

    The number of routes

    The number of peers The number of update-groups

    The ability to advertise routes to each update-group efficiently

    Router Variables

    CPU horsepower

    Code version

    Outbound Interface Bandwidth

    y

    ConvergenceUPDATE Packing

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    64/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    An UPDATE contains a set of Attributes and a list of prefixes (NLRI) BGP starts an UPDATE by building an attribute set BGP then packs as many destinations (NLRIs) as it can into the UPDATE NLRI = Network Layer Reachability Information Only NLRI with a matching attribute set can be placed in the UPDATE NLRI are added to the UPDATE until it is full (4096 bytes max)

    UPDATE Packing refers to how efficiently an implementation packs NL Least efficient: BGP only puts one NLRI per UPDATE Most efficient: BGP puts all NLRI with a certain Attribute set in one UPDAT

    g

    Least Efficient MED 50Origin IGP

    MED 50

    Origin IGP

    10.1.1.0/24 MED 50

    Origin IGP

    10.1.2.0/24 10

    Most Efficient MED 50Origin IGP

    10.1.1.0/24

    10.1.2.0/24

    10.1.3.0/24

    ConvergenceUPDATE Packing

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    65/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    The fewer attribute sets you have the better More NLRI will share an attribute set

    Fewer UPDATEs to converge

    Things you can do to reduce attribute sets next-hop-self for all iBGP sessions

    Dont accept/send communities you dont need

    Use cluster-id to put RRs in the same POP in a cluster To see how many attribute sets you have

    show i p bgp summar y

    190844 net wor k ent r i es usi ng 21565372 byt es of memor y

    302705 pat h ent r i es usi ng 15740660 byt es of memor y57469/31045 BGP path/bestpath attribute entries usi ng 620665memor y

    ConvergenceTCP MSS Max Segment Size

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    66/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    TCP MSS (max segment size) is also a factor in convergence times. The larthe MSS the fewer TCP packets it takes to transport the BGP updates. Fewepackets means less overhead and faster convergence.

    BGP UPDATE Attribute NLRNLRINLRI ..NLRIs.. ..NLRIs..

    Attribute NLRIDefault MSS

    BGP UDPATE is split

    into two TCP packets

    ..NLRIs..

    NLRINLRI ..NLRIs..

    Increased MSS

    The entire BGP update

    can fit in one TCP packet

    IP Header TCP Header

    IP Header TCP Header

    Attribute NLRINLRI ..NLRIs..IP Header TCP Header

    ConvergenceTCP MSS Max Segment Size

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    67/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    MSS Max Segment Size

    Limit on packet size for a TCP socket

    536 bytes by default Path MTU Discovery

    Finds smallest MTU between R1 and R2

    Subtract 40 bytes for TCP/IP overhead

    Enabled by default for BGP

    nei ghbor 2. 2. 2. 2 t r anspor t pat h- mt u- di scover y di sabl e

    To find the MSSR1#sh ip bgp neighbors

    BGP nei ghbor i s 2. 2. 2. 2, r emot e AS 3, ext er nal l i nk

    Dat agr ams ( max dat a segment i s 1460 byt es) :

    ConvergenceUpdate Groups

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    68/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP must create updates based onthe policies towards each peer

    Peers with a common outbound policyare members of the same update-group

    iBGP vs. eBGP Outbound route-map, prefix-lists, etc

    UPDATEs are generated for one

    member of an update-group and thenreplicated to the other members

    Attribute NLRNLRI

    Less Efficient Two peer

    update-groups

    More Efficient Two peer

    the same update-group

    Attribute NLRNLRI

    Attribute NLRNLRI

    ConvergenceDropping TCP Acks

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    69/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Primarily an issue on RRs (Route Reflectors)with

    One or two interfaces connecting to the core Hundreds of RRCs (Route Reflector Clients)

    RR sends out tons of UPDATES to RRCs

    RRCs send TCP ACKs RR core facing interface(s) receive huge wave of

    TCP ACKs

    R

    RR

    TCP A

    BGP UP

    ConvergenceDropping TCP Acks

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    70/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Interface input queue fills upTCP ACKs are dropped

    Each time a TCP packet is dropped, the session goes into slow start

    It takes a good deal of time for a TCP session to come out of slow start Increase the input queue

    hold-queue 1000 in

    If you still see drops increase to 4096

    Convergence

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    71/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    How do you know if BGP has converged?

    Watch the global table version

    Increases by 1 for every bestpath change In the lab: Table version stabilizes

    In the real world: Reaches your normal rate of change

    Watch peer InQ and OutQs

    Wait for all InQ and OutQs to be empty

    To list peers with non-empty queues

    show i p bgp summ | e 0 0

    Watch peer table versions show i p bgp summ

    If peer table version == global table version and InQ/OutQ empty, BGP h

    that peer

    ConvergenceInitial Convergence Summary

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    72/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Initial convergence time is a factor of the amount of work that needsand the router/networks ability to do this fast and efficiently

    Reduce the number of attributes sets in BGP Use next-hop-self, dont send communities you dont need, etc.

    Reduce the number of unique outbound policies towards all peers

    Try to find a small set of common policies, rather than individualizing polic

    The fewer update-groups the better

    MSS/PMTU

    Efficient packaging of BGP messages in TCP

    Stop TCP ACK drops

    Increase interface input queues on RRs

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    73/108

    Periodic Convergence

    ConvergenceRoute Changes

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    74/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    There are 2 elements to route change convergence for BGP

    Failure Detection

    How long does it take to see the failure? (t0 to t1) Convergence

    How long does it take to process and propagate information abouthe failure? (t1 to t2)

    FailureProcess

    Propagate

    t0 t2

    Recovery

    t1

    ConvergenceRoute Changes

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    75/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Time to Detect Failure

    Address Tracking Feature

    Nexthop Tracking Peer Down Detection

    Time to Respond to Failure MRAI Min Route Advertisement Interval

    Advertising the new information

    ConvergenceAddress Tracking Filter

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    76/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Quick ATF review

    ATF = Address Tracking Filter

    ATF is a middle man between the RIB and RIB clients BGP, OSPF, EIGRP, etc are clients of the RIB

    A client tells ATF what prefixes he is interested in

    ATF tracks each prefix Notify the client when the route to a registered prefix changes

    Client is responsible for taking action based on ATF notification

    Provides a scalable event driven model for dealing

    with RIB changes

    ConvergenceNexthop Tracking

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    77/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP nexthop tracking

    Relies on ATF

    Event driven convergence model

    Register NEXTHOPs with ATF 10.1.1.3

    10.1.1.5

    ATF filters out changes for 10.1.1.1/32, 10.1.1.2/32, and 10.1.1.4/32

    BGP has not registered for these

    Changes to 10.1.1.3/32 and 10.1.1.5/32 are passed along to BGP

    Recompute bestpath for prefixes that use these NEXTHOPs

    No need to wait for BGP ScannerRIB

    10.1.1.1/3210.1.1.2/32

    10.1.1.3/32

    10.1.1.4/32

    10.1.1.5/32

    ATF

    BGP NEXTH

    10.1.1.3

    10.1.1.5

    BGP

    ConvergenceNexthop Tracking

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    78/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Enabled by default [ no] bgp next hop t r i gger enabl e

    BGP registers all nexthops with ATF show i p bgp at t r next - hop r i bf i l t er

    Trigger delay is configurable bgp next hop t r i gger del ay

    5 seconds by default

    Debugs debug i p bgp event s next hop

    debug i p bgp r i b- f i l t er

    ConvergencePeer Down Detection

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    79/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP must learn that the peer is down

    Default keepalive/holdtime values are 60 seconds and 180 seconds

    My 2c.use 3 second KA with 9 second holdtime

    Tune your IGP to converge in under 9 seconds

    Use BFD (bidirectional forwarding detection) if you need to be more aggre

    eBGP directly connected

    bgp f ast - ext er nal - f al l over

    If the interface goes down so does the eBGP peer

    Reduce carrier-delay settings

    0 msec for down 100 msec for up

    eBGP multihop

    Relies on holdtime or BFD

    ConvergencePeer Down Detection

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    80/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    iBGP peers

    Relies on holdtime or BFD

    BFD on iBGP peers Know how fast your IGP converges!

    Your BFD dead timer must be greater than that amount

    iBGP peer down detection isnt as critical as eBGP. Why?

    IGP should be tuned to converge quickly

    Fast IGP + BGP Nexthop Tracking = BGP reacts quickly to nexthop chang

    BGP can route around a change in the core prior to bringing down iBGP p

    ConvergenceFast Session Deactivation

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    81/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Fast Session Deactivation nei ghbor x. x. x. x f al l - over

    Register peer's address with ATF

    ATF informs BGP of routing changes to the peer

    When we lose our route to the peer, bring the peer down.

    No need to wait for holdtime to expire

    Primary use case is eBGP multihop

    Multiho

    #1 Li

    #2 Li

    #3 FSdown p

    ConvergenceFast Session Deactivation

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    82/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Very dangerous for iBGP peers

    IGP may not have a route to a peer for a split second

    FSD would tear down the BGP session

    Imagine if you lose your IGP route to your RR (Route Reflector)for just 100ms

    Every RR to RRC session would flap

    Off by default nei ghbor x. x. x. x f al l - over

    ConvergenceFSD vs. BFD

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    83/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Why do we have both? FSD was developed first Goal was fast BGP neighbor detection without

    expense of fast keepalives

    BFD came later Goal was fast neighbor detection for multipleprotocols

    Fast keepalives not as much of a concern BFD KAs are generated by linecards CPUs are also much faster today

    FSD Relies on control plane (absence of a route in

    the RIB) to tear down the peer We could have a route but not have

    connectivity

    BFD Relies on forwarding plane to detect down

    peer If we loose connectivity, the peer comes down

    ConvergenceMRAI (minimum route advertisement interval)

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    84/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    How is the timer enforced for peer X?

    Timer starts when all routes have been advertised to X

    For the next MRAI (seconds) we will not propagate any bestpath cX

    Once Xs MRAI timer expires, send him updates and withdraws

    Restart the timer and the process repeats

    User may see a wave of updates and withdraws to peer X evseconds

    User will NOT see a delay of MRAI between each individual

    withdraw BGP would never converge if this were the case

    ConvergenceMRAI

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    85/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    MRAI timeline for BGP peer w/ MRAI of 5 seconds

    T0 The big bang

    T7 Bestpath Change #1 UPDATE sent immediately MRAI timer starts, will expire at T12

    T10

    Bestpath Change #2 Must wait until T12 for MRAI to expire

    T12 MRAI expires

    Bestpath Change #2 is Txed MRAI timer starts, will expire at T17

    T17 MRAI expires No pending UPDATEs

    t0 t5 t10 t15

    TX update #1

    Start MRAI

    Bestpath

    Change #2

    Bestpath

    Change #1

    MRAI Expires

    TX update #2Start MRAI

    MRA

    ConvergenceMRAI

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    86/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    BGP is not a link state protocol, it is path vector

    May take several rounds/cycles of exchanging updates and withdrnetwork to converge

    MRAI must expire between each round!

    The more fully meshed the network and the more tiers of ASes, the rounds required for convergence

    Think about

    How many tiers of ASes there are in the Internet

    How meshy peering can be in the Internet

    ConvergenceMRAI

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    87/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Internet churn means we are constantly setting and waiting on MRA

    One flapping prefix slows convergence for all prefixes

    Internet table sees roughly 6 bestpath changes per second

    For iBGP and PE-CE eBGP peers nei ghbor x. x. x. x adver t i sement - i nt er val 0

    Has been the default since 12.0(32)S

    For regular eBGP peers

    Lowering to 0 may get you dampened

    OK to lower for eBGP peers if they are not using dampening

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    88/108

    High CPU Utilization

    High Utilization

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    89/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Define High Know what normal CPU utilization is for the router in question

    Is the CPU spiking due to BGP Scanner or is it constant?

    Look at the scenario Is BGP going through Initial Convergence?

    If not then route churn is the usual culprit

    Illegal recursive lookup or some other factor causes bestpath cfor the entire table

    Rout er #show process cpuCPU ut i l i zat i on f or f i ve seconds: 100%/ 0%; one mi nut e: 99%; f i ve mi n

    . . . .139 6795740 1020252 6660 88. 34% 91. 63% 74. 01% 0 BGP Rout

    High Utilization

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    90/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    How to identify route churn?

    Do sh ip bgp summary, note the table version

    Wait 60 seconds

    Do sh ip bgp summary, compare the table version from 60 sec

    You have 150k routes and see the table version increase b

    This is probably normal route churn

    Know how many bestpath changes you normally see per minute

    You have 150k routes and see the table version increase b

    This is bad and is the cause of your high CPU

    High Utilization

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    91/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    What causes massive table version changes?

    Flapping peers

    Hold-timer expiring?

    Corrupt UPDATE?

    Route churn

    Dont try to troubleshoot the entire BGP table at once

    Identify one prefix that is churning and troubleshoot that one prefix

    Will likely fix the problem with the rest of the BGP table churn

    High Utilization

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    92/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Table Version Changing Rapidly: A Little Lab Fun

    RP/ 0/ RP0/ CPU0: XR#sh r out e | include 00:00:Wed Apr 27 13: 53: 40. 201 EDT

    O 1. 0. 0. 0/ 30 [ 110/ 3] vi a 10. 1. 2. 1, 00:00:00, Gi gabi t Et her net 0/ 0/ 0/O 1. 0. 0. 4/ 30 [ 110/ 3] vi a 10. 1. 2. 1, 00:00:00, Gi gabi t Et her net 0/ 0/ 0/O 1. 0. 0. 8/ 30 [ 110/ 3] vi a 10. 1. 2. 1, 00:00:00, Gi gabi t Et her net 0/ 0/ 0/O 1. 0. 0. 12/ 30 [ 110/ 3] vi a 10. 1. 2. 1, 00:00:00, Gi gabi t Et her net 0/ 0/ 0. . .

    RP/ 0/ RP0/ CPU0: XR#sh r out e | include 00:00:Wed Apr 27 13: 53: 44. 162 EDTB 1. 0. 0. 0/ 30 [ 20/ 2] vi a 1. 1. 1. 1, 00:00:01B 1. 0. 0. 4/ 30 [ 20/ 2] vi a 1. 1. 1. 1, 00:00:01B 1. 0. 0. 8/ 30 [ 20/ 2] vi a 1. 1. 1. 1, 00:00:01B 1. 0. 0. 12/ 30 [ 20/ 2] vi a 1. 1. 1. 1, 00:00:01

    . . .

    < 4 seconds later

    High Utilization

    Table Version Changing Rapidly: A Little Lab Fun

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    93/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    RP/ 0/ RP0/ CPU0: aggi es#sh i p bgp 1. 0. 0. 4Wed Apr 27 14: 00: 36. 066 EDT. . .Last Modi f i ed: Apr 27 14: 00: 35. 387 f or 00: 00: 00Pat hs: ( 1 avai l abl e, no best pat h). . .

    1001. 1. 1. 1 ( i naccessi bl e) f r om 1. 1. 1. 1 ( 1. 1. 1. 1). . .

    RP/ 0/ RP0/ CPU0: aggi es#sh i p bgp 1. 0. 0. 4Wed Apr 27 14: 00: 38. 710 EDT

    . . .Last Modi f i ed: Apr 27 14: 00: 38. 387 f or 00: 00: 00Pat hs: ( 1 avai l abl e, no best pat h). . .

    1. 1. 1. 1 ( met r i c 2) f r om 1. 1. 1. 1 ( 1. 1. 1. 1)

    . . .

    Table Version Changing Rapidly: A Little Lab Fun

    3 seconds later1.1.1.1 (NH) flapping

    S thi i ith NEXTHOP 1 1 1 1

    High Utilization

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    94/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Something is wrong with NEXTHOP 1.1.1.1

    Flip flops between inaccessible and accessible with an IGP cost o

    Troubleshoot 1.1.1.1 and the churning will stop

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    95/108

    Layer 3 VPNs

    Layer 3 VPNs

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    96/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Troubleshooting Checklist

    #1 PE1 PE2 core connectivity

    Verify you can ping from loopback to loopback Verify you can mpls ping from loopback to

    loopback

    PE loopbacks must be /32

    Check IGP Check LDP

    #2 PE1 CE1 and PE2 CE2 connectivity

    Can each PE ping their directly connected CE? Remember to do ping vrf FOO x.x.x.x

    PE1

    CE1

    #2

    Layer 3 VPNs

    #3 PE PE vrf connectivity

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    97/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    #3 PE PE vrf connectivity

    Can PEs ping the vrf interface of the other PE?

    If not double check your import/export Route Targets

    #4 PE CE connectivity

    Verify each PE can ping the CE connected to the other PE

    #5 CE CE connectivity

    At this point you should be able to ping CE to CE

    PE1 #

    CE1

    #4

    #

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    98/108

    Looking Glasses

    You are advertising your

    The InternetBGP Looking Glasses

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    99/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    You are advertising youraddress space to your ISPs

    Q: How can you verify they arereceiving it?

    Q: How can you verify the restof the Internet is receiving it?

    A: BGP Looking Glasses

    BGP Looking Glass servers are computers on th

    Internet running one of a variety of publicly availa

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    100/108

    Internet running one of a variety of publicly availaLooking Glass software implementations. A LookGlass server (or LG server) is accessed remotelthe purpose of viewing routing info. Essentially, tserver acts as a limited, read-only portal to routewhatever organization is running the Looking Glaserver. Typically, publicly accessible looking glasservers are run by ISPs or NOCs.

    http://www.bgp4.as

    The InternetBGP Looking Glasses

    https://www.sprint.

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    101/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Show bgp route 72.16

    72.163.0.0/20

    The InternetBGP Looking Glasses

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    102/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    host $ nsl ookup www. ci sc. . .Addr ess: 72. 163. 4. 161

    host $

    http://whois.arin.net/ui

    Huge list of looking glasses here

    The InternetBGP Looking Glasses

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    103/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Huge list of looking glasses here

    http://www.bgp4.as/looking-glasses

    The Level3 looking glass will translate AS #s to company na

    The InternetBGP Looking Glasses

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    104/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    g g p y AS-PATH: 3549 6327

    AS-PATH Translation: GBLX SHAWFIBER

    Long list herehtt //b t t/ id / t ht l

    The InternetWhose AS is that anyway?

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    105/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    g http://bgp.potaroo.net/cidr/autnums.html

    Or lookup a specific AS

    http://whois.arin.net/rest/asn/AS1239/pft

    The University's Route Views project was originally conceived as a tool for Internet operators to obreal-time information about the global routing system from the perspectives of several different bacand locations around the Internet. Although other tools handle related tasks, such as the various L

    Glass Collections (see e.g. NANOG, or the DTI NSPIXP-2 Looking Glass), they typically either proonly a constrained view of the routing system (e.g., either a single provider, or the route server) or t id l ti t ti d t

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    106/108

    not provide real-time access to routing data.

    While the Route Views project was originally motivated by interest on the part of operators in deterhow the global routing system viewed their prefixes and/or AS space, there have been many other

    interesting uses of this Route Views data. For example, NLANR has used Route Views data for ASvisualization (see also NLANR), and to study IPv4 address space utilization (archive). Others haveRoute Views data to map IP addresses to origin AS for various topological studies. CAIDA has useconjunction with theNetGeo database in generating geographic locations for hosts, functionality thboth CoralReef and the Skitter project support.

    University of Oregon Route

    http://www.r

    Complete Your Online Session Evaluation

    Give us your feedback and

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    107/108

    2013 Cisco and/or its affiliates. All rights reserved.BRKRST-3320 Cisco Public

    Maximize your Cisco Live exp

    free Cisco Live 365 account. DPDFs, view sessions on-dema

    live activities throughout the ye

    Cisco Live 365 button in your

    log in.

    yyou could win fabulous prizes.Winners announced daily.

    Receive 20 Cisco Daily Challengepoints for each session evaluationyou complete.

    Complete your session evaluationonline now through either the mobileapp or internet kiosk stations.

  • 5/28/2018 BRKRST-3320 - Troubleshooting BGP

    108/108