Troubleshooting the Cisco Nexus 5000 / 2000 Series Switchesd2zmdbbm9feqrf.cloudfront.net/2011/las/pdf/BRKCRS-3145.pdf · BRKCRS-3145 Troubleshooting the Cisco Nexus 5000 / 2000 Series
Post on 06-Mar-2018
528 Views
Preview:
Transcript
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 2
Objectives
Be able to quickly isolate problematic nodes in the datacenter
Become familiar with troubleshooting in NX-OS
Understand Nexus 5000 and Nexus 2000 platform details
Gain comfort using Nexus 5000 and Nexus 2000 day to day
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 3
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Network Diagrams
Types of logging
Outputs
When to call TAC
Platform Overview and troubleshooting
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 44
Problem Isolation
“A problem well stated is a problem half solved”
Source: Charles F. Kettering, Engineer and Inventor
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 5
Troubleshooting Tool #1
A current, accurate diagram
Physical ports
Logical ports
Spanning-tree root and blocked ports
Helpful to use standard formats
.jpg, .bmp, .pdf
If you cannot describe how your network should be operating, time may be wasted
N7k-1 N7k-2
N5k-1 N5k-2 N5k-3 N5k-4
vPC
po1
vPC
Po2
vPC peer-keep
e1/1 - e1/1
vPC peer-link
e1/2, 2/2
Po100
Domain 100
RSTP Root
N5k-5
e1/10 - e1/10
e1/12 - e1/12
STP BLK
vPC peer-link
e1/1, 1/2
Po101
Domain 101
vPC peer-link
e1/1, 1/2
Po102
Domain 102
e1/30 e1/31
e3/1 e4/1
e1/30 e1/31e1/30 e1/31e1/30 e1/31
e3/1 e4/1
e3/2 e4/2e3/2 e4/2
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 6
Grab a “show tech-support”
Sometimes too general
Large file, time consuming
If time permits, use targeted outputs or a specific show tech
If there is no time, use tac-pac and copy off
Much quicker than transmitting to terminal
Zips entire output to file in volatile:
Copy file off of switch for analysis
Or not…
N5k-1# tac-pac
N5k-1# dir volatile:
180242 Jan 28 4:37:26 2011 show_tech_out.gz
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 7
Which show tech?As of 5.0(3), there are 68N5k-1# show tech-support ?
aaa Display aaa information
aclmgr ACL commands
adjmgr Display Adjmgr information
arp Display ARP information
ascii-cfg Show ascii-cfg information for technical support personnel
assoc_mgr Gather detailed information for assoc_mgr troubleshooting
bcm-usd Gather detailed information for BCM USD troubleshooting
bootvar Gather detailed information for bootvar troubleshooting
brief Display the switch summary
btcm Gather detailed information for BTCM component
callhome Callhome troubleshooting information
cdp Gather information for CDP trouble shooting
...
session-mgr Gather information for troubleshooting session manager
snmp Gather info related to snmp
sockets Display sockets status and configuration
spm Service Policy Manager
stp Gather detailed information for STP troubleshooting
sysmgr Gather detailed information for sysmgr troubleshooting
time-optimized Gather tech-support faster, requires more memory & disk space
track Show track tech-support information
vdc Gather detailed information for VDC troubleshooting
vpc Gather detailed information for VPC troubleshooting
vtp Gather detailed information for vtp troubleshooting
xml Gather information for xml trouble shooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 8
Log your outputRedirect and Append
N5k-1# show clock > bootflash:debug-file.txt
N5k-1# show mac address-table >> bootflash:debug-file.txt
N5k-1# show running-config | count >> bootflash:debug-file.txt
N5k-1# show file bootflash:debug-file.txt
Mon Apr 4 02:39:41 UTC 2011 <==== output from show clock
Legend: <==== output from show mac address-table
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O -
Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-
Link
VLAN MAC Address Type age Secure NTFY Ports
---------+-----------------+--------+---------+------+---+-----------
+ 99 0021.5ad8.c424 dynamic 0 F F Po500
* 1 0021.5ad8.c424 dynamic 250 F F Eth101/1/2
845 <==== output from show running-config | count
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 9
Logging
show logging logfile
Basis for tracing events chronologically
Try using start-time or last
show accounting log
Basis for tracing configuration changes
terminal log-all to also log show commands
All commands end with (SUCCESS) or (FAILURE)
Often overlooked, but very important
N5k-1# show logging logfile start-time 2011 Mar 9 20:00:00
2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/1 is
down (None)
2011 Mar 9 20:17:18 esc-n5548-1 %ETHPORT-5-IF_DOWN_NONE: Interface Ethernet1/3 is
down (None)
N5k-1# show logging last ?
<1-9999> Enter number of lines to display
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 10
Other System Logsshow logging nvram
Persistent logging survives reloads – helpful for crash or reload issues.
esc-n5020-1# show logging nvram
2011 Jan 26 14:58:10 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 124 is
online
2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-PFM_SYSTEM_RESET: Manual
system restart from Command Line Interface
2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %KERN-0-SYSTEM_MSG: Shutdown
Ports.. - kernel
2011 Jan 28 02:47:38 esc-n5020-1 %$ VDC-1 %$ %KERN-0-SYSTEM_MSG: writing
reset reason 9, - kernel
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE:
FEX-101 Off-line (Serial Number JAF132XXXXX)
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 101 is
offline
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %NOHMS-2-NOHMS_ENV_FEX_OFFLINE:
FEX-124 Off-line (Serial Number JAF140XXXXX)
2011 Jan 28 02:47:40 esc-n5020-1 %$ VDC-1 %$ %PFMA-2-FEX_STATUS: Fex 124 is
offline
2011 Jan 28 02:47:43 esc-n5020-1 %$ VDC-1 %$ %VPC-2-PEER_KEEP_ALIVE_RECV_FAIL:
In domain 500, VPC peer keep-alive receive has failed
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 11
When to call TAC
A description of the problem observed, with evidence / clues, along with time and scope
A current network diagram
All parties involved in the problem
show tech is not necessary, but if you must make drastic changes such as reloading or replacing hardware, grab this first
Any targeted outputs, especially around the time of the event in question
You think you have found a bug, but a quick search of defects or release notes on cisco.com may be faster
Most efficient if you have the following:
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 12
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000
Platform Overview and troubleshooting
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 13
Support for tab auto-complete within current context, but commands will execute at higher levels if available.
Filesystems dynamically auto-complete
NX-OSOperation Tips
N5k-3(config-if)# switch?
switchport Configure switchport parameters <=== matching in config-if mode
N5k-3(config-if)# switchn?
switchname Configure system's host name <=== matching in config mode
N5k-3# (config)# show file bootflash:s?
bootflash:stp.log.1
N5k-3# (config)# install all system bootflash:n5<tab>
bootflash:n5000-uk9.5.0.3.N1.1.bin
bootflash:n5000-uk9.5.0.2.N2.1.bin
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 14
CLI list and grep
ctrl-c terminates output
NX-OSOperation Tips
N5k-3# show cli list | grep switchport
show system default switchport san
show interface switchport
show interface <if-mr> switchport
N5k-3# show tech-support
---- show tech-support ----
ctrl-c
N5k-3#
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 15
Mounts could fill, watch /var/tmp – it is cleared by reload or with TAC!!!!
A full /var/tmp can cause upgrade errors, unexpected logs
NX-OSFile Structure
N5k-1# show system internal flash
Mount-on 1K-blocks Used Available Use% Filesystem
/ 204800 111460 93340 55 /dev/root
/proc 0 0 0 0 proc
/sys 0 0 0 0 none
/isan 1536000 453760 1082240 30 none
/var/tmp 131072 108 130964 1 none
/var/sysmgr 512000 4700 507300 1 none
/var/sysmgr/ftp 204800 48604 156196 24 none
/var/sysmgr/ftp/cores 20480 0 20480 0 none
/callhome 32768 0 32768 0 none
/dev/shm 262144 95936 166208 37 none
/volatile 61440 0 61440 0 none
/debug 2048 4 2044 1 none
/dev/mqueue 0 0 0 0 none
/mnt/cfg/0 39257 4332 32898 12 /dev/sda5
/mnt/cfg/1 37242 4332 30987 13 /dev/sda6
/var/sysmgr/startup-cfg 102400 3112 99288 4 none
/dev/pts 0 0 0 0 devpts
/mnt/plog 56192 1784 54408 4 /dev/mtdblock2
/mnt/pss 39273 6058 31187 17 /dev/sda4
/bootflash 859848 768664 47504 95 /dev/sda3
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 16
volatile: filesystem is virtual, use as scratch if needed
Obviously volatile, will not survive a reload
log: filesystem is in root /
NX-OSFile Structure
N5k-1# debug logfile CiscoLive_debugs
N5k-1# show debug
Output forwarded to file CiscoLive_debugs (size: 4194304 bytes)
Debug level is set to Minor(1)
N5k-1# dir log:
0 Apr 04 01:14:01 2011 CiscoLive_debugs
31 Mar 11 11:38:35 2011 dmesg
0 Mar 11 11:38:57 2011 libfipf.4365
79101 Apr 04 00:34:02 2011 messages
6670 Apr 04 00:06:01 2011 startupdebug
N5k-1# copy log:CiscoLive_debugs tftp:
Enter vrf: management
Enter hostname for the tftp server: 10.91.42.134
Trying to connect to tftp server......
Connection to Server Established.
|
TFTP put operation was successful
N5k-1# clear debug-logfile CiscoLive_debugs
-OR-
N5k-1# undebug all
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 17
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000
Platform Overview and troubleshooting
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 18
NX-OS records the finite state machine for many important processes
Using this event-history of FSM states and triggers, debugging can be done after a problem has occurred.
Some common processes:
ethpc – ethernet port client: responsible for talking to the mac and phy
ethpm – ethernet port manager: responsible for translating between configuration and ethpc. ethpc would inform ethpm that link is up, and then ethpm will proceed to give instructions on what the configuration is for the port
port-channel – port-channeling process responsible for aggregating physical links into logical channels
lacp – 802.3ad standard for aggregating links
fwm – forwarding manager; responsible for programming hardware according to the software configuration
Important to compare timestamps and watch for inter-process communication.
NX-OSFSM
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 19
NX-OSFSM
Sometimes it is enough to look at one process FSM, other times you are looking for related events.
Timestamps should line up when there is causality.
Example: A fex comes online after e1/3 is brought up
N5k-1# show logg
2005 Feb 2 13:16:49 esc-n5020-1 %ETHPORT-5-IF_UP: Interface Ethernet1/3 is up
in mode Fex Fabric
2005 Feb 2 13:16:47 esc-n5020-1 %SYSMGR-FEX100-5-MODULE_ONLINE: System
Manager has received notification of local module becoming online.
2005 Feb 2 13:16:47 esc-n5020-1 %SATCTRL-FEX100-2-SATCTRL: FEX-100 Module 1:
Cold boot
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 20
NX-OSFSM
N5k-1# show platform software ethpc event-history interface e100/1/4
1) Event IF_PCFG_RSP, len: 8, at 243054 usecs after Wed Feb 2 13:16:54 2011
Sent port cfg message response to ethpm - Id: 0x2cc1819, Status: success
N5k-1# show port-channel internal event-history interface e100/1/4
>>>>FSM: <Ethernet100/1/4> has 1 logged transitions<<<<<
1) FSM:<Ethernet100/1/4> Transition at 447889 usecs after Wed Feb 2 13:16:54
2011
Previous state: [PCM_ETH_PORT_ST_INIT_DOWN]
Triggered event: [PCM_PORT_EV_IF_CREATE]
Next state: [FSM_ST_NO_CHANGE]
Curr state: [PCM_ETH_PORT_ST_INIT_DOWN]
A given fex host interface shows “port cfg” message
Indicates preparation to enable the interface
port-channel history shows an IF_CREATE event near this time
This is all related to a fex coming online, while e100/1/4 is configured as a port-channel member and is coming up
*e1/3 up at 13:16:49
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 21
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000
Platform Overview and troubleshooting
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 22
NX-OSMTS
NX-OS uses Message and Transaction Service(MTS) to communicate between processes.
When Troubleshooting CPU issues, we can check MTS for a large queue of messages.
When troubleshooting a specific process, we may see specific MTS messages queued.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 23
NX-OSMTS
NX-OS uses Message and Transaction Service(MTS) to communicate between processes.
Useful to check when troubleshooting
high CPU
unresponsive CLI / timeout
control-plane disruption
When troubleshooting a process, we may look for specific MTS messages queued.
MTS messages may be coming in too fast, or there could be a message stuck at the top of the queue
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 24
NX-OSMTS persistant queue is allowed to grow old
N5k-1# show system internal mts buffers details
Node/Sap/queue Age(ms) SrcNode SrcSAP DstNode DstSAP OPC MsgId MsgSize
sup/284/pers 2387380 0x101 1231 0x101 284 86017 1301448368 868
sup/284/pers 14398 0x101 1238 0x101 284 86017 1301470493 868
sup/284/pers 3028 0x101 1897 0x101 284 86017 1301473115 868
sup/284/pers 818 0x101 1328 0x101 284 86017 1301473633 868
sup/284/pers 577 0x101 1236 0x101 284 86017 1301473693 868
sup/284/pers 42 0x101 32562 0x101 284 86017 1301473831 868
N5k-1# sh system internal mts sup sap 284 description
TCPUDP process client MTS queue
N5k-1# sh system internal mts sup sap 1231 description
dcos-xinetd
N5k-1# sh system internal mts opcodes | grep 86017
86017 MTS_OPC_TCP:
The first entry is dcos-xinetd (internet services) and it makes sense to be old, since it‟s a server that is always running (for fabric manager)
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 25
NX-OSMTS
recv queue should not grow old
SAP 0 is an invalid identifier and causes 300 messages to queue, and growing.
Observed impact is various show commands timing out such as show log and show run
N5k-1# show system internal mts buffers details
Node/Sap/queue Age(ms) SrcNode SrcSAP DstNode DstSAP OPC MsgId MsgSize
sup/32/recv 319672424 0x101 25330 0x101 0 7662 1221952768 192
sup/32/recv 319669986 0x101 25336 0x101 32 188 1221953842 328
sup/32/recv 319609082 0x101 25344 0x101 0 7663 1221971222 2452
...
sup/32/recv 227324 0x101 32550 0x101 32 188 1301415915 328
sup/32/recv 165509 0x101 32560 0x101 0 7663 1301432732 2452
sup/32/recv 101893 0x101 32565 0x101 0 7662 1301448663 192
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 26
NX-OSMTS
MTS messages have been addressed to SAP 0 due to a bug.
Reload was needed to clear this scenario
N5k-1# sh system internal mts sup sap 0 description
Not implemented
N5k-1# sh system internal mts sup sap 32 description
Syslog Sup Node Cfg
N5k-1# show system internal sysmgr service name syslogd
Service "syslogd" ("syslogd", 75):
UUID = 0x21, PID = 3924, SAP = 32
State: SRV_STATE_HANDSHAKED (entered at time Sat May 15 05:01:20
2010). Restart count: 1
Time of last restart: Sat May 15 05:01:20 2010. The service never
crashed since the last reboot.
Tag = N/A
Plugin ID: 0
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 27
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview
NX-OS Operation
FSM
MTS
Crashes
Nexus 5000
Nexus 2000
Platform Overview and troubleshooting
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 28
NX-OS attempts to create a core file with information helpful to aid in finding and fixing the problem
stack trace
memory contents
Some processes in NX-OS are able to be restarted in a stateful manner.
Nexus 5000 is a single-supervisor platform; critical processes require a system restart upon a crash.
NX-OSCrashes
2010 Sep 10 16:19:27.411 N5k-1 %$ VDC-1 %$ %SYSMGR-2-
SERVICE_CRASHED: Service "fwm" (PID 2723) hasn't caught signal
6 (core will be saved).
A syslog message is sent just before crash and system restart
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 29
show process log
View status of all processes, including if a core was created
N5k-1# show process log
Process PID Normal-exit Stack Core Log-create-time
--------------- ------ ----------- ----- ----- ---------------
eth_port_channel 2743 N Y N Wed Mar 17 17:20:57 2010
eth_port_channel 2761 N Y N Tue Aug 3 19:14:58 2010
fwm 2703 N Y N Fri Oct 8 19:24:12 2010
...
N5k-1# show process log pid 2703
======================================================
Service: fwm
Description: Forwarding manager Daemon
Started at Thu Oct 7 14:51:51 2010 (151707 us)
Stopped at Fri Oct 8 19:24:12 2010 (203577 us)
Uptime: 1 days 4 hours 32 minutes 21 seconds
Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
...
NX-OSCrashes
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 30
When NX-OS system manager “sysmanager” resets the switch, a core file for the offending process is often generated.
Copy off core file for TAC analysis
NX-OSCrashes
N5k-1# show cores
Module-num Instance-num Process-name PID Core-create-time
---------- ------------ ------------ --- ----------------
1 1 fwm 2723 Sep 17 16:34
N5k-1# copy core://1/fwm/1/ ?
bootflash: Select destination filesystem
ftp: Select destination filesystem
scp: Select destination filesystem
sftp: Select destination filesystem
tftp: Select destination filesystem
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 31
show logging onboard obfl-logs
show logging onboard exception log
show logging onboard kernel-trace
obfl-logs – per module; tracks environmental logs, bootup-records,
uptime at bootup, version at each boot, stack trace if applicable
exception log – crash/exception history and details
kernel-trace – display stack of last kernel exception
OBFL is used to capture information related to hardware, bootup,
and environmental conditions. Onboard failure logging is non-volatile.
NX-OSCrashes
Sometimes a core file does not exist
not enough room in the file system
kernel crashes
third-party processes; ntpd, telnetd, others...
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 32
In addition to the core file, circumstantial evidence around the time of the crash is helpful:
Was there a configuration change?
Was there a physical topology change?
Can this be reproduced?
Was there a recent upgrade?
Are you using an uncommon configuration? – less likely to have been tested or seen by other customers
The more details pointing to a root cause, the more feasible it is to find the problem, provide a workaround, and a fix.
NX-OSCrashes
Additional detail regarding NX-OS:
BRKARC-3471 Cisco NXOS Software - Architecture
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 33
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Nexus 5000
CRC errors
Ethanalyzer / CPU
Queuing and forwarding
SPAN
Spanning-tree
Nexus 2000
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 34
To talk about forwarding errors and troubleshooting, drops are usually part of this discussion
We have to know a basic hardware layout in order to know where to look for problems
The following hardware overview is a preview of
BRKARC-3452 – Cisco Nexus 5000/5500 and 2000 Switch Architecture
Hardware overview
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 35
Nexus 5000 is a distributed forwarding architecture
Unified Port Controller (UPC) ASIC interconnected by a single stage Unified Crossbar Fabric (UCF)
Unified Port Controllers provide distributed packet forwarding capabilities
All port to port traffic passes through the UCF (Fabric)
Four switch ports managed by each UPC
14 UPC in Nexus 5020
7 UPC in Nexus 5010
Unified Crossbar
Fabric
Unified Port
Controller
SFP SFP SFP SFP SFP SFP SFP SFP
SFP SFP
Unified Port
Controller
SFP SFP SFP SFP
Unified Port
Controller
Unified Port
Controller
SFP SFP SFP SFP
Unified Port
Controller
. . .
Nexus 5000 Hardware OverviewData Plane Elements
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 36
Nexus 5500 Hardware OverviewData and Control Plane Elements
Gen 2 UPC
Unified Crossbar Fabric
Gen 2
Gen 2 UPC Gen 2 UPC
Gen 2 UPC Gen 2 UPC
PEX 8525
4 port PCIE
Switch
South
Bridge
10 Gig
12 Gig
Mgmt 0
Console
L1
L2
PCIe x4
PCIe x8
PCIE
Dual Gig
0 1
CPU Intel
Jasper
Forest
. . .PCIE
Dual Gig
0 1
PCIE
Dual Gig
0 1
Serial
Flash
Memory
NVRAM
DRAM
DDR3
Expansion Module
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 37
Nexus 5000/5500 Hardware OverviewData Plane Elements - Unified Crossbar Fabric
Unified Crossbar
Fabric
Nexus 5000 (Gen-1)
58-port packet based crossbar and scheduler
Three unicast and one multicast crosspoint per egress port
Nexus 5550 (Gen-2)
100-port packet based crossbar and new schedulers
4 crosspoints per egress port dynamically configurable between multicast and unicast traffic
Central tightly coupled scheduler
Request, propose, accept, grant, and acknowledge semantics
Packet enhanced iSLIP scheduler
Distinct unicast and multicast schedulers (see slides later for differences in Gen-1 vs. Gen-2 multicast schedulers)
Eight classes of service within the Fabric
Unicast iSLIP
Scheduler
Multicast
Scheduler
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 38
Nexus 5000 Hardware OverviewUnified Port Controller
Each UPC supports four ports and contains,
Multimode Media access controllers (MAC)
Support 1/10 G Ethernet and 1/2/4 G Fibre Channel
1G is available on first 8 ports of the 5010 and first 16 ports of the 5020
(2/4/8 G Fibre Channel MAC is located on the Expansion Module)
Packet buffering and queuing
480 KB of buffering per port
Forwarding controller
Ethernet and Fibre Channel Forwarding and Policy
Unified Port
Controller
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 39
Nexus 5500 Hardware OverviewData Plane Elements - Unified Port Controller (Gen 2)
Each UPC supports eight ports and contains,
Multimode Media access controllers (MAC)
Support 1/10 G Ethernet and 1/2/4/8 G Fibre Channel
All MAC/PHY functions supported on the UPC (5548UP and 5596UP)
Packet buffering and queuing
640 KB of buffering per port
Forwarding controller
Ethernet (Layer 2 and FabricPath) and Fibre Channel Forwarding and Policy (L2/L3/L4 + all FC zoning)
Unified Port
Controller 2
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
MM
AC
+ B
uffer +
Fo
rward
ing
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 40
Nexus 5000/5500 Hardware OverviewControl Plane Elements
CPU
South
Bridge
NIC
Unified Port
Controller
In-band traffic is identified by the UPC and punted to the CPU via two dedicated UPC interfaces, 5/0 and 5/1, which are in turn connected to eth3 and eth4 interfaces in the CPU complex
Eth3 handles Rx and Tx of low priority control pkts
IGMP, CDP, TCP/UDP/IP/ARP (for management purpose only)
Eth4 handles Rx and Tx of highpriority control pkts
STP, LACP, DCBX, FC and FCoE control frames (FC packets come to Switch CPU as FCoE packets)
There is a built-in control-plane policer to limit the amount of traffic punted to CPU
eth3 eth4
NIC
mgmt0
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 41
Nexus 5000/5500 Hardware OverviewControl Plane Elements
CPU
Intel LV Xeon
1.66 GHz
South
Bridge
NIC
CPU queuing structure provides strict protection and prioritization of inbound traffic
Each of the two in-band ports has 8 queues and traffic is scheduled for those queues based on control plane priority (traffic CoS value)
Prioritization of traffic between queues on each in-band interface
CLASS 7 is configured for strict priority scheduling (e.g. BPDU)
CLASS 6 is configured for DRR scheduling with 50% weight
Default classes (0 to 5) are configured for DRR scheduling with 10% weight
Additionally each of the two in-band interfaces has a priority service order from the CPU
Eth 4 interface has high priority to service packets (no interrupt moderation)
Eth3 interface has low priority (interrupt moderation)
eth3 eth4
BP
DU
ICM
P
CF
S
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 42
Nexus 5000 Hardware OverviewControl Plane Elements
CPU
Intel LV Xeon
1.66 GHz
South
Bridge
NIC
Unified Port
Controller
Monitoring of in-band traffic via NX-OS built-in ethanalyzer (sniffer)
Eth3 is equivalent to „inbound-lo‟
Eth4 is equivalent to „inbound-hi‟
eth3 eth4
N5k-2# ethanalyzer local sniff-interface ?
inbound-hi Inbound(high priority) interface
inbound-low Inbound(low priority) interface
mgmt Management interface
N5k-2# sh hardware internal cpu-mac inband counters
eth3 Link encap:Ethernet HWaddr 00:0D:EC:B2:0C:83
UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST MTU:2200 Metric:1
RX packets:3 errors:0 dropped:0 overruns:0 frame:0
TX packets:630 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:252 (252.0 b) TX bytes:213773 (208.7 KiB)
Base address:0x6020 Memory:fa4a0000-fa4c0000
eth4 Link encap:Ethernet HWaddr 00:0D:EC:B2:0C:84
UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST MTU:2200 Metric:1
RX packets:85379 errors:0 dropped:0 overruns:0 frame:0
TX packets:92039 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:33960760 (32.3 MiB) TX bytes:25825826 (24.6 MiB)
Base address:0x6000 Memory:fa440000-fa460000
CLI view of in-band control plane data
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 43
Nexus 5000 Hardware OverviewPacket Forwarding Overview
SFP SFP SFP SFP
SFP SFP SFP SFP
1. Ingress MAC - MAC decoding, MACSEC processing (not supported currently), synchronize bytes
2. Ingress Forwarding Logic - Parse frame and perform forwarding and filtering searches, perform learning apply internal DCE header
3. Ingress Buffer (VoQ) - Queue frames, request service of fabric, dequeue frames to fabric and monitor queue usage to trigger congestion control
4. Cross Bar Fabric - Scheduler determines fairness of access to fabric and determines when frame is de-queued across the fabric
5. Egress Buffers - Landing spot for frames in flight when egress is paused
6. Egress Forwarding Logic - Parse, extract fields, learning and filtering searches, perform learning and finally convert to desired egress format
7. Egress MAC - MAC encoding, pack, synchronize bytes and transmit
1
2
3
4
5
6
7
Unified
Crossbar
Fabric
Ingress
UPC
Egress
UPC
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 44
Nexus 5000 Forwardingcut-through vs. store and forward
Store and forward switching is still utilized when the ingress data rate is slower than the egress data rate.
Cut-through switching is utilized to achieve low latency through the switch fabric.
Bits are serialized in from the ingress port until enough of the packet header has been received to perform a forwarding and policy lookup
Once a lookup decision has been made and the fabric has granted access to the egress port bits are forwarded through the fabric
Egress port performs any header rewrite (e.g. CoS marking) and MAC begins serialization of bits out the egress port
A drop cannot happen on ingress due to any switching logic or even a CRC error. Only faulty hardware or connections can cause a drop on ingress.
Discards can occur on ingress due to queuing configuration and traffic patterns.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 45
Nexus 5000 Forwardingcut-through vs. store and forward
Source Interface Destination Interface Switching Mode
10 GigabitEthernet 10 GigabitEthernet Cut-Through
10 GigabitEthernet 1 GigabitEthernet Cut-Through
1 GigabitEthernet 1 GigabitEthernet Store-and-Forward
1 GigabitEthernet 10 GigabitEthernet Store-and-Forward
FCoE Fibre Channel Cut-Through
FibreChannel FCoE Store-and-Forward
FibreChannel Fibre Channel Store-and-Forward
FCoE FCoE Cut-Through
Simple way to remember: 10G ingress interfaces are always cut-through
Note: 10G interfaces can be configured for Ethernet or FCoE
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 46
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Nexus 5000
CRC errors
Ethanalyzer / CPU
Queuing and forwarding
SPAN
Spanning-tree
Nexus 2000
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 47
Cut-through mode and CRC errorsReceived errors
Cut-through switching changes how we troubleshoot problems in the switch.
Ethernet CRC is at the end of the frame, so even a CRC error cannot cause a drop on a cut-through port.
We are already forwarding the frame by the time the ingress mac can read the CRC value.
Eth
ern
et
Hea
de
r
IPv4
Hea
de
r
IP Payload
FC
S
Pars
ing
Forwardcorruption
CRC Bad
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 48
Cut-through mode and CRC errorsReceived errors
The corrupted frame must be forwarded, but is accounted for as an output error.
N5k-1# show interface e1/1
...
TX
10157 unicast packets 105 multicast packets 52 broadcast packets
11314 output packets 5317822 bytes
0 jumbo packets
1000 output errors 0 collision 0 deferred 0 late collision
0 lost carrier 0 no carrier 0 babble 0 Tx pause
0 interface resets
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 49
Animation frames for printouts
Eth
ern
et
Hea
de
r
IPv4
Hea
de
r
IP Payload
FC
S
Pa
rsin
g
A frame arrives to be parsed but is corrupted.
corruption
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 50
Eth
ern
et
Hea
de
r
IPv4
Hea
de
r
IP Payload
FC
S
Pa
rsin
g
Forward
Store-and-forward only reads the destination mac address to
make forwarding decision.
Here, the decision to forward is made, while unaware of corruption
to follow
Animation frames for printouts
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 51
IP PayloadF
CS
Pa
rsin
g
CRC Bad
It is not until the FCS field in the Ethernet trailer that we can calculate
CRC value
Animation frames for printouts
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 52
Cut-through mode and CRC “stomping”Originated Errors
In addition to receiving errored frames, the Nexus 5000 can generate a bad CRC for several reasons:
MTU violation
IP length error
Ethernet length error
when ethertype < 1500 / 0x5dc it is interpreted as length
Invalid Ethernet preamble
Received and originated errors will count as TX output errors.
Only received errors will count as RX CRC errors.
You are more likely to see CRC errors in a network with a cut-through switch.
The errors will pass through all cut-through switches and finally drop at the first store-and-forward buffer.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 53
Finding the source of CRC errors CRC errors are introduced in 3 ways:
Bad physical connection
copper, fiber, transceiver, phy
“stomping” due to intentionally originated errors
Received bad CRC “stomped” from neighboring cut-through switch.
Start by finding any RX CRC counters.
If none, then this switch is responsible for originating
Use interrupt counters to find the reason and port, if intentional
Log in to next switch upstream of CRC counters, check for RX CRC there.
Use the above logic to determine if this switch is originating any errors.
Finally, inspect optics/pluggables, fiber/cables and troubleshoot as a Layer 1 issue. Change cable and port to find where the problem follows.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 54
Finding the source of CRC errorsObservations, scenario #1
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5e1/5
N5k-1# show interface e1/1
RX
20949142 unicast packets 1147746 multicast packets 6 broadcast
packets
22096894 input packets 30452432662 bytes
18967009 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 55
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5e1/5
N5k-1# show interface e1/5
TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision
Finding the source of CRC errorsObservations, scenario #1
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 56
Finding the source of CRC errorsObservations, scenario #1
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5e1/5
N5k-2# show interface e1/5
RX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 input packets 0 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 57
Finding the source of CRC errorsObservations, scenario #1
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5e1/5
N5k-2# show interface e1/3
TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 58
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
bad fiber
e1/5e1/5
N5k-1# show interface e1/1
RX
20949142 unicast packets 1147746 multicast packets 6 broadcast
packets
22096894 input packets 30452432662 bytes
18967009 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
Frame enters switch as
a CRC error
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 59
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5e1/5
N5k-1# show hardware internal gatos all-ports | egrep name|1/1
name |log|gat|mac|flag|adm|opr|c:m:s:l|ipt|fab|xgat|xpt|if_index|diag
xgb1/1 |0 |7 |2 |b7 |en |up |1:2:2:f|2 |6 |7 |4 |1a000000|pass
Front Panel Internal
e1/1 7:2Look up internal ASIC port
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 60
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5
N5k-1# show hardware internal gatos asic 7 counters interrupt
Gatos 7 interrupt statistics:
Interrupt name |Count |ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+----
gat_fw2_INT_ig_pkt_err_cb_bm_eof_err |1 |0 |1 |0
gat_fw2_INT_ig_pkt_err_eth_crc_stomp |1 |0 |1 |0
gat_fw2_INT_ig_pkt_err_e802_3_len_err |1 |0 |1 |0
gat_mm0_INT_rlp_rx_pkt_crc_err |1 |0 |1 |0
gat_mm0_INT_rlp_rx_pkt_crc_stomped |1 |0 |1 |0
e1/5
Front Panel Internal
e1/1 7:2Interrupt counters will
increment on receipt of
a bad CRC
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 61
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5
N5k-1# show interface e1/5
TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision
e1/5
Front Panel Internal
e1/1 7:2
e1/5 7:1
10Gb/s interfaces will cut-through
switch these bad frames and
increment an output error at
the egress port
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 62
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1 e1/3e1/4
VLAN 7
VLAN 8
e1/5
N5k-1# show hardware internal gatos asic 7 counters interrupt
Gatos 7 interrupt statistics:
Interrupt name |Count |ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+----
gat_fw1_INT_eg_pkt_err_cb_bm_eof_err |1 |0 |0 |0
gat_fw1_INT_eg_pkt_err_eth_crc_stomp |1 |0 |0 |0
gat_fw1_INT_eg_pkt_err_e802_3_len_err |1 |0 |0 |0
e1/5
Front Panel Internal
e1/1 7:2
e1/5 7:1
Interrupt counters increment
upon transmit of errored frame
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 63
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5
N5k-2# show interface e1/5
RX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 input packets 0 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
e1/5
e1/3
Front Panel Internal
e1/1 7:2
e1/5 7:1
Another cut-through port
receives bad frame
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 64
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5
N5k-2# show hardware internal gatos asic 7 counters interrupt
Gatos 7 interrupt statistics:
Interrupt name |Count |ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+----
gat_fw1_INT_ig_pkt_err_cb_bm_eof_err |1 |0 |1 |0
gat_fw1_INT_ig_pkt_err_eth_crc_stomp |1 |0 |1 |0
gat_fw1_INT_ig_pkt_err_e802_3_len_err |1 |0 |1 |0
gat_mm0_INT_rlp_rx_pkt_crc_err |1 |0 |1 |0
gat_mm0_INT_rlp_rx_pkt_crc_stomped |1 |0 |1 |0
e1/5
e1/3
Front Panel Internal
e1/1 7:2
e1/5 7:1
Interrupt counters will
increment on receipt of
a bad CRC
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 65
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5
N5k-2# show interface e1/3
TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision
e1/5
e1/3
Front Panel Internal
e1/1 7:2
e1/5 7:1
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 66
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5
N5k-2# show hardware internal gatos asic 0 counters interrupt
Gatos 0 interrupt statistics:
Interrupt name |Count |ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+----
gat_fw2_INT_eg_pkt_err_cb_bm_eof_err |1 |0 |0 |0
gat_fw2_INT_eg_pkt_err_eth_crc_stomp |1 |0 |0 |0
gat_fw2_INT_eg_pkt_err_e802_3_len_err |1 |0 |0 |0
e1/5
e1/3
Front Panel Internal
e1/1 7:2
e1/5 7:1
e1/3 0:2
Interrupt counters increment
upon transmit of errored frame
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 67
Finding the source of CRC errorsScenario #1: Physical Issue
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
Front Panel Internal
e1/1 7:2
e1/5 7:1
e1/3 0:2
host will drop bad
frame in Rx buffer
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 68
Finding the source of CRC errorsObservations, scenario #2
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
N5k-1# show interface e1/1
RX
20995002 unicast packets 1150262 multicast packets 6 broadcast packets
22145270 input packets 30519119563 bytes
1 jumbo packets 0 storm suppression packets
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 69
Finding the source of CRC errorsObservations, scenario #2
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
N5k-1# show interface e1/7
TX
1266 unicast packets 1147746 multicast packets 6 broadcast packets
0 output packets 0 bytes
0 jumbo packets
1 output errors 0 collision 0 deferred 0 late collision
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 70
Finding the source of CRC errorsObservations, scenario #2
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
N7k-1# show interface e1/11
RX
4 unicast packets 0 multicast packets 0 broadcast packets
4 input packets 5672 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
1 input error 0 short frame 0 overrun 0 underrun 0
ignored
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 71
Finding the source of CRC errorsScenario #2: MTU Exceeded
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
Front Panel Internal
e1/1 7:2
4000B frame
transmitted
N5k-1# show interface e1/1
RX
20995002 unicast packets 1150262 multicast packets 6 broadcast packets
22145270 input packets 30519119563 bytes
1 jumbo packets 0 storm suppression packets
Jumbo packets increment
whenever ethernet payload is
greater than 1500 – not always
an error!
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 72
Finding the source of CRC errorsScenario #2: MTU Exceeded
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
Front Panel Internal
e1/1 7:2
4000B frame
transmittedN5k-1# show hardware internal gatos port e1/1 counters
rx
RX_PKT_SIZE_IS_1519_TO_2047 | 0
RX_PKT_SIZE_IS_2048_TO_4095 | 1
RX_PKT_SIZE_IS_4095_TO_8191 | 0
RX_PKT_SIZE_IS_8192_TO_9216 | 0
RX_PKT_SIZE_GT_9216 | 0
Hardware counters keep track
of size ranges.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 73
Finding the source of CRC errorsScenario #2: MTU Exceeded
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
Front Panel Internal
e1/1 7:2
N5k-1# show hardware internal gatos asic 7 counters interrupt
Gatos 7 interrupt statistics:
Interrupt name |Count |ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+----
gat_bm_port2_INT_err_ig_mtu_vio |1 | | |
In this case, the MTU is set to
the default of 1500 in class-default
class-based
MTU is 1500
So we enter an error condition.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 74
VLAN 7
Finding the source of CRC errorsScenario #2: MTU Exceeded
N7k-1
N5k-2
e1/11
e1/7 e1/7
e1/4
e1/5e1/5
Front Panel Internal
e1/1 7:2
MTU is configured per class, under network-qos.
This allows for a separate FCoE MTU and Ethernet MTU.
N5k-1
e1/1
VLAN 8
e1/3
e1/12
N5k-1# show policy-map type network-qos
Type network-qos policy-maps
===============================
policy-map type network-qos default-nq-
policy
class type network-qos class-fcoe
pause no-drop
mtu 2158
class type network-qos class-default
mtu 1500
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 75
Finding the source of CRC errorsScenario#2: MTU Exceeded
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
N5k-1# show hardware internal gatos asic 0 counters interrupt
Gatos 0 interrupt statistics:
Interrupt name |Count |ThresRch|ThresCnt|Ivls
-----------------------------------------------+--------+--------+--------+----
gat_fw1_INT_eg_pkt_err_cb_bm_eof_err |1 |0 |1 |0
gat_fw1_INT_eg_pkt_err_eth_crc_stomp |1 |0 |1 |0
gat_fw1_INT_eg_pkt_err_ip_pyld_len_err |1 |0 |1 |0
gat_mm1_INT_rlp_tx_pkt_crc_err |1 |0 |1 |0
Front Panel Internal
e1/1 7:2
e1/7 0:1
Leaving the egress interface,
the CRC has been stomped and
other interrupts have fired.
Note the egress interface will
aggregate all frames from various
source interfaces. Adding up
counters can be tricky.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 76
N7k-1# show interface e1/11
RX
4 unicast packets 0 multicast packets 0 broadcast packets
4 input packets 5672 bytes
0 jumbo packets 0 storm suppression packets
0 runts 0 giants 1 CRC 0 no buffer
1 input error 0 short frame 0 overrun 0 underrun 0
ignored
Finding the source of CRC errorsScenario #2: MTU Exceeded
N7k-1
N5k-2N5k-1
e1/11 e1/12
e1/7 e1/7
e1/1e1/4
VLAN 7
VLAN 8
e1/5e1/5
e1/3
Front Panel Internal
e1/1 7:2
e1/7 0:1
The store-and-forward card on the
Nexus 7000 parses the entire frame
and finds a bad CRC value. A drop
occurs on N7k1 – the frame never
makes it to N5k2.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 77
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Crashes
Nexus 5000
CRC errors
Ethanalyzer / CPU
Queuing and forwarding
Spanning-tree
Nexus 2000
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 78
Hardware accelerated switches do not rely on the CPU for frame forwarding and processing.
*Some L3 paths do require CPU path if hw entries are missing – “punt”
CPU is critical for control-plane activities:
LACP – without keeping up with LACPDUs, 802.3ad portchannels would go down
STP and STP Bridge Assurance – A downstream switch missing BPDUs will go forwarding on a blocked port. If the CPU cannot keep up with sending BPDUs, loops can form. Bridge Assurance helps in some ways, instead of going forwarding, a BA-enabled switch will disable the interface.
vPC programming – mac addresses learned on vPC interfaces must be installed on both switches in order to prevent flooding as well as deliver frames to their destination
Redundancy – in the event of a switch outage, the CPU needs to reprogram state information for all processes, configure mac addresses on interfaces in their respective VLANs.
configuration and management – An unresponsive switch is not useful as a troubleshooting tool, and you are blind without a reliable interface with the network
NX-OSHigh CPU
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 79
NX-OSHigh CPU
N5k-1# show process cpu sort | exclude 0.0
PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4120 1137 10931494 0 17.5% pfma
4204 1477 84831831 0 1.9% gatosusd
N5k-1# show system resources
Load average: 1 minute: 0.63 5 minutes: 1.35 15 minutes: 1.41
Processes : 281 total, 1 running
CPU states : 1.0% user, 8.9% kernel, 90.1% idle
Memory usage: 2073408K total, 1412108K used, 661300K free
Hopefully you have a baseline to compare the current CPU trends with a known nominal state
Always gather 3 commands repeating frequently
show process cpu sort | exclude 0.0
show system resources
show process cpu history
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 80
NX-OSHigh CPU
N5k-1# show process cpu history
1 1 1 1 1 1 11
789509607796857706878950694778698849688895079850886958858500
753105000482598603786430941227125016911055026100692801248500
100 ** * * * * * * * * * * **
90 ** ** * * * * * ** * * * ** * * * * *** * * **
80 *** ** * * * *** **** * * * *** * **** * ** *** * ** * **
70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **
60 *** ****************** *** ******* *********** ***** ** ****
50 ************************** ******* *************************
40 ************************************************************
30 ***********************************************************#
20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
CPU% per minute (last 60 minutes)
* = maximum CPU% # = average CPU%
Note the difference between *, maximum CPU and #, average CPU
This is a completely normal looking graph, try to focus on extended high average CPU periods
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 81
Displaying and capturing control-plane frames with built-in Ethanalyzer utility
based on wireshark project, NX-OS command frontend
Can display like tshark, or capture to .pcap file to analyze elsewhere
Can be used on mgmt0 as well as eth3 or eth4, the low and high priority CPU queues
NX-OSEthanalyzer
CPU
eth3
eth4
UPC
ICMP
CFS
BPDU
CDP
LACPDU
ARP
DCBX NIC
NIC
MGMT0
eth0
So
uth
Brid
ge
low
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 82
N5k-1# ethanalyzer local interface mgmt write bootflash:managementCAP
Program exited with status 0.
N5k-1# dir bootflash: | inc management
1224 Apr 04 16:56:33 2011 managementCAP
N5k-1#ethanalyzer local read bootflash:managementCAP
2011-04-04 16:56:33.763150 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=68
2011-04-04 16:56:33.763527 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.763968 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.764391 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.764811 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.765230 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.765649 172.18.118.165 -> 64.102.131.28 SSH Encrypted response packet len=52
2011-04-04 16:56:33.765928 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=68 Win=65535 Len=0 TSV=597611264 TSER=19040186
2011-04-04 16:56:33.765930 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=120 Win=65535 Len=0 TSV=597611264 TSER=19040186
2011-04-04 16:56:33.765932 64.102.131.28 -> 172.18.118.165 TCP 53538 > ssh [ACK] Seq=0 Ack=172 Win=65535 Len=0 TSV=597611264 TSER=19040186
NX-OSEthanalyzer example
capture mgmt0 traffic and save to a file on bootflash
view capture files
copy off for further analysis
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 83
N5k-1# ethanalyzer local interface inbound-hi capture-filter "not ip"
Capturing on eth4
wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0
2005-02-11 20:36:50.251412 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d
2005-02-11 20:36:50.252075 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099
2005-02-11 20:36:50.252204 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a
2005-02-11 20:36:50.252317 00:0d:ec:d6:02:e9 -> 01:80:c2:00:00:00 STP Conf. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a2
2005-02-11 20:36:50.252426 00:0d:ec:d6:02:e8 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x80a1
2005-02-11 20:36:50.391691 00:0d:ec:d3:b5:f4 -> 01:80:c2:00:00:0e LLC U, func=UI; SNAP, OUI 0x00000C (Cisco), PID 0x0134
2005-02-11 20:36:50.803069 00:12:43:01:b0:98 -> 01:80:c2:00:00:00 STP Conf. Root = 8291/00:d0:03:62:4c:00 Cost = 0 Port = 0x8081
2005-02-11 20:36:52.251349 00:0d:ec:d6:02:e4 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809d
2005-02-11 20:36:52.251366 00:0d:ec:d6:02:e0 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x8099
2005-02-11 20:36:52.251373 00:0d:ec:d6:02:e1 -> 01:80:c2:00:00:00 STP RST. Root = 8291/00:d0:03:62:4c:00 Cost = 2 Port = 0x809a
NX-OSEthanalyzer example
capture high priority traffic with capture-filter and display to terminal
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 84
N5k-1# show system resources
Load average: 1 minute: 0.95 5 minutes: 1.54 15 minutes: 1.46
Processes : 281 total, 4 running
CPU states : 26.7% user, 26.7% kernel, 46.5% idle
Memory usage: 2073408K total, 1412172K used, 661236K free
N5k-1# show process cpu sort | exclude 0.0
PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4230 398 5011881 0 22.0% snmpd
4204 1467 84869127 0 20.2% gatosusd
4226 433 5601856 0 5.5% statsclient
4264 1380 391510 3 3.7% ethpm
4302 254 103 2468 1.8% netstack
NX-OSEthanalyzer and CPU
Using to aid in identifying external causes of high CPU utilization
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 85
NX-OSEthanalyzer and CPU
esc-n5020-1# show process cpu history
211111111131111111111121111111131111111114111111831112111111
002244240786947901001225201001390000110010000902910013010023
100
90 #
80 #
70 #
60 #
50 #
40 # # # #
30 # # # ##
20 # #### ## ## # # # ## #
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
CPU% per second (last 60 seconds)
# = average CPU%
Baseline per second
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 86
N5k-1# show process cpu history
1 1
754669098990899966777977656766876775178734455655456466545645
006186077990796258300801881187120477641015900150830621684070
100 ### ### ## #
90 ########### #
80 ########### # # # #
70 # ##################### ##### ## ###
60 # ################################# ### ## # ### #
50 #################################### ### ###################
40 #################################### ### ###################
30 #################################### #######################
20 ############################################################
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
CPU% per second (last 60 seconds)
# = average CPU%
<continued>
NX-OSEthanalyzer and CPU
Observed spike in CPU (per second)
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 87
NX-OSEthanalyzer and CPU
Baseline per minute
N5k-1# show process cpu history
1 1 1 1 1 1 11
789509607796857706878950694778698849688895079850886958858500
753105000482598603786430941227125016911055026100692801248500
100 ** * * * * * * * * * * **
90 ** ** * * * * * ** * * * ** * * * * *** * * **
80 *** ** * * * *** **** * * * *** * **** * ** *** * ** * **
70 *** ** **** * *** **** *** *** *** ****** **** *** * ** * **
60 *** ****************** *** ******* *********** ***** ** ****
50 ************************** ******* *************************
40 ************************************************************
30 ***********************************************************#
20 *##**#*******#***********#*#*#**#**##*###*###**##****#****##
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
CPU% per minute (last 60 minutes)
* = maximum CPU% # = average CPU%
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 88
1 1 1 1 1 1 1
899074676686870687895096077968577068789506947786988496888950
189068779462040167531050004825986037864309412271250169110550
100 *** * ** * * * * * * *
90 *** * * * ** ** * * * * * ** * * * ** * * *
80 ***** * * * * **** ** * * * *** **** * * * *** * **** *
70 ***** *** * *** **** ** **** * *** **** *** *** *** ****** *
60 **#** ************** ****************** *** ******* ********
50 *##**************************************** ******* ********
40 ###*#*******************************************************
30 ######******************************************************
20 #######******#****##**#*******#***********#*#*#**#**##*###*#
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
CPU% per minute (last 60 minutes)
* = maximum CPU% # = average CPU%
NX-OSEthanalyzer and CPU
We also notice a spike in average CPU over the past 5 minutes
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 89
N5k-1# ethanalyzer local interface mgmt capture-filter "not host 10.116.114.157"
Capturing on eth0
wireshark-broadcom-rcpu-dissector: ethertype=0xde08, devicetype=0x0
2005-02-11 21:25:48.452632 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.455871 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.458120 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.459968 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.462428 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.464066 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.466903 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.468165 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
2005-02-11 21:25:48.471662 172.18.118.162 -> 172.18.118.34 SNMP get-response
2005-02-11 21:25:48.472263 172.18.118.34 -> 172.18.118.162 SNMP get-next-request
NX-OSEthanalyzer and CPU
Capturing on mgmt, we see there is an snmpwalk occuring
This should be a temporary condition and should not affect switching performance, but perhaps you can “feel” latency on the terminal
Could affect other control-plane transactions like configuration backups, collection scripts, etc.
Now you can check with your network management team to work out when this is appropriate or if this is a mistake. A full walk is not very efficient to run reguarly.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 90
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Crashes
Nexus 5000
CRC errors
Ethanalyzer / CPU
Queuing and forwarding
Spanning-tree
Nexus 2000
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 91
Nexus 5000/5500 Queuing
Nexus 5000/5500 utilize ingress queuing
Ingress queuing is helpful for data flows where many ports talk to few, the load is spread across the sources
Simple flowcontrol mechanism can be implemented
end-to-end flowcontrol is necessary for FCoE
Ingress queuing is implemented by Virtual Output Queuing (VOQ)
VOQ prevents head of line blocking
One egress interface can be congested, but ingress buff still accepts frame into other queues
8 class-based unicast VOQ per egress interface on every ingress interface
8 class-based multicast VOQ per ingress interface
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 92
Nexus 5000/5500 Queuing
Ingress queuing implication on troubleshooting:
Drops occur at INGRESS!
You must think about where the flow originates on the switch to determine where you would like to look for drops.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 93
Nexus 5000/5500 QueuingN5k-1# show queuing interface e1/5
Ethernet1/5 queuing information:
TX Queuing
qos-group sched-type oper-bandwidth
0 WRR 50
1 WRR 50
RX Queuing
qos-group 0
q-size: 243200, HW MTU: 1600 (1500 configured)
drop-type: drop, xon: 0, xoff: 1520
Statistics:
Pkts received over the port : 100882627
Ucast pkts sent to the cross-bar : 100877529
Mcast pkts sent to the cross-bar : 0
Ucast pkts received from the cross-bar : 786990
Pkts sent to the port : 692821
Pkts discarded on ingress : 5098
Per-priority-pause status : Rx (Inactive), Tx (Inactive)
Ingress discards are present when buffering is not sufficient for the traffic flow.
For example – 2 interfaces transmitting toward 1 interface in sustained oversubscription.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 94
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
Server A is sending some traffic toward Server B
Both servers have had static ARP entries applied for troubleshooting
Server B does not see traffic from Server A when sniffing locally
They are both configured to be in the same VLAN
N5k-2
Server A Server B
Trunk
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 95
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
Start at the ingress interface on server A
N5k-1# show hardware internal gatos port e1/1 | grep “gatos i”
gatos instance : 7
gatos iport : 2
-----------------------------------------------------------------
N55k-1# show hardware internal carmel port e1/1 | grep "carmel i"
carmel instance : 0
carmel iport : 1
Nexus 5000“gatos”
Nexus 5500“carmel”
For this example, we will use Nexus 5000 outputs, but you can substitute gatos for carmel, as they are laid out in a similar architecture.
The actual counters and errors may vary, the methodology does not
Front Panel Internal
e1/1 7:2
e1/5 7:1
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 96
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
Start at the ingress interface on server A
N5k-1# show platform fwm info pif e1/1 | grep stats
Eth1/1 pd: tx stats: bytes 147694477 frames 0 discard 0 drop 0
Eth1/1 pd: rx stats: bytes 26022500 frames 0 discard 0 drop 0
Eth1/1 pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
Eth1/1 pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0
Front Panel Internal
e1/1 7:2
e1/5 7:1
These outputs are clean
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 97
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
N5k-1# show platform fwm info asic-errors 7
Printing non zero Gatos error registers:
N5k-1# show hardware internal gatos asic 7 counters interrupt
Gatos 7 interrupt statistics:
Interrupt name |Count |ThresRch|ThresCnt|Ivls
Front Panel Internal
e1/1 7:2
e1/5 7:1
These outputs are also clean
Move on to the egress interface e1/5
In this case, e1/5 is on the same ASIC, so we have already gathered the output needed
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 98
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
N5k-1# show platform fwm info pif e1/5 | grep stats
Eth1/5 pd: tx stats: bytes 476497477 frames 0 discard 0 drop 0
Eth1/5 pd: rx stats: bytes 232322392 frames 0 discard 0 drop 0
Eth1/5 pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
Eth1/5 pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0
Front Panel Internal
e1/1 7:2
e1/5 7:1
These outputs are clean
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 99
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
N5k-1# show platform fwm info pif e1/5 | grep stats
Eth1/5 pd: tx stats: bytes 332298390 frames 0 discard 0 drop 0
Eth1/5 pd: rx stats: bytes 176797274 frames 0 discard 0 drop 208
Eth1/5 pd fcoe: tx stats: bytes 0 frames 0 discard 0 drop 0
Eth1/5 pd fcoe: rx stats: bytes 0 frames 0 discard 0 drop 0
Front Panel Internal
e1/1 7:2
e1/5 7:1
208 drops seen received on port e1/5
Next we try to find the reason for these drops
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 100
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
N5k-1# show platform fwm info asic-errors 7
Printing non zero Gatos error registers:
DROP_SRC_VLAN_MBR: res0 = 624 res1 = 0
DROP_SRC_VLAN_MBR is 624
This counter is 3x the number of frame drops - hardware caveat
Front Panel Internal
e1/1 7:2
e1/5 7:1
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 101
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
N5k-1# show hardware internal gatos asic 7 counters interrupt
...
gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74 |
...
Interrupt counters will agree that a given error has fired from the hardware, but the number is HEX and we also do not record every interrupt due to the rate at which interrupts can hit CPU. Generally this number will be somewhat less than the fwm pifdrop number.
Front Panel Internal
e1/1 7:2
e1/5 7:1
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 102
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
N5k-1# show hardware internal gatos asic 7 counters interrupt
...
gat_lu_lkup1_INT_func_lo_drop_src_vlan_mbr|74 |
...
Interrupt counters will agree that a given error has fired from the hardware
number is hex and
we do not record every interrupt due to the rate at which interrupts can hit CPU. Generally this number will be somewhat less than the show platform fwm info pif number
Front Panel Internal
e1/1 7:2
e1/5 7:1
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 103
Nexus 5000/5500 QueuingScenario
N5k-1
e1/1
e1/5e1/5
e1/3
N5k-2
Server A Server B
Trunk
From the outputs gathered, we can say either STP is blocking or the VLAN is not allowed
The configs confirm VLAN is not allowed
Use this same methodology to find counters incrementing with your dropped traffic. Where the numbers increment, you can find a reason
Various scenarios cause drops, register list is not available publically – TAC case should be opened for scenarios with conflicting/confusing output.
Front Panel Internal
e1/1 7:2
e1/5 7:1
N5k-1# interface Ethernet1/5
switchport mode trunk
switchport trunk allowed vlan 100-102
N5k-1# interface Ethernet1/5
switchport mode trunk
switchport trunk allowed vlan 100-103
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 104
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Crashes
Nexus 5000
CRC errors
Ethanalyzer / CPU
Queuing and forwarding
Spanning-tree
Nexus 2000
Redundancy operation and troubleshooting
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 105
Spanning-tree
NX-OS keeps a long history of STP states
Usually you can trace back the change that caused an outage, as long as it has not wrapped in the logs.
STP logs shouldn‟t wrap normally without constant topology changes.
Also a good idea to log stp at level 6:
N5k-2(config)# logging level spanning-tree 6
N5k-2# 2011 Jan 21 01:58:23 N5k-2 %STP-6-PORT_ROLE: Port port-channel14 instance VLAN007 role changed to designated
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 106
Spanning-tree
N5k-1# show spanning-tree internal event-history all
-------------------- All the active STPs -----------
VDC01 VLAN0001
0) Transition at 848207 usecs after Thu Jan 13 05:05:54 2005
Root: 0000.0000.0000.0000 Cost: 0 Age: 0 Root Port: none Port: none [STP_TREE_EV_UP]
1) Transition at 367168 usecs after Thu Jan 13 05:05:57 2005
Root: 8001.000d.ecd6.02fc Cost: 0 Age: 0 Root Port: none Port: Ethernet1/15 [STP_TREE_EV_UPDATE_TOPO_RCVD_SUP_BPDU]
2) Transition at 373395 usecs after Thu Jan 13 05:05:57 2005
Root: 2063.00d0.0362.4c00 Cost: 2 Age: 1 Root Port: Ethernet1/15 Port: none [STP_TREE_EV_MULTI_FLUSH_LOCAL]
3) Transition at 434563 usecs after Thu Jan 13 05:06:00 2005
Root: 2063.00d0.0362.4c00 Cost: 2 Age: 1 Root Port: Ethernet1/15 Port: Ethernet1/15 [STP_TREE_EV_MULTI_FLUSH_RCVD]
Checking all trees
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 107
Spanning-tree
N5k-1# show spanning-tree internal event-history tree 1 brief
2005:01:13 05h:05m:54s:848207us T_EV_UP VLAN0001 [0000.0000.0000.0000 C 0 A 0 Rnone P none]
2005:01:13 05h:05m:57s:367168us T_UT_SBPDU VLAN0001 [8001.000d.ecd6.02fc C 0 A 0 R none P Eth1/15]
2005:01:13 05h:05m:57s:373395us T_EV_M_FLUSH_L VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P none]
2005:01:13 05h:06m:00s:434563us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]
2005:01:13 05h:06m:01s:407259us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]
2005:01:13 05h:06m:02s:947220us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]
2005:01:13 05h:06m:04s:947216us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]
2005:01:13 05h:06m:06s:947457us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]
2005:01:13 05h:06m:08s:837586us T_EV_M_FLUSH_R VLAN0001 [2063.00d0.0362.4c00 C 2 A 1 R Eth1/15 P Eth1/15]
... or just the tree you are interested in
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 108
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Crashes
Nexus 5000
Nexus 2000
Management
Queuing and forwarding
Logs
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 109
FEX Management
FEX fabric interfaces run SDP – satellite discovery protocol
You can view the status of a FEX and see some logs from the N5k:
N5k-1# show fex 100
FEX: 100 Description: FEX0100 state: Online
FEX version: 5.0(3)N1(1b) [Switch version: 5.0(3)N1(1b)]
Extender Model: N2K-C2148T-1GE, Extender Serial: JAF1326BBRC
Part No: 73-12009-05
pinning-mode: static Max-links: 1
Fabric port for control traffic: Eth1/3
Fabric interface state:
Eth1/3 - Interface Up. State: Active
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 110
FEX Management
N5k-1# show fex 100 detail
FEX: 100 Description: FEX0100 state: Online
FEX version: 5.0(3)N1(1b) [Switch version: 5.0(3)N1(1b)]
FEX Interim version: 5.0(3)N1(1b)
Switch Interim version: 5.0(3)N1(1b)
Extender Model: N2K-C2148T-1GE, Extender Serial: JAF1326BBRC
Part No: 73-12009-05
Card Id: 70, Mac Addr: 00:0d:ec:d3:b5:c2, Num Macs: 64
Module Sw Gen: 21 [Switch Sw Gen: 21]
post level: complete
...
Logs:
02/02/2005 13:09:06.946120: Module register received
02/02/2005 13:09:06.947614: Image Version Mismatch
02/02/2005 13:09:06.947960: Registration response sent
02/02/2005 13:09:06.948392: Requesting satellite to download image
02/02/2005 13:14:54.149480: Image preload successful.
02/02/2005 13:14:55.375447: Deleting route to FEX
02/02/2005 13:14:55.384270: Module disconnected
02/02/2005 13:14:55.386372: Module Offline
02/02/2005 13:16:52.847574: Module register received
02/02/2005 13:16:52.849146: Registration response sent
02/02/2005 13:16:53.419079: Module Online Sequence
02/02/2005 13:17:09.507541: Module Online
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 111
FEX Management
N5k-1# show system internal fex log fport e1/3
Satmgr debug messages for If 0x1a002000:
[19952]02/02/2005 13:08:32.191646: if [0x1a002000]:Phy cleanup rcvd
[19956]02/02/2005 13:08:32.192257: fport [0x1a002000]:Log - Interface Down
[19957]02/02/2005 13:08:32.192266: fport [0x1a002000]:satmgr_fport_fsm: even:t Port Down. curr state: Discovered
[19958]02/02/2005 13:08:32.192654: fport [0x1a002000]:Log - State changed to: Created
[19962]02/02/2005 13:08:32.192853: fport [0x1a002000]:satmgr_fport_fsm: new state: Created
[19967]02/02/2005 13:08:32.193991: fport [0x1a002000]:Log - fport phy cleanup retry end: sending out resp
[19970]02/02/2005 13:08:32.206315: if [0x1a002000]:Pre Cfg rcvd
[19971]02/02/2005 13:08:32.206606: fport [0x1a002000]:Log - pre config: is not a port-channel member
[19977]02/02/2005 13:08:33.727893: fport [0x1a002000]:Log - Interface Up
[19978]02/02/2005 13:08:33.727904: fport [0x1a002000]:satmgr_fport_fsm: even:t Port Down. curr state: Created
[19982]02/02/2005 13:08:33.729944: fport [0x1a002000]:Log - Port Bringup rcvd
[19986]02/02/2005 13:08:33.731201: fport [0x1a002000]:Log - Suspending Fabric port. reason: Fex not configured
[19987]02/02/2005 13:08:33.731216: fport [0x1a002000]:Log - fport bringup retry end: sending out resp
[19997]02/02/2005 13:08:34.120031: fport [0x1a002000]:Log - Fcot message sent to Ethpm
[19998]02/02/2005 13:08:34.120092: fport [0x1a002000]:Log - Satellite discovered msg sent
[19999]02/02/2005 13:08:34.120459: fport [0x1a002000]:Log - State changed to: Discovered
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 112
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Crashes
Nexus 5000
Nexus 2000
Management
Queuing and forwarding
Logs
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 113
FEX Drops
Network interface drops can be seen from N5k “show queuing interface” as of 5.0(3)N1(1)
Best to “attach” to FEX to get detailed logs
Similar to Cat 6k or Nexus 7k linecard commands
Important to check here as FEX also have crash logs, have their own CPU, and are responsible for communicating link state and offloading some protocols like CDP.
N5k-1# attach fex 100
Attaching to FEX 100 ...
To exit type 'exit', to abort type '$.'
fex-100#
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 114
FEX Drops
Network interface drops can be seen from N5k “show queuing interface” as of 5.0(3)N1(1)
Best to “attach” to FEX to get detailed logs
Similar to Cat 6k or Nexus 7k linecard commands
Important to check here as FEX also have crash logs, have their own CPU, and are responsible for communicating link state and offloading some protocols like CDP.
N5k-1# attach fex 100
Attaching to FEX 100 ...
To exit type 'exit', to abort type '$.'
fex-100#
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 115
FEX Drops
The scenario we are looking for is big pipe to little pipe or many to one.
Know the flow of traffic! If you know the pattern, finding where it is likely to stress the network will be easier.
10G to 1G is especially difficult to buffer, so you may find the FEX is the last stop for the 10G traffic to buffer for your 1G hosts like to drop here and not elsewhere in your 10G network.
Fex queue-limit and buffer-threshold can be adjusted globally, per fex-type, or per fex
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 116
FEX Drops2148
fex-100# dbgexec rw
rw> show ints <0-6>
ASIC: 0:
+-------+--------------------------+--------------+-----------+-----------+-----------+
| ASIC | Interrupt Bit Field | Count1 | Thresh1 | Count2 | Thresh2 |
| Port | | | | | |
+-------+--------------------------+--------------+-----------+-----------+-----------+
| 0-NI1 | not_synced_lane_3 | 1 | 0 | 0 | 1 |
| 0-NI1 | not_synced_lane_2 | 1 | 0 | 0 | 1 |
| 0-NI1 | not_synced_lane_0 | 1 | 0 | 0 | 1 |
| 0-NI1 | synced_lane_3 | 1 | 0 | 0 | 1 |
| 0-NI1 | synced_lane_2 | 1 | 0 | 0 | 1 |
| 0-NI1 | synced_lane_1 | 1 | 0 | 0 | 1 |
| 0-NI1 | synced_lane_0 | 1 | 0 | 0 | 1 |
| 0-NI1 | loc_fault | 1 | 0 | 0 | 1 |
| 0-NI1 | not_aligned | 1 | 0 | 0 | 1 |
| 0-NI1 | aligned | 1 | 0 | 0 | 1 |
+-------+--------------------------+--------------+-----------+-----------+-----------+
this output is clean, no wo_cr counters. *shows non-zero counters.
wo_cr indicates the buffer is “without credit”
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 117
FEX Drops
2148rw> drops <0-6> hi<0-8>
Dropped packet counters for 0-HI0:
red_hix_cnt_rx_allow_vntag_drop : 0
red_hix_cnt_rx_echannel_drop : 0
red_hix_cnt_rx_fwd_drop : 0
red_hix_cnt_rx_mc_drop : 0
red_hix_cnt_rx_runt_pkt_drop : 0
red_hix_cnt_rx_src_vif_out_of_range_drop: 0
red_hix_cnt_tx_lb_drop : 11892
0-SS0 DDROP counters:
OQ0: Class0: 0 Class1: 0 Class2: 0 Class3: 0
OQ1: Class0: 0 Class1: 0 Class2: 0 Class3: 0
OQ2: Class0: 0 Class1: 0 Class2: 0 Class3: 0
OQ3: Class0: 0 Class1: 0 Class2: 0 Class3: 0
OQ4: Class0: 0 Class1: 0 Class2: 0 Class3: 0
0-SS0 ECC1: 0 ECC2: 0
0-SS0 wo_cr: 0 no cells: 0 mtu_vio: 0
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 118
FEX Drops2248
N5k-1# attach fex 130
fex-130# dbgexec satctrl
satctrl/qosctrl> show port 0 0 2 <0-3> *uplink interfaces queue on ingress
...
Rx Discard (WR_DISC): 0
Rx Multicast Discard (WR_DISC_MC): 0
Rx Error (WR_RCV_ERR): 0
...
this output is clean, wr_disc or wr_rcv_err.
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 119
FEX Drops2248
satctrl/qosctrl> show asic 0 0
SS Statistics:
SS No Credit* No Cells MTU Error OQ Discard Free Cells
---+-----------+-----------+-----------+-----------+----------
0 0 0 0 0 10213
1 0 0 0 0 10213
...
Dropped packets per CoS due to OQ head-drop, OQ is per 8 port group:
OQ CoS 0 CoS 1 CoS 2 CoS 3 CoS 4 CoS 5 CoS 6 CoS 7
----+----------+----------+----------+----------+----------+----------+----------+-----------
NR0 0 0 0 0 0 0 0 0
NR1 0 0 0 0 0 0 0 0
NR2 0 0 0 0 0 0 0 0
NR3 0 0 0 0 0 0 0 0
NR4 0 0 0 0 0 0 0 0
NR5 0 0 0 0 0 0 0 0
----+----------+----------+----------+----------+----------+----------+----------+-----------
HR0 0 0 0 0 0 0 0 0
HR1 0 0 0 0 0 0 0 0
HR2 0 0 0 0 0 0 0 0
HR3 0 0 0 0 0 0 0 0
HR4 0 0 0 0 0 0 0 0
HR5 0 0 0 0 0 0 0 0
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 120
FEX Drops2248
fex130# dbgexec prt
prt> drops
PRT_SS_CNT_TAIL_DROP8 : 2 SS0
prt> show rmon 0 ni<0-3>
+----------------------+----------------------+-----------------+----------------------+----------------------+-----------------+
| TX | Current | Diff | RX | Current | Diff |
+----------------------+----------------------+-----------------+----------------------+----------------------+-----------------+
| TX_PKT_LT64 | 0| 0| RX_PKT_LT64 | 0| 0|
| TX_PKT_64 | 5| 1| RX_PKT_64 | 8| 0|
| TX_PKT_65 | 2062219| 264039| RX_PKT_65 | 4073560| 521532|
| TX_PKT_128 | 2149866| 274780| RX_PKT_128 | 2060397| 263419|
| TX_PKT_256 | 1920669| 245601| RX_PKT_256
...
rmon counters are similar to the “counters detailed” on the N5k ports, helpful for error tracking and finding packets of a certain size
updates immediately – “show counters” on n5k waits for the statsclient
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 121
Troubleshooting Nexus 5000 / 2000
Problem Isolation
Platform Overview and troubleshooting
NX-OS Operation
Crashes
Nexus 5000
Nexus 2000
Management
Queuing and forwarding
Logs
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 122
FEX Logs
attach fex <n>
dbgexec rw/prt (rw=2148, prt=2248)
Show ctx – driver information
Show oper – link states for L1 status
Show elog – event log chronicling hardware and software interaction, helpful for L1 issues
Show ints – interrupt counters
Show bootlog – bootup messages
Show log – any other logs
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 1231
2
Final presentation may not end here, look for updated content potentially at the live presentation.
Printout note
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 1241
2
Receive 25 Cisco Preferred Access points for each session evaluation you complete.
Give us your feedback and you could win fabulous prizes. Points are calculated on a daily basis. Winners will be notified by email after July 22nd.
Complete your session evaluation online now (open a browser through our wireless network to access our portal) or visit one of the Internet stations throughout the Convention Center.
Don’t forget to activate your Cisco Live and Networkers Virtual account for access to all session materials, communities, and on-demand and live activities throughout the year. Activate your account at any internet station or visit www.ciscolivevirtual.com.
Complete Your Online Session Evaluation
© 2011 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKCRS-3145 1251
2
Visit the Cisco Store for Related Titles
http://theciscostores.com
top related