Troubleshooting Methods for UCS Customer POCs and Labs

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1Internal Only – Do not Distribute© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1

Troubleshooting Methods for UCS Customer POCs and Labs

August 2012

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2Internal Only – Do not Distribute

Agenda• Why would we need this presentation?• Overview of some recurring items to address

Infrastructure Items

• Adapter and IOM systems troubleshooting• Server systems troubleshooting• Operating systems troubleshooting• Chassis systems troubleshooting• Fabric Interconnect systems troubleshooting


Presentation Goals• Often there are “lessons learned” which can be shared do simplify the POC

process• Some of the known bugs and operational details are lost in the depths of

planning customer scenarios• This is not a review of how to develop a testing plan, nor a script in running a

POC• The key goal is to help information sharing to put best foot forward • All required UCS system training and real-world hands on experience is

assumed• This is a living presentation – the goal is to keep updated with these lessons

learned and common bug issues – to present to the field on request


Basic Connection Points• Data and Many Control Plane

functions are Active/Active cluster• User to UCSM is Active/Standby to

Virtual IP (VIP)• These management connections are

where the Blade CIMC connections are reached via the unified IO of UCS

• Blade CIMC are actually NAT entries on the mgmt port

• UCSM Client (Centrale) or CLI to manage and troubleshoot

UCS VIP

FI-B IPFI-A IP


What Runs on NX-OS Kernel• The fundamental items

on a UCS are the Managed Information Tree (MIT) and the Application Gateways (AG’s) that do the work

• This is for any UCS form factor device

Blade/Rack AG:BIOSRAIDCPLD

Boot MethodBMC Setup

AlertingEtc.

NIC AG:# NICs

Networks to Tie inQOS and Security Policy

# HBAsVSANs to Tie in

QoS and Security PolicyEtc.

Switch AG:Ether Port Networks

QoS PolicySecurity Policy

Linkages to Server NICsNetwork Segments

Etc.

Fabric AG:Storage SegmentsVSAN MappingsF Port Trunking

F Port ChannelingZoning **

Etc.

XML API

Other AG’s:VMM AG

Etc.

MIT


Components on the UCS


Following the Progress of AG Work• Many stages of a given process are run through (FSM-Stage)• Some can be skipped if unneeded or type of action (Shallow vs. Deep)• Almost all actions contain a verification step that the action completed• Logs are retained• View and Monitor


FSM Return Codes and System Faults• These will feed into the normal fault policy of UCSM

FSM faults are just one type – refer to the link below for listing of typesHighly recommend at least becoming familiar with layout of UCS faults and error message reference in URL below

• Severity can change over the life of fault• For POC labs recommend elimination of Critical, Major, Minor faults

Others will be there in normal course of all the actions waiting and performed

http://www.cisco.com/en/US/partner/docs/unified_computing/ucs/ts/faults/reference/ErrMess.html

http://www.cisco.com/en/US/partner/docs/unified_computing/ucs/ts/faults/reference/ErrMess.html


UCS 2xxx IO Module (Fabric Extender) Each UCS IOM in a UCS 5100 Blade Server Chassis is connected to a

6000 Series Fabric Interconnect for Redundancy or Bandwidth Aggregation

Fabric Extender provides 10GE ports to the Fabric Interconnect

Link physical health and the chassis discovery occurs over these links UCS 6000 Series FI A UCS 6000 Series FI B

UCS 5100 Series Blade Server Chassis

UCS 2xxx Series IOM

Back


Interface Statistics and Reports Various points of monitoring and visibility

IOM


Statistics Breakdown• Visibility into many

counters within UCS• These are “count up” with

raw numbersUse the Delta to monitor the changes over the collection intervals

History

Live/now


Recurring Items – UCS Infrastructure• Setting a system baseline

Most of our initial issues in CPOC situations are due to firmware issuesAll system components must be on same firmware package versionHost and mgmt firmware policies are excellent tools to do this – rather than server by serverViewing the components of a package shownWhen demo FI’s arrive, individually set them to a common UCSM version and erase the configurations before attempting to join them in a cluster



First get the package on the UCSCreate the right FW packages for the POCCan check conformance to package in a single screen



Can upgrade via the Bundle Mechanism at the POC startBundle option there for both update and activate – handles all upgradeThis is totally disruptive, so don’t do this method during the POC (after staging)


Recurring Items – UCS Infrastructure• Upgrade prep, checkpoints, cleanup – when uptime is key (not a POC)

Implement a management interface monitoring policyPrior to upgrading one fabric, disable all upstream data and FC portsDisable the mgmt interface also (KVM traffic on the fabric that will not be taken down)This will force traffic to the fabric that will be up (can quickly recover if an error)Upgrade fabric, restore uplinks and mgmt interfaceRepeat on peer fabric – but only after the cluster state is showing as HA_READY when in the CLI and you connect to local management and “show cluster extended”


Recurring Items – UCS Infrastructure• Discovery Policy vs. Re Acknowledgement

behaviorDiscovery policy is just that – a floor in the number of links before a chassis will be discoveredThe link policy will dictate bringing up port-channels from the IOMs to Fis – after discoveryMust then re-acknowledge the chassis (disruptive to blade connectivity) for all connections beyond discovery to be usedAlways re-acknowledge the chassis after it is discovered, or any cabling changes


Recurring Items – UCS Infrastructure• Multicast behavior

In all current versions, IGMP Snooping is enabled and cannot be turned offOnly the 224.0.0.X is flooded within the UCSFundamentally different from traditional switches which floodWe need an upstream PIM router or IGMP snooping querier upstream for proper multicast flow beyond a new flow timeout (~180 seconds)


Recurring Items – UCS Infrastructure• It is always best as a preparation to review the release notes• This is the PRIMARY method we notify the field of issues to keep aware of• Can be large with the product breadth, but for a POC or install will be a

great starting step


Adapter and IOM Systems Troubleshooting

Internal VLAN interface for management

Ports to Blade Adaptors

Displayed from FI “A”

10 Gig Links to Chassis 2

10 GiG Links to Chassis 1

CiscoLive-A# connect nxosCiscoLive-A(nxos)#show interface briefCiscoLive-A(nxos)#show interface fex-fab

Eth X/Y/Z whereX = chassis numberY = mezz card number Z = IOM port number


Path Tracing through UCS• How to locate the MAC of the interfaces

Find the interesting adapter in UCSM from or from the NXOS CLI

#Found mac address in Fabric interconnect A. It should not be visible on Fabric interconnect B. If it is then the customer is doing per flow/packet load balancing at the host level, which is not allowed on UCS B-Series


Displaying the VIF Path on the UCSFrom UCS CLI

From UCSM


VIF Path Details

Slot 7 Slot 8

1/1 1/12/2 2/2

Interface 1

vCON-1

Path 1

vCON-2

Interface 2

Path 2

A B A B

From FI NXOS:

show interface veth 752

show int vfc 756

Management link


Chassis Mappings

NI0

NI1

NI2

NI3

HI0

HI1

HI6

HI5

HI4

HI3

HI2

HI7

4

3

2

1

UCS 2104XP

NI0

NI1

NI2

NI3

HI6

HI5

HI4

HI3

HI2

HI7

HI1

HI0

! ResetConsole

! !UCS B250 M1

! ResetConsole

! !UCS B200 M1

! ResetConsole

! !UCS B200 M1

! ResetConsole

! !UCS B200 M1

! ResetConsole

! !UCS B200 M1

! ResetConsole

! !UCS B200 M1

! ResetConsole

! !UCS B200 M1

vCON-2Interface 1

vCON-2Interface 1

vCON-2Interface 1

vCON-2Interface 1

vCON-2Interface 1

vCON-2Interface 1

vCON-2Interface 1

vCON-1Interface 2

Blade1

Blade2

Blade3

Blade4

Blade5

Blade6

Blade7

Slot 1

Slot 2

Slot 3

Slot 4

Slot 5

Slot 6

Slot 7 Slot 8

4

3

2

1

4

3

2

1


VIFs for FC Adapters

All vifs associated with a EthX/Y/Z interfaces are pinned to the fabric port that EthX/Y/Z interface is pinned to.Check the VLAN to VSAN mapping (show vlan fcoe)

FarNorth-A(nxos)# sh int vethernet 9463vethernet9463 is up Bound Interface is Ethernet2/1/8 Hardware: VEthernet Encapsulation ARPA Port mode is access Last link flapped 1week(s) 1day(s) Last clearing of "show interface" counters never 1 interface resets

FarNorth-A(nxos)# show vifs interface ethernet 2/1/8Interface VIFS-------------- ---------------------------------------------------------Eth2/1/8 veth1241, veth1243, veth9461, veth9463,

FarNorth-A(nxos)# show vlan fcoeVLAN VSAN Status-------- -------- --------1 1 Operational100 100 Operational

FarNorth-A(nxos)# show int vfc1271vfc1271 is up Bound interface is vethernet9463 Hardware is Virtual Fibre Channel Port WWN is 24:f6:00:0d:ec:d0:7b:7f Admin port mode is F, trunk mode is off snmp link state traps are enabled Port mode is F, FCID is 0x710005 Port vsan is 100

FarNorth-A(nxos)# show vifs interface vethernet 9463Interface VIFS-------------- ---------------------------------------------------------veth9463 vfc1271,


Common Helpful Outputs

CiscoLive-A(nxos)# show flogi database vsan 100----------------------------------------------------------------------------------------------------------INTERFACE VSAN FCID PORT NAME NODE NAME----------------------------------------------------------------------------------------------------------vfc703 100 0xdc0002 20:00:00:25:b5:00:00:1b 20:00:00:25:b5:00:00:2avfc725 100 0xdc0000 20:00:00:25:b5:10:10:01 20:00:00:25:b5:00:00:0evfc731 100 0xdc0001 20:00:00:25:b5:10:20:10 20:00:00:25:b5:00:00:2c

CiscoLive-A(nxos)# show zoneset active

zoneset name ZS_mn_bootcamp_v100 vsan 100 zone name Server-1-Palo vsan 100 * fcid 0xdc0000 [pwwn 20:00:00:25:b5:10:10:01] * fcid 0x2400d9 [pwwn 21:00:00:20:37:42:4a:b2]

All baseline troubleshooting should be done from Connect NXOS

CiscoLive-A(nxos)# show fcdomain domain-list vsan 100

Number of domains: 3Domain ID WWN--------- ------------------------------------------------- 0x24 (36) 20:64:00:0d:ec:20:97:c1 [Principal] 0x40 (64) 20:64:00:0d:ec:ee:ef:c10xdc (220) 20:64:00:0d:ec:d0:7b:41 [Local]

CiscoLive-A(nxos)# show fcns database vsan 100VSAN 100:------------------------------------------------------------------------------------------------------FCID TYPE PWWN (VENDOR) FC4-TYPE:FEATURE------------------------------------------------------------------------------------------------------0x2402ef N 50:06:01:6d:44:60:4a:41 (Clariion) scsi-fcp:target0x2400d9 NL 21:00:00:20:37:42:4a:b2 (Seagate) scsi-fcp:target0x400002 N 50:0a:09:88:87:d9:6e:b7 (NetApp) scsi-fcp:target0x40000e N 10:00:00:00:c9:9c:de:9f (Emulex) ipfc scsi-fcp:init0xdc0000 N 20:00:00:25:b5:10:10:01 scsi-fcp:init fc-gs0xdc0001 N 20:00:00:25:b5:10:20:10 scsi-fcp:init fc-gs0xdc0002 N 20:00:00:25:b5:00:00:1b scsi-fcp:initTotal number of entries = 6


NPV FC Views

No FC services running in NPV ModeFCIDs assigned from Core NPIV switchNP port to core Switch must be up and assigned to proper VSANs

FarNorth-B(nxos)# show npv flogi-table------------------------------------------------------------------------------------------------------------------SERVER EXTERNALINTERFACE VSAN FCID PORT NAME NODE NAME INTERFACE------------------------------------------------------------------------------------------------------------------vfc1205 100 0x240007 20:00:00:25:b5:00:00:0a 20:00:00:25:b5:00:00:06 fc2/1vfc1206 100 0x240006 20:00:00:25:b5:00:00:09 20:00:00:25:b5:00:00:06 fc2/1vfc1210 100 0x240008 20:00:10:25:b5:00:00:09 20:00:00:10:b5:00:00:09 fc2/2vfc1238 100 0x240002 20:00:00:25:b5:00:00:10 20:00:00:25:b5:00:00:0f fc2/1vfc1240 100 0x240003 20:00:00:25:b5:00:00:04 20:00:00:25:b5:00:00:0f fc2/2Total number of flogi = 5.

FarNorth-B(nxos)# show int brief

-------------------------------------------------------------------------------Interface Vsan Admin Admin Status SFP Oper Oper Port Mode Trunk Mode Speed Channel Mode (Gbps)-------------------------------------------------------------------------------fc2/1 100 NP off up swl NP 2 --fc2/2 100 NP off up swl NP 2 --fc2/3 1 NP off sfpAbsent -- -- --fc2/4 1 NP off sfpAbsent -- -- --fc2/5 1 NP off sfpAbsent -- -- --fc2/6 1 NP off sfpAbsent -- -- --fc2/7 1 NP off sfpAbsent -- -- --fc2/8 1 NP off sfpAbsent -- -- --

FarNorth-B(nxos)# show npv statusnpiv is enableddisruptive load balancing is disabled

External Interfaces:==================== Interface: fc2/1, VSAN: 100, FCID: 0x240000, State: Up Interface: fc2/2, VSAN: 100, FCID: 0x240001, State: Up

Number of External Interfaces: 2

Server Interfaces:================== Interface: vfc1205, VSAN: 100, State: Up Interface: vfc1206, VSAN: 100, State: Up Interface: vfc1210, VSAN: 100, State: Up Interface: vfc1238, VSAN: 100, State: Up Interface: vfc1240, VSAN: 100, State: Up Interface: vfc1270, VSAN: 100, State: Up Interface: vfc1272, VSAN: 100, State: Up Interface: vfc1280, VSAN: 100, State: Up Interface: vfc1284, VSAN: 100, State: Up

Number of Server Interfaces: 9


Server Systems Troubleshooting• Server Upgrade Items

Do NOT do a BIOS recovery as a mechanism to perform an upgrade of BIOSWe should do this through the update method (M3 Blades) or Host FW packageIn General, we want the CIMC version to be greater than the BIOS version as the data returned from BIOS to CIMC and properly understanding it (delta in documentation today)All firmware components must be from same B (blade components) and C (rack components) packages, matched to the A (infrastructure) package


Server Systems - CIMC Booting Issues• Corrupt CIMC Firmware

POST FailureNot completing boot

• Connecting to CIMC in band to test connectivity• Manually reboot CIMC

**Note, today there is a bug in B230 and B440 where network performance can be negatively affected on CIMC only reboot on VMware hosts


Connecting to CIMC• A quick test to verify the health• This is a very low level data point• Source of blade issue reporting

__________________________________________ Debug Firmware Utility__________________________________________Command List__________________________________________

alarmscoresexithelp [COMMAND]imagesmctoolsmemorymessagesnetworkobflpostpowersensorsselfrumezz1frumezz2frutaskstopupdateusersversion

__________________________________________ Notes:"enter Key" will execute last command"COMMAND ?" will execute help for that command__________________________________________

CiscoLive-A# connect cimc 1/1Trying 127.5.1.1...Connected to 127.5.1.1.Escape character is '^]'.

CIMC Debug Firmware Utility Shell


Rebooting the CIMC• Non disruptive to data path **

** with exception of the current bug on VMware environments


Server Systems Troubleshooting• KVM Access• Independent of Centrale• UCS AAA Login


Server Systems - Memory• This will show errors detected and reported by BIOS and the CIMC• These are also stored in the System Event Log (SEL)• Uncorrectable are an issue, Correctable is making use of ECC parity

CiscoLive-A /chassis/server # show sel 3/1 | include Memory 487 | 03/18/2011 00:16:49 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 0, DIMM Socket: 4, Channel: C, Socket: 0, DIMM: C4 | Asserted 5f1 | 04/16/2011 09:53:12 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 3, DIMM Socket: 7, Channel: A, Socket: 0, DIMM: A7 | Asserted 731 | 04/21/2011 01:59:28 | BIOS | Memory #0x02 | Correctable ECC/other correctable memory error | RUN, Rank: 1, DIMM Socket: 1, Channel: B, Socket: 0, DIMM: B1 | Asserted 732 | 04/21/2011 10:50:55 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 2, DIMM Socket: 6, Channel: A, Socket: 0, DIMM: A6 | Asserted 799 | 04/29/2011 02:50:31 | BIOS | Memory #0x02 | Correctable ECC/other correctable memory error | RUN, Rank: 0, DIMM Socket: 0, Channel: B, Socket: 0, DIMM: B0 | Asserted 79a | 04/29/2011 04:41:33 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 3, DIMM Socket: 3, Channel: B, Socket: 0, DIMM: B3 | Asserted


Server Systems - Memory• We want to know of both correctable (for prediction of failure) and

uncorrectable via threshold policy


Stress Testing and Baseline• Cycling Through Servers Performing Testing on Deployed Hardware

Evacuate the VMs from a given server and put in maintenance mode

Mount the e2e diagnostic .ISO and reboot the server to it

Run utilities to stress test the memory and CPU

Test 1: ./burnin/bin/stress –c 8 –i 4 –m 2 –-vm-bytes 128M –t 100s –v

Test 2: ./burnin/bin/pmemtest –a –l 1000000000

Test 3: ./burnin/bin/stream

Test 4: ./burnin/bin/cachebench -rwbsp -x1 -m24 -d5 -e1

DO NOT RUN THE DISK STRESS (will corrupt the existing RAID)

Record the results

Remove .ISO and reboot VMware to exit maintenance mode

• Identify any suspect devices from tests and plan for maintenance of that item


Server Systems Troubleshooting• Initial Server Deployment or Suspected Issues• Example Results from one Customer POC:

B200-M2 / X5570 / 96G B230-M1 /X6550 / 256GTest #1 1m 40s 1m 40sTest #2 50s 1m 20s Test #3 5m 4s 5m 11sTest #4 13m 45s 13m 46s


Operating Systems Troubleshooting• Windows Items

With the latest BIOS on B230 and B440 M1, the PCI devices are ordered correctly on 1.4 to 2.0 upgrade, but interfaces can be renumbered regardless – fix comingWe can define PCI order, but the adapter definitions to the OS are dependent on the order you map the VIC driver to them

• Red Hat ItemsWe have very good control over these, using the /etc/sysconfig/network-scripts to map the HW address to the eth numberThere are kernel parameters which can affect performance – contact TME teams directly

• ESX ItemsIn box drivers occasionally need to be updatedDue to time sync requirements for inbox deployments (can be 6+ months)


Chassis Troubleshooting• Intra chassis component communications

Inter-Integrated Circuit communications (I2C)

Systems Management Bus was later subsetMulti-Master Bus for simple communications between system elementsIn use inside a standard industry server, and also between chassis components (inside a single chassis only)

• I2C bug cases with some components coming too close to certain margins

Locking the I2C busCreating spurious noise on the busManifests in unpredictable behavior

• What does this mean for POC and Initial Customer Deployments?

Be certain to be running a software at/later than 1.4(3s) which includes SW fixes to these situations – for additional HW margin increments:Power supplies should be ordered as MFG_NEW if possibleIO Modules that are 2104 should be ordered as MFG_NEW if possible


Fabric Interconnect Troubleshooting• 6100 Top Considerations

3k prior to UCS 1.4(1), then 6k to UCS 1.4(1), 14k P*V Count Limit as of UCS 1.4(3q)VIF limits can be very restrictive in C series implementations

• 6200 Top Considerations32k P*V Count Limit at UCS v2.xMulticast when using Port Channels upstream (only do on UCS v2.0(2) and later)


Fabric Interconnect Troubleshooting• Gathering Tech Support

FilesWe have the ability to gather the tech support data from UCSM to your localhostAlways recommend gathering when asking questions to various internal mailers


Fabric Interconnect Troubleshooting• Gathering any Core

DumpsOnce TFTP core exporter is configured, they will be moved off the systemMove exported cores to the trash can


Fabric Interconnect Troubleshooting• Viewing data plane traffic within the UCS

We can SPAN from most sources within the UCSCan SPAN the physical and virtual interfaces Hardware or

software Analyzer


Thank you.

Troubleshooting Methods for UCS Customer POCs and Labs

Documents

cisco andor

cisco confidential

layout of ucs faults

ucs customer pocs

fabric ag

switch ag

required ucs system

ucs vipfib ipfi