Troubleshooting Methods for UCS Customer POCs and Labs. August 2012. Agenda. Why would we need this presentation? Overview of some recurring items to address Infrastructure Items Adapter and IOM systems troubleshooting Server systems troubleshooting Operating systems troubleshooting - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Agenda• Why would we need this presentation?• Overview of some recurring items to address
Infrastructure Items
• Adapter and IOM systems troubleshooting• Server systems troubleshooting• Operating systems troubleshooting• Chassis systems troubleshooting• Fabric Interconnect systems troubleshooting
Following the Progress of AG Work• Many stages of a given process are run through (FSM-Stage)• Some can be skipped if unneeded or type of action (Shallow vs. Deep)• Almost all actions contain a verification step that the action completed• Logs are retained• View and Monitor
FSM Return Codes and System Faults• These will feed into the normal fault policy of UCSM
FSM faults are just one type – refer to the link below for listing of typesHighly recommend at least becoming familiar with layout of UCS faults and error message reference in URL below
• Severity can change over the life of fault• For POC labs recommend elimination of Critical, Major, Minor faults
Others will be there in normal course of all the actions waiting and performed
Recurring Items – UCS Infrastructure• Setting a system baseline
Most of our initial issues in CPOC situations are due to firmware issuesAll system components must be on same firmware package versionHost and mgmt firmware policies are excellent tools to do this – rather than server by serverViewing the components of a package shownWhen demo FI’s arrive, individually set them to a common UCSM version and erase the configurations before attempting to join them in a cluster
Recurring Items – UCS Infrastructure• Setting a system baseline
Can upgrade via the Bundle Mechanism at the POC startBundle option there for both update and activate – handles all upgradeThis is totally disruptive, so don’t do this method during the POC (after staging)
Recurring Items – UCS Infrastructure• Upgrade prep, checkpoints, cleanup – when uptime is key (not a POC)
Implement a management interface monitoring policyPrior to upgrading one fabric, disable all upstream data and FC portsDisable the mgmt interface also (KVM traffic on the fabric that will not be taken down)This will force traffic to the fabric that will be up (can quickly recover if an error)Upgrade fabric, restore uplinks and mgmt interfaceRepeat on peer fabric – but only after the cluster state is showing as HA_READY when in the CLI and you connect to local management and “show cluster extended”
Recurring Items – UCS Infrastructure• Discovery Policy vs. Re Acknowledgement
behaviorDiscovery policy is just that – a floor in the number of links before a chassis will be discoveredThe link policy will dictate bringing up port-channels from the IOMs to Fis – after discoveryMust then re-acknowledge the chassis (disruptive to blade connectivity) for all connections beyond discovery to be usedAlways re-acknowledge the chassis after it is discovered, or any cabling changes
In all current versions, IGMP Snooping is enabled and cannot be turned offOnly the 224.0.0.X is flooded within the UCSFundamentally different from traditional switches which floodWe need an upstream PIM router or IGMP snooping querier upstream for proper multicast flow beyond a new flow timeout (~180 seconds)
Recurring Items – UCS Infrastructure• It is always best as a preparation to review the release notes• This is the PRIMARY method we notify the field of issues to keep aware of• Can be large with the product breadth, but for a POC or install will be a
Path Tracing through UCS• How to locate the MAC of the interfaces
Find the interesting adapter in UCSM from or from the NXOS CLI
#Found mac address in Fabric interconnect A. It should not be visible on Fabric interconnect B. If it is then the customer is doing per flow/packet load balancing at the host level, which is not allowed on UCS B-Series
All vifs associated with a EthX/Y/Z interfaces are pinned to the fabric port that EthX/Y/Z interface is pinned to.Check the VLAN to VSAN mapping (show vlan fcoe)
FarNorth-A(nxos)# sh int vethernet 9463vethernet9463 is up Bound Interface is Ethernet2/1/8 Hardware: VEthernet Encapsulation ARPA Port mode is access Last link flapped 1week(s) 1day(s) Last clearing of "show interface" counters never 1 interface resets
FarNorth-A(nxos)# show int vfc1271vfc1271 is up Bound interface is vethernet9463 Hardware is Virtual Fibre Channel Port WWN is 24:f6:00:0d:ec:d0:7b:7f Admin port mode is F, trunk mode is off snmp link state traps are enabled Port mode is F, FCID is 0x710005 Port vsan is 100
FarNorth-A(nxos)# show vifs interface vethernet 9463Interface VIFS-------------- ---------------------------------------------------------veth9463 vfc1271,
CiscoLive-A(nxos)# show flogi database vsan 100----------------------------------------------------------------------------------------------------------INTERFACE VSAN FCID PORT NAME NODE NAME----------------------------------------------------------------------------------------------------------vfc703 100 0xdc0002 20:00:00:25:b5:00:00:1b 20:00:00:25:b5:00:00:2avfc725 100 0xdc0000 20:00:00:25:b5:10:10:01 20:00:00:25:b5:00:00:0evfc731 100 0xdc0001 20:00:00:25:b5:10:20:10 20:00:00:25:b5:00:00:2c
CiscoLive-A(nxos)# show zoneset active
zoneset name ZS_mn_bootcamp_v100 vsan 100 zone name Server-1-Palo vsan 100 * fcid 0xdc0000 [pwwn 20:00:00:25:b5:10:10:01] * fcid 0x2400d9 [pwwn 21:00:00:20:37:42:4a:b2]
All baseline troubleshooting should be done from Connect NXOS
CiscoLive-A(nxos)# show fcdomain domain-list vsan 100
Number of domains: 3Domain ID WWN--------- ------------------------------------------------- 0x24 (36) 20:64:00:0d:ec:20:97:c1 [Principal] 0x40 (64) 20:64:00:0d:ec:ee:ef:c10xdc (220) 20:64:00:0d:ec:d0:7b:41 [Local]
CiscoLive-A(nxos)# show fcns database vsan 100VSAN 100:------------------------------------------------------------------------------------------------------FCID TYPE PWWN (VENDOR) FC4-TYPE:FEATURE------------------------------------------------------------------------------------------------------0x2402ef N 50:06:01:6d:44:60:4a:41 (Clariion) scsi-fcp:target0x2400d9 NL 21:00:00:20:37:42:4a:b2 (Seagate) scsi-fcp:target0x400002 N 50:0a:09:88:87:d9:6e:b7 (NetApp) scsi-fcp:target0x40000e N 10:00:00:00:c9:9c:de:9f (Emulex) ipfc scsi-fcp:init0xdc0000 N 20:00:00:25:b5:10:10:01 scsi-fcp:init fc-gs0xdc0001 N 20:00:00:25:b5:10:20:10 scsi-fcp:init fc-gs0xdc0002 N 20:00:00:25:b5:00:00:1b scsi-fcp:initTotal number of entries = 6
Server Systems Troubleshooting• Server Upgrade Items
Do NOT do a BIOS recovery as a mechanism to perform an upgrade of BIOSWe should do this through the update method (M3 Blades) or Host FW packageIn General, we want the CIMC version to be greater than the BIOS version as the data returned from BIOS to CIMC and properly understanding it (delta in documentation today)All firmware components must be from same B (blade components) and C (rack components) packages, matched to the A (infrastructure) package
__________________________________________ Notes:"enter Key" will execute last command"COMMAND ?" will execute help for that command__________________________________________
CiscoLive-A# connect cimc 1/1Trying 127.5.1.1...Connected to 127.5.1.1.Escape character is '^]'.
Server Systems - Memory• This will show errors detected and reported by BIOS and the CIMC• These are also stored in the System Event Log (SEL)• Uncorrectable are an issue, Correctable is making use of ECC parity
With the latest BIOS on B230 and B440 M1, the PCI devices are ordered correctly on 1.4 to 2.0 upgrade, but interfaces can be renumbered regardless – fix comingWe can define PCI order, but the adapter definitions to the OS are dependent on the order you map the VIC driver to them
• Red Hat ItemsWe have very good control over these, using the /etc/sysconfig/network-scripts to map the HW address to the eth numberThere are kernel parameters which can affect performance – contact TME teams directly
• ESX ItemsIn box drivers occasionally need to be updatedDue to time sync requirements for inbox deployments (can be 6+ months)
Chassis Troubleshooting• Intra chassis component communications
Inter-Integrated Circuit communications (I2C)
Systems Management Bus was later subsetMulti-Master Bus for simple communications between system elementsIn use inside a standard industry server, and also between chassis components (inside a single chassis only)
• I2C bug cases with some components coming too close to certain margins
Locking the I2C busCreating spurious noise on the busManifests in unpredictable behavior
• What does this mean for POC and Initial Customer Deployments?
Be certain to be running a software at/later than 1.4(3s) which includes SW fixes to these situations – for additional HW margin increments:Power supplies should be ordered as MFG_NEW if possibleIO Modules that are 2104 should be ordered as MFG_NEW if possible
Fabric Interconnect Troubleshooting• Gathering Tech Support
FilesWe have the ability to gather the tech support data from UCSM to your localhostAlways recommend gathering when asking questions to various internal mailers