This presentation may contain VMware confidential information.
Copyright © 2005 VMware, Inc. All rights reserved. All other marks and names mentioned herein may be trademarks of their respective
companies.
Learning Objectives
� Know how to work with performance problems
� Prevent and diagnose network problems
� Avoid and troubleshoot SAN and storage problems
� Learn what to do about system reliability problems
Performance Problems: Before You Begin
�Create a clear statement of the problem:� Problems that can’t be measured can’t be repaired
� “Poor performance” and “poor speeds” are relative
�Compared to previous results? or, to expectations?
�Gather performance and utilization data� Capture an empirical measurement of a quantitative
value
�Compare to a normal, baseline “benchmark” prior to any problem
Troubleshooting Performance ProblemsClarify and quantify the initial complaint
Perception or reality?Perception
Supply education
Reality
Relieve the bottleneck:Allocate more resources, or
Decrease competition
Satisfactory performance?
No
Measure performance again
Identify the resource that is the bottleneck
Rule out common errors
Another bottleneck?Yes Consider physical hardware
for this application
DoneYes
No
Performance: Perception vs. Reality� Common perceived performance problems:
� The display is sluggish: A virtual machine without VMware Tools installed would appear sluggish in Remote Console� Virtual machine lacks proper graphics drivers
� The mouse is slow: Hardware mouse acceleration is turned off by default in Windows Server 2003� Result: poor mouse responsiveness
� Virtual machines perceived as generally slow: Bad network connectivity between client and Service Console
� Remote Console performance indicates nothing about the actual performance of the virtual machines
� High utilization does not necessarily mean bad performance� True measures of performance are user-facing metrics
� Transactions per time
� Response time
Performance: Common Errors in Virtual Machine Configuration
�Failure to use high-performance virtual devices� Virtual SCSI adapters: Choose LSILogic rather than Buslogic
� Virtual Ethernet adapters: Choose vmxnet rather than vlance
�Failure to size virtual machines properly� Set virtual machine’s maximum memory to avoid paging inside
the guest OS
� Set virtual machine’s minimum memory to accommodate steady-state memory needs
� Employ Virtual SMP only for� Guest operating systems supported for SMP use
� Applications that benefit from multiple CPUs
�Failure to distinguish between high- and low-priority consumers� Give high-priority virtual machines many shares of their key
resources
� Give low-priority virtual machines comparatively few shares
Performance: Common Errors in Virtual Machine Configuration
�Failure to set up the guest OS and application environment wisely� Disable screensavers
� Run antivirus software in scheduled-scan mode, not continuously
� Choose the right Windows HAL or Linux kernel for the number of virtual CPUs
�Failure to lay out virtual machine storage correctly� If an application benefits from a particular RAID setting, put its
virtual disks in a VMFS volume in a LUN with that setting
� Do not use undoable disks casually
� Avoid using many undoable disks in the same VMFS volume simultaneously
Performance: Common Errors in Server Configuration
�Failure to match system configuration to workload
� Virtual machines need at least as much CPU and RAM as do physical machines running the same application
� Size your system’s physical RAM to accommodate all virtual
machines’ steady-state memory needs
�If virtual machines’ memory use totals more than RAM allocated to the VMkernel, VMkernel swapping will occur
�Failure to lay out PCI buses properly
� Connect no more than one high-traffic adapter to each bus
� If possible, place storage adapter on one bus, NICs on another
� If possible, spread the NICs in a bond across multiple buses
�Failure to design storage layout properly
� Choose the right failover policy for your disk array
� Spread the traffic to your LUNs across available paths
Performance: Limiting Resource for Some Applications
CPU: Size the physical system properly; choose the Terminal Services workload option
Citrix MetaFrame XP terminal services
RAM: Size virtual machine’s minimum memory to hold the Java heap
CPU: Preallocate database pools
ATG Dynamoapplication server
RAM: Size virtual machine’s minimum memory to hold content
CPU: Avoid system calls by configuring Apache with minimal logging, DNS lookups, process creation
Apache Web server
Disk: Place logs and data in virtual disks on different physical LUNs
CPU: Monitor the scheduler queue length; do not run 2-VCPU virtual machines on 2-PCPU hardware
Microsoft SQL Serverdatabase server
Resources and actionsApplication
Performance: Run Physical? or Run Virtual?
� Workloads that use more resources than virtual machines’maxima should stay on physical hardware� Examples: applications that can use more than 3.6 GB of RAM or two
CPUs
� Virtualization is always a tradeoff between performance and manageability
Workloads that spend much time executing operating-system code will suffer the greatest performance cost
� Examples: Process creation and destruction, thread management
Less virtualization cost More virtualization cost
Computationally intensive workloads
I/O intensive workloads
Workloads with high system-call overhead
Network: Problems in Functional Layers
�Gateway
�Router
�SwitchPhysical hardware
�Firewall running inside guest OS
�Packets dropped inside guest OSVirtual machine or guest OS
�Virtual switch configuration
�VLAN configuration settings
�NIC teaming configuration
�Service Console NIC device driver
VMkernel or Service Console
Reason for connectivity lossLayer
�You may not be able to tell what layer the problem is in, just from the symptom:
� Example: a ping timeout (lost network connectivity) to a virtualmachine: it could be in any layer
Network: Issues With Physical Hardware
Eliminate any problems with physical network components:
�Bad cables
�Bad switch ports
�Broken firmware/hardware
�Software or configuration problems on the switch
Network: VMkernel or Service Console Issues
� Configure networking components to work properly with the ESX Server:
� Configure Service Console NIC to properly negotiate a link with the switch
�� For example, see KB Article #1564For example, see KB Article #1564
� Configure VLAN switch ports as trunk ports
� Configure link aggregation on the switch port side if using NIC teaming with out-IP balancing mode
Network: VMkernel or Service Console Issues
� Ensure that your overall network design is correct
� Check for proper VLAN configuration
� Check for proper firewall design
� Ensure that applications do not conflict with VMkernel networking functionality
� Unicast NLB does not work with VMotion: both applications require conflicting VMkernel parameters�� For more information, see KB article #1573For more information, see KB article #1573
Network: Virtual Machine or Guest OS Problems
� Prevent random connectivity loss if
using vlance virtual NIC with Windows
2003:
� Install updated AMD PCNET driver
�� For more information, see KB Article #1631For more information, see KB Article #1631
� Isolate problems by interchanging the
guest OS network adapter from vlance
to vmxnet, or vice versa
� Eliminate duplicate IP address errors that result from changing network connection’s TCP/IP configuration from DHCP to a static IP address in a Windows guest OS
� Problem can also exist in a native environment
�� For more information, see KB Article # 1179For more information, see KB Article # 1179
� Allow a Red Hat Linux 9.0 virtual machine to properly receive a DHCP-assigned IP address
� Symptom is the error message “Determining IP information for eth0… failed; no link present. Check cable?”
�� For more information, see KB Article # 977For more information, see KB Article # 977
Network: Virtual Machine or Guest OS Problems
Storage: Document and Verify SAN Topology
� How many hosts? How many HBAs in each?
� How many arrays? How many SPs in each? How many ports per SP?
� How many FC switches? How is each zoned? How are they interconnected?
� How is each HBA cabled to each FC switch?
� How is each SP port cabled to each FC switch?
� Are your disk arrays and HBAs supported?
� Disk arrays can be supported for basic LUN connectivity but not for advanced features
Storage: Avoid Common Errors� When installing ESX Server on a production system
� Disconnect Fibre Channel HBAs if doing a local install
� Carefully verify zoning before doing a boot-from-SAN install
� Installer lets you wipe any accessible disks, including SAN LUNs others may be using
� If possible, VMkernel’s resources should be on a path not exposed to SAN-administrator error� VMkernel’s core dump partition (filesystem type vmkcore)
� VMkernel swap file (VMFS partition)
� Dedicate HBAs to VMkernel; do not share with Service Console
� Eliminates risk of I/O contention between the two
� Preserves ability to dynamically scan for new SAN LUNs
Storage: Troubleshoot SAN ConnectivityA SAN-based VMFS is not available
Does the VMkernel see the LUN?Check /proc/vmware/scsi
LUN is present
Does the FC card see the LUN?Boot into its menu
No VMFS in LUN
LUN is absent
LUN is present
LUN is absent
VMkernel configuration problem
Does the FC switch see the FC card?Check for fabric login
Yes, switch sees card
No, switch doesn’t see card
Cabling problem
Will the switch allow the FC card to talk to storage? Check for
port login
No port login to target
Zoning problem
LUN masking problem
Yes, port login to target occurred
Storage: SAN Multipathing
� Multipathing allows continuous availability of a SAN LUN in the event of a hardware failure� Administrator may set preferred paths for each LUN
� ESX Server supports failover with any supported HBAs� Failover occurs automatically, with a configurable delay
� Do not attempt to combine ESX Server failover with other multipathing solutions� Other software or hardware multipathing will conflict with
the VMkernel
� Use zones to enforce access from your ESX Server to your disk array
� Choose the right failover policy for your disk array
Storage: Failover Policies, Disk-Array Types
� ESX Server multipath failover policies: MRU and fixed� Fixed policy “fails back” once original failed path is restored (it
is preferred)
� MRU (“most recently used”) does NOT “fail back”, even when the original failed path is restored (with MRU, there is nopreferred path)
� Disk-array types: active/active and active/passive� Active/active: Any SP may at any time access any LUN
� Active/passive: One SP is active at any time; other is a hot standby
� Examples of disk-array types:� Active/active: EMC Symmetrix, IBM ESS (“Shark”), Hitachi
9900
� Active/passive: HP MSA 1000, EMC CLARiiON, IBM FAStT� HP EVA is technically active/active, but ESX Server uses it as
active/passive
Storage: Clustering Correctly
In virtual machine configuration, set bus sharing type to physical
Use raw-device mapping in physical compatibility mode
Physical-to-virtual clustering
In virtual machine configurations, set bus sharing type to physical
Set VMFS accessibility mode to shared, whether using virtual disks or raw-device mappings
If not using RDM, then replace VMFS volume labels in virtual machine configuration with vmhbaC:T:L:P paths
Cluster-across-boxes
In virtual machine configurations, set bus sharing type to virtual
Cluster-in-a-box
Actions requiredCluster type
Storage: VMFS Accessibility Modes
VMFS
SCSI LUN
VMFS accessibility mode controls whether more than one virtual machine at a time can access a file in
VMFS
SCSI reservations can limit access to one ESX Server at a time
VMFS structure is read-only!
VMkernels cooperate to make one change at a time
How VMFS structureis protected from
corruption
Cluster across boxes
General use
When to use
Software inside virtual machines (MSCS, etc.) requests SCSI reservations
shared
VMkernels locks entire virtual diskspublic
How virtual disks are protectedfrom corruption
Mode
Storage: Troubleshoot Boot-From-SANBoot from a SAN LUN fails
Supported hardware?Check SAN guide
Unsupported
Was the installation done withboot-from-san option?
Replace unsupported gear
Supported
No
Yes
Repeat install
Modify BIOS boot order
Is FC card pointed at correctdisk-array WWN and LUN?
NoModify FC card configuration
Mask any lower-numbered LUN
Rule out cabling or zoning problem
Is server’s BIOS set toboot from FC?
Yes
No
Yes
Storage: Boot-from-SAN BIOS Configuration
Ensure that the Fibre Channel adapter’s boot code is enabled
Storage: Boot-From-SAN BIOS Configuration� Configure BIOS so that Fibre Channel adapter is the boot
device, and desired LUN is the boot volume
� Disable built-in IDE controller if present
Reliability: Avoid Common Causes of Crashes�Hardware problems may produce crash or hang:
�System memory: “burn-in” new memory for 72 hours
�CPU failures or problems
� (Local) boot disk
�Storage: controller’s battery dies and it loses its cache
�VMware software: Apply patches or upgrades
�ASR: Automatic System Reset (or reboot):
� If intermittent: difficult to predict, track, document
�May have to be referred out to the OEM for diagnosis:�Third-party vendor software (backup or management agent)
�Disk hardware
Reliability: Avoid Common Causes of Crashes
ESX Server misconfigurations:�BIOS Settings:
�For example, MPS table mode needs to be set to full table APIC, since they do not get PCI IRQ routing entries
� For details, See Kedge Base article # 1081
�ESX Server SAN configuration�Each LUN must have at least one active path
�If SAN is “active-passive”, ESX Server multipathing policy should be set to “MRU”
� If set to “Fixed”, can easily cause a crash or hang
�VMkernel and Service Console device allocation�When using vmkpcidivy, make sure that devices are properly allocated (to Service Console, virtual machines, or both)
External and/or environmental factors:
�Connections physically broken (to network, attached storage, etc.)
�Temperature and other environmental changes
�SAN hardware online maintenance:
�Some SAN products have an online maintenance feature (Storage Processor service downtime)
�Doesn’t work reliably w/ESX Server; can sometimes cause crash or hang
�May need to be disabled
Reliability: Avoid Common Causes of Crashes
Reliability: PSOD (Purple Screen of Death)
Caused by hardware and VMkernelproblems, and Service Console Oops:�Typically halt the system (“PSOD” or Purple
Screen of Death)
�On reboot, Service Console copies the contents of the vmkcore partition to a core file placed in user root’s home directory (/root)
�Next reboot will copy core file into /rootdirectory of Service Console
�Customer needs to run vm-support and upload resulting file to ftpsite.vmware.com
Reliability: Server Management
Problems, not with the entire Service Console, but rather with one of its key daemon processes:
�vmware-serverd, or
�vmware-ccagent (if managed by a VirtualCenter server)
�vmware-authd
Usually produces a conventional Unix core file:�Path: /var/log/vmware/core
�Also produces characteristic “511 error”
Reliability: vmm (Virtual Machine Monitor)
If the vmm world crashes:
� If the virtual machine has one vcpu, a core file named vmware-core is generated
� If the virtual machine has two vcpus, two core files, vmware-core0 and vmware-core1are generated
�Core files are placed in the virtual machine configuration file directory
�Core files get archived and compressed
�For example, vmware-core gets archived and compressed into vmware-core.gz
Reliability: the Virtual Machine’s Processes
Service Console processes:�vmware-vmx processes support virtual
devices which are removable (virtual floppy disk, virtual CD-ROM, etc.)
�vmware-mks process supports virtual mouse, keyboard and video through VMware Remote Console
�Crashes of any of these typically will produce conventional Unix core file: “core”(or, “core.pid”)
Reliability: the Virtual Machine’s Guest OS
Guest OS: Windows, Linux, Netware, etc.
�Produces bluescreen (BSOD) or application
fault
�Capture screen
�Minidump (64kb), or kernel dump (size of
kernel image)
�Dr. Watson output file (Windows)
�Core file (Linux)
�Abend (Netware)
Reliability: Guest OS Bluescreen
Troubleshoot just as you would a similar problem on a physical machine.
Summary
� Performance problems
� Network problems
� SAN and storage problems
� System reliability problems