Top Banner
Monitoring and Intelligently Monitoring and Intelligently Reacting to ESX Performance Reacting to ESX Performance Greg Shields Greg Shields Partner and Principal Technologist Concentrated Technology www.ConcentratedTech.com
46
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ESX performance problems 10 steps

Monitoring and Intelligently Reacting to Monitoring and Intelligently Reacting to ESX PerformanceESX Performance

Greg ShieldsGreg ShieldsPartner and Principal TechnologistConcentrated Technologywww.ConcentratedTech.com

Page 2: ESX performance problems 10 steps

This slide deck was used in one of our many conference presentations. We hope you enjoy it, and invite you to use it

within your own organization however you like.

For more information on our company, including information on private classes and upcoming conference appearances, please

visit our Web site, www.ConcentratedTech.com.

For links to newly-posted decks, follow us on Twitter:@concentrateddon or @concentratdgreg

This work is copyright ©Concentrated Technology, LLC

Page 3: ESX performance problems 10 steps

Class DiscussionClass Discussion

What kinds of performance things should one monitor on an ESX server?– Why?

Page 4: ESX performance problems 10 steps

ESX Performance 101ESX Performance 101

Processor Use– Processor use on any server > 80%

Consider this “overuse”.– Reduce processing requirements on VMs.– Migrate VMs elsewhere, rebalance.

Page 5: ESX performance problems 10 steps

ESX Performance 101ESX Performance 101

Processor Use– Processor use on any server > 80%

Consider this “overuse”.– Reduce processing requirements on VMs.– Migrate VMs elsewhere, rebalance.

Memory Use– Memory use on any server > 80%

Consider this “overuse”– Reduce assigned vRAM to VMs, if possible.– Migrate VMs elsewhere, rebalance.

Page 6: ESX performance problems 10 steps

ESX Performance 201ESX Performance 201

Network throughput– Network throughput > 80% and steady

Begin analyzing throughput consumption– Consider re-routing heavy consumption to

independent pNICs & independent vSwitches.– Rebalance load, although this tends to just shift

problems.

Page 7: ESX performance problems 10 steps

ESX Performance 201ESX Performance 201

Network throughput– Network throughput > 80% and steady

Begin analyzing throughput consumption– Consider re-routing heavy consumption to

independent pNICs & independent vSwitches.– Rebalance load, although this tends to just shift

problems.

Context Switches– Context switches significantly higher than baseline– Analyze workload. Consider V2P.– Rebalance.– Upgrade hardware to Nehalem / Opteron

Page 8: ESX performance problems 10 steps

ESX Performance 201ESX Performance 201

IOPS– IOPS demand > IOPS supply

Consider this “overuse”– Analyze with esxtop or Disk | Usage in Performance

tab– Adding disks spreads spindle demand, reduces

contention– Consider more/smaller datastores– Consider new storage hardware that can rebalance

internally based on observed contention. $$$

DEMO: ESX performance tab. DEMO: Customizing perf stats intervals

Page 9: ESX performance problems 10 steps

Thank you!Thank you!Class Dismissed!Class Dismissed!

Page 10: ESX performance problems 10 steps

Thank you!Thank you!Class Dismissed!Class Dismissed!

““Uh, GimmeUh, Gimme’’ a Break, Greg. a Break, Greg.Is that All YouIs that All You’’ve Got?ve Got?””

Page 11: ESX performance problems 10 steps

ESX Performance 301ESX Performance 301

The Structured Approach!– Greg’s TEN STEP Plan to VM Happiness– Computers are deterministic.– Virtual computers are as well, however they are

much more complicated.– Virtual computers have so many more

dependencies than traditional computers. Makes the ad hoc process less intuitive.

– Your “gut feeling” with virtual environments is less effective.

Homework Reading: Performance Troubleshooting for VMware vSphere 4Get it at VMware.com

Page 12: ESX performance problems 10 steps

Step 1: VMware ToolsStep 1: VMware Tools

If the VMware Tools aren’t working, this will cause numerous low-level issues.– Always start by verifying their functionality

DEMO: Verifying VMware Tools status

Page 13: ESX performance problems 10 steps

Step 2: Verify Host CPUStep 2: Verify Host CPUSaturationSaturation

CPU saturation on an ESX host creates contention, which slows down all VMs.– Performance | Advanced– CPU | Usage– Is this number consistently above 75%?– If yes, go to Step 3.

Page 14: ESX performance problems 10 steps

Step 3: Verify VM Ready TimeStep 3: Verify VM Ready Time

If high host CPU usage, then the next step is to see which VM is causing the problem.– Select Host | Virtual Machines tab | Host CPU – Mhz

column.– Locate high-use VM.– Select VM | Performance tab | CPU | Ready (all

vCPUs)

If Ready > 2000ms for any vCPU, then host CPU saturation exists.

Page 15: ESX performance problems 10 steps

Step 3: SolutionsStep 3: Solutions

Rebalance VMs. Move VMs off this host. Increase CPU shares available to host, if

resource constrained.– Resource Pools can do this.

Reduce the number of vCPUs assigned to VMs.

Add hosts.

Page 16: ESX performance problems 10 steps

Step 4: Verify Guest CPUStep 4: Verify Guest CPUSaturationSaturation

Remember that CPU saturation can happen on the host, but it can also happen in the VM.– Shares/Limits/Other can restrict guest processing.– “Everything looks good on the host, but the guest is

running at 100%”

Check VM CPU for saturation– Select VM | Performance tab | CPU | Usage– Is this number consistently above 75%?

Page 17: ESX performance problems 10 steps

Step 4: SolutionsStep 4: Solutions

The VM is working too hard– (Aren’t we all?)– Not getting enough resources to accomplish its

task. Assign more CPU shares.– Installed workload not well-throttled. Throttle or

reconfigure applications. Balance processing across time of day.

– Add vCPUs. Only do this if the application is multi-threaded.

– Remove pinning of processes to processors.

Page 18: ESX performance problems 10 steps

Step 4½: Verify VMs areStep 4½: Verify VMs areActually Using their vCPUsActually Using their vCPUs

An interesting reverse! Assigning multiple vCPUs to a VM that isn’t

using them wastes resources.– If that VM isn’t using the vCPU, remove it so

another VM can use it instead.– Select VM | Performance tab | CPU | Usage– Look at all vCPU objects.– Is usage for all vCPUs but one close to 0?

Page 19: ESX performance problems 10 steps

Step 4½: SolutionsStep 4½: Solutions

Reduce assigned vCPUs to one.– …and don’t do that again!

Page 20: ESX performance problems 10 steps

Step 5: Check for HostStep 5: Check for HostMemory SwappingMemory Swapping

Memory swapping is generally always a condition you want to avoid.– Swapping exerts an incredible tax on performance.– A solution of last resort.– Select Host | Performance tab | Memory | Swap

In/Out Rate– Are either of these above 0?

Page 21: ESX performance problems 10 steps

Step 5: SolutionsStep 5: Solutions

Limited solutions for memory swapping.– Reduce memory overcommit. Drop the level of

assigned memory in each VM as appropriate.

– Most of us over-assign memory to VMs anyway. So, at least at first, this can sometimes be effective.

– Reduce reservations. Too many reservations can impact optimization of memory sharing.

– Add RAM.

– Enable resource controls. Note that this might cause VM memory swapping.

DEMO: Verifying a VM’s balloon driver is functioning.

Page 22: ESX performance problems 10 steps

Step 5½: Check for VMStep 5½: Check for VMMemory SwappingMemory Swapping

The solutions for Step 5 can cause downstream effects in each VM.– You decrease available RAM– VM doesn’t have enough– VM itself has to swap

This is a situation just as bad a host swapping.– Select Host | Performance tab | Memory | Real-

Time | Stacked Graph (per VM)– Are any VMs reporting memory swapping > 0?– If so, then that VM needs more RAM.

Page 23: ESX performance problems 10 steps

Step 5½: SolutionsStep 5½: Solutions

That VM needs more RAM.– You’ve gone too far with restricting its resources.

Page 24: ESX performance problems 10 steps

Step 6: Check forStep 6: Check forOverloaded StorageOverloaded Storage

Many paths for verifying storage utilization.– IOPS is an emerging metric.– Can also verify Command Aborts. Identifies the

number of SCSI commands that were aborted.– Select host | Performance tab | Disk | Command

Aborts | Attached LUNs.

– Are any LUNs showing Command Aborts > 0?

Page 25: ESX performance problems 10 steps

Step 6: SolutionsStep 6: Solutions

This indicates that the storage layer cannot keep up with the demands of VMs.– Increase storage performance. $$$– Segregate storage. Modularity assists here.– Spread VMFS LUNs across more spindles. Add

disks. Reduces storage contention.– Use tools like vscsiStats to quantify storage

behaviors.– Balance memory with storage. Sometimes

throwing more RAM at a VM lessens its storage demand.

– Buy new storage. Buy more storage. $$$

Page 26: ESX performance problems 10 steps

Step 6: vscsiStatsStep 6: vscsiStats

http://communities.vmware.com/docs/DOC-10095– IO size– Seek distance– Outstanding IOs– Latency in ms

Page 27: ESX performance problems 10 steps

Step 7-1: Check for Inbound Step 7-1: Check for Inbound Networking ProblemsNetworking Problems

An inbound network problem is a VM that cannot process receive packets.– Packets are coming in over the wire, but the VM

lacks the resources to process them.– Thus, those packets must be dropped and

retransmitted, reducing effective performance.– This creates a cascading problem. More dropped

packets == more retransmitted ones == more to do == more oversubscription. Yikes!

– Select host | Performance tab | Network | Receive Packets Dropped

– Is this value greater than 0?

Page 28: ESX performance problems 10 steps

Step 7-1: SolutionsStep 7-1: Solutions

An inability to process inbound packets usually relates to vProc overutilization.– With vNICs, your processor is needed to process

their workloads.– Not enough processor == a less-capable vNIC– Reduce VM CPU utilization– Increase VM CPU reservation– Add pCPUs. Add servers.– Verify VMs are using the most-effective driver

(VMXNET3 for most workloads).

Page 29: ESX performance problems 10 steps

Step 7-2: Check for Outbound Step 7-2: Check for Outbound Networking ProblemsNetworking Problems

An outbound network problem is a VM that cannot effectively send packets.– Outbound VM packets are buffered at the vSwitch.– Heavy traffic at the vSwitch can overload its

attached pNIC.– When this happens, packets get dropped and must

be retransmitted.– Select host | Performance tab | Network | Transmit

Packets Dropped– Is this value greater than 0?

Page 30: ESX performance problems 10 steps

Step 7-2: SolutionsStep 7-2: Solutions

An inability to process outbound packets often requires additional pNICs.– Aggregate more pNICs to handle outbound load.– Ensure you’re not using failover mode, but load

balancing.– Rebalance high network use VMs to other hosts.– Rebalance high network use VMs to other vSwitches

(which should be attached to different pNICs).– Add networking.– Reduce ambient network traffic. Isolate subnets.– Ahhh, the old backups network problem. Or, the

n00b who multicasts on the server net! We’ve all been that n00b at some point…

Page 31: ESX performance problems 10 steps

Step 8: Check forStep 8: Check forSlow StorageSlow Storage

“Slow” storage is represented by high storage latency.– Essentially, the storage isn’t responding fast enough.– Storage layer itself could be insufficient, or

overloaded.– Select host | Performance tab | Disk | Physical Device

Read/Write Latency (all LUNs)– Are any average latencies greater than 10ms, or any

peaks above 20ms.*

– * These are VMware’s suggested starting values. Yours may be different based on storage architecture.

Page 32: ESX performance problems 10 steps

Step 8: SolutionsStep 8: Solutions

This indicates that the storage layer cannot keep up with the demands of VMs.– Increase storage performance. $$$– Segregate storage. Modularity assists here.– Spread VMFS LUNs across more spindles. Add disks.

Reduces storage contention.– Use tools like vscsiStats to quantify storage behaviors.– Balance memory with storage. Sometimes throwing

more RAM at a VM lessens its storage demand.

– Buy new storage. Buy more storage. $$$

– Notice that these are the same as for Step 6!

Page 33: ESX performance problems 10 steps

Step 8: SolutionsStep 8: Solutions

ESX Server

ESX Server

SAN Storage Device

Page 34: ESX performance problems 10 steps

Step 9: Check for Low VMStep 9: Check for Low VMCPU UtilizationCPU Utilization

Wait a minute! Isn’t low VM CPU utilization a good thing? Isn’t this why virtualization works?– Yes, and no.– Low VM CPU utilization can mean a low-needs

workload.– It can also mean a workload in a wait state.– Only check here if end user experience is suffering.– Select VM | Performance tab | CPU | Usage (VM)– Is this a lower than expected value?

Page 35: ESX performance problems 10 steps

Step 9: SolutionsStep 9: Solutions

Suffering end user experience but low CPU utilization usually indicates a wait state.– Verify other counters: Network, storage.– Storage response time?– Network response time?– Other servers or virtual servers that this workload

relies upon to do its job?

– Another common source: Overly restrictive resource allocations.

Page 36: ESX performance problems 10 steps

Step 10: Check for MemoryStep 10: Check for MemoryReclamationReclamation

Remember that ESX’s balloon driver will reclaim memory that it doesn’t believe a VM needs.– However, that driver has very limited visibility into

what each VM is actually doing with its memory.– It becomes a problem when memory that the VM

needs is reclaimed. Kind of like a double page fault.

– Select host | Performance tab | Memory | Balloon– If this value is greater than 0, then…

Page 37: ESX performance problems 10 steps

Step 10: Check for MemoryStep 10: Check for MemoryReclamationReclamation

Remember that ESX’s balloon driver will reclaim memory that it doesn’t believe a VM needs.– However, that driver has very limited visibility into

what each VM is actually doing with its memory.– It becomes a problem when memory that the VM

needs is reclaimed. Kind of like a double page fault.– Select host | Performance tab | Memory | Balloon– If this value is greater than 0, then…– Select VM | Performance tab | Memory | Stacked

Graph (per VM) | Balloon.– Is this value greater than 0 for the specific VMs

which are experiencing problems?

Page 38: ESX performance problems 10 steps

Step 10: SolutionsStep 10: Solutions

Ballooning occurs when there’s not enough memory to go around.– You’re oversubscribing your RAM.– This can be a good thing, unless it takes memory

from where its actually needed.– Eliminate memory overcommittment on the host.

Essentially, stop assigning more RAM to VMs than you have.

– Use reservations to ensure adequate memory for VMs.

– Be aware that this may just shift the problem elsewhere.

– Buy RAM. Buy servers. $$$

Page 39: ESX performance problems 10 steps

ESX Performance 401ESX Performance 401

Page 40: ESX performance problems 10 steps

ESX Performance 401ESX Performance 401

Honestly…– …go buy a product. Let someone else do the work!

Page 41: ESX performance problems 10 steps

ESX Performance 401ESX Performance 401

Honestly…– …go buy a product. Let someone else do the work!

This analysis takes time.– Time that you probably don’t have.– What you want is actionable information– “Convert all this math into a ‘click here’ response.”

Page 42: ESX performance problems 10 steps

ESX Performance 401ESX Performance 401

Another problem throughout these approaches relates to their “perspective”.– Virtualization touches everything in the datacenter

and introduces dependencies everywhere.– vSphere’s perspective means that it can only see

behaviors as it observes them.– Metaphor: Einstein’s Theory of Relativity.

Third-party products tie into networking, storage, applications, user experience, etc.– They can interrelate performance from multiple

perspectives.

Page 43: ESX performance problems 10 steps

ESX Performance 401ESX Performance 401

Who’s Who inVirtualizationPerformanceand CapacityManagement

Source: http://www.virtualizationpractice.com/blog/?p=6749

Page 44: ESX performance problems 10 steps

Final ThoughtsFinal Thoughts

Virtualization adds ridiculous interdependencies to the IT datacenter that weren’t there before.– No human alive can monitor all those metrics

effectively and at all times.– You need actionable information.– Use these tips to get you started, solve the

immediate problems.– Consider investing in a set-it-and-forget-it solution.

Page 45: ESX performance problems 10 steps

Monitoring and Intelligently Reacting Monitoring and Intelligently Reacting to ESX Performanceto ESX Performance

Greg ShieldsGreg ShieldsPartner and Principal TechnologistConcentrated Technologywww.ConcentratedTech.com

Please fill out evaluations,or more servers will crash!

!!!

Page 46: ESX performance problems 10 steps

This slide deck was used in one of our many conference presentations. We hope you enjoy it, and invite you to use it

within your own organization however you like.

For more information on our company, including information on private classes and upcoming conference appearances, please

visit our Web site, www.ConcentratedTech.com.

For links to newly-posted decks, follow us on Twitter:@concentrateddon or @concentratdgreg

This work is copyright ©Concentrated Technology, LLC