Top Banner
VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential
50

VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

Apr 01, 2015

Download

Documents

Tyrone Stalker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VMware vCenter Server Fault Tolerance

John Browne/Adly Taibi/Cormac Hogan

Product Support Engineering

Rev E.

VMware Confidential

Page 2: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 2

Module 2 Lessons

Lesson 1 – vCenter Server High Availability

Lesson 2 – vCenter Server Distributed Resource Scheduler

Lesson 3 – Fault Tolerance

Lesson 4 – Enhanced vMotion Compatibility

Lesson 5 – DPM - IPMI

Lesson 6 – vApps

Lesson 7 – Host Profiles

Lesson 8 – Reliability, Availability, Serviceability ( RAS )

Lesson 9 – Web Access

Lesson 10 – vCenter Update Manager

Lesson 11 – Guided Consolidation

Lesson 12 – Health Status

Page 3: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 3

Module 2-3 Lessons

Lesson 1 – Understanding Fault Tolerance

Lesson 2 – Prerequisites for Fault Tolerance

Lesson 3 – Setting up Fault Tolerance

Lesson 4 – Viewing information about Fault Tolerant VM’s

Lesson 5 – Fault Tolerant Guidelines

Lesson 6 – Troubleshooting Fault Tolerance

Page 4: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 4

Understanding VMware Fault Tolerance

The VMware Fault Tolerance (FT) feature creates a virtual machine configuration that can provide continuous availability.

VMware Fault Tolerance (FT) is built on the ESX/ESXi 4.0 host platform. FT is provided using the Record/Replay functionality implemented in the VM monitor.

VMware FT works by creating an identical copy of a virtual machine.

One copy of the virtual machine, called the primary, is in the active state, receiving requests, serving information, and running applications.

Another copy, called the secondary, receives the same input that is received by the primary.

Page 5: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 5

Understanding VMware Fault Tolerance (ctd)

Page 6: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 6

Understanding VMware Fault Tolerance (ctd)

VMware FT provides a higher level of business continuity than HA.

In the case of FT, the secondary immediately comes on-line and all (or almost all) information about the state of the virtual machine is preserved.

The state of the secondary machine is dependant on the latency & lag between the primary and secondary VMs.

VMware FT does not require a Virtual Machine restart & applications and data stored in memory do not need to be re-entered or reloaded.

Page 7: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 7

Virtual Machine Record & Replay

RECORD REPLAY

Application

Operating System

Virtualization Layer

Application

Operating System

Virtualization Layer

Logging causes of non-determinism• Input (network, user), asynchronous I/O (disk, devices), CPU Timer interrupts

Deterministic delivery of events previously logged

• Result = repeatable VM execution

Page 8: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 8

Virtual Machine Record & Replay (ctd)

For a given primary VM, FT runs a secondary VM on a different host.

Sharing virtual disks with primary.

Secondary VM kept in “virtual lockstep” via logging info sent over private network connection.

Only primary VM sends and receives network packets, secondary is “Passive”.

If primary host fails, secondary VM takes over with no interruption to applications.

Page 9: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 9

FT in the VMkernel

The FT vmkernel module is called vmklogger.

Log entries are put in the log buffer, which is flushed/filled asynchronously.

Log entries are sent/received through socket on VMkernel NIC.

There should be a dedicated VMkernel network for logging which has FT Logging enabled.

vmkernel vmkernel

primary backup

Page 10: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 10

Determining Node Failure

FT does frequent heartbeat’ing through multiple NICs to determine when primary/backup hosts are down.

Backup “goes live” and becomes new primary if it declares current primary dead

We must have a method to distinguish a crashed host from a network failure (“split-brain”).

Our method is to use an atomic operation (rename) on shared VMFS.

Whenever primary/backup believes other host is down, it renames common file.

Winner of rename “race” survives, loser of rename “race” commits suicide.

Page 11: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 11

Record/Replay and FT Requirements: ESX/HW

CPUs: Limited processors (AMD Barcelona+, Intel Penryn+), processors must be the same family (i.e. no mix/match)

Hardware Virtualization must be enabled in the BIOS

Hosts must be in an HA-enabled cluster

Storage: shared storage (FC, iSCSI, or NAS)

Network: minimum of 3 NICs for various types of traffic (ESX Management/VMotion, VM traffic, FT Logging)

GigE required for VMotion and FT Logging

Minimized single points of failures in the environment – i.e. NIC teaming, multiple network switches, storage multipathing

Primary and secondary hosts must be running the same build of ESX

Page 12: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 12

VMware Fault Tolerance and HA Work Together

FT VM’s run only in an HA cluster

Mission-critical VMs are protected by FT and HA, remaining VM’s protected by HA

When a host fails:

FT secondary takes over

New FT secondary is started by HA

HA-only VM’s are restarted

Resource Pool

VMware HA

X

X

VMware FT

VMware FT

VMware FTX

Page 13: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 13

Module 2-3 Lessons

Lesson 1 – Understanding Fault Tolerance

Lesson 2 – Prerequisites for Fault Tolerance

Lesson 3 – Setting up Fault Tolerance

Lesson 4 – Viewing information about Fault Tolerant VM’s

Lesson 5 – Fault Tolerant Guidelines

Lesson 6 – Troubleshooting Fault Tolerance

Page 14: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 14

Prerequisites for VMware Fault Tolerance

For VMware FT to perform as expected, it must run in an environment that meets specific requirements.

The primary and secondary fault tolerant virtual machines must be in a VMware HA cluster.

Primary and secondary ESX/ESXi hosts should be the same CPU model family.

Primary and secondary virtual machines must not run on the same host. FT will automatically place the secondary VM on a different host.

Page 15: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 15

Prerequisites for VMware Fault Tolerance (ctd)

Storage

Virtual machine files must be stored on shared storage.

Shared storage solutions include NFS, FC, and iSCSI.

For virtual disks on VMFS-3, the virtual disks must be thick, meaning they cannot be "thin" or sparsely allocated.

Turning on VMware FT will automatically convert the VM to thick-eager zeroed disks.

Virtual Raw Disk Mapping (RDM) is supported. Physical RDM is not supported.

Page 16: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 16

Prerequisites for VMware Fault Tolerance (ctd)

Networking

Multiple gigabit Network Interface Cards (NICs) are required.

A minimum of two VMKernel Gigabit NICs dedicated to VMware FT Logging and VMotion.

The FT Logging interface is used for logging events from the primary virtual machine to the secondary FT virtual machines.

For best performance, use 10Gbit NIC rather than 1Gbit NIC, and enable the use of jumbo frames.

Page 17: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 17

Prerequisites for VMware Fault Tolerance (ctd)

Processor

SMP Virtual Machines are not supported.

Virtual Machines must be of the same CPU model family. Supported processors include the following:

Intel Core 2, also known as Merom

Intel 45nm Core 2, also known as Penryn.

Intel Next Generation, also known as Nehalem.

AMD 2nd Generation Opteron, also known as Rev E/F common feature set.

AMD 3rd Generation Opteron, also known as Greyhound.

Page 18: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 18

Prerequisites for VMware Fault Tolerance (ctd)

Host BIOS

VMware FT requires that Hardware Virtualization (HV) be turned on in the BIOS. The process for enabling HV varies among BIOS’es.

If HV is not enabled, attempts to power on a primary copy of a fault tolerant virtual machine produces the following error message:

"Fault tolerance requires that Record/Replay is enabled for the virtual machine. Module Statelogger power on failed."

Page 19: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 19

Prerequisites for VMware Fault Tolerance (ctd)

If HV is enabled for the ESX/ESXi host that is hosting a primary copy of a fault tolerant virtual machine, but not on any other hosts in the cluster, the primary can be successfully powered on.

After the primary is powered, VMware FT automatically attempts to start the fault tolerant secondary. This fails after a brief delay and produces the following error message:

"Secondary virtual machine could not be powered on as there are no compatible hosts that can accommodate it."

The primary remains powered on in live mode, but fault tolerance is not established.

Page 20: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 20

Prerequisites for VMware Fault Tolerance (ctd)

Turn off power-management (also known as power-capping) in the BIOS.

If power management is left enabled, the secondary hosts may enter lower performance, power-saving modes.

Such modes can leave the secondary virtual machine with insufficient CPU resources, potentially making it impossible for the secondary to complete all tasks completed on a primary in a timely fashion.

Turn off hyperthreading in the BIOS.

If hyperthreading is left enabled and the secondary virtual machine is sharing a CPU with another demanding virtual machine, the secondary virtual machine may run too slowly to complete all tasks completed on the primary in a timely fashion.

Page 21: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 21

Module 2-3 Lessons

Lesson 1 – Understanding Fault Tolerance

Lesson 2 – Prerequisites for Fault Tolerant VM’s

Lesson 3 – Setting up Fault Tolerance

Lesson 4 – Viewing information about Fault Tolerant VM’s

Lesson 5 – Fault Tolerant Guidelines

Lesson 6 – Troubleshooting Fault Tolerance

Page 22: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 22

Setting Up Fault Tolerance

To enable Fault Tolerance, connect the vSphere client to the vCenter Server using an account with cluster administrator permissions.

1. In the Hosts & Clusters view, select a Virtual Machine.

2. Next, right mouse click > Fault Tolerance > Turn Fault Tolerance On

If the Virtual Machines is stored on a thinly provisioned or eagerly scrubbed disk(s), those disk files must be converted to Thick-EagerZeroed before FT can be enabled.

When FT is enabled, a message appears informing users of this requirement and of the fact that the conversion will be completed.

The specified virtual machine is marked as a primary and a secondary is established on another host. FT is now enabled.

Page 23: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 23

Setting Up Fault Tolerance (ctd)

Page 24: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 24

Setting Up Fault Tolerance (ctd)

Page 25: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 25

Module 2-3 Lessons

Lesson 1 – Understanding Fault Tolerance

Lesson 2 – Prerequisites for Fault Tolerant VM’s

Lesson 3 – Setting up Fault Tolerance

Lesson 4 – Viewing information about Fault Tolerant VM’s

Lesson 5 – Fault Tolerant Guidelines

Lesson 6 – Troubleshooting Fault Tolerance

Page 26: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 26

Viewing Information about Fault Tolerant VMs

Fault Tolerant VMs have an additional Fault Tolerance pane on their summary tab which provides information about the Fault Tolerance setup and performance.

Fault Tolerance Status - Indicates the status of fault tolerance - Protected or Not Protected/Disabled.

Page 27: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 27

Viewing Information about Fault Tolerant VMs (ctd)

Secondary Location - Displays the ESX/ESXi host on which the secondary virtual machine is hosted.

Total Secondary CPU - Indicates all secondary CPU usage, displayed in MHz.

Total Secondary Memory - Indicates all secondary memory usage, displayed in MB.

Secondary VM Lag Time shows the current delay between the primary and secondary VM.

Log Bandwidth shows the consumed bandwidth on the link for Record/Replay operations between the primary and secondary VM.

This value is based on the FT operations only, and is not the bandwidth usage on the wire (i.e with. TCP/IP/Ethernet headers).

Page 28: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 28

FT Virtual Machine files

Before VM is FT enabled

After VM is FT Enabled

Page 29: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 29

Maps View of an FT VM

Page 30: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 30

Module 2-3 Lessons

Lesson 1 – Understanding Fault Tolerance

Lesson 2 – Prerequisites for Fault Tolerant VM’s

Lesson 3 – Setting up Fault Tolerance

Lesson 4 – Viewing information about Fault Tolerant VM’s

Lesson 5 – Fault Tolerant Guidelines

Lesson 6 – Troubleshooting Fault Tolerance

Page 31: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 31

VMware FT Restrictions

Many VMware Infrastructure features and third-party products are supported for use with VMware FT, but the following features are not:

Microsoft Cluster Services (MSCS): MSCS does its own failover and management. As a result, conflicts may arise with coexistence of VMware FT and MSCS solutions.

Nested Page Tables/Extended Page Tables (NPT/EPT): A restriction of the record/replay implementation. This restriction does not affect the user experience. Record/replay for virtual machines automatically disables NPT/EPT, even though other virtual machines on the same host can continue to use these features.

Paravirtualization: A restriction of the record/replay implementation. Record/replay does not work with paravirtualized guests.

Hot-plugging devices: A restriction of the record/replay implementation. Users cannot hot add and remove devices.

Automatic DRS recommendation application: For this release, an FT virtual machine can not be used with DRS, though manual VMotion is allowed.

Page 32: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 32

Features not supported with VMware FT

Symmetric multiprocessor (SMP) virtual machines.

Storage VMotion.

NPIV – N-Port ID Virtualization.

NIC passthrough.

Devices which do not have Record/Replay support such as USB and sound.

Some network interfaces for legacy network hardware such as vlance.

While some legacy drivers are not supported, VMware FT does revert to the supported vmxnet2 driver, thereby handling cases where vlance would otherwise be required.

Virtual Machine snapshots.

Page 33: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 33

Fault Tolerance Best Practices

Ratio of Fault Tolerant VMs to ESX/ESXi hosts

Maintaining consistency between primary and secondary fault tolerant virtual machines makes significant use of disk and network resources.

You should have no more than four to eight fault tolerant virtual machines, primaries or secondaries on any single host.

The number of fault tolerant virtual machines that you can safely run on each host cannot be stated precisely because the number is based on the ESX/ESXi host and virtual machine size and workload factors, all of which can vary widely.

Page 34: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 34

Fault Tolerance Use Cases

Several typical situations that can benefit from the use of VMware FT. For example:

Any application that needs to be available at all times. This especially applies to applications that have long-lasting client connections that users want to maintain during hardware failure.

Custom applications that have no other way of doing clustering.

Cases where high availability might be provided through MSCS, but MSCS is too complicated to configure and maintain.

Page 35: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 35

Module 2-3 Lessons

Lesson 1 – Understanding Fault Tolerance

Lesson 2 – Prerequisites for Fault Tolerant VM’s

Lesson 3 – Setting up Fault Tolerance

Lesson 4 – Viewing information about Fault Tolerant VM’s

Lesson 5 – Fault Tolerant Guidelines

Lesson 6 – Troubleshooting Fault Tolerance

Page 36: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 36

Primary vmware.log FT Startup Messages

Mar 04 15:40:41.556: vmx| MigrateStateUpdate: Transitioning from state 0 to 1.Mar 04 15:40:41.557: vmx| Migrating to become primaryMar 04 15:40:41.557: vmx| StateLogger_MigrateStart: VMotion srcIp 192.168.0.65, dstIp 192.168.0.55Mar 04 15:40:41.557: vmx| StateLogger_MigrateStart: Logging srcIp 172.16.0.65, dstIp 172.16.0.55...

Mar 04 15:40:49.538: vmx| VMXVmdbCbVmVmxMigrate: Got SET callback for /vm/#_VMX/vmx/migrateState/cmd/##1_202/op/=startMar 04 15:40:49.539: vmx| VmxMigrateGetStartParam: mid=464539447b562 dstwid=4953Mar 04 15:40:49.539: vmx| Received migrate 'start' request for mig id 1236210039633250, dest world id 4953.Mar 04 15:40:49.541: vmx| MigrateStateUpdate: Transitioning from state 1 to 2.Mar 04 15:40:49.817: vcpu-0| MigrateStateUpdate: Transitioning from state 2 to 3.Mar 04 15:40:49.818: vcpu-0| Migrate: Preparing to suspend.Mar 04 15:40:49.819: vcpu-0| Migrating a secondary VMMar 04 15:40:49.819: vcpu-0| CPT current = 0, requesting 1Mar 04 15:40:49.819: vcpu-0| Migrate: VM stun started, waiting 8 seconds for go/no-go message....

Page 37: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 37

Primary vmware.log FT Startup Messages (ctd)

Mar 04 15:40:49.852: vmx| Migrate_Open: Migrating to <192.168.0.55> with migration id 1236210039633250Mar 04 15:40:49.852: vmx| Checkpointed in VMware ESX, 4.0.0 build-151628, build-151628, Linux HostMar 04 15:40:49.853: vmx| BusMemSample: checkpoint 3 initPercent 75 touched 98304Mar 04 15:40:49.854: vmx| FT saving on primary to create new backupMar 04 15:40:49.889: vmx| Connection accepted, ft id 2487727458.Mar 04 15:40:49.892: vmx| STATE LOGGING ENABLED (interponly 0 interpbt 0)Mar 04 15:40:49.893: vmx| LOG data...

Mar 04 15:40:50.275: vmx| Migrate: VM successfully stunned.Mar 04 15:40:50.276: vmx| MigrateStateUpdate: Transitioning from state 3 to 4.Mar 04 15:40:50.890: vmx| MigrateSetStateFinished: type=1 new state=5Mar 04 15:40:50.890: vmx| MigrateStateUpdate: Transitioning from state 4 to 5.Mar 04 15:40:50.891: vmx| StateLogger_MigrateSucceeded: Backup connected

Mar 04 15:40:50.891: vmx| Migrate: Attempting to continue running on the source.Mar 04 15:40:50.893: vmx| CPT current = 3, requesting 6

...

Mar 04 15:40:50.915: vmx| Continue sync while logging or replaying 8428Mar 04 15:40:50.924: vmx| Migrate: cleaning up migration state.Mar 04 15:40:50.924: vmx| MigrateStateUpdate: Transitioning from state 5 to 0.

Page 38: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 38

Migration Transition States - Primary

Base state.  No migration currently in progress.

MIGRATE_VMX_NONE - state 0

VMX has received a MIGRATE_TO message.  Waiting for the start message along with the world ID of the destination.

MIGRATE_TO_VMX_READY – state 1

VMX has received a MIGRATE_START message.  Precopying data to destination.

MIGRATE_TO_VMX_PRECOPY – state 2

Precopy done.  Saving checkpoint.

MIGRATE_TO_VMX_CHECKPT – state 3

Done saving checkpoint.  Waiting for acknowledgement from destination that the VMX started.  Until the acknowledgement is received, the migration may still fail back to the source.

MIGRATE_TO_VMX_WAIT_HANDSHAKE – state 4

Migration succeeded or failed.  On success, VMX process needs to power down and cleanup.  On failure, VM will continue running and be ready for the next migration operation after this state passes.

MIGRATE_TO_VMX_FINISHED – state 5

Page 39: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 39

Migration Transition States - Secondary

Base state.  No migration currently in progress.

MIGRATE_VMX_NONE - state 0

VMX has received a MIGRATE_FROM message.  Getting ready to receive VM.

MIGRATE_FROM_VMX_INIT – state 7

VMX is ready and waiting for source to send VM data.

MIGRATE_FROM_VMX_WAITING – state 8

Both memory and checkpoint data is being copied to destination.

MIGRATE_FROM_VMX_PRECOPY – state 9

Data was precopied.  Restoring checkpoint.

MIGRATE_FROM_VMX_CHECKPT – state 10

Migration succeeded or failed.  On success, VMX process runs migrated VM. After state passes, VMX is ready for next migration operation.  On failure, VM will power down and cleanup.

MIGRATE_FROM_VMX_FINISHED – state 11

Page 40: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 40

FT Troubleshooting – Primary vmkernel logs

Immediately following the FT migration you will see messages like these on the ESX. You will want to note the migration ID & the statelogger ID in the case where there are many FT VMs:

Primary:

Mar  4 10:51:35 prme-stft053 vmkernel: 0:16:24:12.912 cpu2:4281)VMotion: 2582: 1236192688557132 S: Stopping pre-copy: only 11178 pages were modified, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~122.213 MB/s)

Mar  4 10:51:35 prme-stft053 vmkernel: 0:16:24:12.917 cpu3:4280)VSCSI: 5850: handle 8193(vscsi0:0):Destroying Device for world 4281 (pendCom 0)

Mar  4 10:51:36 prme-stft053 vmkernel: 0:16:24:13.663 cpu7:4230)VMKStateLogger: 6856: 2316520524: accepting connection from secondary at 10.0.57.10

Page 41: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 41

FT Troubleshooting – Secondary vmkernel logsSecondary:

Mar  4 10:51:34 prme-stft057 vmkernel: 0:19:53:47.483 cpu2:4286)VMotion: 1805: 1236192688557132 D: Set ip address '192.168.57.10' worldlet affinity to recv World ID 4289

Mar  4 10:51:34 prme-stft057 vmkernel: 0:19:53:47.644 cpu7:4228)MigrateNet: vm 4228: 1096: Accepted connection from <192.168.53.10>

Mar  4 10:51:34 prme-stft057 vmkernel: 0:19:53:47.644 cpu7:4228)MigrateNet: vm 4228: 1110: dataSocket 0x4100b6092e60 send buffer size is 263536

Mar  4 10:51:35 prme-stft057 vmkernel: 0:19:53:49.427 cpu3:4289)VMotionRecv: 226: 1236192688557132 D: Estimated network bandwidth 100.872 MB/s during pre-copy

Mar  4 10:51:36 prme-stft057 vmkernel: 0:19:53:50.055 cpu7:4286)VSCSI: 3469: handle 8193(vscsi0:0):Creating Virtual Device for world 4287 (FSS handle 163860)

Mar  4 10:51:36 prme-stft057 vmkernel: 0:19:53:50.176 cpu7:4286)VMKStateLogger: 1949: 2316520524:  Connected to primary

Page 42: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 42

FT Troubleshooting – vmware.log

The FT pair ID (logged from the StateLogger vmkernel module to identify the FT pair) is also found in the vmware.log file.

This is an example of a secondary who's primary died:Mar 03 20:03:56.457: vcpu-0| StateLoggerSetEndOfLog: BCnt: 344876494570 fSz: 0 bufPos 14775374

...Mar 03 20:03:56.464: vmx| Preparing to go live

...Mar 03 20:03:56.503: vmx| Done going live

Mar 03 20:03:56.503: vmx| Failover initiated via vmdb

Mar 03 20:03:56.504: vmx| Gone live because of Lost connection to primary.

Mar 03 20:03:56.506: vmx| Unstunning after golive

...

Mar 03 20:04:08.199: vmx| FT saving on primary to create new backup

Mar 03 20:04:08.203: vmx| Connection accepted, ft id 607078005.

Page 43: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 43

FT Troubleshooting – Split Brain

For support's purposes, the vmkernel log files will display messages similar to the following on the host running the VM that lost the race for the generation file (and thus did not golive):Mar  4 10:52:45 prme-stft057 vmkernel: 0:19:54:58.861 cpu2:4291)VMKStateLogger: 7823: Rename of .ft-generation2 to .ft-generation3 failed: Not found Mar  4 10:52:45 prme-stft057 vmkernel: 0:19:54:58.861 cpu2:4291)VMKStateLogger: 2792: 2316520524: Can *NOT* golive

On the host running the VM that did win the race and successfully renamed the file (and did golive) you will see a corresponding message:Mar  4 10:52:45 prme-stft053 vmkernel: 0:16:25:22.150 cpu6:4283)VMKStateLogger: 2792: 2316520524: Can golive

The other thing you'll want to note is the statelogger ID if there are multiple FT enabled VMs.

Page 44: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 44

VMware SiteSurvey Tool

We have created a new utility which analyzes a cluster of ESX hosts and tells you whether the configuration is suitable for FT.

This includes checking for FT-compatible processors, shared storage, BIOS settings, etc.

The utility is called VMware SiteSurvey and a Beta copy is available in the "Documents" tab.

To use it, download the VMware SiteSurvey executable from that page and run it, which will install the utility on your local Windows machine.

Page 45: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 45

VMware SiteSurvey Tool (ctd)

Page 46: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 46

Troubleshooting Fault Tolerance

When attempting to power on a virtual machine with VMware FT enabled, an error message may appear in a pop-up dialog box.

"Fault tolerance requires that Record/Replay is enabled for the virtual machine. Module Statelogger power on failed.“

What is a possible root cause?

Page 47: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 47

Troubleshooting Fault Tolerance (ctd)

After powering on a virtual machine with VMware FT enabled, an error message may appear in the Recent Task Pane.

"Secondary virtual machine could not be powered on as there are no compatible hosts that can accommodate it.“

What is a possible root cause?

Page 48: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 48

Troubleshooting Fault Tolerance (ctd)

When selecting a VM to enable Fault Tolerance, you find that the ‘Turn on Fault Tolerance’ option is greyed out. What are the possible causes?

1. The host on which the Virtual Machine resides is not part of a VMware HA Cluster.

2. The host on which the Virtual Machine resides does not have Hardware Virtualization turned on in the BIOS for the CPUs.

3. The Virtual Machine does not support VMware Fault Tolerance. Update the virtual machine to a more recent version.

4. The Virtual Machine has snapshots. Delete any snapshots.

Page 49: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 49

Lesson 2-3 Summary

vSphere 4.0 introduce a new concept called Fault Tolerance.

This enhances the VM availability that we had with VMware HA in so far as there is no downtime on the VM when a hardware failure occurs on the ESX host.

However in this initial release, there are a number of restrictions placed on the VM configuration if it wishes to use FT.

Page 50: VMware vCenter Server Fault Tolerance John Browne/Adly Taibi/Cormac Hogan Product Support Engineering Rev E. VMware Confidential.

VI4 - Mod 2-3 - Slide 50

Lesson 2-3 - Lab 1

Lab 1 involves creating Fault Tolerant VM’s

Create a Fault Tolerant VM

Watch a Fault Tolerant VM failover to another host

Fault Tolerant VM settings