© 2011 VMware Inc. All rights reserved
Confidential
Business Continuity & Disaster Recovery in
Virtual & Cloud Environments
Liam Ferrel
2 Confidential
Basic Outline for this session….
CAMPUS
METRO / SYNC
DISTANCE SYNC / ASYNC
3 Confidential
Availability Design Myths
One solution fits all
One solution is cheaper
One is easier to implement
One is easier to manage its just “Next > Next > Finish”
So you and this powerpoint will tell me which solution I need?
• No
• All solutions have pros / cons
• All implementations are different
• All customer profiles are different
• Use this information to define what will work for you, you have to live it and
breath it
4 Confidential
Keep it Simple (Don’t take offence at the next 4 slides)
5 Confidential
“Disaster” Avoidance – Host Level
“Hey… That host WILL need to go down
for maintenance. Let’s vMotion to avoid
a disaster and outage.”
X
This is vMotion.
Most important
characteristics:
• By definition, avoidance,
not recovery.
• “non-disruptive” is
massively different than
“almost non-
disruptive”
6 Confidential
“Disaster” Recovery – Host Level
Hey… That host WENT down due to unplanned
failure causing a unplanned outage due to that
disaster. Let’s automate the RESTART of the
affected VMs on another host.
X
This is VMware HA.
Most important
characteristics:
• By definition recovery
(restart), not avoidance
• Simplicity, automation,
sequencing
7 Confidential
Disaster Avoidance – Site Level
Hey… That site WILL need to go down
for maintenance. Let’s vMotion to avoid
a disaster and outage.
This is Long Distance
vMotion.
Most important
characteristics:
• By definition, avoidance,
not recovery.
• “non-disruptive” is
massively different than
“almost non-
disruptive”
X
8 Confidential
Disaster Recovery – Site Level
Hey… That site WENT down due to unplanned
failure causing a unplanned outage due to that
disaster. Let’s automate the RESTART of the
affected VMs on another host.
This is Disaster Recovery.
Most important characteristics:
• By definition recovery (restart),
not avoidance
• Simplicity, testing, split brain
behavior, automation,
sequencing, IP address
changes X
9 Confidential
Types
10 Confidential
Site A Datastore
Type 1: “Stretched Single vSphere Cluster”
vMotion
vCenter Server
vSphere Cluster
Site A hosts
ESXi ESXi ESXi ESXi
Site B Datastore
Site B hosts
ESXi ESXi ESXi ESXi
Active / Active Storage
11 Confidential
One little note re: “Intra-Cluster” vMotion
Intra-cluster vMotions can be highly parallelized
• With vSphere 4.1 and vSphere 5 it’s up to 4 per host/128 per datastore if using
1GbE
• 8 per host/128 per datastore if using 10GbE
Need to meet the vMotion network requirements
• 622Mbps or more,
• 5ms RTT (upped to 10ms RTT if using Metro vMotion - vSphere 5 Enterprise
Plus)
• Layer 2 equivalence for vmkernel (support requirement)
• Layer 2 equivalence for VM network traffic (required)
12 Confidential
vSphere Cluster
Site A Datastore
Type 2: “Multiple vSphere Clusters”
vMotion
vCenter Server
vSphere Cluster
Site A hosts
ESXi ESXi ESXi ESXi
Site B Datastore
Site B hosts
ESXi ESXi ESXi ESXi
Active / Active Storage
13 Confidential
One little note re: “Inter-Cluster” vMotion
Inter-Cluster vMotions are serialized
• Involves additional calls into vCenter, so hard limit
• Lose VM cluster properties (HA restart priority, DRS settings, etc.)
Need to meet the vMotion network requirements
• 622Mbps or more
• 5ms RTT (upped to 10ms RTT if using Metro vMotion w vSphere 5 Enterprise
Plus)
• Layer 2 equivalence for vmkernel (support requirement)
• Layer 2 equivalence for VM network traffic (required)
14 Confidential
Stretched Cluster Considerations
Most networks lacks site awareness, so stretched clusters introduce new networking challenges.
With all storage configurations:
• The movement of VMs from one site to another doesn’t update the network
• VM movement causes “horseshoe routing” (LISP and other technologies, help address this)
• You’ll need to use multiple isolation addresses in your VMware HA configuration
15 Confidential
vSphere Cluster B
Site A Datastore
Type 3: “Site to Site Replication & Recovery”
vCenter
Server A
vSphere Cluster A
Site A hosts
ESXi ESXi ESXi ESXi
Site B Datastore
Site B hosts
ESXi ESXi ESXi ESXi
vCenter
Server B
Array-based (sync, async or
continuous) replication or vSphere
Replication (async)
SRM Server SRM Server
16 Confidential
Protection Groups
1
6
Collection of VMs that are protected together
• Grouping enforced by storage layout (lun or consistency group) for array
based replication (ABR). Per VM for vSphere Replication
VMFS
LUN
VMFS
LUN
VMFS
LUN
Datastore Groups Protection Groups
17 Confidential
Recovery Plans & Protection Groups
1
7
Protection Group - A
Protection Group - B
Recovery Plan for Groups A&B
✔
✔
Protection Group - A
Protection Group - B
18 Confidential
Recovery Plans
1
8
Essentially an automated “runbook”
for recovery
• Consists of one or more protection groups
• Controls every step of recovery process
• Storage (presentation)
• Network Customization (portgroup connection
and/or address changes)
• Power On Sequencing (dependencies
settable)
• Suspension of non essential workloads
• Invocation of customer defined pre/post
power on scripts (optional)
• Steps executed influenced by workflow
selected i.e Planned Migration / DR / Test
Failover
• Consists of one or more protection groups
19 Confidential
Group 5 Group 4 Group 3 Group 2 Group 1
Sequencing
Database Apache
Desktop
Desktop
Desktop
Desktop
Apache
Apache
Mail Sync Exchange
App Server
Master
Database
App Server
Database
20 Confidential
Test Failover – Non Disruptive
2
0
Protection Group
VMFS
LUN
Source Storage
(R/W)
VMFS
LUN
Replica Storage
(R/O)
VMFS
SNAP
Snapshot of Replica
Storage (R/W)
Test Network Placeholder VMs
1 2 3
21 Confidential
Failover
2
1
Protection Group
VMFS
LUN
Source Storage
(R/W)
VMFS
LUN
Replica Storage
(R/O)
VMFS
LUN
Replica Storage
Promoted (R/W)
Live Network Placeholder VMs
1 2 3
22 Confidential
Automated Failback
• Available for ALL SAN Replication SRA’s
• Implemented via new “Reprotect” workflow
• Resets protected state for workloads migrated or recovered
• Single button invocation
• “Flips” Protection Group and Recovery Plan states A->B becomes B->A
• No requirement to manually recreate objects
Automated Failback (Reprotect)
23 Confidential
What if we don’t have SAN replication?
24 Confidential
Introducing vSphere Replication (VR)
25 Confidential
VR Basics
Adding native replication to SRM
• Virtual machines can be replicated regardless of the underlying storage
• Enables replication between heterogeneous datastores
• Replication is managed as a property of a virtual machine
• Efficient replication minimizes impact on VM workloads
source target
26 Confidential
When To Use Stretched vSphere Clusters?
Campus / nearby sites
• Sites within Synchronous distance
• Two buildings on a common campus
• Two datacenters within a city
Planned migration important
• Long-distance vMotion for planned maintenance, disaster avoidance, or load
balancing
DR Features less critical
• No testing, orchestration, or automation
• VMware HA typically not sufficient for automation – requires scripting /
manual process due to VM placement with primary / secondary arrays
• RTOs typically longer
27 Confidential
When To Use Site Recovery Manager?
Longer-distance DR sites
• Any sites separated by >100km
• Any sites separated by <100km which could still be categorized as “DR”
where DR features are important
DR Features critical
• Non-disruptive testing
• Automated / Reliable / Repeatable / Auditable DR process
• Customizable recovery workflows
Planned migration with downtime ok
• Couple of hours downtime acceptable
• Planned migration not done routinely – mostly for disaster avoidance, and
infrequently for planned maintenance
28 Confidential
Software Defined Availability: vCloud Services
Software Defined
Availability
Clustering
Disaster Recovery
Replication
Data Protection
Availability
DR RTO
RPO
Storage
Performance
99.99%
1 hour
10 Min
High I/O
High Security
1 TB
Unified
Any Class of Protection
Flexible Service Level Options
Tier 1
Se
rvic
e L
eve
l
Tier 2 Tier 3
Bronze
99 %,
Low
Performance
Silver
99.9%,
Medium
Performance
Gold
99.99%,
High
Performance
Any Application
Traditional / Next-Gen Applications
Apps
Anywhere
Application Mobility Over Any Distance
(Metro/Geo)
Public
Cloud
Component Failures to Large Scale Disasters
Any Failure Scenario
29 Confidential
Thank you