Disaster Recovery 2.0

© 2009 VMware Inc. All rights reserved

Disaster Recovery 2.0 A paradigm shift in DR Architecture.

Singapore, Q2 2013

Co-authors & Reviewers

Reviewers• Michael White

Staff Product Integration Architect, VMware, ca.linkedin.com/pub/michael-white/3/bb0/619

• Michael Webster Strategic Architect, VMware, longwhiteclouds.com

Iwan ‘e1’ Rahabok Staff SE, Strategic Accounts, VMware

[email protected] | sg.linkedin.com/in/e1ang

VCAP-DCD, TOGAF Certified

Lim Wei Chiang SE, Strategic Accounts, VMware

[email protected] | sg.linkedin.com/in/weichiang

VCP, CCDP, CCNP

VCDX, vExpert 2012

mailto:[email protected]

mailto:[email protected]

Business Requirements

3

It is similar to Insurance. • It’s no longer acceptable to run business without DR protection.

The question is now about…• How do we cut the DR cost & complexity? People cost, technology cost, etc.

Protect the Business in the event of Disaster

Disaster did strike in Singapore

4

29 June 2004• Electricity Supply Interruption

• More than 300,000 homes were left in the dark

• About 30% of Singapore was affected. If both your Prod and DR datacenters were on this 30%....

• Caused by the disruption of natural gas supply from West Natuna, Indonesia. A valve at the gas receiving station operated by ConocoPhillips tripped. Natural gas supply was disrupted causing 5 units of the combined-cycle gas turbines (CCGT) at Tuas Power Station, Power Seraya Power Station and SembCorp Cogen to trip.

• Some of the CCGTs could not switch to diesel successfully. Investigation into the incident is in progress.

Other Similar Incidents• The first disruption in natural gas supply occurred on 5 Aug 2002 due to a tripping of a

valve in the gas receiving station which led to a power blackout.

Disaster Recovery (DR) >< Disaster Avoidance (DA)

DA requires that Disaster must be avoidable.• DA implies that there is Time to respond to an impending Disaster. The time window

must be large enough to evacuate all necessary system.

Once avoided, for all practical purpose, there is no more disaster.• There is no recovery required.

• There is no panic & chaos.

DA is about Preventing (no downtime). DR is about Recovering (already down)• 2 opposite context.

It is insufficient to have DA only.

DA does not protect the business when Disaster strikes.

Get DR in place first, then DA.

6

DR Context: It’s a Disaster, so…

It might strike when we’re not ready• E.g. IT team having offsite meeting, and next flight is 8 hours away.

• Key technical personnels are not around (e.g. sick or holiday)

We can’t assume Production is up.• There might be nothing for us to evacuate or migrate to DR site.

• Even if the servers are up, we might not even able to access it (e.g. network is down).

Even if it’s up, we can’t assume we have time to gracefully shutdown or migrate.• Shutting down multi-tier apps are complex and take time when you have 100s…

We can't assume certain system will not be affected• DR Exercise should involve entire datacenter.

Assume the worst, and start from that point.

Singapore MAS Guidelines

7

MAS is very clear that DR means Disaster has happened as there is outage.

Clause 8.3.3 states Total Site should be tested. So if you are not doing entire DC test, you’re not in compliance.

DR: Assumptions

8

A company wide DR Solution shall assume:• Production is down or not accessible.

Entire datacenter, not just some systems.

• Key personnels are not available Storage admin, Network admin, AD admin, VMware admin, DBA, security, Windows admin,

RHEL admin, etc. Intelligence should be built into the system to eliminate reliance on human expert.

• Manual Run Books are not 100% up to date Manual documents (Word, Excel, etc) covering every steps to recover entire datacenter is

prone to human error. It contains thousands of steps, written by multiple authors. Automation & virtualisation reduce this risk.

DR Principles

9

To Business Users, actual DR experience must be identical to the Dry Run they experience

• In panic or chaotic situation, users should deal with something they are trained with.

• This means Dry Run has to simulate Production (without shutting down Production)

Dry Run must be done regularly.

• This ensures:

New employees are covered.

Existing employees do not forget.

The procedures are not outdated (hence incorrect or damaging)

• Annual is too long a gap, especially if many users or departments are involved.

DR System must be a replica of Production System

• Testing with a system that is not identical to production deems the Dry Run invalid.

• Manually maintain 2 copies of >100s servers, network, storage, security settings are classic examples of invalid Dry Run, as the DR System is not the Production system.

• System >< Datacenter. Normally, the DR DC is smaller. System here means a collection of servers, storage, network, security that make up “an application from business point of view”.

Datacenter-wide DR Solution: Technical Requirements

10

Fully Automated• Eliminate reliance on many key personnels.

• Eliminate outdated (hence misleading) manual runbooks.

Enable frequent Dry Run, with 0 impact to Production.• Production must not be shutdown, as this impacts the business.

Once you shutdown production, it is no longer a Dry Run. Actual Run is great, but it is not practical as Business will not allow entire datacenter to go down regularly just for IT to test infrastructure.

• No clashing with Production Hostnames and IP addresses.

• If Production is not impacted, then users can take time to test DR. No need to finish within certain time window anymore.

Scalable to entire datacenter• 1000s servers

• Cover all aspect of infrastructure, not just server + storage. Network, Security, Backup have to included so entire datacenter can be failed over automatically.

11

DR 1.0 architecture (current thinking)

Typical DR 1.0 solution (at infrastructure layer) has the following properties:

Area Solution

Server • Data drive (LUN) is replicated. • OS/App drive is not. So there are 2 copies: Production and DR. They have different host

name, IP address. They can’t be the same, as having identical hostname/IP will result in conflict as the network spans across both datacenters.

• This means DR system is actually different to Production, even on actual DR. Production never fails to DR. Only the data gets mounted.

• Technically, this is not a “production recovery” solution, but a “Site 2 mounting Site 1 data” solution. IT has been telling Business that IT is recovering Production, while what IT does is actually running a different system, and the only thing used from Production is just the data.

Storage • Not integrated with the server. Practically 2 different solution, manually run by 2 different team, with a lot of manual coordination and unhappiness.

Network • Not aware of DR Test and Dry Run. It’s 1 network for all purpose.• Firewall rules manually maintained on both sides.

12

DR 1.0 architecture: Limitations

Technically, it is not even a DR solution• We do not recover the Production System. We merely mount production Data on a

different System The only way for the System to be recovered is to do SAN boot on DR Site.

• Can’t prove to audit that DR = Production.

• Registry changes, config changes, etc are hard to track at OS and Application level.

Manual mapping of data drive to associated server on DR site.

Not a scalable solution as manual update don’t scale well to 1000s servers.

Heavy on scripting, which are not tested regularly.

DR Testing relies heavily on IT expertise.

DR Requirements: Summary

13

ID Requirements Description

R01 DR copy = Production copy.Dry Run = Actual Run.

This is to avoid an invalid Dry Run as the System Under Test itself are not the same.No changes allowed (e.g. IP address and Host name) as it means Dry Run >< real DR

R02 Identical User Experience From business users point of view, the entire Dry Run exercise must match real/actual DR experience.

R03 No impact on Production during Dry Run.

DR test should not require Production to be shutdown, as it becomes a real failover. A real failover can’t be done frequently as it impacts the business. Business will resist testing, making the DR Solution risky due to rare testing.

R04 Frequent Dry Run This is only possible if Production is not affected.

R05 No reliance on human experts An datacenter wide DR needs a lot of expert from disciplines, making it an expensive effort.Actual procedure should be simple. It should not recover from error state.

R06 Scalable to entire datacenter DR solution should scale to >1000s servers while maintaining RTO and simplicity.

R01: DR Copy = Production Copy

14

Solution: replicate System + Data, not just data drive (LUN).• OS, Apps, settings, firewall, load balancer, etc.

Implication of the solution:• If Production network is not stretched, the server will be unreachable. Changing IP will break

Application.

• If Production network is stretched, IP Address and Hostname will conflict with Production. Changing Hostname will definitely break Application. Stretched L2 network is not a full solution. Entire LAN isolation is the solution.

Solution: Entire Dry Run network must be isolated (bubble network)• No conflict with Production, as it’s actually identical. It’s a shadow of Production LAN.

• All network services (AD, DNS, DHCP, Proxy) must exist in the Shadow Prod LAN.

Implication of the solution:• For VM, this is easily done via vSphere and SRM

• For Physical Servers, they need to be connected to Dry Run LAN. Permanent connection simplifies and eliminate risk of accidental update to production.

R02: Identical User Experience

15

desktop.ABC Corp.com

Desktop-DRTest.ABC Corp.com

Production desktop pools

DR Test desktop pools(on-demand)

VDI is a natural companion to DR as it makes the “front-end” experience seamless.• Users use Virtual Desktop as their day to day desktop.

• VDI enables us to DR the desktop too.

During Dry Run• Users connect to desktop.vmware.com for production and desktop-DR.vmware.com for Dry Run. Having 2 desktops mean the

environment is completely isolated.

During actual Disaster• Desktop-DR.vmware.com is renamed to desktop.vmware.com as the original desktop.vmware.com is down (affected by the

same DR). Users connect to desktop.vmware.com, just like they do in their day to day, hence creating an identical experience.

R03: No impact on Production during Dry Run

16

To achieve the above, the DR Solution:• Cannot require Production be shutdown or stopped. It must be Business as Usual.

• Must be an independent, full copy altogether, no reliance on Production component. Network, security, AD, DNS, Load Balancer, etc.

R04: Frequent Dry Run

17

To achieve the above, the DR Solution cannot:• Be laborious or prone to human error. A fully automated solution address this.

• Touch production system or network. So it has to be an isolated environment. A Shadow Production LAN solves this.

• VMware SRM enables the automation component for VM. Physical Machines are harder to isolate. Need physical isolation.

You should have the full confidence that the Actual Fail Over will work. This can only be achieved if you can do

frequent dry run.

ESXi Host (1 physical box)

Solution: Isolating

18

Non-Prod LAN Portgroup.VLAN 10Type: VM Network

Production LAN Portgroup.VLAN 20Type: VM Network

ESXi Mgmt Portgroup.VLAN 40Type: vmkernel Network

To physical switches.Main network on Site 2.

DR Test LAN Portgroup.VLAN 30Type: VM Network

To physical switchesIsolated DR Test networkConnected Prod Network

Solution: Dealing with Physical Servers

19

Singapore (Prod Site) Singapore (DR Site)

CRM-App-Server.vmware.com

10.10.10.20

CRM-DB-Server.vmware.com

10.10.10.30

CRM-Web-Server.vmware.com10.10.10.10

CRM-App-Server.vmware.com

10.10.10.20

CRM-DB-Server.vmware.com

10.10.10.30

CRM-Web-Server.vmware.com10.10.10.10

Shadow Production LAN

CRM-DB-Server-Test.vmware.com

20.20.20.30

Physical Servers: Dual boot option

20

Physical Server must be dual-boot (OS):- Normal Operation: Test/Dev environment (default boot)- Dry Run or DR: Shadow Production network

Shadow Production LAN (10.10.10.x)

LAN on Datacenter 2 (20.20.20.x)

This VM is a Jump Box. Without a Jump Box, we cannot access Shadow Production LAN during Dry Run. It runs on ESXi which is connected to both LANs.

OS drive + Data

Need to add 2 NIC for each server

Physical Servers: Dual partition option

21

DR Partition




Test/Dev Partition

1 physical box

Production Networks

Typical Physical Network: it’s 1 network

22

Singapore (Prod Site)

Users Site

AD/DNS Non-AD DNS

Production VMs Production PMs

Singapore (DR Site) Country X (any site)

AD/DNS Non-AD DNS


AD/DNS Non-AD DNS


Users (from any country) can access any servers (physical or virtual) on any country as basically there is only 1 “network”. There is routing to connect various LAN.In 1 “network”, we can’t have 2 machines with same host name or same IP.Each LAN has its own network address. Hence changing of IP address is required when moving from Prod Site to DR Site.

ABC Corp operates in many countries in Asia, with Singapore being the HQ.A system may consist of multiple servers from the more than 1 country.DNS service for Windows is provided by MS AD.DNS service for non Windows is provided by non MS AD.

Site 2 needs to have 2 distinct Network

23

DR Server




Test/Dev Server

Mode 1: Normal Operation or During Dry Run

24


Non Prod LAN (20.20.20.x)

Production LAN (10.10.10.x)

Datacenter: Site 1

Users Site

Datacenter: Site 2

x

Desktop LAN (30.30.30.x)

Jump box

Mode 2: Partial DR

25

Non Prod LAN (20.20.20.x)

Production LAN (10.10.10.x)

Datacenter: Site 1

Users Site

Datacenter: Site 2

Desktop LAN (30.30.30.x)

Summary: DR 2.0 and 1.0

26

ID DR 1.0 DR 2.0

R01 Does not meet- It uses 2 copies, which are manually sync

Meet

R02 Does not meet. - The DR system >< Production system.

Meet

R03 Does not meet.- Dry Run is done on another system, not production copy.

Meet

R04 Does not meet Meet

R05 Does not meet. - Resource intensive. Dual boot, script, etc.

Meet

R06 Does not meet. Meet

Works for Physical Server. Does not work well in Virtual Environment

VM fits much better than Physical server.Network must have Shadow Production LAN

DA

27

From the view of DR

DA & DR in virtual environment

DR and DA solution do not fit well together in vCloud Suite 5.1• DA requires 1 vCenter

DA needs long distance migration, which don’t work across 2 vCenters.

• DR requires 2 vCenters. vCenter prevents the same VM to appear 2x in the same vCenter.

There is confusion on DR + DA• You cannot have DA + DR on the same “system”. You need 3 instances.

1 primary 1 secondary for DR purpose 1 secondary for DA purpose.

• Next slide explains limitations of some DA solution for DR use case. This is not to criticise the DA solution, as it is a good solution for DA use case.

DA Solution: Stretched Cluster (+ Long Distance vMotion)

29

When actual DR strikes…

• We can’t assume Production is up. Hence vMotion is not a solution.

• HA will kick in and boot all VMs. Orders will not be honoured.

Challenge of the above solution: How do we Test?

• DR Solution must be tested regularly as per Requirement R04.

• The test must be identical from user point of view, as per Requirement R02.

• So the test will have to be like this: Cut replication, then mount the LUNs, then add VMs into VC, boot the VMs.

But… we cannot mount the LUNs the same vCenter as they have the same signature! Even if we can, we must know the exact placement of each VMs (which is complex). Even if we can, we cannot boot 2 VMs on the same vCenter! This means Production VMs must be down. This fails Requirement R03.

Conclusion:

Stretched Cluster does not even qualify as DR Solution as it can’t be tested & it’s 100% manual.

DA Solution: 2 Clusters in 1 VC (+ Long Distance vMotion)

30

This is a variant of Stretched Cluster.

• It fixes the risk & complexity of Stretched Cluster. And no performance impact of uncontrolled long distance vMotion.

When actual DR strikes…

• We can’t assume Production is up. Hence vMotion is not a solution.

• HA will not even kick in as it’s separate cluster. In fact, VMs will be in error state, appearing italized in vCenters. Can they be removed from vCenter since the host is not responding?

• We can assume vCenter will be up on DR Site, if it’s separately protected by Heartbeat.

Challenge of the above solution: How do we Test?

• All issues facing Stretched Cluster apply.

Conclusion:

2-Cluster is inferior to Stretched Cluster from DR point of view

Active/Active or Active/Passive

31

Which one makes sense?

32

Virtual Datacenter

Physical Datacenter 2Physical Datacenter 1

Just what is a Software-Defined Datacenter anyway?

Physical Compute Function

Compute Vendor 1 Compute Vendor 2

Physical Network Function

Network Vendor 1 Network Vendor 2

Physical Storage Function

Storage Vendor 1 Storage Vendor 2

Physical Compute Function

Compute Vendor 1 Compute Vendor 2

Physical Network Function

Network Vendor 1 Network Vendor 2

Physical Storage Function

Storage Vendor 1 Storage Vendor 2

Shared Nothing Architecture.

Not stretched between 2 physical DC.

Production might be 10.10.x.x. DR might be 20.20.x.x


No replication between 2 physical DC.

Production might be FC. DR might be iSCSI.


No stretched cluster between 2 physical DC.

Each site has its own vCenter.

33

Level of Availability

Tier Technology RPO RTO

Tier 0 Active/Active at Application Layer 0 hours 0 hours

Tier 1 Array-based Replication 0 hours 2 hour

Tier 2 vSphere Replication 15 min 1 hour

Tier 3 No replication. Backup & Restore 1 day 8 hours

Background

34

Active/Active Datacenter has many level of definition:

• Both DC are actively running workload, so one is not idle.

This means Site 2 can be running non Production workload, like Test/Dev and DR.

• Both DC are actively running Production workload

Build from previous, this means Site 2 must run Production workload.

• Both DC are actively running Production workload, with cluster failover.

Build from previous, the same App run on both side. But the instance on Site 2 is not serving

users. It’s waiting for an application-level failover.

This is typicaly done via geo-cluster solution.

• Both DC are actively running Production workload, with A/A aplication-level

Both Apps are running. Normally done via global Load Balancer.

No need to failover as each App is “complete”. It has the full data, and it does not need to tell

the other App when its data is updated. No transaction level integrity required.

This is the ideal. But most apps cannot do this as the data cannot be split. You can only have 1

data.

In vSphere context, this is what it means by Active/Active vSphere.

Both vSphere are actively running Production VMs

A closer look at Active/Active

250 Prod VMs

Prod Clusters

500 Test/Dev VMs

T/D Clusters

vCenter

500 Test/Dev VMs250 Prod VMs

Prod Clusters T/D Clusters

vCenter

Lots of traffic between:

Prod to Prod

T/D to T/D

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters + DR Cluster

vCenter

MAS TRM Guideline

36

It states “near” 0, not 0.

It states “should”, not “must”.

It states “critical”, not all systems. So A/A is only for a subset. This points to an Application-level solution, not Infrastructure-level. We can add this capability without changing the architecture, as shown on next slide.

Adding Active/Active to a mostly Active/Passive vSphere

500 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

450 Prod VMs

Prod Clusters

vCenter

1000 Test/Dev VMs

T/D Clusters

vCenter

50 VMs

1 Cluster

Global LBGlobal LB

© 2009 VMware Inc. All rights reserved

Thank You Next few slides is just to give perspective from a different solution (NAT based). This solution does not leverage Network

Virtualisation. It is based on the classical networking solution.

38

Application Analysis

SRM should be done after Business Impact Analysis• BIA will list all the apps, owners, RTO, RPO, regulatory requirements, dependancy, etc.

• Some applications are important to business, but not have high DR priority.

• These are normally scheduled/batched apps, like Payroll, Employee Appraisal

Group application by Services• 1 Services has many VM.

• Put them in the same Datastore Group, as they will need to fail over together.

For each app to be protected, document the dependancy• Upstream and downstream.

• A large multi-tier app can easily span > 10 VM.

• Some apps require basic services like AD and DNS to be up.

Type of “DR”• Sometimes there is not a big disaster but a small one. Examples:

The core switch is going to be out for 12 hours. Annual power cycle on the entire building. This happens at Suntec City, which is considered the vertical Silicon Valley of

Singapore.

• Define all the Recovery Plans

Consider the time it takes CIO to decide to trigger DR as part of RTO/RPO. Do you have enough CPU/RAM to boot the Production VM during Test Run?• Identify DR VMs can be suspended. Add their total Reservation

SOURCE NAT

Source IP Changed:

10.30.30.30 => 10.20.20.20

LOAD BALANCE

VIP Mapped to server IP:

10.10.10.10 => 10.20.20.31

Pre-Failover

40


Load Balancer

User 10.30.30.30Global DNSLoad Balancer

Prod Site

DNS Query: www.abc.comDNS Response: Virtual IP 110.10.10.10

HTTP GET: 10.10.10.10

Load Balancer

DR Site

VIP 1

SNAT

VIP 2

SNAT

SOURCE NAT

Source IP Changed:

10.30.30.30 => 10.20.20.20

LOAD BALANCE


192.168.10.10 => 10.20.20.31

Post-Failover

41


Load Balancer


Prod Site

DNS Query: www.abc.comDNS Response: Virtual IP 2192.168.10.10

HTTP GET: 192.168.10.10

Load Balancer

DR Site

VIP 1

SNAT

VIP 2

SNAT

SOURCE NAT

Source IP Changed:

10.30.30.30 => 10.20.20.20

LOAD BALANCE


192.168.10.10 => 10.20.20.31

DR Dry Run

42


Load Balancer


Prod Site

DNS Query: www-dr-test.abc.comDNS Response: Virtual IP 2192.168.10.10

HTTP GET: 192.168.10.10

Load Balancer

DR Site

VIP 1

SNAT

VIP 2

SNAT

DR Test VMs DR Test PMs

Design Challenges…

L3 & Firewall L2 Switching

Eliminate need for customersto re-IP their virtual servers when failing over to the DR site

1 Provide hardware level isolation to avoid overlapping IP subnets in a shared switching infrastructure

SHARED INFRASTRUCTURE

MPLSOSAPP

OSAPP

OSAPP

Customer A

192.168.10.0/23

OSAPP

OSAPP

OSAPP

Customer B

10.1.1.0/24

OSAPP

OSAPP

OSAPP

Customer C

10.1.1.0/24

IPSEC VPN(Internet)

2

Hypervisor

VAPP

VAPP

VAPP

Compute & Storage

10.1

11.1

Multiple VLANs segments &VLAN Routing per customerwithout use of NAT

3

VLAN B

VLAN C

Organizational VDC A

Organizational VDC B

Organizational VDC C

MPLSOSAPP

OSAPP

OSAPP

Customer A

192.168.10.0/24

OSAPP

OSAPP

OSAPP

Customer B

10.1.1.0/24

OSAPP

OSAPP

OSAPP

Customer C

10.1.1.0/24

vCloud 5.1 / vShield 5.1 Solution…

vCloud / vShieldL2 Switching

VXLAN

VXLAN

VXLAN

VXLAN

Pass-ThruVLAN

IPSEC VPN(Internet)

Internet VLAN

Use vShield Edge Gateway and VXLAN to provide multiple routable segments & isolation within an organizational VDC

2

Hypervisor

VAPP

VAPP

VAPP

Simplify L2 edge configuration by using simple pass-thru VLANs for customer WAN termination and segmentation

1

Organizational VDC A

Organizational VDC B

Organizational VDC C

Storage

Consolidate compute and network services into a single common hardware platform

3

Disaster Recovery 2.0

Documents

power seraya power station

power systemelectricity

tuas power station

senoko power station

power blackout

seraya power stations

parts parts of singapore

business need of