High availability and Disaster Recovery in a Multi-Site Virtual Environment using virtualization Henk Den Baes Technology Advisor Microsoft BeLux.

High availability and Disaster Recovery in a Multi-Site Virtual Environment using virtualization

Henk Den BaesTechnology AdvisorMicrosoft BeLux

HA & DR with Multi-Site Clustering

IntroductionNetworkingStorageQuorum

Session Objectives

• Session Objective(s): – Clustering is not too expensive and not that complex– Understanding the need and benefit of multi-site clusters– What to consider as you plan, design, and deploy your

first multi-site cluster

• Clustering your Hyper-V servers is a great solution for not only high availability, but also disaster recovery

Hyper-V Virtualization Scenarios

Business Continuity

Dynamic Datacenter

Server Consolidation

Test and Dev

Business ContinuityResumption of full operations combining People, Processes and Platforms

Disaster RecoverySite-level crisis , data and IT operations resumption

Backup and RestorePresumes infrastructure is whole97% is file/small unit related

High AvailabilityPresumes that the rest of the environment is active

Keeping the Business Running

VHDShared Storage

CSV

Backup/Recovery

Secondary SitePrimary Site

Storage Array

Storage Array

Virtualization reduces BC costs and minimizes business downtime by:

• increasing the availability of infrastructure• extending protection to more applications • simplifying backups, recovery and DR testing

Business Continuity

High Availability

Disaster Recovery

Backup and Recovery

Disaster Recovery

Backup/Recovery Backup/Recovery

Clustering

Quick/Live Migration

Business Continuity with Virtualization

Two node single-site cluster

SAN


SANSAN

Storage Array Secondary Storage Array

Primary Storage Array

Single-site Cluster Multi-site Cluster

WAN Connectivity

Differences Between Single-site & Multi-site Clusters

VMs move between nodes on the same SAN and share common

storage

VMs move between physical nodes on different SANs and without true shared storage between the sites

SAN Replication

Multi-site stretch configurations can provide automatic fail-over


Storage Array Storage Array

Geographically distributed clusters are extended to different physical locations

Stretch clustering uses the same concept as local site clustering

Storage array or third party software provides SAN data replication

Stretch Clustering automatically fails VMs over to a geographically different

site

Replicated data from

site A

Primary site data is replicated to the

secondary site

Microsoft Stretch Clustering & Storage Continuity

Benefits of a Multi-Site Cluster

• Protects against loss of an entire datacenter• Automates failover– Reduced downtime– Lower complexity disaster recovery plan

• Reduces administrative overhead– Automatically synchronize application and cluster changes– Easier to keep consistent than standalone servers

The primary reason DR solutions fail isdependence on people

DR: NMBS VDI use case - NOC

• Windows 7 master• NOC is installed on 1 site

– DRP is costly & has to be tested yearly

– There is no automatic app. Sync– Dedicated master, manual upgrades– No persistent image, need for

admin rights• Solution:

– Remote Desktop Services– Hyper-V R2– Remote Desktop Connection



Network Considerations• Network Deployment Options:

1. Stretch VLAN’s across sites2. Cluster nodes can reside in different subnets

Site A

Public Network

10.10.10.1 20.20.20.1

30.30.30.1 40.40.40.1

Redundant Network

Site B

Stretching the Network• Longer distance traditionally means greater network latency• Missed inner-node health checks can cause false failover• Cluster inner-node heartbeating is fully configurable

• SameSubnetDelay (default = 1 second)– Frequency heartbeats are sent

• SameSubnetThreshold (default = 5 heartbeats)– Missed heartbeats before an interface is considered down

• CrossSubnetDelay (default = 1 second)– Frequency heartbeats are sent to nodes on dissimilar subnets

• CrossSubnetThreshold (default = 5 heartbeats)– Missed heartbeats before an interface is considered down to nodes on dissimilar subnets

• Command Line: Cluster.exe /prop• PowerShell (R2): Get-Cluster | fl *

Updating VM’s IP on Subnet Failover

• On cross-subnet failover, if guest is…

• Best to use DHCP in guest OS for cross-subnet failover

•IP updated automaticallyDHCP•Admin needs to configure new IP•Can be scriptedStatic IP

Client Reconnect Considerations• Nodes in dissimilar subnets• VM obtains new IP address• Clients need that new IP Address from DNS to reconnect

Record Updated

10.10.10.111 20.20.20.222

DNS Server 1DNS Server 2DNS Replication

Record Created Record Updated

VM = 20.20.20.222

Site A Site B

Solutions• Solution #1: Prefer Local Failover– Scale up for local failover for higher availability

• No change in IP addresses for HA• Means not going over the WAN and is still usually preferred

– Cross-site failover for disaster recovery

• Solution #2: Stretch VLAN’s– Deploying a VLAN minimizes client reconnection times

• IP of the VM never changes

• Solution #3: Abstraction in Network Device– Network device uses 3rd IP– 3rd IP is the one registered in DNS & used by client



Storage in Multi-Site Clusters• Different than local clusters:– Multiple storage arrays – independent per site– Nodes commonly access own site storage– No ‘true’ shared disk visible to all nodes

Site B

SAN

Site A

Site B

Storage Considerations

Changes are made on Site A and replicated to Site B

Requires data replication mechanism between sites

Site A Site B

SAN

Site A Site B

Replica

Synchronous Replication• Host receives “write complete” response from the storage after the data is

successfully written on both storage devices

PrimaryStorage

SecondaryStorage

WriteComplete

Replication

Acknowledgement

WriteRequest

SecondaryStorage

WriteComplete

WriteRequest

Replication

Asynchronous Replication• Host receives “write complete” response from the storage after the data is

successfully written to just the primary storage device, then replication

PrimaryStorage

Synchronous vs. Asynchronous

Synchronous AsynchronousNo data loss Potential data loss on hard

failuresRequires high bandwidth/low latency connection

Enough bandwidth to keep up with data replication

Stretches over shorter distances

Stretches over longer distances

Write latencies impact application performance

No significant impact on application performance

Hardware Replication Partners• Hardware storage-based replication

EMC Cluster Enabler•SRDF /CE for DMX arrays•RecoverPoint /CE for Clariion arrays

HP Cluster Extension •HP StorageWorks CLX•HP LeftHand

NetApp•MetroCluster

IBM•IBM XIV Storage System

Compellent•LiveVolume

HDS•Hitachi Storage Cluster (HSC)

Software Replication Partners

Double-Take Availability

SteelEye DataKeeper Cluster Edition

Symantec Storage Foundation for Windows

Software host-based replication

Storage Virtualization Abstraction• Some replication solutions provide complete

abstraction in storage array• Servers are unaware of accessible disk location• Fully compatible with Cluster Shared Volumes (CSV)

Site BSite A

Virtualized storage presents logical

LUN

Servers abstracted from storage

Focus on Double-Take for Hyper-V

• Product Features– Host level filter driver replication– Simplified management

• Auto discovery, guest level policies, & guest protection schema• Not a file level protection product (block based)• One click failover and failover management

– WAN support (bandwidth throttling, compression…)

– Integration with SCOM and SCVMM• All managed via one familiar console

• Licensed per Hyper-V Host– Unlimited number of VMs

Basic Double-Take Configuration

How Double-Take Replication Works

OperatingSystem

HardwareLayer

File System

Applications

OperatingSystem

Double-Take Filter

HardwareLayer

File System

Applications

Initial Mirror of Data

WAN OptimizedThree Levels of Data Compression and Scheduled Bandwidth Limiting

Capabilities

Any IP Network

Host-Level Protection for Hyper-V

Hyper-VHost

VHD

VHD

VHD

VHD

VHD

VHD

Hyper-VHost

• Integrates with Microsoft Failover Clustering

• Uses Double-Take Patented Replication

• Extends Clusters Across Geographical Distances

• Eliminates Single Point of Disk Failure

Double-Take GeoCluster

How Double-Take GeoCluster Works

GeoCluster nodes use separate disks, kept synchronized by real-time replication

Only the active node accesses

its disks

At failover, the new active node

resumes with current,

replicated data

Data is replicated to all passive nodes

Replication

GeoCluster for Hyper-V Workloads

• Product Features– Provides redundancy of storage– Allows cluster nodes to be geographically distributed– Utilizes GeoCluster technology to extend Hyper-V clustering

across virtual hosts without the use of shared disk– Replicates cluster data to a secondary node, eliminating

single point of failure– Allows manual and automatic moves of cluster resources

between virtual hosts

CSV with Replicated Storage

Site BSite A

VHD

Read/OnlyRead/Write

VM attempts to access replica

• Traditional architectural assumptions may collide…– Traditional replication solutions typically assume only 1

array accessed at a time– Cluster Shared Volumes assumes all nodes can

concurrently access a LUN

• Talk to your storage vendor for their support story



Quorum Overview

• Disk only (not recommended)• Node and Disk majority

• Node majority• Node and File Share majority

VoteVote Vote Vote Vote

Majority is greater than 50%Possible Voters: Nodes (1 each) + 1 Witness (Disk or File Share)4 Quorum Types

Replicated Disk Witness

• A witness is a tie breaker when nodes lose network connectivity– When a witness is not a single decision maker, problems occur

• Do not use in multi-site clusters unless directed by vendor

Replicated Storage

?Vote Vote Vote

Node Majority

Site BSite A

Cross site network

connectivity broken!

Can I communicate with majority of the nodes in the

cluster?

Yes, then Stay Up

Can I communicate with majority of the nodes in the

cluster?

No, drop out of Cluster

Membership

5 Node Cluster: Majority = 3

Majority in Primary Site

Node Majority

Disaster at Site 1

Can I communicate with majority of the nodes in the cluster?

No, drop out of Cluster Membership

Majority in Primary Site

5 Node Cluster: Majority = 3

Need to force quorum

manually

Site A

We are down!

Site B

Forcing Quorum• Forcing quorum is a way to manually override and start a

node even though it has not achieved quorum– Always understand why quorum was lost– Used to bring cluster online without quorum– Cluster starts in a special “forced” state– Once majority achieved, drops out of “forced” state

• Command Line:– net start clussvc /fixquorum (or /fq)

• PowerShell (R2):– Start-ClusterNode –FixQuorum (or –fq)

Multi-Site with File Share Witness

Site A Site B

Site C (branch office)

SCENARIO:Complete resiliency and automatic recovery from the loss of any 1 site

\\Foo\Share

WAN

File Share Witness

File Share Witness

Multi-Site with File Share Witness

\\Foo\Share

WAN

SCENARIO:Complete resiliency and automatic recovery from the loss of connection between sites

Can I communicate with majority of the nodes in the cluster?

No (lock failed), drop out of Cluster Membership

Site BSite A

Can I communicate with majority of

the nodes (+FSW) in the cluster?

Yes, then Stay Up

Site C (branch office)

File Share Witness (FSW) Considerations

• Simple Windows File Server• Single file server can serve as a witness

for multiple clusters – Each cluster requires it’s own share– Can be made highly available on a separate cluster

• Recommended to be at 3rd separate site to enable automatic site failover

• FSW cannot be on a node in the same cluster• FSW should not be in a VM running on the same cluster

Quorum Model Recap•Even number of nodes•Best availability solution – FSW in 3rd site

Node and File Share Majority

•Odd number of nodes•More nodes in primary siteNode Majority

•Use as directed by vendorNode and Disk Majority

•Not Recommended•Use as directed by vendor

No Majority: Disk Only

Datacenter Recovery Partners

• Citrix Essentials for Hyper-V augments Hyper-V DR by automating disaster recovery configuration – StorageLink Site Recovery manages storage automation– Workflow orchestration for VM site failover– Non-disruptive testing & staging of VM prior to failover– Single click failback– Recovery plans– Integrates with SCVMM– Plus more…

Microsoft Site Recovery Solution StackEnd to end Disaster Recovery

Storage and Data Availability

Server and Application Availability

• Hyper-V• Clustering• Quick and

Live Migration

• Synchronous & Asynchronous Replication

• Array state and application restart

• Workflow automation

• DR Run-book

• Simplified configuration & testing

Management

Storage Partner Data Replication

Automation

• Physical and Virtual

• Performance and Resource Optimization

Microsoft Private Cloud – Server Platform

Simplify with integrated physical, virtual and cloud management

Improve agility with private cloud computing infrastructure

Optimize service delivery across datacenter infrastructure and business critical services

Lower costs through automation

“We don’t have to manage our infrastructure with multiple tools…we have one central monitoring and management console from which we can care for every aspect of our environment” - Doug Miller, Practice Architect, Microsoft Practice Group, CDW

Design, Configure

& Deploy

Data Protection & Recovery

Virtualize, Deploy & Manage

Monitor & Manage Service End to End

IT Service Management

http://images.google.co.uk/imgres?imgurl=http://redcanary.mypublicsquare.com/files/redcanary/teching-the-plunge/opalis_logo.jpg&imgrefurl=http://www.redcanary.ca/view/teching-the-plunge&usg=__yZjoeavfb_iA72EWOGGppWOOPkA=&h=65&w=155&sz=31&hl=en&start=1&um=1&tbnid=oFzaDVCRRXBusM:&tbnh=41&tbnw=97&prev=/images?q=opalis+logo&hl=en&rlz=1W1IRFA_en&um=1

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after

the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

High availability and Disaster Recovery in a Multi-Site Virtual Environment using virtualization Henk Den Baes Technology Advisor Microsoft BeLux.

Documents

different site

site drp

primary site data

different subnets site

secondary site microsoft

disaster recovery slide

benefit of multisite

virtualization slide