High availability and Disaster Recovery in a Multi-Site Virtual Environment using virtualization Henk Den Baes Technology Advisor Microsoft BeLux
Mar 28, 2015
High availability and Disaster Recovery in a Multi-Site Virtual Environment using virtualization
Henk Den BaesTechnology AdvisorMicrosoft BeLux
HA & DR with Multi-Site Clustering
IntroductionNetworkingStorageQuorum
Session Objectives
• Session Objective(s): – Clustering is not too expensive and not that complex– Understanding the need and benefit of multi-site clusters– What to consider as you plan, design, and deploy your
first multi-site cluster
• Clustering your Hyper-V servers is a great solution for not only high availability, but also disaster recovery
Hyper-V Virtualization Scenarios
Business Continuity
Dynamic Datacenter
Server Consolidation
Test and Dev
Business ContinuityResumption of full operations combining People, Processes and Platforms
Disaster RecoverySite-level crisis , data and IT operations resumption
Backup and RestorePresumes infrastructure is whole97% is file/small unit related
High AvailabilityPresumes that the rest of the environment is active
Keeping the Business Running
VHDShared Storage
CSV
Backup/Recovery
Secondary SitePrimary Site
Storage Array
Storage Array
Virtualization reduces BC costs and minimizes business downtime by:
• increasing the availability of infrastructure• extending protection to more applications • simplifying backups, recovery and DR testing
Business Continuity
High Availability
Disaster Recovery
Backup and Recovery
Disaster Recovery
Backup/Recovery Backup/Recovery
Clustering
Quick/Live Migration
Business Continuity with Virtualization
Two node single-site cluster
SAN
Secondary SitePrimary Site
SANSAN
Storage Array Secondary Storage Array
Primary Storage Array
Single-site Cluster Multi-site Cluster
WAN Connectivity
Differences Between Single-site & Multi-site Clusters
VMs move between nodes on the same SAN and share common
storage
VMs move between physical nodes on different SANs and without true shared storage between the sites
SAN Replication
Multi-site stretch configurations can provide automatic fail-over
Secondary SitePrimary Site
Storage Array Storage Array
Geographically distributed clusters are extended to different physical locations
Stretch clustering uses the same concept as local site clustering
Storage array or third party software provides SAN data replication
Stretch Clustering automatically fails VMs over to a geographically different
site
Replicated data from
site A
Primary site data is replicated to the
secondary site
Microsoft Stretch Clustering & Storage Continuity
Benefits of a Multi-Site Cluster
• Protects against loss of an entire datacenter• Automates failover– Reduced downtime– Lower complexity disaster recovery plan
• Reduces administrative overhead– Automatically synchronize application and cluster changes– Easier to keep consistent than standalone servers
The primary reason DR solutions fail isdependence on people
DR: NMBS VDI use case - NOC
• Windows 7 master• NOC is installed on 1 site
– DRP is costly & has to be tested yearly
– There is no automatic app. Sync– Dedicated master, manual upgrades– No persistent image, need for
admin rights• Solution:
– Remote Desktop Services– Hyper-V R2– Remote Desktop Connection
HA & DR with Multi-Site Clustering
IntroductionNetworkingStorageQuorum
Network Considerations• Network Deployment Options:
1. Stretch VLAN’s across sites2. Cluster nodes can reside in different subnets
Site A
Public Network
10.10.10.1 20.20.20.1
30.30.30.1 40.40.40.1
Redundant Network
Site B
Stretching the Network• Longer distance traditionally means greater network latency• Missed inner-node health checks can cause false failover• Cluster inner-node heartbeating is fully configurable
• SameSubnetDelay (default = 1 second)– Frequency heartbeats are sent
• SameSubnetThreshold (default = 5 heartbeats)– Missed heartbeats before an interface is considered down
• CrossSubnetDelay (default = 1 second)– Frequency heartbeats are sent to nodes on dissimilar subnets
• CrossSubnetThreshold (default = 5 heartbeats)– Missed heartbeats before an interface is considered down to nodes on dissimilar subnets
• Command Line: Cluster.exe /prop• PowerShell (R2): Get-Cluster | fl *
Updating VM’s IP on Subnet Failover
• On cross-subnet failover, if guest is…
• Best to use DHCP in guest OS for cross-subnet failover
•IP updated automaticallyDHCP•Admin needs to configure new IP•Can be scriptedStatic IP
Client Reconnect Considerations• Nodes in dissimilar subnets• VM obtains new IP address• Clients need that new IP Address from DNS to reconnect
Record Updated
10.10.10.111 20.20.20.222
DNS Server 1DNS Server 2DNS Replication
Record Created Record Updated
VM = 20.20.20.222
Site A Site B
Solutions• Solution #1: Prefer Local Failover– Scale up for local failover for higher availability
• No change in IP addresses for HA• Means not going over the WAN and is still usually preferred
– Cross-site failover for disaster recovery
• Solution #2: Stretch VLAN’s– Deploying a VLAN minimizes client reconnection times
• IP of the VM never changes
• Solution #3: Abstraction in Network Device– Network device uses 3rd IP– 3rd IP is the one registered in DNS & used by client
HA & DR with Multi-Site Clustering
IntroductionNetworkingStorageQuorum
Storage in Multi-Site Clusters• Different than local clusters:– Multiple storage arrays – independent per site– Nodes commonly access own site storage– No ‘true’ shared disk visible to all nodes
Site B
SAN
Site A
Site B
Storage Considerations
Changes are made on Site A and replicated to Site B
Requires data replication mechanism between sites
Site A Site B
SAN
Site A Site B
Replica
Synchronous Replication• Host receives “write complete” response from the storage after the data is
successfully written on both storage devices
PrimaryStorage
SecondaryStorage
WriteComplete
Replication
Acknowledgement
WriteRequest
SecondaryStorage
WriteComplete
WriteRequest
Replication
Asynchronous Replication• Host receives “write complete” response from the storage after the data is
successfully written to just the primary storage device, then replication
PrimaryStorage
Synchronous vs. Asynchronous
Synchronous AsynchronousNo data loss Potential data loss on hard
failuresRequires high bandwidth/low latency connection
Enough bandwidth to keep up with data replication
Stretches over shorter distances
Stretches over longer distances
Write latencies impact application performance
No significant impact on application performance
Hardware Replication Partners• Hardware storage-based replication
EMC Cluster Enabler•SRDF /CE for DMX arrays•RecoverPoint /CE for Clariion arrays
HP Cluster Extension •HP StorageWorks CLX•HP LeftHand
NetApp•MetroCluster
IBM•IBM XIV Storage System
Compellent•LiveVolume
HDS•Hitachi Storage Cluster (HSC)
Software Replication Partners
Double-Take Availability
SteelEye DataKeeper Cluster Edition
Symantec Storage Foundation for Windows
Software host-based replication
Storage Virtualization Abstraction• Some replication solutions provide complete
abstraction in storage array• Servers are unaware of accessible disk location• Fully compatible with Cluster Shared Volumes (CSV)
Site BSite A
Virtualized storage presents logical
LUN
Servers abstracted from storage
Focus on Double-Take for Hyper-V
• Product Features– Host level filter driver replication– Simplified management
• Auto discovery, guest level policies, & guest protection schema• Not a file level protection product (block based)• One click failover and failover management
– WAN support (bandwidth throttling, compression…)
– Integration with SCOM and SCVMM• All managed via one familiar console
• Licensed per Hyper-V Host– Unlimited number of VMs
Basic Double-Take Configuration
How Double-Take Replication Works
OperatingSystem
HardwareLayer
File System
Applications
OperatingSystem
Double-Take Filter
HardwareLayer
File System
Applications
Initial Mirror of Data
WAN OptimizedThree Levels of Data Compression and Scheduled Bandwidth Limiting
Capabilities
Any IP Network
Host-Level Protection for Hyper-V
Hyper-VHost
VHD
VHD
VHD
VHD
VHD
VHD
Hyper-VHost
• Integrates with Microsoft Failover Clustering
• Uses Double-Take Patented Replication
• Extends Clusters Across Geographical Distances
• Eliminates Single Point of Disk Failure
Double-Take GeoCluster
How Double-Take GeoCluster Works
GeoCluster nodes use separate disks, kept synchronized by real-time replication
Only the active node accesses
its disks
At failover, the new active node
resumes with current,
replicated data
Data is replicated to all passive nodes
Replication
GeoCluster for Hyper-V Workloads
• Product Features– Provides redundancy of storage– Allows cluster nodes to be geographically distributed– Utilizes GeoCluster technology to extend Hyper-V clustering
across virtual hosts without the use of shared disk– Replicates cluster data to a secondary node, eliminating
single point of failure– Allows manual and automatic moves of cluster resources
between virtual hosts
CSV with Replicated Storage
Site BSite A
VHD
Read/OnlyRead/Write
VM attempts to access replica
• Traditional architectural assumptions may collide…– Traditional replication solutions typically assume only 1
array accessed at a time– Cluster Shared Volumes assumes all nodes can
concurrently access a LUN
• Talk to your storage vendor for their support story
HA & DR with Multi-Site Clustering
IntroductionNetworkingStorageQuorum
Quorum Overview
• Disk only (not recommended)• Node and Disk majority
• Node majority• Node and File Share majority
VoteVote Vote Vote Vote
Majority is greater than 50%Possible Voters: Nodes (1 each) + 1 Witness (Disk or File Share)4 Quorum Types
Replicated Disk Witness
• A witness is a tie breaker when nodes lose network connectivity– When a witness is not a single decision maker, problems occur
• Do not use in multi-site clusters unless directed by vendor
Replicated Storage
?Vote Vote Vote
Node Majority
Site BSite A
Cross site network
connectivity broken!
Can I communicate with majority of the nodes in the
cluster?
Yes, then Stay Up
Can I communicate with majority of the nodes in the
cluster?
No, drop out of Cluster
Membership
5 Node Cluster: Majority = 3
Majority in Primary Site
Node Majority
Disaster at Site 1
Can I communicate with majority of the nodes in the cluster?
No, drop out of Cluster Membership
Majority in Primary Site
5 Node Cluster: Majority = 3
Need to force quorum
manually
Site A
We are down!
Site B
Forcing Quorum• Forcing quorum is a way to manually override and start a
node even though it has not achieved quorum– Always understand why quorum was lost– Used to bring cluster online without quorum– Cluster starts in a special “forced” state– Once majority achieved, drops out of “forced” state
• Command Line:– net start clussvc /fixquorum (or /fq)
• PowerShell (R2):– Start-ClusterNode –FixQuorum (or –fq)
Multi-Site with File Share Witness
Site A Site B
Site C (branch office)
SCENARIO:Complete resiliency and automatic recovery from the loss of any 1 site
\\Foo\Share
WAN
File Share Witness
File Share Witness
Multi-Site with File Share Witness
\\Foo\Share
WAN
SCENARIO:Complete resiliency and automatic recovery from the loss of connection between sites
Can I communicate with majority of the nodes in the cluster?
No (lock failed), drop out of Cluster Membership
Site BSite A
Can I communicate with majority of
the nodes (+FSW) in the cluster?
Yes, then Stay Up
Site C (branch office)
File Share Witness (FSW) Considerations
• Simple Windows File Server• Single file server can serve as a witness
for multiple clusters – Each cluster requires it’s own share– Can be made highly available on a separate cluster
• Recommended to be at 3rd separate site to enable automatic site failover
• FSW cannot be on a node in the same cluster• FSW should not be in a VM running on the same cluster
Quorum Model Recap•Even number of nodes•Best availability solution – FSW in 3rd site
Node and File Share Majority
•Odd number of nodes•More nodes in primary siteNode Majority
•Use as directed by vendorNode and Disk Majority
•Not Recommended•Use as directed by vendor
No Majority: Disk Only
Datacenter Recovery Partners
• Citrix Essentials for Hyper-V augments Hyper-V DR by automating disaster recovery configuration – StorageLink Site Recovery manages storage automation– Workflow orchestration for VM site failover– Non-disruptive testing & staging of VM prior to failover– Single click failback– Recovery plans– Integrates with SCVMM– Plus more…
Microsoft Site Recovery Solution StackEnd to end Disaster Recovery
Storage and Data Availability
Server and Application Availability
• Hyper-V• Clustering• Quick and
Live Migration
• Synchronous & Asynchronous Replication
• Array state and application restart
• Workflow automation
• DR Run-book
• Simplified configuration & testing
Management
Storage Partner Data Replication
Automation
• Physical and Virtual
• Performance and Resource Optimization
Microsoft Private Cloud – Server Platform
Simplify with integrated physical, virtual and cloud management
Improve agility with private cloud computing infrastructure
Optimize service delivery across datacenter infrastructure and business critical services
Lower costs through automation
“We don’t have to manage our infrastructure with multiple tools…we have one central monitoring and management console from which we can care for every aspect of our environment” - Doug Miller, Practice Architect, Microsoft Practice Group, CDW
Design, Configure
& Deploy
Data Protection & Recovery
Virtualize, Deploy & Manage
Monitor & Manage Service End to End
IT Service Management
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after
the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.