1
MAXIMIZE AVAILABILITY AND UPTIME
BY CLUSTERING PHYSICAL DATA
CENTERS WITHIN METRO DISTANCES
MICHAEL NAKAMURA, SENIOR SOLUTIONS ARCHITECT HENRY CHU, SENIOR SOLUTIONS ARCHITECT
OCTOBER 2012
WEBTECH EDUCATIONAL SERIES
Maximize Availability and Uptime by Clustering Your Physical Data Centers
within Metro Distances
As IT infrastructures continue to be virtualized, data center architects are looking
for ways to increase the mobility and high availability of virtual machines beyond a
single data center.
Expanding data centers across multiple locations has become an increasingly
common strategy to address high-availability and disaster recovery needs for
businesses with high uptime requirements.
Join Hitachi Data Systems for this Webinar and learn how you can:
• Accelerate tier-1 virtualization adoption by providing best-in-class SLAs
• Dynamically move workloads within and across data centers to avoid
contention, and support utility-on-demand models
• Provide automated recovery of applications with high return on investment
UPCOMING WEBTECHS
November
Comprehensive and Simplified Management for VMware vSphere
environments, November 14, 11 a.m. PT, 2 p.m. ET
Microsoft SQL Server 2012 Data Warehouse solutions on Hitachi
converged platform, November 27, 9 a.m. PT, 12 p.m. ET
Check www.hds.com/webtech for
Links to the recording, the presentation and Q&A (available next week)
Schedule and registration for upcoming WebTech sessions
AGENDA
Customer challenges
VMware Metro Storage Cluster overview
Hitachi Storage Cluster for VMware vSphere
technical review
Best practices
© Hitachi Data Systems Corporation and Brocade Communications Systems, Inc. 2012. All Rights Reserved.
CUSTOMER CHALLENGES
Downtime
‒ Key component(s) failure in single data center
‒ Planned maintenance
‒ No disaster recovery without downtime
Reluctance to migrate mission-critical apps
‒ Fear of performance degradation
‒ Data recovery is an issue; inability to meet recovery time objectives (RTO) and recovery point objectives (RPO)
Lack of a single point of management
across data centers
No ability to pool resources across data
centers limits application deployment flexibility
VMWARE METRO STORAGE CLUSTER OVERVIEW
VMware vSphere Metro Storage Cluster (vMSC) is a new
certified configuration in which a storage device spans
multiple geographical storage systems
Hitachi Storage Cluster certification is complete – on
VMware Hardware Compatibility List
Implemented for disaster and downtime avoidance
WHAT IS A METRO STORAGE CLUSTER?
WHY USE A METRO STORAGE CLUSTER?
Maximize availability and uptime by clustering physical
data centers within metro distances
Leverage VMware infrastructure high-availability benefits
with storage-based synchronous replication awareness
Stretched storage clusters provide new architectures that
enable
Nondisruptive workload mobility
Cross-site load balancing of resources
Avoidance of disaster and downtime
Uniform host access model – provides a single view of a datastore
across sites
Data consistency across 2 sites in the case of failure
HITACHI STORAGE CLUSTER FOR VMWARE VSPHERE: INFRASTRUCTURE OVERVIEW
HITACHI STORAGE CLUSTER FOR VMWARE VSPHERE: MANAGEMENT OVERVIEW
vCenter Server contains
these management
components:
‒ vCenter
‒ Hitachi Dynamic Link Manager (HDLM) command
‒ vSphere CLI
‒ CCI Raid Manager
Cmd Dev presented from both Hitachi Virtual Storage Platform (VSP) systems
Best practice: Place vCenter at
a 3rd site to ensure virtual
infrastructure management is
not affected from any 1 site
during a sitewide failure
HITACHI STORAGE CLUSTER FOR VMWARE VSPHERE: ARCHITECTURE OVERVIEW
Hitachi High Availability Manager (HAM)
installed on each VSP
P-VOL and S-VOL seen as a single
volume
‒ RCU takes MCU serial number upon failover
Write data transferred from MCU to RCU
cache via synchronous Hitachi TrueCopy®
‒ Supports external storage and Hitachi Dynamic Provisioning volumes
Quorum disk on external storage
‒ Used by both MCU and RCU
‒ Unique quorum disk for each MCU-RCU relationship
‒ Allows verification of data integrity before failover
‒ Denotes location of most recent host data
HITACHI DYNAMIC LINK MANAGER (HDLM) WITH HIGH AVAILABILITY MANAGER (HAM): INTRODUCTION
Virtual storage represents P-VOL and
S-VOL as a single volume
‒ P-VOL and S-VOL have same VOL ID in SCSI inquiry
HDLM in ESX manages path
selection
‒ Active I/O sent to P-VOL
‒ S-VOL in standby state in normal operation
‒ Load balancing algorithm
Extended round robin
Extended least I/O
Extended least blocks
HAM uses synchronous TrueCopy to
replicate from P-VOL to S-VOL
HDLM WITH HAM: VMOTION AND DYNAMIC RESOURCE SCHEDULER
vMotioned VMs
‒ Hosts within the cluster will use active paths to P-VOLs
HDLM WITH HAM: VMWARE HIGH AVAILABILITY (HA)
VMware HA failover
‒ VMs failover to existing ESX nodes in HA cluster
‒ I/O continues to active P-VOL paths
HDLM WITH HAM: PATH FAILOVER
When paths to P-VOL fail,
HDLM PSP handles the
path failover
HDLM WITH HAM: STORAGE FAILOVER
When all paths to P-VOL or
MCU fail
‒ Paths to S-VOL become active
‒ Verify data integrity with quorum disk before failover
‒ RCU splits S-VOL with write- enabled status
HDLM WITH HAM: PATH RECOVERY
Storage recovery will
require reverse sync
‒ pairresync –swaps/swapp
When storage recovers
and paths to P-VOL
recover
‒ Paths to S-VOL become standby
‒ P-VOL paths become active
QUORUM FAILURE
Remote mirroring between P-VOL and S-VOL stops
P-VOL continues to process host I/O
REPLICATION LINK FAILURE
P-VOL continues to process host I/O
HDLM WITH HAM: SITE FAILURE
VM failover handled by VMware HA
Storage failover handled by HAM
Path failover to replicated storage handled by HDLM
WAN LINK FAILURE (UNDER REVIEW)
Link for replication and
remote site has failed but
links to local site are
active
‒ P-VOL cannot process host I/O
‒ HDLM switches the I/O path to S-VOL
‒ Site 1: I/O paths to S-VOL also cannot be used, so Site 1 cannot continue to access both P-VOL and S-VOL
‒ Site 2: S-VOL continues to process host I/O
BEST PRACTICE DESIGN RECOMMENDATIONS
Performance bottleneck dependent on WAN latency and
bandwidth
‒ Optionally use VMware HA with N+1 settings with combination of DRS affinity rules to keep VMs on same site where the active volume resides
Quorum disk should be located at 3rd site to ensure
quorum access is not affected from any 1 site during
sitewide failure.
vCenter should be located at 3rd site to ensure virtual
infrastructure management is not affected from any 1 site
during sitewide failure.
BEST PRACTICE DESIGN RECOMMENDATIONS
Perform storage failback during scheduled downtime
‒ Perform a clean and controlled storage failback by migrating high-uptime virtual machines to a single host via VMware vMotion and then performing storage failback
Avoid single points of failure by architecting with
redundancy in mind
23
QUESTIONS
UPCOMING WEBTECHS
November
Comprehensive and Simplified Management for VMware vSphere
environments, November 14, 11 a.m. PT, 2 p.m. ET
Microsoft SQL Server 2012 Data Warehouse solutions on Hitachi
converged platform, November 27, 9 a.m. PT, 12 p.m. ET
Check www.hds.com/webtech for
Links to the recording, the presentation and Q&A (available next week)
Schedule and registration for upcoming WebTech sessions
THANK YOU MICHAEL NAKAMURA HENRY CHU