Data Center Business Continuance and Disaster Recovery · Business Continuance Is More Critical than Ever 75% of IT decision-makers have altered Disaster Recovery/Business Continuance
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Center Business ContinuanceBusiness Continuanceand Disaster Recovery
Cost of application downtime, lost data and productivity
• Regulatory mandates (Homeland Hurricanesg y (Defense, Basel II, HIPAA, GLB, SEC)
Firms must recover business operations the same business day a disruption occurs“Out-of-region” data center, 200+ km away Mandates backup data centers on separate grids
Business Continuance Is More Critical than Ever75% of IT decision-makers have altered Disaster Recovery/Business Continuance programs as a result of September 11result of September 11
Following a disaster 43% of directly affectedFollowing a disaster 43% of directly affected businesses do not reopen and 29% fail within 24 months as a result
Only 15% of Global 2000 enterprises have a full-fledged business continuity plan.
Disasters: fire, storm, floods, earthquakes, chemical accidents, nuclear accidents, wars
Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution
Data Center Disaster RecoveryObjectives Failure Scenarios Design Options
Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension
Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution
Data Center Disaster RecoveryObjectivesFailure Scenarios Design Options
Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension
Recovery of data and resumption of service - EnsuringRecovery of data and resumption of service Ensuring business can recover and continue after failure or disaster
Ability of a business to adapt, change and continue when confronted with various outside impacts
• Business Impact Analysis (BIA)Business Impact Analysis (BIA) Determines the impacts of various disasters to specific business functions and company assets
• Risk Analysis Identifies important functions and assets that are critical to company’s operationscompany s operations
• Disaster Recovery Plan (DRP) Restores operability of the target systems applications orRestores operability of the target systems, applications, or computing facility at the secondary Data Center after the disaster
Recovery Point Objective (RPO)Th i t i ti ( i t th t ) i hi h t d d tThe point in time (prior to the outage) in which system and data
must be restored toTolerable lost of data in event of disaster or failureThe impact of data loss and the cost associated with the loss
Recovery Time Objective (RTO)The period of time after an outage in which the systems and dataThe period of time after an outage in which the systems and data
must be restored to the predetermined RPO The maximum tolerable outage time
R A Obj ti (RAO)Recovery Access Objective (RAO)Time required to reconnect user to the recovered application,
Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution
Data Center Disaster RecoveryObjectives Failure ScenariosDesign Options
Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension
Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution
Data Center Disaster RecoveryObjectives Failure Scenarios Design Options
Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension
A data center that is environmentally ready and hasA data center that is environmentally ready and has sufficient hardware, software to provide data processing service with little down or no down time.
Hot Backup offers Disaster Recovery, with little or no human intervention
A li ti d t i li t d f th i itApplication data is replicated from the primary site
Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution
Data Center Disaster RecoveryObjectives Failure Scenarios Design Options
Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension
Site Selection MechanismsSite Selection MechanismsSite selection mechanisms depend on the technology or mix of technologies adopted for request routing:or mix of technologies adopted for request routing:1. HTTP Redirect
2 DNS Based2. DNS Based
3. L3 Routing with Route Health Injection (RHI)
H lth f d/ li ti d t bHealth of servers and/or applications needs to be taken into account
Optionally other metrics (like load ) can be measuredOptionally, other metrics (like load ) can be measured and utilized for a better selection
HTTP Redirection – The IdeaHTTP Redirection The Idea
Leveraging the HTTP redirect function:Leveraging the HTTP redirect function:HTTP return code 302
Proper site selection made after the initial DNS requestProper site selection made after the initial DNS request has been resolved, via redirection
Mainly as a method of providing site persistence while providing local server farm failure recovery
Can be used with the “Location Cookie” feature of the CSS to provide redirection after wrong site selectionCSS to provide redirection after wrong site selection
DNS-Based Site Selection – The IdeaDNS Based Site Selection The Idea
The client D-proxy (local name server) performs iterative queriesThe device which acts as “site selector” is the authoritative name server for the domain(s) distributedauthoritative name server for the domain(s) distributed in multiple locationsThe “site selector” sends keepalives to servers or
l d b l i th l l d t l tiserver load balancer in the local and remote locationsThe “site selector” selects a site for the name resolution, according to the pre-defined answers andresolution, according to the pre defined answers and site load balance methodThe user traffic is sent to the selected location
Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution
Data Center Disaster RecoveryObjectives Failure Scenarios Design Options
Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension
Cluster OverviewA cluster is two or more servers configured to appear as one Two types of clustering: Load balancing (LB) and High Availability (HA) Web Servers
Clustering provides benefits for availability, reliability, scalability, and manageabilityLB l t i lti l i f Application ServersLB clustering: multiple copies of the same application against the same data set, usually read only HA clustering: multiple copies of
Application Servers
HA clustering: multiple copies of long running application that requires access to a common data depository, usually read and write
Typical HA Cluster ComponentsTypical HA Cluster Components
Application software that are clustered to provide High pp p gAvailability. Example: Microsoft Exchange, SQL, Oracle database, File and Print Services Operating System that runs on the server hardware. E l Mi ft Wi d 2000 2003 Li ( d thExample: Microsoft Windows 2000 or 2003, Linux (and the other flavors of UNIX), IBM VMS or z/OS (for mainframe)Cluster Software that provides the HA clustering service for the application Example: Microsoft MSCS EMCfor the application. Example: Microsoft MSCS, EMC AutoStart (Legato), Veritas Cluster Server, HP TruCluster and OpenVMS Optionally Cluster Enabler a software that synchronizesOptionally, Cluster Enabler, a software that synchronizes the cluster software with the storage disk array software
Active/Standby:– Active node takes client requests and writing to the data– Standby takes over when detecting failure on active– Two-node or multi-node
Active/Active: node1 node2
– Database requests load balanced to both nodes– Lock mechanism ensures data integrity– Most scalable design
File System Approaches for HA ClustersFile System Approaches for HA Clusters
Shared Everythingy g– Equal access to all storage– Each node mounts all storage resources– Provides a single layout reference system for all nodesProvides a single layout reference system for all nodes– Changes updated in the layout reference
Shared Nothing– Traditional file system with peer-peer communication– Each node mounts only its “semi-private” storage– Data stored on the peer system’s storage is accessed via the peer-p y g ppeer communication– Failed node’s storage needs to be mounted by the peer
Considerations for HA ClustersConsiderations for HA Clusters
Split Brain: Cluster partitioning when nodes can not communicate withSplit Brain: Cluster partitioning when nodes can not communicate with each other but are equally capable of forming a cluster and mount disks.
Extended L2 required in most implementations for:Public Network since client only knows about the Virtual IP address– Public Network, since client only knows about the Virtual IP address
– Private Network, used for Heart-beats
Storage:– Directly Attached Disk (DAS) cannot be used– Shared Disk needs to be visible to both Nodes– Needs to interface with cluster software for disk failover, zoning, LUN masking when there is a node failure
In certain cases a L3 routed solution is possible 11 20 5 x 172.28.210.x
Microsoft MSCS – Requires that 2 nodes be on the same subnet.
Th i ti b t th 2
node1 node2
11.20.5.x
– The communication between the 2 nodes is UDP unicast– Local Area Mobility (LAM) allows the placement of the nodes on 2 different subnetsdifferent subnets
Veritas VCS– Allows having nodes with IP addresses in different subnets
Extended SAN
– The Virtual Address needs to change when moving from node1 to node2– DNS can be used to provide name-
The Cluster Enabler (CE) provides the interface between the
node1 node2the interface between the Clustering Software and the Disk Array’s softwareWhen the Clustering Software detects a failure and wants to fail
active standby
detects a failure and wants to fail the node, the Cluster Enabler instructs the Disk Array to perform an failover Extended SAN
Cluster Enabler also allows node1 to be zoned to sym1320 and node2 to be zoned to 1291The Cluster Enabler running onThe Cluster Enabler running on each node typically communicates with the Cluster Enabler Software running on the remote node with Local Multicast messages RW WD
Introduction to Data Center - The EvolutionIntroduction to Data Center The Evolution
Data Center Disaster RecoveryObjectives Failure Scenarios Design Options
Components of Disaster RecoveryComponents of Disaster RecoverySite Selection - Front End GSLBServer High Availability - ClusteringD t R li ti d S h i ti SAN E t iData Replication and Synchronization - SAN Extension
Direct Attached Storage (DAS)St i “l l” b hi d thStorage is “local” behind the server No storage sharing possibleCostly to scale; complex to manage
Network Attached Storage (NAS)Storage is accessed at a file level over an IP networkSt b h d b tStorage can be shared between servers
Storage Area Networks (SAN)Storage is accessed at a block-levelStorage is accessed at a block level Separation of Storage from the ServerHigh performance interconnect providing high I/O throughput
Replication: Modes of OperationReplication: Modes of Operation
SynchronousSynchronousAll data written to cache of local and remote arrays before I/O is complete and acknowledged to host
AsynchronousWrite acknowledged after write to local array cache; changes (writes) are replicated to remote array asynchronously(writes) are replicated to remote array asynchronously
Semi-synchronousWrite acknowledged with a single subsequent WRITE command g gpending from remote array
Redo Logs (Cyclic)Redo Logs (Cyclic)Copy of Every Committed
Transaction Synchronously Replicated
Primary Site Secondary Site
Earlier DBfor Zero Loss
Database
Earlier DB Backups
SAN E t i
Replicated/Copied
Point in Time Copy Taken
When DB Quiescent
Database copy at time t0
Database Copy at Time t0
Extension Transport
Archive LogsReplicated/Copied
Quiescent
Archive Logs
Mixture of sync and async replication technologies commonly usedUsually only redo logs sync replicated to remote siteArchive logs created from redo log and copied when redo log switches
Summary - Design DetailsSummary Design DetailsData centers 1 and 2 are in primary location with close enough distance that can provide DC HA for active/activeenough distance that can provide DC HA for active/active accessData Center 3 (DR) with > tolerable disaster radius, away for Primary DC 1 and 2for Primary DC 1 and 2Web/App server farms are load balanced geographicallyDB servers are within a geo HA cluster and running in aDB servers are within a geo-HA cluster and running in a L3 designSynchronize Data replication between data centers within y pthe primary locationAsynchronous Data replication is done between the primary and secondary storage systems