© 2005 EMC Corporation. All rights reserved. 1 1 Disaster Recovery of Technology Services : Issues Strategies Directions Presented by Dave Purdy 6-23-2005 Achieving Continuity of Operations (COOP)
Jan 16, 2016
© 2005 EMC Corporation. All rights reserved. 111
Disaster Recovery of Technology Services:
Issues
Strategies
DirectionsPresented by Dave Purdy 6-23-2005
Achieving Continuity of Operations (COOP)
© 2005 EMC Corporation. All rights reserved. 22
Ever increasing need for COOPPeople, Data, and Services Availability
Drivers/trends for improved recoverability and/or availability of Services:
– Current measure increasingly deemed inadequate
– Physical vs. Electronic transport of Data– Melding of “DR” and “Operational Availability”– Self Insurance for DR – Public Safety/Service Availability vs. Cost
Maturity in understanding COOP issues:– Recovery vs. Restart– Identification of App/DB inter-dependencies– DR vs. Operational Availability (HA)– Breaking down the problem:
• Information Availability• Application Availability
© 2005 EMC Corporation. All rights reserved. 33
Production Availability and Disaster Recovery:Converging?
Planned occurrences: Competing workloads (87% of occurrences)
Backup, reporting Data warehouse extracts Application and data restore
Unplanned occurrences: Failure(13% of occurrences)
Database corruption Component failure Human error
Disaster: Natural or man-made (<1% of occurrences) Flood, fire, earthquake Contaminated building“DR”
“CA”
“HA”
Insurance
ROI
© 2005 EMC Corporation. All rights reserved. 44
100 %Procedural
( 0 % IT ArchitecturalRedundancy )
100 %Automatic
(100 %IT ArchitecturalRedundancy)
24 hrs x 7 days
Manual
Transparent
Failsafe
Non Critical BusinessSmall Industries
Low Failsafe
Resources
High Failsafe
Essential ServicesGovernment, Airlines,Hospital
BanksFinancial Services
TelecommunicationsFood Manufacturer
Consumer GoodsManufacturing
Manufacturing
Retail
Low VolumeHigh Volume
TransportationLogistics
Low security
High Security
Low security
Differing levels of IT Architectural dependency with regard to Availability Strategies:
Creating a context: Government moving up the Continuum
© 2005 EMC Corporation. All rights reserved. 55
Availability Drivers
Increased realization that critical services depend on IT availability
Pervasive requirements to protect people and data Increasing nature of real-time “transactions” “Lost” transactions cannot be re-created
Increased recognition that traditional recovery from tape is no longer viable
New vision - Merger of production and DR disciplines to focusing on continuous availability
Public Service, Safety, and Inter-Agency dependencies driving criticality of COOP
© 2005 EMC Corporation. All rights reserved. 66
Traditional Disaster Recovery: Tape
Tape Backup with Offsite Tape Storage RPO = 24+ hours or time of last backup stored offsite RTO = 24 - 96 hours or time required to restart operations
Transport tapes to recovery site Setup systems to receive data Restore from tape Synchronize systems and DB for resumption
SecsMinsHrsDays Wks Secs Mins Hrs Days Wks
Days Wks
Retrieve Tape Set Up Systems
Restore from Tape
Wks Days
Tape BackupOffsite Storage
RPO RTO
© 2005 EMC Corporation. All rights reserved. 77
Consistency=Usability: This is not a platform or application issue….
Getting All the Data at the Same TimeAcross databases, applications, and platforms….
Consistency Group
Mainframe
Consistency Group
Windows
Consistency GroupUNIX
UNIX
Mainframe
Windows
© 2005 EMC Corporation. All rights reserved. 88
Patterns of DR Program Evolution:
– Restore is very different than Restart – Testing effectiveness and control: Subset vs. Full / Hotsite vs. Internal– Application/Agency Inter-dependencies– Traditional recovery and restore techniques being deemed inadequate– Increased complexity (and benefit) in justifying “DR” versus “DR + HA” as
2nd site becomes more integrated with primary site
Insourced “CA”
To 2/3 sites-Active
-Triangulate
Insourced DR & HA To 2nd Site-Passive-Active
Commercial Hotsite
withElectronic Vaulting or Replication
CommercialHotsiteQuickShip
Local Remote
OffsiteVital
Records
Key Learnings:
© 2005 EMC Corporation. All rights reserved. 99
A Practical Approach to Unifying Requirements and IT Capabilities for Mutually Agreement…
Customer Problem AreaMaximum Acceptable Data Loss (RPO)Maximum Acceptable Downtime (RTO)
Zero Sec. Mins Hours > 24 hrs.
TAPE BACKUP & RECOVERY
DISKDATA
REPLICATION
SERVER CLUSTERING &
VIRTUALIZATION
LOCAL
LOCAL
REMOTE
REMOTE
REMOTE
LOCAL
Market
Requirem
ents
© 2005 EMC Corporation. All rights reserved. 1010
Primary
Secondary
-or-
Tertiary
Secondary
Synch
Asynch
Asynch
AsynchNetwork
In-Region
Out-RegionAvailability Strategies:Disaster Recovery (DR)High Availability (HA)“Continuous Availability” (CA)
Commercial Hotsite
-SunGard
-IBM BRCS
© 2005 EMC Corporation. All rights reserved. 1111
Remote Replication Capability Continuum Summary
Asynchronous Seconds of data exposure No performance impact Unlimited distance
Source
Unlimited Distance
Target
Asynchronous Point-in-Time Hours of data exposure No performance impact Unlimited distance
Source
Unlimited Distance
Target
Prod
Synchronous No data exposure Limited distance
Source
Limited Distance
Target
Triangulated Synch & Asynch Simultaneous Synchronous and
Asynchronous Three site awareness
Limited
Long-distancesite
Primarysite
2nd Site
Unlimited
Unlimited
© 2005 EMC Corporation. All rights reserved. 1212
Best Practices for Achieving Business Continuity
Determine requirements / service levels– System / application mapping
Validate ability to achieve service-level agreements
– Evaluate costs / tradeoffs of technologies to meet service levels
Create right level of protection for your Agencies (or Inter-Agencies?) specific business and application requirements
Integrate it– Across information storage platforms– Across processing infrastructure (servers, networks, applications)– Across data centers and geographic locations– Integrate with Change Management
Business Continuity Planning: Lessons from the Nation’s Capitol in
the Post-9/11 World
Mary Kaye VavasoureGov Services
Office of the Chief Technology Officer District of Columbia
Recent History as Context• 1996-1999 Y2K made continuity a priority
Internet made networks a focus and eGovernment a reality
• 2001 9-11 the unthinkable happened; security of data, network, and infrastructure became key to recovery
• 2002 Federal Patriot Act made Continuity planning a legal mandate
• 2003 Sarbanes-Oxley Act added more regulatory requirements
Hurricane Isabel caused regional power outages that lasted 4-7 days
Key Elements of Business Continuity Strategy
• High availability platform and procedures• Proven Emergency Operations Process
– Detailed, service-based procedures
– Dedicated staff
– Regional coordination
– Frequent practice with planned events
• Focus on Continuity of Communications– Public safety wireless network
– Public portal resiliency with specialized content
– High availability messaging platform
High Availability Platform; Centralized Process
• In-sourced, high availability Disaster Recovery• Active-Active for availability
– Multiple servers behind hardware load balancers (millisecond fail-over)
– Separate web application and database tiers– 95% of public web services covered (104 sites + main portal)
• Active-Passive for Disaster Recovery– Two data centers– Multiple types of replication
• Cluster synch for dynamic portal content• MS/CRS for legacy applications and static pages• Database tier uses SQL and Oracle replication
• Future: tertiary site for continuous availability of portal• Centralized failure recovery process run by senior staff
Comprehensive Emergency Operations Process
• Started with Y2K; focus on manual processes to back-up automated systems
• Post 9/11: focus on continuity of services• Dedicated staff=DC EMA + agency representatives + key
service providers (utilities, suppliers, Federal public safety, regional emergency agency staff, etc.)
• Hardened site• 14 Emergency Liaison Officers for key services• Two-tiered operational structure (EOC and JIC)• Clearly defined decision-making process and lines of authority• Redundant communication channels with all levels of
responders, and the public at large• Frequent practice, using planned events
Specialized Content for Public Communication
• Public portal’s Emergency Center provides detailed content for emergency response plans
• Extensive use of GIS-based content – Content to tailored to individual’s location– Facilitates location of shelters, evacuation routes,
and major transportation services
• Specialized “Emergency Mode” will take over entire portal during catastrophic events
Focus on Continuity of Communications
• Public safety wireless network for voice and data
• Federal and regional voice interoperability
• 99% of District geography is covered
• Dedicated transmission towers, and mobile repeater systems
• Signals can penetrate thick building walls, metro system tunnels, underground locations
District of Columbia Office of the Chief Technology Officer – 1
Coverage Improvement With New Network
• Coverage Improvement With New MPD Network
District of Columbia Office of the Chief Technology Officer – 2
Public Portal Resiliency
• Active-Active failover, with load balancing on heartbeat for high availability
• Actual=99.99999%• Active-Passive disaster recovery between two
local data centers• Future Tertiary site for continuous availability
GOAL=Never Go Dark
High Availability Messaging Platform
• Completely fault tolerant email system enables government officials to communicate and share data during significant outages
• High volume synchronous data replication between primary and secondary data centers, using EMC’s CLARiiON Mirror View
• Homeland Security funding ($900k) made public safety agencies the priority focus during implementation:– MPD– FEMS– DMH– CFSA– DOC
• Can failover email accounts, and the most recent data from 4 hours prior to the outage
Key Success Factors
• People
• Process
• Practice
Mary Kaye VavasourProgram Manager
eGovernment ServicesOffice of the Chief Technology Officer
District of Columbia