Business Continuity & Disaster Recovery Business Impact Analysis RPO/RTO Disaster Recovery Testing, Backups, Audit
Business Continuity & Disaster Recovery
Business Impact AnalysisRPO/RTO
Disaster RecoveryTesting, Backups, Audit
Imagine a system failure… Server failure Disk System failure Hacker break-in Denial of Service attack Extended power failure Snow storm Spyware Malevolent virus or worm Earthquake, tornado Employee error or revengeHow will this affect each
business?
Event Damage Classification
Negligible: No significant cost or damageMinor: A non-negligible event with no material or financial impact on the businessMajor: Impacts one or more departments and may impact outside clientsCrisis: Has a major material or financial impact on the businessMinor, Major, & Crisis events should be documented and tracked to repair
Workbook:Disasters and Impact
Problematic Event or Incident
Affected Business Process(es)
(Assumes a university)
Impact Classification & Effect on finances, legal
liability, human life, reputation
Fire Class rooms, business departments
Crisis, at times Major,Human life
Hacking Attack Registration, advising, Major,Legal liability
Network Unavailable
Registration, advising, classes, homework,
education
Crisis
Social engineering, /Fraud
Registration, Major,Legal liability
Server Failure (Disk/server)
Registration, advising, classes, homework,
education.
Major, at times: Crisis
Recovery Time: TermsInterruption Window: Time duration organization can wait between point of failure and service resumptionService Delivery Objective (SDO): Level of service in Alternate ModeMaximum Tolerable Outage: Max time in Alternate Mode
Regular Service
Alternate Mode
RegularService
InterruptionWindow
Maximum Tolerable Outage
SDO
Interruption
Time…
Disaster Recovery Plan Implemented
RestorationPlan Implemented
Definitions
Business Continuity: Offer critical services in event of disruptionDisaster Recovery: Survive interruption to computer information systemsAlternate Process Mode: Service offered by backup systemDisaster Recovery Plan (DRP): How to transition to Alternate Process ModeRestoration Plan: How to return to regular system mode
Classification of Services
Critical $$$$: Cannot be performed manually. Tolerance to interruption is very lowVital $$: Can be performed manually for very short timeSensitive $: Can be performed manually for a period of time, but may cost more in staffNonsensitive ¢: Can be performed manually for an extended period of time with little additional cost and minimal recovery effort
Determine Criticality of Business Processes
Corporate
Sales (1) Shipping (2) Engineering (3)
Web Service (1) Sales Calls (2)
Product A (1)
Product B (2)
Product C (3)
Product A (1)
Orders (1)
Inventory (2)
Product B (2)
RPO and RTO
How far back can you fail to? How long can you operate without a system?One week’s worth of data? Which services can last how long?
Inte
rrup
tion
1 1 1Hour Day Week
Recovery Point Objective Recovery Time Objective
Inte
rrup
tion
1 1 1Week Day Hour
Recovery Point Objective
Mirroring:RAID
BackupImages
Orphan Data: Data which is lost and never recovered.RPO influences the Backup Period
Business Impact Analysis SummaryService Recovery
PointObjective(Hours)
RecoveryTime
Objective(Hours)
CriticalResources(Computer,
people,peripherals)
Special Notes(Unusual treatment at
Specific times, unusual risk conditions)
Registration
0 hours 4 hours SOLAR, networkRegistrar
High priority during Nov-Jan,March-June, August.
Personnel 2 hours 8 hours PeopleSoft Can operate manually for some time
Teaching 1 day 1 hour D2L, network, faculty files
During school semester: high priority.
WorkBook
Partial BIA for a university
RAID – Data Mirroring
ABCDABCD
AB CD Parity
AB CD
RAID 0: Striping RAID 1: Mirroring
Higher Level RAID: Striping & Redundancy
Redundant Array of Independent Disks
Network Disaster Recovery
Redundancy
Includes:Routing protocolsFail-overMultiple paths
Alternative Routing
>1 Medium or > 1 network provider
Diverse Routing
Multiple paths,1 medium type
Last-mile circuit protection E.g., Local: microwave & cable
Long-haul network diversityRedundant network providers
Voice RecoveryVoice communication backup
Disruption vs. Recovery Costs
Cost
Time
Service Downtime
Alternative Recovery StrategiesMinimum Cost
* Hot Site
* Warm Site
* Cold Site
Alternative Recovery Strategies
Hot Site: Fully configured, ready to operate within hoursWarm Site: Ready to operate within days: no or low power main computer. Does contain disks, network, peripherals.Cold Site: Ready to operate within weeks. Contains electrical wiring, air conditioning, flooringDuplicate or Redundant Info. Processing Facility: Standby hot site within the organizationReciprocal Agreement with another organization or divisionMobile Site: Fully- or partially-configured trailer comes to your site, with microwave or satellite communications
What is Cloud Computing?
Database
App Server
Laptop
PC
Web ServerCloud Computing
VPN Server
This would cost $200/month.This would cost $200/month.
Introduction to Cloud
NIST Visual Model of Cloud Computing DefinitionNational Institute of Standards and Technology, www.cloudstandards.org
Cloud Service ModelsSoftware(SaaS): Provider runs own applications on cloud infrastructure. Platform(PaaS): Consumer provides apps; provider provides system and development environment.Infrastructure(laaS): Provides customers access to processing, storage, networks or other fundamental resources
Cloud Deployment Models
Private Cloud: Dedicated to one organizationCommunity Cloud: Several organizations with shared concerns share computer facilitiesPublic Cloud: Available to the public or a large industry groupHybrid Cloud: Two or more clouds (private, community or public clouds) remain distinct but are bound together by standardized or proprietary technology
Disaster Recovery
Disaster RecoveryTesting
An Incident Occurs…
Security officerdeclares disaster
Call SecurityOfficer (SO)or committee
member
SO followspre-established
protocol
Emergency Response Team: Human life:
First concern
Phone tree notifiesrelevant participants
IT follows DisasterRecovery Plan
Public relationsinterfaces with media (everyone else quiet)
Mgmt, legalcouncil act
Concerns for a BCP/DR Plan
Evacuation plan: People’s lives always take first priority
Disaster declaration: Who, how, for what? Responsibility: Who covers necessary disaster
recovery functions Procedures for Disaster Recovery Procedures for Alternate Mode operation
Resource Allocation: During recovery & continued operation
Copies of the plan should be off-site
Disaster Recovery ResponsibilitiesGeneral Business First responder:
Evacuation, fire, health… Damage Assessment Emergency Mgmt Legal Affairs Transportation/
Relocation/Coordination (people, equipment)
Supplies Salvage Training
IT-Specific Functions Software Application Emergency operations Network recovery Hardware Database/Data Entry Information Security
BCP DocumentsFocus: IT Business
EventRecovery
Disaster Recovery PlanProcedures to recover at alternate site
Business Recovery PlanRecover business after a disaster
IT Contingency Plan: Recovers major application or system
Occupant Emergency Plan:Protect life and assets during physical threat
Cyber Incident Response Plan: Malicious cyber incident
Crisis Communication Plan:Provide status reports to public and personnel
Business Continuity
Business Continuity Plan
Continuity of Operations PlanLonger duration outages
WorkbookBusiness Continuity Overview
Classifica-tion
(Critical or Vital)
BusinessProcess
Incident orProblematic
Event(s)
Procedure for Handling(Section 5)
Vital Registration
Computer Failure
If total failure, forward requests to UW-
SystemOtherwise, use 1-week-old database for read purposes
onlyCritical Teaching Computer
FailureFaculty DB Recovery
Procedure
MTBF = MTTF + MTTR
• Mean Time to Repair (MTTR)• Mean Time Between Failure (MTBF)
Measure of availability:• 5 9s = 99.999% of time working = 5 ½
minutes of failure per year.
works repair works repair works
1 day 84 days
Disaster Recovery Test Execution
Always tested in this order:Desk-Based Evaluation/Paper Test: A group steps through a paper procedure and mentally performs each step.Preparedness Test: Part of the full test is performed. Different parts are tested regularly.Full Operational Test: Simulation of a full disaster
Testing Objectives
Main objective: existing plans will result in successful recovery of infrastructure & business processes
Also can:• Identify gaps or errors• Verify assumptions• Test time lines• Train and coordinate staff
Testing Procedures
Tests start simple and become more challenging with progressInclude an independent 3rd party (e.g. auditor) to observe testRetain documentation for audit reviews
Develop testobjectives
Execute Test
Evaluate Test
Develop recommendations to improve test effectiveness
Follow-Up to ensure recommendations
implemented
Test StagesPreTest: Set the StageSet up equipmentPrepare staff
Test: Actual test
PostTest: CleanupReturning resourcesCalculate metrics: Time required, % success rate in processing, ratio of successful transactions in Alternate mode vs. normal modeDelete test dataEvaluate planImplement improvements
PreTest
Test
PostTest
Gap Analysis
Comparing Current Level with Desired Level• Which processes need to be improved?• Where is staff or equipment lacking?• Where does additional coordination need
to occur?
InsuranceIPF &
EquipmentData & Media Employee
DamageBusiness Interruption:Loss of profit due to IS interruption
Valuable Papers & Records: Covers cash value of lost/damaged paper & records
Fidelity Coverage:Loss from dishonest employees
Extra Expense:Extra cost of operation following IPF damage
Media ReconstructionCost of reproduction of media
Errors & Omissions:Liability for error resulting in loss to client
IS Equipment & Facilities: Loss of IPF & equipment due to damage
Media TransportationLoss of data during xport
IPF = Information Processing Facility
Auditing BCPIncludes: Is BIA complete with RPO/RTO defined for all services? Is the BCP in-line with business goals, effective, and current? Is it clear who does what in the BCP and DRP? Is everyone trained, competent, and happy with their jobs? Is the DRP detailed, maintained, and tested? Is the BCP and DRP consistent in their recovery coverage? Are people listed in the BCP/phone tree current and do they have a
copy of BC manual? Are the backup/recovery procedures being followed? Does the hot site have correct copies of all software? Is the backup site maintained to expectations, and are the
expectations effective? Was the DRP test documented well, and was the DRP updated?
Summary of BC Security Controls
• RAID• Backups: Incremental backup, differential backup• Networks: Diverse routing, alternative routing• Alternative Site: Hot site, warm site, cold site,
reciprocal agreement, mobile site• Testing: checklist, structured walkthrough,
simulation, parallel, full interruption• Insurance
Step 1: Define Threats Resulting in Business DisruptionKey questions:•Which business processes are of strategic importance?•What disasters could occur?•What impact would they have on the organization financially? Legally? On human life? On reputation?
Impact ClassificationNegligible: No significant cost or damageMinor: A non-negligible event with no material or financial impact on the businessMajor: Impacts one or more departments and may impact outside clientsCrisis: Has a major financial impact on the business
Step 1: Define Threats Resulting in Business Disruption
Problematic Event or Incident
Affected Business
Process(es)
Impact Classification & Effect on finances,
legal liability, human life, reputation
Fire Hacking incident Network Unavailable(E.g., ISP problem)Social engineering, fraudServer Failure (E.g., Disk)Power Failure
1 1 1Hour Day Week
Step 2: Define Recovery Objectives
Recovery Point Objective Recovery Time Objective
Inte
rrup
tion
Business Process
Recovery Time
Objective(Hours)
Recovery Point
Objective(Hours)
Critical Resources(Computer,
people, peripherals)
Special Notes(Unusual treatment at
specific times, unusual risk conditions)
1 1 1Week Day Hour
Business Continuity
Step 3: Attaining Recovery Point Objective (RPO)
Step 4: Attaining Recovery Time Objective (RTO)
Classification(Critical or
Vital)
Business Process
Problem Event(s) or Incident
Procedure for Handling(Section 5)
Criticality Classification
Critical: Cannot be performed manually. Tolerance to interruption is very low
Vital: Can be performed manually for very short timeSensitive: Can be performed manually for a period of
time, but may cost more in staffNon-sensitive: Can be performed manually for an
extended period of time with little additional cost and minimal recovery effort