MICROSOFT SQL SERVER HIGH AVAILABILITY AND DISASTER RECOVERY Michael Poremba // October 2008
Nov 19, 2014
MICROSOFT SQL SERVER HIGH AVAILABILITY
AND DISASTER RECOVERY
Michael Poremba // October 2008
Database HA & DR Experience…
Work with business to determine HA or DR requirements for applications and data?
Design HA or DR solutions?
Administer HA or DR process?
Still learning MS SQL Server HA & DR capabilities?
2
Scope of this Presentation
Data Availability Data recovery High availability Disaster recovery
Technology Focus MS SQL Server Physical servers SANs
In-depth how-to(available elsewhere)
Partitioned views (federated) Advanced DBA techniques Custom application logic 3rd-party software solutions Alternate DBMS engines
(e.g. Oracle; DB2) HA on virtual machines Complex scenarios &
solutions Load balancing
Presentation Focus Beyond Scope of Presentation
3
So, you need to make yourproduction database bulletproof…
Introduction to Data Availability
4
Data Availability Continuum
Degrees of protection for information systems:
Business Risk Solution
Data Recovery
Data loss Redundant data
High Availability
Downtime ofdatabase service
Redundant system components
Disaster Recovery
Downtime ofbusiness operations
Redundant systemsand facilities
5
Business Case for Availability
Keep business-critical applications available
Secondary: Server
maintenance
Protect against loss of data center
Secondary: Application
upgrades Infrastructure
upgrades
High Availability Disaster Recovery
6
Service Level Agreement (SLA) Permitted downtime (planned vs. unplanned?)
Acceptable data/transaction loss Application response times Mean time to recovery
Note: Database uptime is not equivalent to application availability Failures of other application services Network outages
Uptime SLA
Downtimeper Year
Downtimeper Month
99.9% 8.76 hours 43.8 minutes
99.99% 52.6 minutes 4.38 minutes
99.999% 5.26 minutes 0.438 minutes
7
Protect What?
Application data stores Databases Files Other data repositories
Database services DBMS availability for applications
Application services Application availability for users and external
systems
Databases are the heart of most information systems;they deserve the highest affordable protection.
8
Database Failure Scenarios
Storage subsystem Disk Controller
Network Server Power
Operator errors DBMS interruption Drops / deletes
Application defects
DBMS defects Data corruption
Physical Infrastructure Failures
Logical Data Failures
9
Service Recovery Strategies
StandbyMode
Failover Behavior SQL Server Feature
Coldstandby
• Manual intervention required to restore offline data copy
• Backup and restore
Warm standby
• Data copy online and ready
• Manual failover required
• Transaction log shipping
• Database mirroring
Hot standby
• Automatic failover • Database mirroring
• Failover clustering
10
Data Recovery—TerminologyTerminology varies for source vs. copy
High Availability Strategy
Data Source Data Copy
Backup and Restore
Database Backup
Log Shipping Primary SecondaryStandby
Database Mirroring Principal Mirror
Failover Clustering PrimaryActive
SecondaryPassiveStandbyInactive
11
[Briefly…]
Data Recovery12
Database Backups
Traditional backup types Full backup Differential backup Transaction log backup
Disk is better than tape First backup to disk (separate physical disk
volume) Detect exceptions encountered during backup Verify backup files Copy backup files to tape or remote disk
Data retention policy for backup files
13
Database Backup Strategy
Backup of user databases not sufficient for recovery
System database Master database MSDB database Model database External data stores…
14
Synch with External Data StoresSynchronize recovered database with external data stores:Identity column seedsFull-text indexes(SQL Server 2000)
LDAP entriesFile system objectsOther databases
15
Backup Retention Policy
Location of backup files Duration of retention Protection of sensitive data
Sarbanes/Oxley (SOX) HIPAA Internal policies for data management and
protection Access to backups from offsite data
storage
16
Data Recovery Process
Backup file sets Full baseline,
differential, and transaction logs
Retrieving backup files Offsite storage Tape Network copy Dependency on
multiple people to get access to backup files
Recovery strategy depends on failure scenario Create comprehensive
failure matrix Devise recovery strategy
for each scenario Does worst-case
recovery scenario fit within SLA parameters?
Recovery time; SLA Include future data
growth in recovery plan Fully test recovery
strategies—practice is essential
17
High Availability18
High Availability
Minimize or avoid service downtime Whether planned or unplanned
When components fail,service interruption is brief or non-existent Automatic failover
Eliminate single points of failure (as affordable) Redundant components Fault-tolerant servers
19
Redundant Components
Objective: Avoid single points of failure (where affordable)Approach: Use redundant components for database service Database server nodes Server components
ECC RAM; failure-tolerant HW & OS DBMS instance User databases Storage devices Storage unit components
MPIO: Interfaces; paths; switches; controllers RAID: Disks
Networking MPIO: Interfaces; paths; switches
Data copies E.g. Recovering torn page from mirror in SQL Server 2008
20
Transaction Log Shipping
Warm standby solution Duplicate user database
Copy transaction logs to standby server & restore
Database available for read-only access Users must disconnect for logs to be applied Two database licenses required if querying
standby Manual application failover Supported on standard hardware Possible data loss (unapplied transactions)
21
Database Mirroring Redundancy at user database level
Duplicate copy of user database Independent storage devices Multiple copies of instance databases
Mirrored over private network channel Mirror always redoing transactions from principal Negligible impact on transaction throughput
Multiple mirroring modes: High-availability: commit @ log on mirror;
automatic failover High-protection: commit @ log on mirror; manual
failover High-performance: commit when logged on
principal Very fast automatic failover—seconds
Requires witness server Mirror-aware application client connection
Provided by client library Database connection string must specify both
servers Mirror may be available for read-only access
(snapshots) Works with standard hardware
Local Storage local sys DBs mirror user DB
Local Storage local sys DBs source user DB
node A node B
witness(optional)
22
Mirror Witness
With mirroring, more than one server is required to decide on failover
Witness automates failover from primary to mirror Watches database availability Reports observations back to principal and mirror
Runs in separate SQL Server instance (Express is OK)
Prevents “split brain” scenario Very low resource consumption
Can be witness for multiple databases Not a single point of failure
23
SQL Server Failover Clustering
Shared Storage system DBs user DBs quorum
node A node B
Two clustered nodes Active/Passive config
MS SQL services Running on virtual
server Shared storage device
User databases System databases Quorum drive Redundant internal
components
24
Active/Passive Failover Clustering Redundancy at database instance
level All databases fail over together Shared copy of system databases
Single data copy on shared storage device No I/O overhead reducing throughput Storage unit is single point of failure
for cluster All database services are clustered
SQL Agent; Analysis Services; Full-Text engine, MS DTC
Automatic failover (up to minutes) DBMS accessed over virtual IP Database not available from
inactive node for DB client connections Storage is controlled by one cluster
node at a time Requires hardware certified by
Microsoft for Microsoft Cluster Service
Shared Storage system DBs user DBs quorum
node A node B
25
HA Comparison
Scope: user DB Standard hardware One SQL license
(unless querying snapshots on mirror)
Very fast failover (seconds)
OS flexible (e.g. 32/64) Independent storage Independent services Reporting on mirror Geographic separation OK
Scope: DBMS instance Certified hardware One SQL license
(only one node can access database)
Automatic failover (up to minutes)
Enterprise OS Shared storage Clustered services Standby not available Servers are usually co-
located
Database Mirroring Failover Clustering
26
Considerations for HA
HA complements backup and recovery strategy Does not replace data recovery plan
Application service availability is often determined by a network of interdependent services Availability can be difficult to define (e.g. partial
failures) Failure probability difficult to measure or compute
Increased system complexity could lead to lower service availability! Operator error a leading cause of availability issues Increased number/types of system components More complex to configure and administer
27
Data Recovery Requirements
28
Disaster Recovery29
Disaster Recovery
Minimize downtime of business operations Redundant systems and facilities
SQL Server features: Transaction log shipping Database mirroring Failover clustering
Other technologies Storage-based mirroring
30
Disaster Recovery Planning
Data security requirements Clarify SLA, data loss allowance Evaluate system cost vs. data protection Failure analysis System redundancy Process validation Training for personnel
Prevention practices Executing disaster recovery and business
continuity Practice, practice, practice
31
Business Continuity Facility
System redundancy Systems: Web servers app servers; database, etc. Data: Databases; data files on OS; security info,
etc. Networking: Domain, routing, subnet, VIPs, etc.
Alternate facilities Network bandwidth Physical or network access by operations staff
Failover Often a deliberate decision, using manual failover
32
Data Redundancy
Synchronous redundancy Network bandwidth cost Network latency and application performance Network reliability
Asynchronous redundancy Risk of data loss More cost-effective Resilient to network latency issues
Candidate Technologies SQL Server database mirroring Failover clustering with SAN-based mirroring
33
DR Using Database Mirroring Two sites: Primary and DR location Separate failover clusters at each site SQL Server database mirroring between
sites witness(optional)
Shared Storage B local sys DBs local quorum mirror user DB
node B1 node B2
Shared Storage A local sys DBs local quorum source user DB
node A1 node A2
failover cluster at site A failover cluster at site B
databasemirroring
34
DR Using SAN-Based Mirroring Two sites: Primary and DR location Four-node failover cluster; one virtual IP
address SAN-based mirroring between sites Manual cluster failover
Shared Storage B system DBs quorum user DBs
node B1 node B2
Shared Storage A system DBs quorum user DBs
node A1 node A2
failover cluster nodes at site A failover cluster nodes at site B
storage-based
mirroring
35
[Skip if time is running short.]
Complimentary Technologies
36
SAN-Based Data Mirroring
Data blocks duplicated at storage level Similar to transaction log shipping
Copy performed in sequence and coordinated with database checkpoint Ensures consistency of mirrored data files
Synchronous or asynchronous mirroring Co-located or geographically dispersed—both
are OK SAN link bandwidth must support database I/O rate
May require extra feature support from SAN vendor
Could rely on Failover Clustering for HA
37
SQL Server Database Snapshots Read-only point-in-time database
snapshot No data is copied—instantaneous
Historical snapshot pages tracked separately from changing pages
Snapshots can be maintained indefinitely Limited only by available storage
Snapshot copy can be used for reporting Read-only, so no locking issues
38
SQL Server Replication
Transactional replication High transaction volume Low data latency required Mixed technologies:
Integrates with other DBMS
Merge replication Bi-directional data
changes Typically server-to-client
Snapshot replication Large, infrequent data
changes Data change latency OK Best for smaller data sets
Subscriber databases available for reporting
Replicate data subsets
Some data loss is possible
Periodically validate replicated data
39
App Development and Admin
40
Considerations for App Developers App services tolerant to database service interruptions Application transactions must be handled in code—data
consistency Exception handling for transaction retry, connection
recovery Requires coding standards, code reviews, and testing Bulk data operations Transaction volume impacts rollback time during failover Batch jobs must be run on alternate nodes Don’t bypass transaction logging Synchronization with external data sources? Be aware of database recovery model Mirroring uses FailoverPartner in connection string Use TCP/IP as client protocol
41
Considerations for Admins
Use identical server hardware, when possible Design network redundancies, when feasible
Consider network latency for geographic separation Always manage through virtual cluster, not individual cluster
nodes Retest failover/failback after HA maintenance Diagnose after failover
Repair alternate node Resynchronize data, as necessary Be aware of primary/secondary locations Ensure application services are connected and functioning
properly Keep server node configurations synchronized:
Service pack and patch levels Duplicate non-redundant resources Jobs; logins and permissions; OS & sys objects
42
HA Risks
System performance degradation HA system complexity leads to availability
issues Some system failures not planned for Backup and recovery planning incomplete Administrators not fully trained or informed User databases not synchronized with
other data sources
43
Common Admin Use Cases
Maintain HA nodes Hardware maintenance Rolling upgrades and software patches
Resynchronize the redundant copy Re-synch mirror Restart log shipping
Diagnose and repair Diagnose cause of failover Repair failed node and restore failover
capabilities Test failover and failback
44
Common Admin Actions
Train and practice administrators to: Initiate a database mirror Manually failover mirror database or
cluster node Add/remove passive node from mirror or
cluster Upgrade/patch servers nodes Restart or redirect application services
45
More Information46
References—Books
Microsoft SQL Server 2008 High Availability with Clustering & Database Mirroringby Michael Otey, 2009.
Microsoft SQL Server High Availabilityby Paul Bertucci, 2004.
Pro SQL Server 2005 High Availabilityby Allan Hirt, 2007.
Pro SQL Server 2005 Replication by Sujoy Paul, 2006.
Pro SQL Server 2005 Service Brokerby Klaus Aschenbrenner, 2007.
The Rational Guide to SQL Server 2005 Service Brokerby Roger Wolter, 2006.
High Availability Related Topics
47
References—Presentations48
Microsoft Load Balancing and Clusteringhttp://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides/14.%20Microsoft%20Load%20Balancing%20and%20Clustering.ppt
SQL Server 2005 High Availabilityhttp://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt
High Availability Technologies In SQL Server 2000 And SQL Server 2005 http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt
Meeting the Availability Challengehttp://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC4-4F7D46FCEAA6/HighAvailability.ppt
Disaster Recovery Mistakeshttp://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-11-07%20MM.ppt
SQL Server 2005 High Availabilityhttp://blogs.msdn.com/sql2005event/attachment/564303.ashx
Effective Usage of SQL Server 2005 Database Mirroring http://www.sqlserver-qa.net/SSQA-Effective%20Usage%20of%20SQL%20Server%202005%20Database%20Mirroring_show.ppt
References—Articles
Achieve High Availability for SQL Serverhttp://technet.microsoft.com/en-us/magazine/cc162477.aspx
Geographically Dispersed Clusters in Windows Server 2003http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx
Restoring file and filegroup backups http://support.microsoft.com/kb/281122/en-us
Restoring specific tables or rows from backupshttp://support.microsoft.com/kb/321836/en-us
Maintaining Availability During Upgradeshttp://msdn.microsoft.com/en-us/library/ms191449.aspx
49