SQL Server High Availability

MICROSOFT SQL SERVER HIGH AVAILABILITY

AND DISASTER RECOVERY

Michael Poremba // October 2008

Database HA & DR Experience…

Work with business to determine HA or DR requirements for applications and data?

Design HA or DR solutions?

Administer HA or DR process?

Still learning MS SQL Server HA & DR capabilities?

2

Scope of this Presentation

Data Availability Data recovery High availability Disaster recovery

Technology Focus MS SQL Server Physical servers SANs

In-depth how-to(available elsewhere)

Partitioned views (federated) Advanced DBA techniques Custom application logic 3rd-party software solutions Alternate DBMS engines

(e.g. Oracle; DB2) HA on virtual machines Complex scenarios &

solutions Load balancing

Presentation Focus Beyond Scope of Presentation

3

So, you need to make yourproduction database bulletproof…

Introduction to Data Availability

4

Data Availability Continuum

Degrees of protection for information systems:

Business Risk Solution

Data Recovery

Data loss Redundant data

High Availability

Downtime ofdatabase service

Redundant system components

Disaster Recovery

Downtime ofbusiness operations

Redundant systemsand facilities

5

Business Case for Availability

Keep business-critical applications available

Secondary: Server

maintenance

Protect against loss of data center

Secondary: Application

upgrades Infrastructure

upgrades

High Availability Disaster Recovery

6

Service Level Agreement (SLA) Permitted downtime (planned vs. unplanned?)

Acceptable data/transaction loss Application response times Mean time to recovery

Note: Database uptime is not equivalent to application availability Failures of other application services Network outages

Uptime SLA

Downtimeper Year

Downtimeper Month

99.9% 8.76 hours 43.8 minutes

99.99% 52.6 minutes 4.38 minutes

99.999% 5.26 minutes 0.438 minutes

7

Protect What?

Application data stores Databases Files Other data repositories

Database services DBMS availability for applications

Application services Application availability for users and external

systems

Databases are the heart of most information systems;they deserve the highest affordable protection.

8

Database Failure Scenarios

Storage subsystem Disk Controller

Network Server Power

Operator errors DBMS interruption Drops / deletes

Application defects

DBMS defects Data corruption

Physical Infrastructure Failures

Logical Data Failures

9

Service Recovery Strategies

StandbyMode

Failover Behavior SQL Server Feature

Coldstandby

• Manual intervention required to restore offline data copy

• Backup and restore

Warm standby

• Data copy online and ready

• Manual failover required

• Transaction log shipping

• Database mirroring

Hot standby

• Automatic failover • Database mirroring

• Failover clustering

10

Data Recovery—TerminologyTerminology varies for source vs. copy

High Availability Strategy

Data Source Data Copy

Backup and Restore

Database Backup

Log Shipping Primary SecondaryStandby

Database Mirroring Principal Mirror

Failover Clustering PrimaryActive

SecondaryPassiveStandbyInactive

11

[Briefly…]

Data Recovery12

Database Backups

Traditional backup types Full backup Differential backup Transaction log backup

Disk is better than tape First backup to disk (separate physical disk

volume) Detect exceptions encountered during backup Verify backup files Copy backup files to tape or remote disk

Data retention policy for backup files

13

Database Backup Strategy

Backup of user databases not sufficient for recovery

System database Master database MSDB database Model database External data stores…

14

Synch with External Data StoresSynchronize recovered database with external data stores:Identity column seedsFull-text indexes(SQL Server 2000)

LDAP entriesFile system objectsOther databases

15

Backup Retention Policy

Location of backup files Duration of retention Protection of sensitive data

Sarbanes/Oxley (SOX) HIPAA Internal policies for data management and

protection Access to backups from offsite data

storage

16

Data Recovery Process

Backup file sets Full baseline,

differential, and transaction logs

Retrieving backup files Offsite storage Tape Network copy Dependency on

multiple people to get access to backup files

Recovery strategy depends on failure scenario Create comprehensive

failure matrix Devise recovery strategy

for each scenario Does worst-case

recovery scenario fit within SLA parameters?

Recovery time; SLA Include future data

growth in recovery plan Fully test recovery

strategies—practice is essential

17

High Availability18

High Availability

Minimize or avoid service downtime Whether planned or unplanned

When components fail,service interruption is brief or non-existent Automatic failover

Eliminate single points of failure (as affordable) Redundant components Fault-tolerant servers

19

Redundant Components

Objective: Avoid single points of failure (where affordable)Approach: Use redundant components for database service Database server nodes Server components

ECC RAM; failure-tolerant HW & OS DBMS instance User databases Storage devices Storage unit components

MPIO: Interfaces; paths; switches; controllers RAID: Disks

Networking MPIO: Interfaces; paths; switches

Data copies E.g. Recovering torn page from mirror in SQL Server 2008

20

Transaction Log Shipping

Warm standby solution Duplicate user database

Copy transaction logs to standby server & restore

Database available for read-only access Users must disconnect for logs to be applied Two database licenses required if querying

standby Manual application failover Supported on standard hardware Possible data loss (unapplied transactions)

21

Database Mirroring Redundancy at user database level

Duplicate copy of user database Independent storage devices Multiple copies of instance databases

Mirrored over private network channel Mirror always redoing transactions from principal Negligible impact on transaction throughput

Multiple mirroring modes: High-availability: commit @ log on mirror;

automatic failover High-protection: commit @ log on mirror; manual

failover High-performance: commit when logged on

principal Very fast automatic failover—seconds

Requires witness server Mirror-aware application client connection

Provided by client library Database connection string must specify both

servers Mirror may be available for read-only access

(snapshots) Works with standard hardware

Local Storage local sys DBs mirror user DB

Local Storage local sys DBs source user DB

node A node B

witness(optional)

22

Mirror Witness

With mirroring, more than one server is required to decide on failover

Witness automates failover from primary to mirror Watches database availability Reports observations back to principal and mirror

Runs in separate SQL Server instance (Express is OK)

Prevents “split brain” scenario Very low resource consumption

Can be witness for multiple databases Not a single point of failure

23

SQL Server Failover Clustering

Shared Storage system DBs user DBs quorum

node A node B

Two clustered nodes Active/Passive config

MS SQL services Running on virtual

server Shared storage device

User databases System databases Quorum drive Redundant internal

components

24

Active/Passive Failover Clustering Redundancy at database instance

level All databases fail over together Shared copy of system databases

Single data copy on shared storage device No I/O overhead reducing throughput Storage unit is single point of failure

for cluster All database services are clustered

SQL Agent; Analysis Services; Full-Text engine, MS DTC

Automatic failover (up to minutes) DBMS accessed over virtual IP Database not available from

inactive node for DB client connections Storage is controlled by one cluster

node at a time Requires hardware certified by

Microsoft for Microsoft Cluster Service

Shared Storage system DBs user DBs quorum

node A node B

25

HA Comparison

Scope: user DB Standard hardware One SQL license

(unless querying snapshots on mirror)

Very fast failover (seconds)

OS flexible (e.g. 32/64) Independent storage Independent services Reporting on mirror Geographic separation OK

Scope: DBMS instance Certified hardware One SQL license

(only one node can access database)

Automatic failover (up to minutes)

Enterprise OS Shared storage Clustered services Standby not available Servers are usually co-

located

Database Mirroring Failover Clustering

26

Considerations for HA

HA complements backup and recovery strategy Does not replace data recovery plan

Application service availability is often determined by a network of interdependent services Availability can be difficult to define (e.g. partial

failures) Failure probability difficult to measure or compute

Increased system complexity could lead to lower service availability! Operator error a leading cause of availability issues Increased number/types of system components More complex to configure and administer

27

Data Recovery Requirements

28

Disaster Recovery29

Disaster Recovery

Minimize downtime of business operations Redundant systems and facilities

SQL Server features: Transaction log shipping Database mirroring Failover clustering

Other technologies Storage-based mirroring

30

Disaster Recovery Planning

Data security requirements Clarify SLA, data loss allowance Evaluate system cost vs. data protection Failure analysis System redundancy Process validation Training for personnel

Prevention practices Executing disaster recovery and business

continuity Practice, practice, practice

31

Business Continuity Facility

System redundancy Systems: Web servers app servers; database, etc. Data: Databases; data files on OS; security info,

etc. Networking: Domain, routing, subnet, VIPs, etc.

Alternate facilities Network bandwidth Physical or network access by operations staff

Failover Often a deliberate decision, using manual failover

32

Data Redundancy

Synchronous redundancy Network bandwidth cost Network latency and application performance Network reliability

Asynchronous redundancy Risk of data loss More cost-effective Resilient to network latency issues

Candidate Technologies SQL Server database mirroring Failover clustering with SAN-based mirroring

33

DR Using Database Mirroring Two sites: Primary and DR location Separate failover clusters at each site SQL Server database mirroring between

sites witness(optional)

Shared Storage B local sys DBs local quorum mirror user DB

node B1 node B2

Shared Storage A local sys DBs local quorum source user DB

node A1 node A2

failover cluster at site A failover cluster at site B

databasemirroring

34

DR Using SAN-Based Mirroring Two sites: Primary and DR location Four-node failover cluster; one virtual IP

address SAN-based mirroring between sites Manual cluster failover

Shared Storage B system DBs quorum user DBs

node B1 node B2

Shared Storage A system DBs quorum user DBs

node A1 node A2

failover cluster nodes at site A failover cluster nodes at site B

storage-based

mirroring

35

[Skip if time is running short.]

Complimentary Technologies

36

SAN-Based Data Mirroring

Data blocks duplicated at storage level Similar to transaction log shipping

Copy performed in sequence and coordinated with database checkpoint Ensures consistency of mirrored data files

Synchronous or asynchronous mirroring Co-located or geographically dispersed—both

are OK SAN link bandwidth must support database I/O rate

May require extra feature support from SAN vendor

Could rely on Failover Clustering for HA

37

SQL Server Database Snapshots Read-only point-in-time database

snapshot No data is copied—instantaneous

Historical snapshot pages tracked separately from changing pages

Snapshots can be maintained indefinitely Limited only by available storage

Snapshot copy can be used for reporting Read-only, so no locking issues

38

SQL Server Replication

Transactional replication High transaction volume Low data latency required Mixed technologies:

Integrates with other DBMS

Merge replication Bi-directional data

changes Typically server-to-client

Snapshot replication Large, infrequent data

changes Data change latency OK Best for smaller data sets

Subscriber databases available for reporting

Replicate data subsets

Some data loss is possible

Periodically validate replicated data

39

App Development and Admin

40

Considerations for App Developers App services tolerant to database service interruptions Application transactions must be handled in code—data

consistency Exception handling for transaction retry, connection

recovery Requires coding standards, code reviews, and testing Bulk data operations Transaction volume impacts rollback time during failover Batch jobs must be run on alternate nodes Don’t bypass transaction logging Synchronization with external data sources? Be aware of database recovery model Mirroring uses FailoverPartner in connection string Use TCP/IP as client protocol

41

Considerations for Admins

Use identical server hardware, when possible Design network redundancies, when feasible

Consider network latency for geographic separation Always manage through virtual cluster, not individual cluster

nodes Retest failover/failback after HA maintenance Diagnose after failover

Repair alternate node Resynchronize data, as necessary Be aware of primary/secondary locations Ensure application services are connected and functioning

properly Keep server node configurations synchronized:

Service pack and patch levels Duplicate non-redundant resources Jobs; logins and permissions; OS & sys objects

42

HA Risks

System performance degradation HA system complexity leads to availability

issues Some system failures not planned for Backup and recovery planning incomplete Administrators not fully trained or informed User databases not synchronized with

other data sources

43

Common Admin Use Cases

Maintain HA nodes Hardware maintenance Rolling upgrades and software patches

Resynchronize the redundant copy Re-synch mirror Restart log shipping

Diagnose and repair Diagnose cause of failover Repair failed node and restore failover

capabilities Test failover and failback

44

Common Admin Actions

Train and practice administrators to: Initiate a database mirror Manually failover mirror database or

cluster node Add/remove passive node from mirror or

cluster Upgrade/patch servers nodes Restart or redirect application services

45

More Information46

References—Books

Microsoft SQL Server 2008 High Availability with Clustering & Database Mirroringby Michael Otey, 2009.

Microsoft SQL Server High Availabilityby Paul Bertucci, 2004.

Pro SQL Server 2005 High Availabilityby Allan Hirt, 2007.

Pro SQL Server 2005 Replication by Sujoy Paul, 2006.

Pro SQL Server 2005 Service Brokerby Klaus Aschenbrenner, 2007.

The Rational Guide to SQL Server 2005 Service Brokerby Roger Wolter, 2006.

High Availability Related Topics

47

References—Presentations48

Microsoft Load Balancing and Clusteringhttp://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides/14.%20Microsoft%20Load%20Balancing%20and%20Clustering.ppt

SQL Server 2005 High Availabilityhttp://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt

High Availability Technologies In SQL Server 2000 And SQL Server 2005 http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt

Meeting the Availability Challengehttp://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC4-4F7D46FCEAA6/HighAvailability.ppt

Disaster Recovery Mistakeshttp://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-11-07%20MM.ppt

SQL Server 2005 High Availabilityhttp://blogs.msdn.com/sql2005event/attachment/564303.ashx

Effective Usage of SQL Server 2005 Database Mirroring http://www.sqlserver-qa.net/SSQA-Effective%20Usage%20of%20SQL%20Server%202005%20Database%20Mirroring_show.ppt

References—Articles

Achieve High Availability for SQL Serverhttp://technet.microsoft.com/en-us/magazine/cc162477.aspx

Geographically Dispersed Clusters in Windows Server 2003http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx

Restoring file and filegroup backups http://support.microsoft.com/kb/281122/en-us

Restoring specific tables or rows from backupshttp://support.microsoft.com/kb/321836/en-us

Maintaining Availability During Upgradeshttp://msdn.microsoft.com/en-us/library/ms191449.aspx

49

SQL Server High Availability

Documents

data management

data availabilityso

backup files recovery

database uptime

warm data copy

y offline data

future data growth

recovery note