Mmckeown hadr that_conf

PLATINUM SPONSORS

Gold Sponsors

High Availability and

Disaster Recovery

in Windows Azure

MIKE MCKEOWN

BLOG: HTTP://WWW.MICHAELMCKEOWN.COM

TWITTER: NWOEKCM

LINKEDIN: WWW.LINKEDIN.COM/PUB/MIKE-MCKEOWN/20/B73/389/

CLOUD SOLUTIONS ARCHITECT - ADITI TECHNOLOGIES

http://www.michaelmckeown.com/

http://www.linkedin.com/pub/mike-mckeown/20/b73/389/

CORPORATE OVERVIEW OF ADITI

- TRUSTED, RESPECTED, TECHNOLOGY SERVICES LEADER

2012 Partner of the year Windows Azure , Finalist

2011 Partner of the year Windows Azure SI, Finalist

2010 Partner of the yearWindows Azure , Winner

Best companies to work for

Top 10 IT Workplace

Global Cloud MVPs

Top 50 Cloud influencers

1:114 hiring ratio

The Best ‘OF’

Vendor Award

52% of our customers rate us 5/5.

45 + active customers.

1200+ engagements.

1600 people, globally

18 years, 12 locations

You might be from Wisconsin if… You have been both frostbitten and sunburned all in the same week.

You owe more money on your snowmobile than you do on your family car.

You consider a six pack of beer and a bug-zapper quality entertainment

You go to your family reunion looking to meet new women

You learned to drive a tractor before the training wheels were off your bike.

You think that John Deere Green, Ford Blue, and Primer Gray are the three

primary colors.

Your school loses half its student body during deer season.

The blue book value of your truck goes up and down depending on how

much gas it has in it.

Agenda

High Availability (HA) and Disaster Recovery (DR)

Definitions

Service Level Agreements (SLAs)

Designing for Failure

HA/DR Architectures

Failover Demo – Azure Traffic Manager

Tips and Best Practices

Introduction to HA/DR

High Availability (HA) includes a Disaster Recovery (DR) plan

Cloud failure is inevitable

Proper management means fast recognition to minimize effects

Define tolerance thresholds and an associated strategy

Consider budget and strategic location of resources

Cloud provides affordable and easily configurable geo-redundancy

Azure builds resiliency into some of its services

Others you must build it in yourself

What is your Cloud HA/DR strategy?

1. HA = Flat tire and spare donut tire

With spare tire car continues to run

Can’t reach top speeds

Can’t maneuver as well

Example of Azure HA:

An instance of a Web role

crashes due to a fault on its rack

SLA allows app to keep running

High Availability Definitions

1. Fault Tolerance

Detects and maneuvers around failed elements to continue and return the correct results within specific timeframe

Use one or more design strategies - app redundancy, data replication, or degraded functionality (i.e. order processing system)

2. Availability

HA systems are measured by the % of their availability in terms of planned/unplanned service outages for users

Azure Availability SLA

Techniques can improve availability so its always available during problems

Redundant and reliable design

Redundancy in Windows Azure

• Windows Azure Storage with 2x replicas

• Azure SQL Database built-in 2x backup servers

• Windows Azure Caching with high availability enabled

• Multi-instance Windows Azure Web Sites and Cloud Services

• Failover with Windows Azure Traffic Manager

Reliability in Windows Azure

• Auto recovery of crashed/nonresponsive instances

• Fault domain to scatter instances across racks

• Virtual machine availability set to allocate VMs across Fault domains

• Upgrade domain to avoid shutting down all instances at the same time

• Handle transient errors using the Transient Fault Handling Application block

High Availability Definitions

1. Fault Tolerance

Detects and maneuvers around failed elements to continue and return the correct results within specific timeframe

Use one or more design strategies - app redundancy, data replication, or degraded functionality (i.e. order processing system)

2. Availability

HA systems are measured by the % of their availability in terms of planned/unplanned service outages for users

Azure Availability SLA

Techniques can improve availability so its always available during problems

Redundant and reliable design

3. Scalability

Meet increased demand with consistent results in acceptable time windows

Horizontal scale out (dynamic) vs vertical scale up (restart)

Scalability

What does HA require? Strategies to absorb outage of key components

No single points of failure

Multiple web servers and data replication

Graceful failover when individual components fail (and they will)

Backup components and systems

XXX

2. DR = Bad Car Crash

Entire Data center down and no connection to the database

Network goes down and can’t contact to on prem machines

Disaster Recovery

Process, policies, and procedures to restore critical systems after a catastrophic event

Application failure, data corruption (human error also), network down,

failure of connected service, DC down

A DR Plan is a part of a good HA strategy

Invest time and resources to continually plan, prepare, rehearse,

document, train, and update processes

One point of responsibility

Real World DR Plan – Dilbert Technical Services

Establish RPO and RTO and know your SLAs

http://www.youtube.com/watch?v=FAHPYnXDca8

Recovery Point Objective (RPO)

Disaster

How much data can you lose

and still be okay after rollback?

How consistent does data need

to be after a rollback?

> RPO means less critical/$

< RPO means more critical/$

Recovery Time Objective (RTO)

Disaster

RTO

How much time does it take to

recover?

> RTO means less critical/$

< RTO means more critical/$

What’s in a Hot Dog?

Animal organs

Kindey, liver, hearts, etc.

Reproductive organs?

Plastic, glass, bugs, and animal bones

Mechanically Separated Meats

“A paste-like meat product produced by forcing bones, with attached edible meat, under high pressure through a sieve to separate the bone from the edible meat tissue,"

SLAs are like hot dogs

The closer to a 10 (more 9’s) the more up time but costs more and higher maintenance

Azure has non-cumulative monthly SLAs

Service Level Agreements

Compounding of SLAs

Effective availability - Considers the SLAs of each dependent service and their cumulative effect on the total system availability

Windows Azure Compute (2 instances) = 99.95%

SQL Azure Database = 99.9%

Windows Azure Storage = 99.9%

Total Monthly SLA

4.38 hours + 8.76 hours + 8.76 hours = 21.9 hours

Effective Availability: 99.75%

Is the good enough for your app?

Can Effective availability of SLAs meet RPO and RTO of your app?

SLA – Downtime vs Costs

Azure HA/DR Architecture Concepts

Failure Design

Multi-Site Data Backup/Recovery Strategies

Immediately or eventually consistent systems

FC and Fault Domains

PaaS and IaaS

Windows Azure Traffic Manager

Design For Failure

Large scale failures in any Cloud are rare but will happen

Cloud Data Centers don’t magically remove failures

Fabric Controller helps to quickly recover from problems in one DC

Understand RPO/RTO requirements to design for failures

Balance cost and complexity of HADR efforts against risk(s) you’re willing to bear

Cloud has made DR and HA remarkably easy and affordable

Multiple configurations possible with a few clicks

Application owners are ultimately responsible for failure management

Owners of DR Plans and HA strategy

Multi-Site Data Recovery Approaches

1. Azure Data Synch Services (PaaS)

Recommended between Azure SQL Database instances only

5 minutes minimum replication

If need lower RPO need to do it yourself

Creates clutter in synced databases

2. SQL Server Merge Replication (IaaS)

Two SQL Server databases (IaaS VMs) in two different regions

Update is DB A goes to DB B also and vice versa

Synchronous transactional operations locks tables and affects performance

3. SQL Server 2012 Always-On Availability Groups (IaaS)

Two SQL Server databases (IaaS VMs) in different regions

Immediate replication in master and its replicas

Non-transactional so no locking or performance degradation

1. Azure Data Sync ServicesSQL Azure Database only (pure PaaS)

5 minute minimum replication

Transactional and blocking

One way or two way

Not recommended with SQL Server

Azure SQL Database Azure SQL DatabaseData Sync Services

2. SQL Server Merge Replication/Azure IaaS VMs

Two databases in two different Regions in IaaS VMs

Update is DB A goes to DB B …..and vice versa

Synchronous transactional operations locks tables and affects performance

Azure IaaS VM and

SQL Server 1

Azure IaaS VM and

SQL Server 2

SQL Server

Database A

SQL Server

Database B

Trans Sync from B to A

Trans Sync from A to B

3. SQL Server 2012 Always-On Availability Groups Two databases in two different Regions in IaaS VMs

Immediate replication in master and its replicas

Non-transactional so no locking or performance degradation

Azure IaaS VM and

SQL Server 2012

Azure IaaS VM and

SQL Server 2012

SQL Server 2012SQL Server 2012

Master DB Replica DB

Always On (Non-

Blocking)

Synchronization

Consistency Models

Immediately consistent systems Traditional Synchronous pattern of all at once

Can hurt performance with locking/blocking

Possibly lose something at failure and recovery

The “C” in ACID

Transactional consistency to all affected data based upon rules, triggers, constraints

Eventually consistent systems Asynchronous patterns using durable queues

Nothing lost in recovery

The ability to recreate system after failure

Improves fault tolerance in systems

Customer may not need to see immediate updates

Posts to Twitter/Facebook

DB may have some inconsistencies at any point in time

All nodes eventually consistent when all updates are done

Both have a role in HADR based upon RTO and RPO

“A fault domain is a set of hardware components – computers, switches, and more – that share a single point of failure.”

Cant control FDs – given by Azure

Fault Domains do not span data centers

FC provisions multiple role instances across Fault Domains

FC monitors Fault Domains to reduce localized failures

Upon failure FC enforces SLA and re-provisions instances

Fault Domains - PaaS

“A fault domain is a set of hardware components – computers, switches, and more – that share a single point of failure.”

VM Availability Sets

Different Fault Domains/Racks

Azure locates VMs in different

fault domains to prevent

localized failure

Required for 99.95% VM SLA

Ex. Web & SQL Server

Fault Domains – IaaS Virtual Machines

Windows Azure Traffic Manager (WATM)

Automated priority of routing

1. Failover

2. Performance

3. Round-robin

Gives a new DNS prefix for users

Key point – You decide if your

failover domain is dormant or

active while NOT in failover mode

WATM rolls over regardless if site is up or down

You need to manage if failover domain is active or dormant

HA/DR Cloud Architectures

HA/DR Types and Terms

Mostly PaaS concepts with a bit of IaaS

Example : home phone

1. Cold

Backup has nothing active, pre-loaded, or updated

Least expensive and slowest recovery

Ex. Have to go out and buy new home phone

2. Warm or Passive

Backup has some parts loaded/current and others made active upon failure

Ex. Home phone at house but still packed and notcharged

3. Hot or Active

Backup is loaded and ready to receive load upon failure but not active

Ex. Home phone with charged battery but not plugged into home circuit

4. Highly Available

Backup is loaded and active and receiving load as part of normal processing

Most expensive and quickest recovery

Ex. Home phone with charged battery and plugged into home phone circuit

http://www.youtube.com/watch?v=EKRI10G-MAA

RTO vs. Cost

Single Region Deployment

•

•

•

•

•

Cold DR

•

•

•

•

•

•

•

•

•

Fault Domain #1

Fault Domain #2

Fault Domain #1

Fault Domain #2

Warm DR

Fault Domain #1

Fault Domain #2

Hot DR – Option 1

Fault Domain #1

Fault Domain #2

Fault Domain #1

Fault Domain #2

Hot DR – Option 2

Fault Domain #1

Fault Domain #2

Fault Domain #1

Fault Domain #2

High Availability

Fault Domain #1

Fault Domain #2

Demo: HA using Azure Traffic

Manager

HA/DR Checklist for Risk Mitigation

1. Conduct a risk assessment for each application

Each can have different requirements.

Some applications are more critical than others

Justify extra cost to architect them for disaster recovery

Use this information to define the RTO and RPO for each application.

2. Design for failure starting with the application architecture

3. Implement best practices for high availability

Balancing cost, complexity, and risk

4. Implement disaster recovery plans and processes.

5. Establish backup strategies for all reference and transactional data.

6. Consider failures that span the module level all the way to a complete Cloud outage.

7. Choose a multi-site disaster recovery architecture.

General HA Best Practices

Avoid single points of failure

Always place (at least) one of each component (load balancers, app servers,

databases, …) in at least two regions or fault domains

Maintain sufficient capacity to absorb region/ fault domain failures

Reserved Instances (hot) – guarantee capacity is available in a separate region/cloud

Replicate data across clouds/regions for failover

Setup monitoring, alerts, and operations to identity and automate problem resolution

or failover process

Design stateless applications for resilience to reboot / relaunch

Summary

Plan and design

for failure

Work with

business and IT

- RPO and RTO

Understand

cumulative SLAs

Implement

correct HA/DR

Architectures

Best Practices

and Checklist

Start with some

DR strategy and

improve

continually

Resources

Disaster Recovery and High Availability for Windows Azure Applications

Mike McKeown and Hanu Kommalapati

http://msdn.microsoft.com/en-us/library/dn251004.aspx

Contingency Planning Guide for Information Technology Systems

National Institute of Standards and Technology

https://www.fismacenter.com/sp800-34.pdf

Failsafe: Guidance for Resilient Cloud Architectures

Marc Mercuri, Ulrich Homann, and Andrew Townhill

http://msdn.microsoft.com/en-us/library/windowsazure/jj853352.aspx

Business Continuity for Windows Azure

Patrick Wickline, Adam Skewgar, Walter Myers III

http://msdn.microsoft.com/en-us/library/windowsazure/hh873027.aspx

http://msdn.microsoft.com/en-us/library/dn251004.aspx

https://www.fismacenter.com/sp800-34.pdf

http://msdn.microsoft.com/en-us/library/windowsazure/jj853352.aspx

http://msdn.microsoft.com/en-us/library/windowsazure/hh873027.aspx

Questions?

AUGUST 11TH – 13TH 2014

SAME PLACE, SAME TIME