Top Banner
MICROSOFT CONFIDENTIAL – INTERN Resilent Cloud Applications Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team
60

Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Jan 15, 2016

Download

Documents

Zoe Kennedy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Resilent Cloud ApplicationsMark Simms (@mabsimms)Principal Program ManagerWindows Azure Customer Advisory Team

Page 2: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Session Objectives

Designing resilient large-scale services requires careful design and architecture choices

This session will explore key patterns & practices for highly available cloud services, illustrated with customer examples

Interactivity rocks -> please ask questions throughout!

Page 3: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Setting the Stage

Page 4: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Setting the stageScalability

AvailabilityInsight

Page 5: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Setting the stage

Maximize service availability for consumersEnsure customers (and client devices) can access and use the service

Minimize impact of failure on consumersDegrade gracefully, isolate faults, fallback to alternate delivery paths

Maximize performance and capacityServices that are “live”, but cannot handle desired/required demand are not available

Page 6: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Musings on application design Traditional web service

design (N-tier) Make “everything

stateless”

Load Balancer

Web Servers

AppServers

Page 7: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Musings on application design Traditional web service

design (N-tier) Make “everything

stateless” Separate logic from

data (state) Leverage specialized

external state services Cache, load balancer,

relational database, document database, key/value store, etc

Load Balancer

Web Servers

AppServers

Database

DistributedCache

Doc Store

...

Page 8: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Musings on application design No service is an island Dependencies on

other internal and external services

Trading time-to-market and agility for control

Load Balancer

Web Servers

AppServers

Database

DistributedCache

Doc Store

...

External Services (SendGrid, Twitter, Facebook, etc)

Page 9: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

What’s in a workload?#1: without the relational database the application

cannot fulfill any workloads

#2: the relational database is an external

service, subject to partial availability

Page 10: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Designing for Failure

Page 11: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Decompose by WorkloadApplications are compromised of one or more workloadsProducts like SharePoint and Windows Server are designed with this principle in mindEach with different profiles, requirements and boundariesManagement, Availability, Operational, Cost, Health, Security, Capacity, etc.

Decomposition allows for workload specific optimizationTechnology selections, scalability and availability approaches, etc.

Page 12: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

What are the “9”sAvailability % Downtime per year Downtime per month* Downtime per week

90% ("one nine") 36.5 days 72 hours 16.8 hours

99% ("two nines") 3.65 days 7.20 hours 1.68 hours

99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

12

• Study Windows Azure Platform SLAs:

• Compute External Connectivity: 99.95% (2 or more instances)

• Compute Instance Availability: 99.9% (2 or more instances)

• Storage Availability: 99.9%

• SQL Azure Availability: 99.9%

Page 13: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

The Truth About 9s

SLA = *

Page 14: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Define Your SLAs

Page 15: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Design for Failure

Given enough scale, time and pressure all components or services will fail

Your application will experience 1..N failures

How will your application behave? Gracefully handle failure modes, continue to deliver value Not so gracefully …

Fault types: Transient. Temporary service interruptions, self-healing Enduring. Require intervention.

Page 16: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Failure Scope

Region

Service

NodeIndividual Nodes May FailConnectivity Issues (transient failures), hardware failures,

Entire Services May FailService dependencies (internal and external), configuration and code issues

Regions may become unavailableConnectivity Issues, acts of nature

Page 17: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Handling Transient and Enduring Failures Use fault-handling

frameworks that recognize transient errors Make it part of the background ”noise”

Appropriate retry and backoff policies

Page 18: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Handling Transient and Enduring Failures

Page 19: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.
Page 20: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Handling Transient and Enduring Failures

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728290

50000

100000

150000

200000

250000

300000

350000

400000

450000

Web Request Response Latency

Avg Latency Response latency

• At some point, your request is blocking the line

• Fail gracefully, and get out of the queue!

• Anti-patterns:• Too much trust in

downstream services and client proxies

• Not bounding non-deterministic calls

• Blocking synchronous operations

Page 21: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Sample Retry Policies

Platform Context Sample Target e2e latency max

“Fast First”

Retry Count

Delay Backoff

SQL Database

Synchronous (e.g. render web page)

200 ms Yes 3 50 ms Linear

Asynchronous (e.g. process queue item)

60 seconds No 4 5 s Exponential

Azure Cache

Synchronous (e.g. render web page)

100 ms Yes 3 10 ms Linear

Asynchronous (e.g. process queue item)

500 ms Yes 3 100 ms

Exponential

Page 22: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Circuit Breaker at Netflix

A request to a remote service times out

Thread pool and bounded task queue used to interact with a service dependency are at 100%

Client library used to interact with a service dependency throws an exception

On

Off

Error RateThresholdCriteria

Page 23: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Circuit Breaker at Netflix - Fallbacks

Page 24: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Deployment Redundancy

Page 25: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Failure Points

Focus on identifying design elements that are subject to external change. For example:

Database connection Website connection Configuration file Registry key

Categories of common Failure Points: ACLs, Database access, External web site/service access,

Transactions, Configuration, Capacity, Network

definition: design elements that can cause an outage.

Page 26: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Failure Modes

Examples of failure modes: Configuration file is not in correct location Too much traffic overusing resources Database reaches maximum capacity

The following would not be considered a failure mode: Product bugs Symptoms of problems Informational occurrences

definition: a predictable root cause of the outage that occurs at a Failure Point.

Page 27: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Failure Mode Example

27

public int GetBusinessData(string[] parameters){ try {

var config = Config.Open(_configPath);var conn = ConnectToDB(config.ConnectString);var data = conn.GetData(_sproc, parameters);return data;

} catch (Exception e) {

WriteEventLogEvent(100, E_ExceptionInDal);throw;

}}

Potential Failure Points: Database Server Database Table Configuration File

Potential Failure Modes: DB Server not responding DB offline DB access denied Sproc execute denied DB doesn’t exist DB timeout on connect Index corrupt Database corrupt Table doesn’t exist Table corrupt Config file missing or

invalid

Page 28: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Design for operations

Page 29: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Running a Live Site Service

Page 30: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Running without Insight / Telemetry

Page 31: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Capturing Insight Log all internal/external “transactions” (database, web services, etc) Application context (module/component) Host context (server/role/instance/process) Timing information (start/stop/duration) Activity identifier

Consolidate logs to central system / dashboard for health monitoring and troubleshooting

Page 32: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Capturing InsightCapture timing and context information

through helper delegates (background noise)

Capture contextual errors (inner exceptions, etc) on

error

Logging library is asynchronous (fire-and-forget) to avoid blocking

Page 33: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Many Options

Windows Azure Diagnostics

Page 34: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Designing for Insight

Instrument for production loggingIf you didn’t capture it, it didn’t happen

Implement inter-service monitoring and alertingCapture and quantify inter-service behavior and activity

Run-time configurable loggingEnable activation (capture or delivery) of additional channels at run-time

Page 35: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Define ALM

Dev Fabric

Code Unit Test

Run

Check In

Build

Automated Test

Run

Test

Deploy

Dev on Azure

CI

Stage

Deploy

TestMonitor

QA/Pre-release on Azure

Production Release on

Azure

Log Defect

Defect Feature Triage

Plan Fixes Updates

Plan

Design

Scope

Page 36: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Updating Configuration For a production service configuration == code

Need rigorous ALM process for rolling out (and rolling back) updates to both.

Page 37: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Updating Services

“We want global, simultaneous production rollouts of our new code”Are you sure about that?

Production rollouts: Running N, N+1 concurrently Rolling load over to N+1, ability to fallback

Page 38: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

What is a health model?

Logical piece of an applicationA component that makes sense to an operatorEach entity has a health stateEntities can be external or internalMultiple instances of an entity may exist

Managed EntityBreak down health state by functional teamMust be mutually exclusiveGroup by organizational responsibility e.g. security, performance, backupMay be specific or non-technology e.g. orders shipped.

AspectDefines level of operation currently availableNormal state is fully functionalWell designed applications may support partial operation e.g. read only

Operational Condition

Page 39: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Troubleshooting Workflow

DetectionIs there a problem?

ClassificationWhat’s not working, how bad is it?

DiagnosisWhy is there a problem?

RecoveryWhat needs to be done to fix it?

VerificationIs the problem really gone?

Page 40: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Resources Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx)

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services

(http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx)

Designing and Deploying Internet Scale Services

https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

Page 41: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Design for Scale

Page 42: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Scale

Resources

Demands

Unit of Scale

Workloads

Page 43: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Scale by Units

Page 44: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Workload 1

Workload 2

Bottom Ramp Peek

Page 45: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Data Partitioning

Page 46: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Understanding the 3Vs

Page 47: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Understanding Queryability

Page 48: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Horizontal Partitioning

Page 49: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Vertical Partitioning

Page 50: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Hybrid Partitioning

Page 51: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

Data – to cache or not to cache….

Page 52: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

52

Microsoft Confidential

Push vs. Pull

Load Balanced PushSync and good for sequential processingDependent on downstream servicesThrottling vs. Performance

Managed Pull/ThroughputAsynchronous and event driven processingEasy Parallelisation and PipeliningExtending logic is easy

Logic based

• Priority• Date• Amount• Etc.

Time based

• ASAP• Gradually• Periodically• On-Demand

Volume based

• Single• In Batches

Page 53: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

53

Microsoft Confidential

Data on the inside – Data on the outsidehttp://msdn.microsoft.com/en-us/library/ms954587.aspx

•Immutable (versions)•Requires open schema for interopReference Data

•Low concurrency updates (e.g. shopping basket)Activity Data

•Highly concurrent update (e.g. inventory)•Should live in worker role

Resource (shared) Data

Page 54: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

54

Microsoft Confidential

“Query Ready” Cache

Query patternsPush the data close to where it is queried– Example: BING Maps

Process, structure, produce, format etc. data and cache “query ready” dataLight/cheap data production is OK

Pure and Idempotent operations are usually good candidatesDuplication is OK

Same data in a different formatSame data in multiple places

This requires processing data before it is queried - NOT at the query timeAll data can be cachedSome data can be cached:Frequently usedProcess Heavy, Expensive dataBuild as you Go

Page 55: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

55

Microsoft Confidential

Distributed Caching

Simple to administerNo need to manage and host a distributed cache yourself.

Integrates easily into existing applicationsASP.NET session state and output cache providers enable no-code integration.

Same managed interfaces as Windows Server AppFabric Cache

On-Premises App Windows Azure App

Core Logic

AppFa

bri

c C

ach

e A

PIs Windows

Server AppFabric

Cache

Core Logic

AppFa

bri

c C

ach

e A

PIs

Windows Azure AppFabric Caching

Page 56: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Data Resiliency

Page 57: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Backup and Restore

Page 58: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Backing Up Table and Blob Storage

Source Replica

Log

Log Replica

01100100 01100001 01110100 01100001

Page 59: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

Managing Backed Up Data

Page 60: Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team.

MICROSOFT CONF IDENTIAL – INTERNAL ONLY

CDN

pic1.jpgpic1.jpg

Content Delivery Network

Blob Service

EdgeLocation

EdgeLocation

EdgeLocation

pic1.jpg