Top Banner
Windows Azure Compute Brad Calder General Manager Windows Azure
45

Windows Azure Compute

Feb 25, 2016

Download

Documents

tauret

Windows Azure Compute. Brad Calder General Manager Windows Azure. Large-Scale Workload Patterns. “Growing Fast“ Docs.com on Facebook . “On and Off “ RiskMetrics. Inactivity Period . Compute . Compute . Average Usage. Usage. Average. Time . Time . - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Windows Azure Compute

Windows Azure ComputeBrad Calder

General ManagerWindows Azure

Page 2: Windows Azure Compute

Usage

Com

pu te

Time

Average

InactivityPeriod

“On and Off “ RiskMetrics

On and off workloads (e.g. batch job)Over provisioned capacity is wasted

Com

pu te

Time

“Unpredictable Bursting“

Twitter

Average Usage

Unexpected/unplanned peak in demand Sudden spike impacts performance Hard/costly to over provision for extreme cases

Average Usage

Com

pu te

Time

“Growing Fast“Docs.com on Facebook

Successful services needs to grow/scale Keeping up w/growth is big IT challenge Can be hard to predict growth

Com

pu te

Time

Average Usage

“Predictable Bursting“Walmart

Services with micro seasonality trends Peaks due to periodic increased demandWasted capacity off season

Large-Scale Workload Patterns

Page 3: Windows Azure Compute

12:00 AM 1:54 AM 3:48 AM 5:42 AM 7:36 AM 9:30 AM11:24 AM 1:18 PM 3:12 PM 5:06 PM 7:00 PM 8:54 PM 10:48 PM

Japan Great Britain

BING SEARCHES – JAPAN VS. UK

Source: Microsoft

Computing Demand Daily FluctuationQu

ery

Volu

me

Page 4: Windows Azure Compute

• turbotax.com • taxcut.com• hrblock.com • taxact.com

Source: Alexa

~4x normal load(Holiday shopping)

~10x normal load(Tax season)

• target.com • walmart.com• toysrus.com • barnesandnoble.com

Jan 2009 Jan 2010 Jan 2009 Jan 2010Source: Alexa

Computing Demand Yearly Variability

Page 5: Windows Azure Compute

Time

Dem

and

What is a “Cloud”?• Cloud: on-demand, scalable, compute and storage

resources

TimeDe

man

dSelf Server Provisioning Cloud Provisioning

OverprovisionedUnderprovisioned

Page 6: Windows Azure Compute

What is Under the Covers of a Service

Business logic

Datacenter (Power and Cooling)

Respond to hardware failures

Monitoring and alerting infrastructureReliable/Secure storage and computation

Metering and billing infrastructureLive upgrades and OS patches

Add compute/storage capacity on the flyOverprovision for peak traffic

Service “glue”

Buy and provision hardware

Page 7: Windows Azure Compute

What is Windows Azure?An operating system for the cloud:

….Service 1 Service 2 Service NService 3

……

Page 8: Windows Azure Compute

Cloud Terminology• Infrastructure as a Service (IaaS):

basic compute and storage resources• On-demand servers• Amazon EC2, VMWare vCloud, etc

• Platform as a Service (PaaS): cloud application infrastructure• On-demand application-hosting environment• Google AppEngine, Salesforce.com, Windows Azure, etc

• Software as a Service (SaaS): cloud applications• On-demand applications• Office 365, GMail, etc

Page 9: Windows Azure Compute

Operating System

Operating System

VM

WebServer

Operating System

VM

DBMS

2) Choose image, then create and configure VM(s) for

application

1) Choose image, then

create VM for DBMS and

configure DBMS

IaaS

Library

VM Images

Developer/Ops

ApplicationDataLoad

Balancer

5) Config

ure load

balancer

6) Manage VMs and DBMS (e.g.,

deploying new OS images in VMs)

3) Provision database,

then create tables and add data

4) Install

application

Page 10: Windows Azure Compute

Operating System

Operating System

VM

Operating System

VM

DBMS

PaaS Developer/Ops

ApplicationDataLoad

Balancer

2) Deploy applicati

on w/ service model

WebServer

1) Provision database,

then create tables and add data

3) Automated Service Managem

ent

Page 11: Windows Azure Compute

Windows Azure• Windows Azure is an OS for the data center• Handles resource management, provisioning, and

monitoring• Manages application lifecycle• Allows developers to concentrate on business logic

• Provides common building blocks for distributed applications• Reliable queuing• Simple unstructured and structured storage• SQL storage• Application services like access control, caching, and

connectivity

Page 12: Windows Azure Compute

Windows Azure Platform

Fabric Controller Windows Azure Networking

AppFabric Caching

AppFabric Access Control Server

SQL Azure

AppFabric Service Bus

WindowsAzure

Compute

WindowsAzure

Middleware Services

Windows Azure Applications

Windows Azure Storage

Windows Azure CDN

WindowsAzure

Data Services

Page 13: Windows Azure Compute

• Owns all the hardware in the data center• Uses the inventory to host services• Similar to what a per machine operating system

does with applications• Provisions the hardware as necessary• Maintains the health of the hardware• Deploys applications to free resources• Maintains the health of those applications

Fabric Controller

Page 14: Windows Azure Compute

Windows Azure Fabric Controller

Highly-availableFabric Controller

Hardware control Software control

WS08 Hypervisor

VMVM

VM

Fabric

Agent

Switches

Load-balancers

Page 15: Windows Azure Compute

Scaling with the Fabric Controller Service Model

Page 16: Windows Azure Compute

Scaling• There are two basic scaling models:

Compute

Compute

Compute

Compute

Scale Up Scale Out

Compute

Page 17: Windows Azure Compute

Scaling Lessons• Use few, well-defined scaling units• Define scaling boundaries• Scale out those units as needed

Page 18: Windows Azure Compute

Scale-Out ApplicationsNetwork Load Balancer

Stateless ‘Worker’

Stateless Front End

Shared Filesystem

(Azure Blobs)

Partitioned RDBMS

(SQL Azure)

Key/ValueDatastore

(Azure Tables)

AzureQueues

Scale Out

Scale Out

AlreadyProvided ScalableStorage

Page 19: Windows Azure Compute

The Windows Azure Service Model• A Windows Azure application is called a “service”• Definition information• Configuration information• At least one “role”

• A role is the scaling boundary withina service• Roles are like DLLs in the service “process”• Collection of code with an entry point

that runs in its own virtual machine• Virtual machine is scale unit • Role code runs in a virtual machine • Role scales by instances of a virtual machine size

LB

Durable

Store

Front End

Middle Tier

Page 20: Windows Azure Compute

Multi-Tier Cloud Application• A cloud application is typically made up of different

components• Front end: e.g. load-balanced stateless web servers• Middle worker tier: e.g. order processing, encoding• Backend storage: e.g. Azure Blobs, Azure Tables, SQL

Azure• Multiple instances of each for scalability and availability• Requires at least 2 instances of each to achieve the SLA

Front-End

Cloud Application

Front-End

HTTP/HTTPS

Windows

AzureStorag

e,SQL

Azure

Load Balancer Middle-

Tier

Page 21: Windows Azure Compute

Service Model and Role Contents• Definition:

• Role name• Role type • VM size (e.g. small, medium, etc.)• Network endpoints

• Configuration:• Number of instances• Number of update and fault domains

• Code: • Web/Worker Role: Hosted DLL

and other executables• VM Role: VHD

Service ModelRole: Front-End

DefinitionType: WebVM Size: SmallEndpoints: External-1ConfigurationInstances: 2Update Domains: 3Fault Domains: 2

Role: Middle-Tier

DefinitionType: WorkerVM Size: LargeEndpoints: Internal-1ConfigurationInstances: 3Update Domains: 3Fault Domains: 2

Page 22: Windows Azure Compute

Service Model Files• Service definition is in ServiceDefinition.csdef

• Service configuration is in ServiceConfiguration.cscfg

• CSPack program Zips service binaries and definition into Service Package File (service.cscfg)

Page 23: Windows Azure Compute

ServiceDefinition.csdef<?xml version="1.0" encoding="utf-8"?><ServiceDefinition name="Sample" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" upgradeDomainCount=“3"> <WorkerRole name="Middle-Tier" vmsize="Large"> <Endpoints> <InternalEndpoint name="Internal-1" protocol="tcp" /> </Endpoints> </WorkerRole> <WebRole name="Front-End" vmsize="Small"> <Sites> <Site name="Web"> <Bindings> <Binding name="Endpoint1" endpointName="External-1" /> </Bindings> </Site> </Sites> <Endpoints> <InputEndpoint name="External-1" protocol="http" port="80" /> </Endpoints> <Imports> <Import moduleName="Diagnostics" /> </Imports> </WebRole></ServiceDefinition>

Page 24: Windows Azure Compute

ServiceConfiguration.cscfg<?xml version="1.0" encoding="utf-8"?><ServiceConfiguration serviceName="Sample" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceConfiguration" osFamily=“2"

osVersion="*"> <Role name="Middle-Tier"> <Instances count="3" /> </Role> <Role name="Front-End"> <Instances count="2" /> <ConfigurationSettings> <Setting name="Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString" value="DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=key" /> </ConfigurationSettings> </Role></ServiceConfiguration>

Page 25: Windows Azure Compute

Windows Azure Push-button Deployment• Step 1: Allocate VMs/nodes,

VDIPs/VIPs• Across fault domains• Across update domains

• Step 2: Place role images on nodes

• Step 3: Start roles in VM instances

• Step 4: Configure load-balancers• Step 5: Maintain desired number

of role instances• Failed roles automatically

restarted• Node failure results in new VMs

automatically allocated

Allocation across fault and update domains

Load-balancers

Page 26: Windows Azure Compute

• Windows Azure FC monitors the health of roles• FC detects if a role dies• Restart the role to bring it back to a healthy state

• If a failed node can’t be recovered, FC migrates role instances to a new node• A suitable replacement location is found• Existing role instances are notified of the

configuration change

FC Automated Management

Page 27: Windows Azure Compute

Availability andFault/Upgrade Domains

Page 28: Windows Azure Compute

Availability Service Level Agreements (SLA)

• Windows Azure Platform SLAs:• Compute External Connectivity: 99.95% (2 or more

instances)• Storage Availability: 99.9%• SQL Azure Availability: 99.9%

Availability % Downtime per year Downtime per month* Downtime per week

99% ("two nines") 3.65 days 7.20 hours 1.68 hours99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

Page 29: Windows Azure Compute

Maintaining Availability: Assume Failure• Hardware fails • 3-5% of servers experience failures annually

• Software fails• Inevitable in any evolving, complex system

• Tolerating failure means:• Redundancy where possible• Need to build in retries and backoff• Fast recovery• Big red buttons

Page 30: Windows Azure Compute

Hardware Redundancy

TOR

LB LBAgg

PDU

LB LBAgg LB LB

Agg LB LB

Agg LB LB

Agg LB LB

Agg

Racks

Datacenter

RoutersAggregation Routers and

Load Balancers

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

……… … …

Top of RackSwitches

Power Distribution

Units

…Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Node

s

Top of Rack Switch is a Single Point of Failure

Page 31: Windows Azure Compute

Maintaining Availability: Fault Domains• Avoid single points of failure• Unit of failure based on

data center topology is a rack• E.g. top-of-rack switch on a rack of

machines• Windows Azure considers

fault domains when allocating service roles• At least 2 fault domains per service• Will try and spread roles out across

more

Front-End-1

Fault Domain 1

Fault Domain

2

Front-End-2

Middle Tier-2

Middle Tier-1

Fault Domain 3

Middle Tier-3

Front-End-1

Middle Tier-1

Front-End-

2Middl

e Tier-

2

Middle

Tier-3

Page 32: Windows Azure Compute

• Update domains specifies what percentage of your service you will take offline for an upgrade• Specify the # of update domains for

your service• Default is 5 and max is 20

• Roles are evenly assigned an update domain

• Used to update only one domain at a time• Rolling update

Update Domains

Upgrade domains

allocated across fault domains

Fault domains

Page 33: Windows Azure Compute

Service Deployment and Maintenance

Page 34: Windows Azure Compute

Containing Failure: Datacenter Clusters• Datacenters are divided into “clusters”

• Approximately 1000 rack-mounted servers (we call them “nodes”)• Provides a unit of fault isolation

• Each cluster is managed by a Fabric Controller (FC)

Cluster1

Cluster2

Clustern

…Datacenter network

FC FC FC

Page 35: Windows Azure Compute

The Fabric Controller (FC)• The “kernel” of the cloud operating system

• Manages datacenter hardware• Manages Windows Azure services

• Four main responsibilities:• Datacenter resource allocation• Datacenter resource provisioning• Service lifecycle management• Service health management

• Inputs:• Description of the hardware and network resources it will

control• Service model and binaries for cloud applications

Page 36: Windows Azure Compute

Service Deployment Steps• Process service model files

• Determine resource requirements• Create role images

• Allocate compute and network resources• Prepare nodes

• Place role images on nodes• Create virtual machines• Start virtual machines and roles

• Configure networking• Dynamic IP addresses (DIPs) assigned to nodes• Virtual IP addresses (VIPs) + ports allocated and mapped to sets of

DIPs• Configure packet filter for VM to VM traffic within service• Program load balancers to allow traffic to external endpoints

Page 37: Windows Azure Compute

Service Resource Allocation• Goal: allocate service components to available resources

while satisfying all hard constraints • Size of VM

• HW requirements: CPU, Memory, Storage, Network• Upgrade domains• Fault domains

• Secondary goal: Satisfy soft constraints • Optimize network proximity: pack different roles into same node

Page 38: Windows Azure Compute

Deploying a ServiceRole B

Middle-Tier RoleCount: 3

Update Domains: 3Size: Large

Role AFront-End Role

(Front End)Count: 2

Update Domains: 3Size: Medium

LoadBalance

r10.100.0.36

10.100.0.122

www.mycloudapp.net

www.mycloudapp.net

Fault domain

Upgrade domain

Page 39: Windows Azure Compute

Inside a Deployed Node

Fabric Controller (Primary)

FC Host Agent

Host Partition

Guest Partitio

nGuest Agent

Guest Partitio

nGuest Agent

Guest Partitio

nGuest Agent

Guest Partitio

nGuest Agent

Physical Node

Fabric Controller (Replica)

Fabric Controller (Replica)…

Role Instance

Role Instance

Role Instance

Role Instance

Trust boundary Image Repository

(OS VHDs, role ZIP files)

Page 40: Windows Azure Compute

Detection: Load Balancer Operation• FC programs load balancers (LB) to “probe” guest

agent (GA) every 15 seconds• If the guest misses two probes, the LB stops forwarding

traffic• The role can report “busy” status to the GA • GA stops responding to probes

• LB keeps an idle connection open for 60s• Use keep-alive commands if the connection needs to be

open longer

Page 41: Windows Azure Compute

Recovery: Server and Role Health• FC maintains service availability by monitoring the

software and hardware health• Based primarily on heartbeats • Automatically “heals” affected rolesProblem Fabric Detection Fabric Response

Role instance crashes FC guest agent monitors role termination FC restarts role

Guest VM or agent crashes FC host agent notices missing guest agent heartbeats

FC restarts VM and hosted role

Host OS or agent crashes FC notices missing host agent heartbeat Tries to recover nodeFC reallocates roles to other nodes

Detected node hardware issue Host agent informs FC FC migrates roles to other nodesMarks node “out for repair”

Page 42: Windows Azure Compute

Updating Your Service• There are two update types:• In-place: used for large scale services and used to

updated services with local state• VIP swap: for ease of testing and fail-back for smaller

services• In-place (rolling) update:• Role instances updated one update domain at a time• Two modes: automatic and manual

Page 43: Windows Azure Compute

In-Place Update• Purpose: Ensure service stays up

while updating• Used by Windows Azure OS updates

• System considers update domains when upgrading a service• 1/Update domains = percent of

service that will be offline• Default is 5 and max is 20

• The Windows Azure SLA is based on at least two update domains and two role instances of each role

Front-End-

1

Front-End-

2

Update Domain 1

Update Domain

2

Middle

Tier-1

Middle

Tier-2

Middle

Tier-3

Update Domain

3

Middle Tier-

3Front-End-2Front-End-

1

Middle Tier-

2

Middle

Tier-1

Page 44: Windows Azure Compute

Windows Azure Compute Summary• Platform as a Service is all about reducing

management and operations overhead• The Windows Azure Fabric Controller is the

foundation for Windows Azure’s PaaS• Provisions machines• Deploys services• Configures hardware for services• Monitors service and hardware health

Page 45: Windows Azure Compute