LECTURE 1: INTRODUCTION TO CLOUD COMPUTING .................................................................... 6 Essential Characteristics: ............................................................................................................. 6 Service Models: ........................................................................................................................... 7 Deployment Models: ................................................................................................................... 7 LECTURE 2: Introducing Windows Azure ................................................................................... 9 Azure Overview ........................................................................................................................... 9 Is Your Application a Good Fit for Windows Azure? ................................................................. 11 Understand the Benefits of Windows Azure ............................................................................. 11 Target Scenarios that Leverage the Strengths of Windows Azure ............................................ 12 Scenarios that Do Not Require the Capabilities of Windows Azure.......................................... 15 Evaluate Architecture and Development .................................................................................. 16 Summary.................................................................................................................................... 18 LECTURE 3: Main Components of Windows Azure ....................................................................... 19 Table of Contents .................................................................................................................... 19 The Components of Windows Azure.................................................................................... 19 Execution Models .................................................................................................................... 20 Data Management .................................................................................................................. 23 Networking .............................................................................................................................. 25 Business Analytics ................................................................................................................... 27 Messaging ................................................................................................................................ 29 Caching..................................................................................................................................... 30 Identity...................................................................................................................................... 32 High-Performance Computing .............................................................................................. 33 Media ........................................................................................................................................ 33 Commerce ................................................................................................................................ 34 SDKs .......................................................................................................................................... 35 Getting Started ........................................................................................................................ 36 Lecture 4: WINDOWS AZURE COMPUTE ....................................................................................... 62 Web Sites vs Cloud Services vs Virtual Machines ...................................................................... 62 WINDOWS AZURE CLOUD SERVICES: ................................................ Error! Bookmark not defined. WEB ROLE AND WORKER ROLE ..................................................................................................... 70 THE THREE RULES OF THE WINDOWS AZURE PROGRAMMING MODEL................................... 70 A WINDOWS AZURE APPLICATION IS BUILT FROM ONE OR MORE ROLES ............................... 71
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
LECTURE 1: INTRODUCTION TO CLOUD COMPUTING .................................................................... 6
Different workloads can, and often do, have different requirements, different levels of criticality to the
business, and different levels of financial consideration associated with them. By decomposing an application
into workloads, an organization provides itself with valuable flexibility. A workload-centric approach provides
better controls over costs, more flexibility in choosing technologies best suited to the workload, workload
specific approaches to availability and security, flexibility and agility in adding and deploying new
capabilities, etc.
Scenarios When thinking about resiliency, it‘s sometimes helpful to do so in the context of scenarios. The following are
examples of typical scenarios:
Scenario 1 – Sports Data Service
A customer provides a data service that provides sports information. The service has two primary workloads.
The first provides statistics for the player and teams. The second provides scores and commentary for games
that are currently in progress.
Scenario 2 – E-Commerce Web Site
An online retailer sells goods via a website in a well-established model. The application has a number of
workloads, with the most popular being ―search and browse‖ and ―checkout.‖
Scenario 3 – Social
A high profile social site allows members of a community to engage in shared experiences around forums,
user generated content, and casual gaming. The application has a number of workloads, including
registration, search and browse, social interaction, gaming, email, etc.
Scenario 4 - Web
An organization wishes to provide an experience to customers via its web site. The application needs to
deliver experiences on both PC-based browsers as well as popular mobile device types (phone, tablet) The
application has a number of workloads including registration, search and browse, content publishing, social
commenting, moderation, gaming, etc.
Example of Decomposing by Workload Let‘s take a closer look at one of the scenarios and decompose it into its child workloads. Scenario #2, an
ecommerce web site, could have a number of workloads – browse & search, checkout & management, user
registration, user generated content (reviews and rating), personalization, etc.
Example definitions of two of the core workloads for the scenario would be:
Browse & Search enables customers to navigate through a product catalog, search for specific items, and
perhaps manage baskets or wish lists. This workload can have attributes such as anonymous user access,
sub-second response times, and caching. Performance degradation may occur in the form of increased
response times with unexpected user load or application-tolerant interrupts for product inventory refreshes.
In those cases, the application may choose to continue to serve information from the cache.
Checkout & Management helps customers place, track, and cancel orders; select delivery methods and
payment options; and manage profiles. This workload can have attributes such as secure access, queued
processing, access to third-party payment gateways, and connectivity to back-end on-premise systems.
While the application may tolerate increased response time, it may not tolerate loss of orders; therefore, it is
designed to guarantee that customer orders are always accepted and captured, regardless of whether the
application can process the payment or arrange delivery.
Establish a Lifecycle Model
An application lifecycle model defines the expected behavior of an application when operational. At different
phases and times, an application will put different demands on the system whether at a functional or scale
level. The lifecycle model(s) will reflect this.
Workloads should have defined lifecycle models for all relevant and applicable scenarios. Services may have
hourly, daily, weekly, or seasonal lifecycle differences that, when modeled, identify specific capacity,
availability, performance, and scalability requirements over time.
Many services will have a minimum of two applicable models, particularly if service demand bursts in a
predictable fashion. Whether it‘s a spike related to peak demand during a holiday period, increased filing of
tax returns just before their due date, morning and afternoon commuter time windows, or end-of-year filing
of employee performance reviews, many organizations have an understanding of predictable spike in
demand for a service that should be modeled.
Figure 1. A view of the lifecycle model on a month by month basis
Figure 2. A look at the lifecycle model more granularly, at the daily level
Establish an Availability Model and Plan Once a lifecycle model is identified, the next step is to establish an availability model and plan. An availability
model for your application identifies the level of availability that is expected for your workload. It is critical as
it will inform many of the decisions you‘ll make when establishing your service.
There are a number of things consider and a number of potential actions that can be taken.
SLA Identification When developing your availability plan, it‘s important to understand what the desired availability is for your
application, the workloads within that application, and the services that are utilized in the delivery of those
workloads.
Defining the Desired SLA for Your Workload
Understanding the lifecycle of your workload will help you understand the desired Service Level Agreement
that you‘d like to deliver. Even if an SLA is not provided for your service publicly, this is the baseline to which
you‘ll aspire to meet in terms of availability.
There are a number of options that can be taken that will provide scalability and resiliency. These can take
varying costs and contain multiple layers. At an application level, utilizing all of these is unfeasible for most
projects due to cost and implementation time. By decomposing your application to the workload level, you
gain a benefit that you can make these investments at a more targeted level, the workload.
Even at the workload level, you may not choose to implement every option. What you choose to implement
or not is determined by your requirements. Regardless of the options you do choose, you should make a
conscious choice that‘s informed and considerate of all of the options.
Autonomy Autonomy is about independence and reducing dependency between the parts which make up the service
as a whole. Dependency on components, data, and external entities must be examined when designing
services, with an eye toward building related functionality into autonomous units within the service. Doing so
provides the agility to update versions of distinct autonomous units, finer tuned control of scaling these
autonomous units, etc.
Workload architectures are often composed of autonomous components that do not rely on manual
intervention, and do not fail when the entities they depend upon are not available. Applications composed
of autonomous parts are:
available and operational
resilient and easily fault-recoverable
lower-risk for unhealthy failure states
easy to scale through replication
less likely to require manual interventions
These autonomous units will often leverage asynchronous communication, pull-based data processing, and
automation to ensure continuous service.
Looking forward, the market will evolve to a point where there are standardized interfaces for certain types
of functionality for both vertical and horizontal scenarios. When this future vision is realized, a service
provider will be able to engage with different providers and potentially different implementations that solve
the designated work of the autonomous unit. For continuous services, this will be done autonomously and
be based on policies.
As much as autonomy is an aspiration, most services will take a dependency on a third party service – if only
for hosting. It‘s imperative to understand the SLAs of these dependent services and incorporate them into
your availability plan.
+Understanding the SLAs and Resiliency Options for Service
Dependencies This section identifies the different types of SLAs that can be relevant to your service. For each of these
service types, there are key considerations and approaches, as well as questions that should be asked.
Public Cloud Platform Services
Services provided by a commercial cloud computing platform, such as compute or storage, have service level
agreements that are designed to accommodate a multitude of customers at significant scale. As such, the
SLAs for these services are non-negotiable. A provider may provide tiered levels of service with different
SLAs, but these tiers will be non-negotiable.
Questions to consider for this type of service:
Does this service allow only a certain number of calls to the Service API?
Does this service place limit on the call frequency to the Service API?
Does the service limit the number of servers that can call the Service API?
What is the publicly available information on how the service delivers on its availability promise?
How does this service communicate its health status?
What is the stated Service Level Agreement (SLA)?
What are the equivalent platform services provided by other 3rd parties?
3rd Party ―Free‖ Services
Many third parties provide ―free‖ services to the community. For private sector organizations, this is largely
done to help generate an ecosystem of applications around their core product or service. For public sector,
this is done to provide data to the citizenry and businesses that have ostensibly have paid for its collection
through the funding of the government through taxes.
Most of these services will not come with service level agreements, so availability is not guaranteed. When
SLAs are provided, they typically focus on restrictions that are placed on consuming applications and
mechanisms that will be used to enforce them. Examples of restrictions can include throttling or blacklisting
your solution if it exceeds a certain number of service calls, exceeds a certain number of calls in a given time
period (x per minute), or exceeds the number of allowable servers that are calling the service.
Questions to consider for this type of service:
Does this service allow only a certain number of calls to the Service API?
Does this service place limits on the call frequency to the Service API?
Does the service limit the number of servers that can call the Service API?
What is the publicly available information on how the service delivers on its availability promise?
How does this service communicate its health status?
What is the stated Service Level Agreement (SLA)?
Is this a commodity service where the required functionality and/or data are available from multiple service
providers?
If a commodity service, is the interface interoperable across other service providers (directly or through an
available abstraction layer)?
What are the equivalent platform services provided by other 3rd parties?
3rd Party Commercial Services
Commercial services provided by third parties have service level agreements that are designed to
accommodate the needs of paying customers. A provider may provide tiered levels of SLAs with different
levels of availability, but these SLAs will be non-negotiable.
Questions to consider for this type of service:
Does this service allow only a certain number of calls to the Service API?
Does this service place limits on the call frequency to the Service API?
Does the service limit the number of servers that can call the Service API?
What is the publicly available information on how the service delivers on its availability promise?
How does this service communicate its health status?
What is the stated Service Level Agreement (SLA)?
Is this a commodity service where the required functionality and/or data are available from multiple service
providers?
If a commodity service, is the interface interoperable across other service providers (directly or through an
available abstraction layer)?
What are the equivalent platform services provided by other 3rd parties?
Community Cloud Services
A community of organizations, such as a supply chain, may make services available to member
organizations.
Questions to consider for this type of service:
Does this service allow only a certain number of calls to the Service API?
Does this service place limits on the call frequency to the Service API?
Does the service limit the number of servers that can call the Service API?
What is the publicly available information on how the service delivers on its availability promise?
How does this service communicate its health status?
What is the stated Service Level Agreement (SLA)?
As a member of the community, is there a possibility of negotiating a different SLA?
Is this a commodity service where the required functionality and/or data are available from multiple service
providers?
If a commodity service, is the interface interoperable across other service providers (directly or through an
available abstraction layer)?
What are the equivalent platform services provided by other 3rd parties?
1st Party Internal Enterprise Wide Cloud Services
An enterprise may make core services, such as stock price data or product metadata, available to its divisions
and departments.
Questions to consider for this type of service:
Does this service allow only a certain number of calls to the Service API?
Does this service place limits on the call frequency to the Service API?
Does the service limit the number of servers that can call the Service API?
What is the publicly available information on how the service delivers on its availability promise?
How does this service communicate its health status?
What is the stated Service Level Agreement (SLA)?
As a member of the organization, is there a possibility of negotiating a different SLA?
Is this a commodity service where the required functionality and/or data are available from multiple service
providers?
If a commodity service, is the interface interoperable across other service providers (directly or through an
available abstraction layer)?
What are the equivalent platform services provided by other 3rd parties?
1st Party Internal Divisional or Departmental Cloud Services
An enterprise division or department may make services available to other members of their immediate
organization.
Questions to consider for this type of service:
Does this service allow only a certain number of calls to the Service API?
Does this service place limits on the call frequency to the Service API?
Does the service limit the number of servers that can call the Service API?
What is the publicly available information on how the service delivers on its availability promise?
How does this service communicate its health status?
What is the stated Service Level Agreement (SLA)?
As a member of the division, is there a possibility of negotiating a different SLA?
Is this a commodity service where the required functionality and/or data are available from multiple service
providers?
If a commodity service, is the interface interoperable across other service providers (directly or through an
available abstraction layer)?
What are the equivalent platform services provided by other 3rd parties?
The ―True 9s‖ of Composite Service Availability
Taking advantage of existing services can provide significant agility in delivering solutions for your
organization or for commercial sale. While attractive, it is important to truly understand the impacts these
dependencies have on the overall SLA for the workload.
Availability is typically expressed as a percentage of uptime in a given year. This is expressed availability
percentage is referred to as the number of ―9s.‖ For example, 99.9 represents a service with ―three nines‖ and
Figure 3. Downtime related to the more common ―9s‖
One common misconception is related to the number of ―9s‖ for a composite service provides. Specifically, it
is often assumed that if a given service is composed of 5 services, each with a promised 99.999 uptime in
their SLAs, that the resulting composite service has availability of 99.999. This is not the case.
The percentage is actually a calculation which considers the amount of downtime per year. A service with an
SLA of ―four 9s‖ (99.99%) can be offline up to 52.56 minutes. Incorporating 5 of these services into a
composite introduces an identified SLA risk of 262.8 minutes or 4.38 hours. This reduces the availability to
99.95% before a single line of code is written! You generally can‘t change the availability of a third-party
service; however, when writing your code, you can increase the overall availability of your application using
concepts laid out in this document.
When leveraging external services, the importance of understanding SLAs – both individually and their
impact on the composite - cannot be stressed enough.
Identify Failure Points and Failure Modes To create a resilient architecture, it‘s important to understand it. Specifically, making a proactive effort to
understand and document what can cause an outage.
Understanding the failure points and failure modes for an application and its related workload services can
enable you to make informed, targeted decisions on strategies for resiliency and availability.
Failure Points A failure point is a design element that can cause an outage. An important focus is on design elements that
are subject to external change.
Examples of failure points include –
Database connections
Website connections
Configuration files
Registry keys
Categories of common failure points include –
ACLs
Database access
External web site/service access
Transactions
Configuration
Capacity
Network
Failure Modes While failure points define the areas that can result in an outage, failure modes identify the root cause of an
outage at those failure points.
Examples of failure modes include –
A missing configuration file
Significant traffic exceeding resource capacity
A database reaching maximum capacity
Resiliency Patterns and Considerations This document will look at key considerations across compute, storage, and platform services. Before
covering these topics, it is important to recap several basic resiliency impacting topics that are often either
misunderstood and/or not implemented.
Default to Asynchronous As mentioned previously, a resilient architecture should optimize for autonomy. One of the ways to achieve
autonomy is by making communication asynchronous. A resilient architecture should default to
asynchronous interaction, with synchronous interactions happening only as the result of an exception.
Stateless web-tiers or web-tiers with a distributed cache can provide this on the front end of a solution.
Queues can provide this capability for communication for interaction between workload services or for
services within a workload service.
The latter allows messages to be placed on queues and secondary services can retrieve them. This can be
done based on logic, time, or volume considerate logic. In addition to making the process asynchronous, it
also allows scaling of tiers ―pushing‖ or ―pulling‖ from the queues as appropriate.
Timeouts A common area where transient faults will occur is where your architecture connects to a service or a
resource such as a database. When consuming these services, it‘s a common practice to implement logic that
introduces the concept of a time out. This logic identifies an acceptable timeframe in which a response is
expected and will generate an identifiable error when exceeding that time frame. Based on the appearance
of the timeout error, appropriate steps will be taken based on the context in which the error occurs. Context
can include the number of times this error has occurred, the potential impact of the unavailable resource,
SLA guarantees for the current time period for the given customer, etc.
Handle Transient Faults When designing the service(s) that will deliver your workload, you must accept and embrace that failures will
occur and take the appropriate steps to address them.
One of the common areas to address is transient faults. As no service has 100% uptime, it‘s realistic to expect
that you may not be able to connect to a service that a workload has taken a dependency on. The inability to
connect to or faults seen from one of these services may be fleeting (less than a second) or permanent (a
provider shuts down).
Degrade Gracefully
Your workload service should aspire to handle these transient faults gracefully. Netflix, for example, during
an outage at their cloud provider utilized an older video queue for customers when the primary data store
was not available. Another example would be an ecommerce site continuing to collect orders if its payment
gateway is unavailable. This provides the ability to process orders when the payment gateway is once again
available or after failing over to a secondary payment gateway.
When doing this, the ideal scenario is to minimize the impact to the overall system. In both cases, the service
issues are largely invisible to end users of these systems.
Transient Fault Handling Considerations
There are several key considerations for the implementation of transient fault handling, as detailed in the
following sections.
Retry logic
The simplest form of transient fault handling is to retry the operation that failed. If using a commercial third
party service, implementing ―retry logic‖ will often resolve this issue.
It should be noted that designs should typically limit the number of times the logic will be retried. The logic
will typically attempt to execute the action(s) a certain number of times, registering an error and/or utilizing
a secondary service or workflow if the fault continues.
Exponential Backoff
If the result of the transient fault is due to throttling by the service due to heavy load, repeated attempts to
call the service will only extend the throttling and impact overall availability.
It is often desirable to reduce the volume of the calls to the service to help avoid or reduce throttling. This is
typically done algorithmically, such as immediately retrying after the first failure, waiting 1 second after the
second failure, 5 seconds after the 3rd failure, etc. until ultimately succeeding or a hitting an application
defined threshold for failures.
This approach is referred to ―exponential backoff.‖
Idempotency
A core assumption with connected services is that they will not be 100% available and that transient fault
handling with retry logic is a core implementation approach. In cases where retry logic is implemented, there
is the potential for the same message to be sent more than once, for messages to be sent out of sequence,
etc.
Operations should be designed to be idempotent, ensuring that sending the same message multiple times
does not result in an unexpected or polluted data store.
For example, inserting data from all requests may result in multiple records being added if the service
operation is called multiple times. An alternate approach would be to implement the code as an intelligent
‗upsert‘. A timestamp or global identifier could be used to identify new from previously processed messages,
inserting only newer ones into the database and updating existing records if the message is newer than what
was received in the past.
Compensating Behavior
In addition to idempotency, another area for consideration is the concept compensating behavior. In a world
of an every growing set of connected systems and the emergence of composite services, the importance of
understanding how to handle the compensating behavior is important.
For many developers of line of business applications, the concepts of transactions are not new, but the frame
of reference is often tied to the transactional functionality exposed by local data technologies and related
code libraries. When looking at the concept in terms of the cloud, this mindset needs to take into new
considerations related to orchestration of distributed services.
A service orchestration can span multiple distributed systems and be long running and stateful. The
orchestration itself is rarely synchronous, can span multiple systems and can span from seconds to years
based on the business scenario.
In a supply chain scenario that could tie together 25 organizations in the same workload activity, for
example, there may be a set of 25 or more systems that are interconnected in one or more service
orchestrations.
If success occurs, the 25 systems must be made aware that the activity was successful. For each connection
point in the activity, participant systems can provide a correlation ID for messages it receives from other
systems. Depending on the type of activity, the receipt of that correlation ID may satisfy the party that the
transaction is notionally complete. In other cases, upon the completion of the interactions of all 25 parties,
and confirmation message may be sent to all parties (either directly from a single service or via the specific
orchestration interaction points for each system).
To handle failures in composite and/or distributed activities, each service would expose a service interface
and operation(s) to receive requests to cancel a given transaction by a unique identifier. Behind the service
façade, workflows would be in place to compensate for the cancellation of this activity. Ideally these would
be automated procedures, but they can be as simple as routing to a person in the organization to remediate
manually.
Circuit Breaker Pattern A circuit breaker is a switch that automatically interrupts the flow of electric current if the current exceeds a
preset limit. Circuit breakers are used most often as a safety precaution where excessive current through a
circuit could be hazardous. Unlike a fuse, a circuit breaker can be reset and re-used.
The same pattern is applicable to software design, and particularly applicable for services where availability
and resiliency are a key consideration.
In the case of a resource being unavailable, implementing a software circuit breaker can respond with
appropriate action and respond appropriately.
A common implementation of this pattern is related to accessing of databases or data services. Once an
established type and level of activity fails, the circuit breaker would react. With data, this is typically caused
by the inability to connect to a database or a data service in front of that database.
If a call to a database resource failed after 100 consecutive attempts to connect, there is likely little value in
continuing to call the database. A circuit breaker could be triggered at that threshold and the appropriate
actions can be taken.
In some cases, particularly when connecting to data services, this could be the result of throttling based on a
client exceeding the number of allowed calls within a given time period. The circuit breaker may inject delays
between calls until such time that connections are successfully established and meet the tolerance levels.
In other cases, the data store may not be unavailable. If a redundant copy of the data is available, the system
may fail over to that replica. If a true replica is unavailable or if the database service is down broadly across
all data centers within a provider, a secondary approach may be taken. This could include sourcing data from
a version of the data requested via an alternate data service provider. This alternate source could be from a
cache, an alternate persistent data store type on the current cloud provider, a separate cloud provider, or an
on premise data center. When such an alternate is not available, the service could also return a recognizable
error that could be handled appropriately by the client.
Circuit Breaker Example: Netflix
Netflix, a media streaming company, is often held up as a great example of a resilient architecture. When
discussing the circuit breaker pattern at Netflix, that team calls out several criteria that are included in their
circuit breaker in their Netflix Tech Blog. These included:
1. A request to the remote service times out.
2. The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity.
3. The client library used to interact with a service dependency throws an exception.
All of these contribute to the overall error rate. When that error rate exceeds their defined thresholds, the
circuit breaker is ―tripped‖ and the circuit for that service immediately serves fallbacks without even
attempting to connect to the remote service.
In that same blog entry, the Netflix team states that the circuit breaker for each of their services implements
a fallback using one of the following three approaches:
1. Custom fallback – a service client library provides an invokable fallback method or locally available data on
an API server (e.g., a cookie or local cache) is used to generate a fallback response.
2. Fail silent – a method returns a null value to the requesting client, which works well when the data being
requested is optional.
3. Fail fast – when data is required or no good fallback is available, a 5xx response is returned to the client. This
approach focuses on keeping API servers healthy and enabling a quick recovery when impacted services
come back online, but does so at the expense of negatively impacting the client UX.
Handling SLA Outliers: Trusted Parties and Bad Actors To enforce an SLA, an organization should address how its data service will deal with two categories of
outliers—trusted parties and bad actors.
Trusted Parties and White Listing
Trusted parties are organizations with whom the organization could have special arrangements, and for
whom certain exceptions to standard SLAs might be made.
There may be some users of a service that want to negotiate special pricing terms or policies. In some cases,
a high volume of calls to the data service might warrant special pricing. In other cases, demand for a given
data service could exceed the volume specified in standard usage tiers. Such customers should be defined as
trusted parties to avoid inadvertently being flagged as bad actors.
White Listing
The typical approach to handling trusted parties is to establish a white list. A white list, which identifies a list
of trusted parties, is used by the service when it determines which business rules to apply when processing
customer usage. White listing is typically done by authorizing either an IP address range or an API key.
When establishing a consumption policy, an organization should identify if white listing is supported; how a
customer would apply to be on the white list; how to add a customer to the white list; and under what
circumstances a customer is removed from the white list.
Handling Bad Actors
If trusted parties stand at one end of the customer spectrum, the group at the opposite end is what is
referred to as ―bad actors.‖ Bad actors place a burden on the service, typically from attempted
―overconsumption.‖ In some cases bad behavior is genuinely accidental. In other cases it is intentional, and,
in a few situations, it is malicious. These actors are labeled ―bad‖, as their actions – intentional or otherwise –
have the ability to impact the availability of one or more services.
The burden of bad actors can introduce unnecessary costs to the data service provider and compromise
access by consumers who faithfully follow the terms of use and have a reasonable expectation of service, as
spelled out in an SLA. Bad actors must therefore be dealt with in a prescribed, consistent way. The typical
responses to bad actors are throttling and black listing.
Throttling
Organizations should define a strategy for dealing with spikes in usage by data service consumers.
Significant bursts of traffic from any consumer can put an unexpected load on the data service. When such
spikes occur, the organization might want to throttle access for that consumer for a certain period of time. In
this case the service refuses all requests from the consumer for a certain period of time, such as one minute,
five minutes, or ten minutes. During this period, service requests from the targeted consumer result in an
error message advising that they are being throttled for overuse.
The consumer making the requests can respond accordingly, such as by altering its behavior.
The organization should determine whether it wants to implement throttling and set the related business
rules. If it determines that consumers can be throttled, the organization will also need to decide what
behaviors should trigger the throttling response.
Black listing
Although throttling should correct the behavior of bad actors, it might not always be successful. In cases in
which it does not work, the organization might want to ban a consumer. The opposite of a white list, a black
list identifies consumers that are barred from access to the service. The service will respond to access
requests from black-listed customers appropriately, and in a fashion that minimizes the use of data service
resources.
Black listing, as with white listing, is typically done by using either an API key or with an IP address range.
When establishing a consumption policy, the organization should specify what behaviors will place a
consumer on the black list; how black listing can be appealed; and how a consumer can be removed from
the black list.
“Automate All the Things” People make mistakes. Whether it‘s a developer making a code change that could have unexpected
consequences, a DBA accidentally dropping a table in a database, or an operations person who makes a
change but doesn‘t document it, there are multiple opportunities for a person to inadvertently make a
service less resilient.
To reduce human error, a logical approach is to reduce the amount of humans in the process. Through the
introduction of automation, you limit the ability for ad hoc, inadvertent deltas from expected behavior to
jeopardize your service.
There is a meme in the DevOps community with a cartoon character saying ―Automate All the Things.‖ In the
cloud, most services are exposed with an API. From development tools to virtualized infrastructure to
platform services to solutions delivered as Software as a Service, most things are scriptable.
Scripting is highly recommended. Scripting makes deployment and management consistent and predictable
and pays significant dividends for the investment.
Automating Deployment
One of the key areas of automation is in the building and deployment of a solution. Automation can make it
easy for a developer team to test and deploy to multiple environments. Development, test, staging, beta,
and production can all be deployed readily and consistently through automated builds. The ability to deploy
consistently across environments works toward ensuring that what‘s in production is representative of what‘s
been tested.
Establish and Automating a Test Harness
Testing is another area that can be automated. Like automated deployment, establishing automated testing
is valuable in ensuring that your system is resilient and stays resilient over time. As code and usage of your
service evolves it‘s important to remain that all appropriate testing is done, both functionally and at scale.
Automating Data Archiving and Purging
One of the areas that gets little attention is that of data archiving and purging. Data volume is growing and
continues to grow at a higher volume and in greater variety than any time in history. Depending on the
database technology and the types of queries required, unnecessary data can reduce the response time of
your system and increase costs unnecessarily. For resiliency plans that include one or more replicas of a data
store, removing all but the necessary data can expedite management activities such as backing up and
restoring data.
Identify the requirements for your solution related to data needed for core functionality, data needed for
compliance purposes but can be archived, and data that is no longer necessary and can be purged.
Utilize the APIs available from the related products and services to automate the implementation of these
requirements.
Understand Fault Domains and Upgrade Domains When building a resilient architecture, it‘s also important to understand the concepts of fault domains and
upgrade domains.
Fault Domains
Fault domains constrain the placement of services based on known hardware boundaries and the likelihood
that a particular type of outage will affect a set of machines. A fault domain is defined as a series of
machines can fail simultaneously, and are usually defined by physical properties (a particular rack of
machines, a series of machines sharing the same power source, etc).
Upgrade Domains
Upgrade domains are similar to fault domains. Upgrade domains define a physical set of services that are
updated by the system at the same time. The load balancer at the cloud provider must be aware of upgrade
domains in order to ensure that if a particular domain is being updated that the overall system remains
balanced and services remain available.
Depending on the cloud provider and platform services utilized, fault domains and upgrade domains may be
provided automatically, be something your service can opt-in to via APIs, or require a 1st or 3rd party
solution.
Identify Compute Redundancy Strategies On-premises solutions have often relied on redundancy to help them with availability and scalability. From
an availability standpoint, redundant data centers provided the ability to increase likelihood of business
continuity in the face of infrastructure failures in a given data center or part of a data center.
For applications with geo-distributed consumers, traffic management and redundant implementations
routed users to local resources, often with reduced latency.
Note
Data resiliency, which includes redundancy, is covered
as a separate topic in the section titled Establishing a
Data Resiliency Approach.
Redundancy and the Cloud
On-premises, redundancy has historically been achieved through duplicate sets of hardware, software, and
networking. Sometimes this is implemented in a cluster in a single location or distributed across multiple
data centers.
When devising a strategy for the cloud, it is important to rationalize the need for redundancy across three
vectors. These vectors include deployed code within a cloud provider‘s environment, redundancy of
providers themselves, and redundancy between the cloud and on premises.
Deployment Redundancy
When an organization has selected a cloud provider, it is important to establish a redundancy strategy for
the deployment within the provider.
If deployed to Platform as a Service (PaaS), much of this may be handled by the underlying platform. In an
Infrastructure as a Service (IaaS) model, much of this is not.
Deploy n number of roles with in a data center
The simplest form of redundancy is deploying your solution to multiple compute nodes within a single cloud
provider. By deploying to multiple nodes, the solution can limit downtime that would occur when only a
single node is deployed.
In many Platform as a Service environments, the state of the virtual machine hosting the code is monitored
and virtual machines detected to be unhealthy can be automatically replaced with a healthy node.
Deploy Across Multiple Data Centers
While deploying multiple nodes in a single data center will provide benefits, architectures must consider that
an entire data center could potentially be unavailable. While not a common occurrence, events such as
natural disasters, war, etc. could result in a service disruption in a particular geo-location.
To achieve your SLA, it may be appropriate for you to deploy your solution to multiple data centers for your
selected cloud provider. There are several approaches to achieving this, as identified below.
1. Fully Redundant Deployments in Multiple Data Centers
The first option is a fully redundant solution in multiple data centers done in conjunction with a traffic
management provider. A key consideration for this approach will be impact to the compute-related costs for
this type of redundancy, which will increase 100% for each additional data center deployment.
2. Partial Deployment in Secondary Data Center(s) for Failover
Another approach is to deploy a partial deployment to a secondary data center of reduced size. For example,
if the standard configuration utilized 12 compute nodes, the secondary data center would contain a
deployment containing 6 compute nodes.
This approach, done in conjunction with traffic management, would allow for business continuity with
degraded service after an incident that solely impacted the primary center.
Given the limited number of times a data center goes offline entirely, this is often seen as a cost-effective
approach for compute – particularly if a platform allows the organization to readily onboard new instances in
the second data center.
3. Divided Deployments across Multiple Data Centers with Backup Nodes
For certain workloads, particularly those in the financial services vertical, there is a significant amount of data
that must be processed within a short, immovable time window. In these circumstances, work is done in
shorter bursts and the costs of redundancy are warranted to deliver results within that window.
In these cases, code is deployed to multiple data centers. Work is divided and distributed across the nodes
for processing. In the instance that a data center becomes unavailable, the work intended for that node is
delivered to the backup node which will complete the task.
4. Multiple Data Center Deployments with Geography Appropriate Sizing per Data Center
This approach utilizes redundant deployments that exist in multiple data centers but are sized appropriately
for the scale of a geo-relevant audience.
Provider Redundancy
While data-center-centric redundancy is good, Service Level Agreements are at the Service Level vs. the data
center. There is the possibility that the services delivered by a provider could become unavailable across
multiple or all data centers.
Based on the SLAs for a solution, it may be desirable to also incorporate provider redundancy. To realize this,
cloud-deployable products or cloud services that will work across multiple cloud platforms must be
identified. Microsoft SQL Server, for example, can be deployed in a Virtual Machine inside of Infrastructure as
a Service offerings from most vendors.
For cloud provided services, this is more challenging as there are no standard interfaces in place, even for
core services such as compute, storage, queues, etc. If provider redundancy is desired for these services, it is
often achievable only thorugh an abstraction layer. An abstraction layer may provide enough functionality
for a solution, but it will not be innovated as fast as the underlying services and may inhibit an organization
from being able to readily adopt new features delivered by a provider.
If redundant provider services may are warranted, it can be at one of several levels--an entire application, a
workload, or an aspect of a workload. At the appropriate level, evaluate the need for compute, data, and
platform services and determine what must truly be redundant and what can be handled via approaches to
provide graceful degradation.
On-Premises Redundancy
While taking a dependency on a cloud provider may make fiscal sense, there may be certain business
considerations that require on-premises redundancy for compliance and/or business continuity.
Based on the SLAs for a solution, it may be desirable to also incorporate on-premises redundancy. To realize
this, private cloud-deployable products or cloud services that will work across multiple cloud types must be
identified. As with the case of provider redundancy, Microsoft SQL Server is a good example of a product
that can be deployed on-premises or in an IaaS offering.
For cloud provided services, this is more challenging as there are often no on-premises equivalents with
interface and capability symmetry.
If redundant provider services are required on premises, this can be at one of several levels--an entire
application, a workload, or an aspect of a workload. At the appropriate level, evaluate the need for compute,
data, and platform services and determine what must truly be redundant and what can be handled via
approaches to provide graceful degradation.
Redundancy Configuration Approaches
When identifying your redundancy configuration approaches, classifications that existed pre-cloud also
apply. Depending on the types of services utilized in your solution, some of this may be handled by the
underlying platform automatically. In other cases, this capability is handled through technologies like
Windows Fabric.
1. Active/active — Traffic intended for a failed node is either passed onto an existing node or load balanced
across the remaining nodes. This is usually only possible when the nodes utilize a homogeneous software
configuration.
2. Active/passive — Provides a fully redundant instance of each node, which is only brought online when its
associated primary node fails. This configuration typically requires the most extra hardware.
3. N+1 — Provides a single extra node that is brought online to take over the role of the node that has failed.
In the case of heterogeneous software configuration on each primary node, the extra node must be
universally capable of assuming any of the roles of the primary nodes it is responsible for. This normally
refers to clusters which have multiple services running simultaneously; in the single service case, this
degenerates to active/passive.
4. N+M — In cases where a single cluster is managing many services, having only one dedicated failover node
may not offer sufficient redundancy. In such cases, more than one (M) standby servers are included and
available. The number of standby servers is a tradeoff between cost and reliability requirements.
5. N-to-1 — Allows the failover standby node to become the active one temporarily, until the original node can
be restored or brought back online, at which point the services or instances must be failed-back to it in order
to restore high availability.
6. N-to-N — A combination of active/active and N+M, N to N redistributes the services, instances or
connections from the failed node among the remaining active nodes, thus eliminating (as with active/active)
the need for a 'standby' node, but introducing a need for extra capacity on all active nodes.
Traffic Management Whether traffic is always geo-distributed or routed to different data centers to satisfy business continuity
scenarios, traffic management functionality is important to ensure that requests to your solution are being
routed to the appropriate instance(s).
It is important to note that taking a dependence on a traffic management service introduces a single point of
failure. It is important to investigate the SLA of your application‘s primary traffic management service and
determine if alternate traffic management functionality is warranted by your requirements.
Establish a Data Partitioning Strategy While many high scale cloud applications have done a fine job of partitioning their web tier, they are less
successful in scaling their data tier in the cloud. With an ever growing diversity of connected devices, the
level of data generated and queried is growing at levels not seen before in history. The need to be able to
support 500,000 new users per day, for example, is now considered reasonable.
Having a partitioning strategy is critically important across multiple dimensions, including storing, querying,
or maintaining that data.
Decomposition and Partitioning
Because of the benefits and tradeoffs of different technologies, it is common to leverage technologies that
are most optimal for the given workload.
Having a solution that is decomposed by workloads provides you with the ability to choose data
technologies that are optimal for a given workload. For example, a website may utilize table storage for
content for an individual, utilizing partitions at the user level for a response experience. Those table rows
may be aggregated periodically into a relational database for reporting and analytics.
Partitioning strategies may, and often will, vary based on the technologies chosen.
Understanding the 3 Vs
To properly devise a partitioning strategy, an organization must first understand it.
The 3 Vs, made popular by Gartner, look at three different aspects of data. Understanding how the 3 Vs
relate to your data will assist you in making an informed decision on partitioning strategies.
Volume
Volume refers to the size of the data. Volume has very real impacts on the partitioning strategy. Volume
limitations on a particular data technology may force partitioning due to size limitations, query speeds at
volume, etc.
Velocity
Velocity refers to the rate at which your data is growing. You will likely devise a different partitioning strategy
for a slow growing data store vs. one that needs to accommodate 500,000 new users per day.
Variety
Variety refers to the different types of data that are relevant to the workload. Whether it‘s relational data,
key-value pairs, social media profiles, images, audio files, videos, or other types of data, it‘s important to
understand it. This is both to choose the right data technology and make informed decisions for your
partitioning strategy.
Horizontal Partitioning
Likely the most popular approach to partitioning data is to partition it horizontally. When partitioning
horizontally, a decision is made on criteria to partition a data store into multiple shards. Each shard contains
the entire schema, with the criteria driving the placement of data into the appropriate shards.
Based on the type of data and the data usage, this can be done in different ways. For example, an
organization could choose to partition their data based on a customer last name. In another case, the
partition could be date centric, partitioning on the relevant calendar interval of hour, day, week, or month.
Figure 4. An example of horizontal partitioning by last name
Vertical Partitioning
Another approach is vertical partitioning. This optimizes the placement of data in different stores, often tied
to the variety of the data. Figure 5 shows an example where metadata about a customer is placed in one
store while thumbnails and photos are placed in separate stores.
Vertical partitioning can result in optimized storage and delivery of data. In Figure 5, for example, if the
photo is rarely displayed for a customer, returning 3 megabytes per records can add unnecessary costs in a
pay as you go model.
Figure 5. An example of vertical partitioning.
Hybrid Partitioning
In many cases it will be appropriate to establish a hybrid partitioning strategy. This approach provides the
efficiencies of both approaches in a single solution.
Figure 6 shows an example of this, where the vertical partitioning seen earlier is now augmented to take
advantage of horizontal partitioning of the customer metadata.
Figure 6. An example of horizontal partitioning.
Cloud computing == network computing
At the heart of cloud computing is the network. The network is crucial as it provides the fabric or backbone
for devices to connect to services as well as services connecting to other services. There are three network
boundaries to consider in any FailSafe application.
Those network boundaries are detailed below with Windows Azure used as an example to provide context:
1. Role boundaries are traditionally referred to as tiers. Common examples are a web tier or a business logic
tier. If we look at Windows Azure as an example, it formally introduced roles as part of its core design to
provide infrastructure support the multi-tier nature of modern, distributed applications. Windows Azure
guarantees that role instances belonging t the same service are hosted within the scope of a single network
environment and managed by a single fabric controller.
2. Service boundaries represent dependencies on functionality provided by other services. Common examples
are a SQL environment for relational database access and a Service Bus for pub/sub messaging support.
Within Windows Azure, for example, service boundaries are enforced through the network: no guarantee will
be given that a service dependency will be part of the same network or fabric controller environment. That
might happen, but the design assumption for any responsible application has to be that any service
dependency is on a different network managed by a different fabric controller.
3. Endpoint boundaries are external to the cloud. They include any consuming endpoint, generally assumed to
be a device, connecting to the cloud in order to consume services. You must make special considerations in
this part of the design due to the variable and unreliable nature of the network. Role boundaries and service
boundaries are within the boundaries of the cloud environment and one can assume a certain level of
reliability and bandwidth. For the external dependencies, no such assumptions can be made and extra care
has to be given to the ability of the device to consume services, meaning data and interactions.
The network by its very nature introduces latency as it passes information from one point of the network to
another. In order to provide a great experience for both users and as dependent services or roles, the
application architecture and design should look for ways to reduce latency as much as sensible and manage
unavoidable latency explicitly. One of the most common ways to reduce latency is to avoid services calls that
involve the network--local access to data and services is a key approach to reduce latency and introduce
higher responsiveness Using local data and services also provides another layer of failure security; as long as
the requests of the user or application can be served from the local environment, there is no need to interact
with other roles or services, removing the possibility of dependent component unavailability as a failure
Millions of developers around the world know how to create applications using the Windows Server
programming model. Yet applications written for Windows Azure, Microsoft’s cloud platform, don’t
exactly use this familiar model. While most of a Windows developer’s skills still apply, Windows Azure
provides its own programming model.
Why? Why not just exactly replicate the familiar world of Windows Server in the cloud? Many vendors’
cloud platforms do just this, providing virtual machines (VMs) that act like on-premises VMs. This
approach, commonly called Infrastructure as a Service (IaaS), certainly has value, and it’s the right
choice for some applications. Yet cloud platforms are a new world, offering the potential for solving
today’s problems in new ways. Instead of IaaS, Windows Azure offers a higher-level abstraction that’s
typically categorized as Platform as a Service (PaaS). While it’s similar in many ways to the on-premises
Windows world, this abstraction has its own programming model meant to help developers build better
applications. The Windows Azure programming model focuses on improving applications in three areas:
Administration: In PaaS technologies, the platform itself handles the lion’s share of administrative
tasks. With Windows Azure, this means that the platform automatically takes care of things such as
applying Windows patches and installing new versions of system software. The goal is to reduce the
effort—and the cost—of administering the application environment.
Availability: Whether it’s planned or not, today’s applications usually have down time for Windows
patches, application upgrades, hardware failures, and other reasons. Yet given the redundancy that
cloud platforms make possible, there’s no longer any reason to accept this. The Windows Azure
programming model is designed to let applications be continuously available, even in the face of
software upgrades and hardware failures.
Scalability: The kinds of applications that people want to write for the cloud are often meant to
handle lots of users. Yet the traditional Windows Server programming model wasn’t explicitly
designed to support Internet-scale applications. The Windows Azure programming model, however,
was intended from the start to do this. Created for the cloud era, it’s designed to let developers build
the scalable applications that massive cloud data centers can support. Just as important, it also allows
applications to scale down when necessary, letting them use just the resources they need.
Whether a developer uses an IaaS technology or a PaaS offering such as Windows Azure, building
applications on cloud platforms has some inherent benefits. Both approaches let you pay only for the
computing resources you use, for example, and both let you avoid waiting for your IT department to
deploy servers. Yet important as they are, these benefits aren’t the topic here. Instead, the focus is
entirely on making clear what the Windows Azure programming model is and what it offers.
THE THREE RULES OF THE WINDOWS AZURE
PROGRAMMING MODEL
To get the benefits it promises, the Windows Azure programming model imposes three rules on
applications:
A Windows Azure application is built from one or more roles. A Windows Azure application runs multiple instances of each role.
A Windows Azure application behaves correctly when any role instance fails.
It’s worth pointing out that Windows Azure can run applications that don’t follow all of these rules—it
doesn’t actually enforce them. Instead, the platform simply assumes that every application obeys all
three. Still, while you might choose to run an application on Windows Azure that violates one or more
of the rules, be aware that this application isn’t actually using the Windows Azure programming model. Unless you understand and follow the model’s rules, the application might not run as you expect it to.
A WINDOWS AZURE APPLICATION IS BUILT FROM ONE OR
MORE ROLES
Whether an application runs in the cloud or in your data center, it can almost certainly be divided into
logical parts. Windows Azure formalizes these divisions into roles. A role includes a specific set of code,
such as a .NET assembly, and it defines the environment in which that code runs. Windows Azure today
lets developers create three different kinds of roles:
Web role: As the name suggests, Web roles are largely intended for logic that interacts with the
outside world via HTTP. Code written as a Web role typically gets its input through Internet
Information Services (IIS), and it can be created using various technologies, including ASP.NET,
Windows Communication Foundation (WCF), PHP, and Java.
Worker role: Logic written as a Worker role can interact with the outside world in various ways—it’s
not limited to HTTP. For example, a Worker role might contain code that converts videos into a
standard format or calculates the risk of an investment portfolio or performs some kind of data
analysis.
Virtual Machine (VM) role: A VM role runs an image—a virtual hard disk (VHD)—of a Windows Server
2008 R2 virtual machine. This VHD is created using an on-premises Windows Server machine, then
uploaded to Windows Azure. Once it’s stored in the cloud, the VHD can be loaded on demand into a VM role and executed. From January 2012 onwards the Virtual Machine (VM) role is replaced by Windows Azure Virtual Machine.
All three roles are useful. The VM role was made available quite recently, however, and so it’s fair to say
that the most frequently used options today are Web and Worker roles. Figure 1 shows a simple
Windows Azure application built with one Web role and one Worker role. This application might use a Web role to accept HTTP requests from users, then hand off the work these users request, such as reformatting a video file and making it available for viewing, to a Worker role. A
primary reason for this two-part breakdown is that dividing tasks in this way can make an application easier to scale.
It’s also fine for a Windows Azure application to consist of just a single Web role or a single Worker role—
you don’t have to use both. A single application can even contain different kinds of Web and Worker
roles. For example, an application might have one Web role that implements a browser interface, perhaps
built using ASP.NET, and another Web role that exposes a Web services interface implemented using
WCF. Similarly, a Windows Azure application that performed two different kinds of data analysis might
define a distinct Worker role for each one. To keep things simple, though, we’ll assume that the example
application described here has just one Web role and one Worker role.
As part of building a Windows Azure application, a developer creates a service definition file that names
and describes the application’s roles. This file can also specify other information, such as the ports each
role can listen on. Windows Azure uses this information to build the correct environment for running
the application.
A WINDOWS AZURE APPLICATION RUNS MULTIPLE
INSTANCES OF EACH ROLE
Every Windows Azure application consists of one or more roles. When it executes, an application
that conforms to the Windows Azure programming model must run at least two copies—two distinct
instances—of each role it contains. Each instance runs as its own VM, as Figure 2 shows.
Figure 2: A Windows Azure application runs multiple instances of each role.
As described earlier, the example application shown here has just one Web role and one Worker role. A
developer can tell Windows Azure how many instances of each role to run through a service
configuration file (which is distinct from the service definition file mentioned in the previous section).
Here, the developer has requested four instances of the application’s Web role and three instances of its
Worker role.
Every instance of a particular role runs the exact same code. In fact, with most Windows Azure
applications, each instance is just like all of the other instances of that role—they’re interchangeable. For
example, Windows Azure automatically load balances HTTP requests across an application’s Web role
instances. This load balancing doesn’t support sticky sessions, so there’s no way to direct all of a client’s
requests to the same Web role instance. Storing client-specific state, such as a shopping cart, in a
particular Web role instance won’t work, because Windows Azure provides no way to guarantee that all
of a client’s requests will be handled by that instance. Instead, this kind of state must be stored
externally, as described later.
A WINDOWS AZURE APPLICATION BEHAVES CORRECTLY
WHEN ANY ROLE INSTANCE FAILS
An application that follows the Windows Azure programming model must be built using roles, and it
must run two or more instances of each of those roles. It must also behave correctly when any of those
role instances fails. Figure 3 illustrates this idea.
Figure 3: A Windows Azure application behaves correctly even when a role instance fails.
Here, the application shown in Figure 2 has lost two of its Web role instances and one of its Worker role
instances. Perhaps the computers they were running on failed, or maybe the physical network
connection to these machines has gone down. Whatever the reason, the application’s performance is
likely to suffer, since there are fewer instances to carry out its work. Still, the application remains up and
functioning correctly
If all instances of a particular role fail, an application will stop behaving as it should—this can’t be
helped. Yet the requirement to work correctly during partial failures is fundamental to the Windows
Azure programming model. In fact, the service level agreement (SLA) for Windows Azure requires
running at least two instances of each role. Applications that run only one instance of any role can’t get
the guarantees this SLA provides. The most common way to achieve this is by making every role instance equivalent, as with load-balanced
Web roles accepting user requests. This isn’t strictly required, however, as long as the failure of a single
role instance doesn’t break the application. For example, an application might use a group of Worker
role instances to cache data for Web role instances, with each Worker role instance holding different
data. If any Worker role instance fails, a Web role instance trying to access the cached data it contained
behaves just as it would if the data wasn’t found in the cache (e.g., it accesses persistent storage to
locate that data). The failure might cause the application to run more slowly, but as seen by a user, it still
behaves correctly. One more important point to keep in mind is that even though the sample application described so far
contains only Web and Worker rules, all of these rules also apply to applications that use VM roles. Just
like the others, every VM role must run at least two instances to qualify for the Windows Azure SLA,
and the application must continue to work correctly if one of these instances fails. Even with VM roles,
Window Azure still provides a form of PaaS—it’s not traditional IaaS.
WHAT THE WINDOWS AZURE PROGRAMMING MODEL
PROVIDES
The Windows Azure programming model is based on Windows, and the bulk of a Windows developer’s
skills are applicable to this new environment. Still, it’s not the same as the conventional Windows Server
programming model. So why bother to understand it? How does it help create better applications? To
answer these questions, it’s first worth explaining a little more about how Windows Azure works. Once
this is clear, understanding how the Windows Azure programming model can help create better software
is simple.
SOME BACKGROUND: THE FABRIC CONTROLLER
Windows Azure is designed to run in data centers containing lots of computers. Accordingly, every
Windows Azure application runs on multiple machines simultaneously. Figure 4 shows a simple example
of how this looks.
Figure 4: The Windows Azure fabric controller creates instances of an application’s roles on different
machines, then monitors their execution.
As Figure 4 shows, all of the computers in a particular Windows Azure data center are managed by
an application called the fabric controller. The fabric controller is itself a distributed application that
runs across multiple computers.
When a developer gives Windows Azure an application to run, he provides the code for the application’s
roles together with the service definition and service configuration files for this application. Among
other things, this information tells the fabric controller how many instances of each role it should create.
The fabric controller chooses a physical machine for each instance, then creates a VM on that machine
and starts the instance running. As the figure suggests, the role instances for a single application are
spread across different machines within this data center.
Once it’s created these instances, the fabric controller continues to monitor them. If an instance fails for
any reason—hardware or software—the fabric controller will start a new instance for that role. While
failures might cause an application’s instance count to temporarily drop below what the developer
requested, the fabric controller will always start new instances as needed to maintain the target number
for each of the application’s roles. And even though Figure 4 shows only Web and Worker roles, VM roles
are handled in the same way, with each of the role’s instances running on a different physical machine.
THE BENEFITS: IMPROVED ADMINISTRATION, AVAILABILITY, AND SCALABILITY
Applications built using the Windows Azure programming model can be easier to administer, more
available, and more scalable than those built on traditional Windows servers. These three attributes are
worth looking at separately.
The administrative benefits of Windows Azure flow largely from the fabric controller. Like every operating
system, Windows must be patched, as must other system software. In on-premises environments, doing
this typically requires some human effort. In Windows Azure, however, the process is entirely automated:
The fabric controller handles updates for Web and Worker role instances (although not for VM role
instances). When necessary, it also updates the underlying Windows servers those VMs run on. The result
is lower costs, since administrators aren’t needed to handle this function.
Lowering costs by requiring less administration is good. Helping applications be more available is also
good, and so the Windows Azure programming model helps improve application availability in
several ways. They are the following:
Protection against hardware failures. Because every application is made up of multiple instances of
each role, hardware failures—a disk crash, a network fault, or the death of a server machine—won’t
take down the application. To help with this, the fabric controller doesn’t choose machines for an
application’s instances at random. Instead, different instances of the same role are placed in different
fault domains. A fault domain is a set of hardware—computers, switches, and more—that share a
single point of failure. (For example, all of the computers in a single fault domain might rely on the
same switch to connect to the network.) Because of this, a single hardware failure can’t take down an
entire application. The application might temporarily lose some instances, but it will continue to
behave correctly.
Protection against software failures. Along with hardware failures, the fabric controller can also
detect failures caused by software. If the code in an instance crashes or the VM in which it’s running
goes down, the fabric controller will start either just the code or, if necessary, a new VM for that role.
While any work the instance was doing when it failed will be lost, the new instance will become part
of the application as soon as it starts running.
The ability to update applications with no application downtime. Whether for routine maintenance or to install a whole new version, every application needs to be updated. An application built using the
Windows Azure programming model can be updated while it’s running—there’s no need to take it
down. To allow this, different instances for each of an application’s roles are placed in different
update domains (which aren’t the same as the fault domains described earlier). When a new version
of the application needs to be deployed, the fabric controller can shut down the instances in just one
update domain, update the code for these, then create new instances from that new code. Once
those instances are running, it can do the same thing to instances in the next update domain, and so
on. While users might see different versions of the application during this process, depending on
which instance they happen to interact with, the application as a whole remains continuously
available.
The ability to update Windows and other supporting software with no application downtime. The
fabric controller assumes that every Windows Azure application follows the three rules listed earlier,
and so it knows that it can shut down some of an application’s instances whenever it likes, update the
underlying system software, then start new instances. By doing this in chunks, never shutting down
all of a role’s instances at the same time, Windows and other software can be updated beneath a
continuously running application.
Availability is important for most applications—software isn’t useful if it’s not running when you need it—
but scalability can also matter. The Windows Azure programming model helps developers build more
scalable applications in two main ways:
Automatically creating and maintaining a specified number of role instances. As already described, a
developer tells Windows Azure how many instances of each role to run, and the fabric controller
creates and monitors the requested instances. This makes application scalability quite
straightforward: Just tell Windows Azure what you need. Because this cloud platform runs in very
large data centers, getting whatever level of scalability an application needs isn’t generally a problem.
Providing a way to modify the number of executing role instances for a running application: For
applications whose load varies, scalability is more complicated. Setting the number of instances just
once isn’t a good solution, since different loads can make the ideal instance count go up or down
significantly. To handle this situation, Windows Azure provides both a Web portal for people and an
API for applications to allow changing the desired number of instances for each role while an
application is running. Making applications simpler to administer, more available, and more scalable is useful, and so using the Windows Azure programming model generally makes sense. But as mentioned earlier, it’s possible to run
applications on Windows Azure that don’t follow this model. Suppose, for example, that you build an
application using a single role (which is permitted) but then run only one instance of that role (violating
the second and third rules). You might do this to save money, since Windows Azure charges separately
for each running instance. Anybody who chooses this option should understand, however, that the fabric
controller won’t know that his application doesn’t follow all three rules. It will shut down this single
instance at unpredictable times to patch the underlying software, then restart a new one. To users, this
means that the application will go down from time to time, since there’s no other instance to take over.
This isn’t a bug in Windows Azure; it’s a fundamental aspect of how the technology works.
Getting all of the benefits that Windows Azure offers requires conforming to the rules of its programming
model. Moving existing applications from Windows Server to Windows Azure can require some work, a topic
addressed in more detail later in this paper. For new applications, however, the argument for using the
Windows Azure model is clear. Why not build an application that costs less to administer? Why not build an
application that need never go down? Why not build an application that can easily scale up and down? Over
time, it’s reasonable to expect more and more applications to be created using the Windows
Azure programming model.
IMPLICATIONS OF THE WINDOWS AZURE PROGRAMMING
MODEL: WHAT ELSE CHANGES?
Building applications for Windows Azure means following the three rules of its programming model.
Following these rules isn’t enough, though—other parts of a developer’s world must also adjust. The
changes the Windows Azure programming model brings to the broader development environment can be
grouped into three areas:
How role instances interact with the operating system.
How role instances interact with persistent storage.
How role instances interact with other role instances. This section looks at all three.
INTERACTIONS WITH THE OPERATING SYSTEM
For an application running on a typical Windows Server machine, the administrator of that machine is in
control. She can reboot VMs or the machine they run on, install Windows patches, and do whatever
else is required to keep that computer available. In Windows Azure, however, all of the servers are
owned by the fabric controller. It decides when VMs or machines should be rebooted, and for Web and
Worker roles (although not for VM roles), the fabric controller also installs patches and other updates to
the system software in every instance.
This approach has real benefits, as already described. It also creates restrictions, however. Because the
fabric controller owns the physical and virtual machines that Windows Azure applications use, it’s free to
do whatever it likes with them. This implies that letting a Windows Azure application modify the system
it runs on—letting it run in administrator mode rather than user mode—presents some challenges. Since
the fabric controller can modify the operating system at will, there’s no guarantee that changes a role
instance makes to the system it’s running on won’t be overwritten. Besides, the specific virtual (and
physical) machines an application runs in change over time. This implies that any changes made to the
default local environment must be made each time a role instance starts running.
In its first release, Windows Azure simply didn’t allow applications to modify the systems they ran on—
applications only ran in user mode. This restriction has been relaxed—both Web and Worker roles now
give developers the option to run applications in admin mode—but the overall programming model hasn’t
changed. Anybody creating a Windows Azure application needs to understand what the fabric controller
is doing, then design applications accordingly.
INTERACTIONS WITH PERSISTENT STORAGE
Applications aren’t just code—they also use data. And just as the programming model must change to
make applications more available and more scalable, the way data is stored and accessed must also
change. The big changes are these:
Storage must be external to role instances. Even though each instance is its own VM with its own file
system, data stored in those file systems isn’t automatically made persistent. If an instance fails, any
data it contains may be lost. This implies that for applications to work correctly in the face of failures,
data must be stored persistently outside role instances. Another role instance can now access data
that otherwise would have been lost if that data had been stored locally on a failed instance.
Storage must be replicated. Just as a Windows Azure application runs multiple role instances to allow
for failures, Windows Azure storage must provide multiple copies of data. Without this, a single
failure would make data unavailable, something that’s not acceptable for highly available
applications.
Storage must be able to handle very large amounts of data. Traditional relational systems aren’t
necessarily the best choice for very large data sets. Since Windows Azure is designed in part for
massively scalable applications, it must provide storage mechanisms for handling data at this scale.
To allow this, the platform offers blobs for storing binary data along with a non-SQL approach called tables for storing large structured data sets.
Figure 5 illustrates these three characteristics, showing how Windows Azure storage looks to an application.
Figure 5: While applications see a single copy, Windows Azure storage replicates all blobs and tables three times.
In this example, a Windows Azure application is using two blobs and one table from Windows Azure
storage. The application sees each blob and table as a single entity, but under the covers, Windows Azure
storage actually maintains three instances of each one. These copies are spread across different physical
machines, and as with role instances, those machines are in different fault domains. This improves the
application’s availability, since data is still accessible even when some copies are unavailable. And because
persistent data is stored outside any of the application’s role instances, an instance failure loses only
whatever data it was using at the moment it failed.
The Windows Azure programming model requires an application to behave correctly when a role instance
fails. To do this, every instance in an application must store all persistent data in Windows Azure storage
or another external storage mechanism (such as SQL Azure, Microsoft’s cloud-based service for relational
data). There’s one more option worth mentioning, however: Windows Azure drives. As already described,
any data an application writes to the local file system of its own VM can be lost when that VM stops
running. Windows Azure drives change this, using a blob to provide persistent storage for the file system
of a particular instance. These drives have some limitations—only one instance at a time is allowed to
both read from and write to a particular Windows Azure drive, for example, with all other instances in this application allowed only read access—but they can be useful in some situations.
INTERACTIONS AMONG ROLE INSTANCES
When an application is divided into multiple parts, those parts commonly need to interact with one
another. In a Windows Azure application, this is expressed as communication between role instances.
For example, a Web role instance might accept requests from users, then pass those requests to a
Worker role instance for further processing. The way this interaction happens isn’t identical to how it’s done with ordinary Windows applications. Once again, a key fact to keep in mind is that, most often, all instances of a particular role are
equivalent—they’re interchangeable. This means that when, say, a Web role instance passes work to a
Worker role instance, it shouldn’t care which particular instance gets the work. In fact, the Web role
instance shouldn’t rely on instance-specific things like a Worker role instance’s IP address to
communicate with that instance. More generic mechanisms are required.
The most common way for role instances to communicate in Windows Azure applications is through Windows Azure queues. Figure 6 illustrates the idea.
Figure 6: Role instances can communicate through queues, each of which replicates the messages it holds three times.
In the example shown here, a Web role instance gets work from a user of the application, such as a person
making a request from a browser (step 1). This instance then creates a message containing this work and
writes it into a Windows Azure queue (step 2). These queues are implemented as part of Windows Azure
storage, and so like blobs and tables, each queue is replicated three times, as the figure
shows. As usual, this provides fault-tolerance, ensuring that the queue’s messages are still available if a failure occurs.
Next, a Worker role instance reads the message from the queue (step 3). Notice that the Web role
instance that created this message doesn’t care which Worker role instance gets it—in this application,
they’re all equivalent. That Worker role instance does whatever work the message requires (step 4),
then deletes the message from the queue (step 5).
This last step—explicitly removing the message from the queue—is different from what on-premises
queuing technologies typically do. In Microsoft Message Queuing (MSMQ), for example, an application can
do a read inside an atomic transaction. If the application fails before completing its work, the transaction
aborts, and the message automatically reappears on the queue. This approach guarantees that every
message sent to an MSMQ queue is delivered exactly once in the order in which it was sent.
Windows Azure queues don’t support transactional reads, and so they don’t guarantee exactly-once,
in-order delivery. In the example shown in Figure 6, for instance, the Worker role instance might finish
processing the message, then crash just before it deletes this message from the queue. If this happens,
the message will automatically reappear after a configurable timeout period, and another Worker role
instance will process it. Unlike MSMQ, Windows Azure queues provide at-least-once semantics: A
message might be read and processed one or more times.
This raises an obvious question: Why don’t Windows Azure queues support transactional reads? The
answer is that transactions require locking, and so they necessarily slow things down (especially with the
message replication provided by Windows Azure queues). Given the primary goals of the platform, its
designers opted for the fastest, most scalable approach.
Most of the time, queues are the best way for role instances within an application to communicate. It’s
also possible for instances to interact directly, however, without going through a queue. To allow this,
Windows Azure provides an API that lets an instance discover all other instances in the same application
that meet specific requirements, then send a request directly to one of those instances. In the most
common case, where all instances of a particular role are equivalent, the caller should choose a target
instance randomly from the set the API returns. This isn’t always true—maybe a Worker role
implements an in-memory cache with each role instance holding specific data, and so the caller must
access a particular one. Most often, though, the right approach is to treat all instances of a role as
interchangeable.
MOVING WINDOWS SERVER APPLICATIONS TO WINDOWS
AZURE
Anybody building a new Windows Azure application should follow the rules of the Windows Azure
programming model. To move an existing application from Windows Server to Windows Azure,
however, that application should also be made to follow the same rules. In addition, the application
might need to change how it interacts with the operating system, how it uses persistent storage, and the
way its components interact with each other.
How easy it is to make these changes depends on the application. Here are a few representative examples:
An ASP.NET application with multiple load-balanced instances that share state stored in SQL Server.
This kind of application typically ports easily to Windows Azure, with each instance of the original
application becoming an instance of a Web or Worker role. Applications like this don’t use sticky
sessions, which helps make them a good fit for Windows Azure. (Using ASP.NET session state is
acceptable, however, since Windows Azure provides an option to store session state persistently in
Windows Azure Storage tables.) And moving an on-premises SQL Server database to SQL Azure is
usually straightforward.
An ASP.NET application with multiple instances that maintains per-instance state and relies on sticky
sessions. Because it maintains client-specific state in each instance between requests, this application
will need some changes. Windows Azure doesn’t support sticky sessions, and so making the
application run on this cloud platform will require redesigning how it handles state.
A Silverlight or Windows Presentation Foundation (WPF) client that accesses WCF services running in
a middle tier. If the services don’t maintain per-client state between calls, moving them to Windows
Azure is straightforward. The client will continue to run on user desktops, as always, but it will now
call services running on Windows Azure. If the current services do maintain per-client state, however,
they’ll need to be redesigned.
An application with a single instance running on Windows Server that maintains state on its own
machine. Whether the clients are browsers or something else, many enterprise applications are built
this way today, and they won’t work well on Windows Azure without some redesign. It might be
possible to run this application unchanged in a single VM role instance, but its users probably won’t
be happy with the results. For one thing, the Windows Azure SLA doesn’t apply to applications with
only a single instance. Also, recall that the fabric controller can at any time reboot the machine on
which this instance runs to update that machine’s software. The application has no control over when
this happens; it might be smack in the middle of a workday. Since there’s no second instance to take
over—the application wasn’t built to follow the rules of the Windows Azure programming model—it
will be unavailable for some period of time, and so anybody using the application will have their work
interrupted while the machine reboots. Even though the VM role makes it easy to move a Windows
Server binary to Windows Azure, this doesn’t guarantee that the application will run successfully in its
new home. The application must also conform to the rules of the Windows Azure programming
model.
A Visual Basic 6 application that directly accesses a SQL Server database, i.e., a traditional
client/server application. Making this application run on Windows Azure will most likely require
rewriting at least the client business logic. While it might be possible to move the database (including
any stored procedures) to SQL Azure, then redirect the clients to this new location, the application’s
desktop component won’t run as is on Windows Azure. Windows Azure doesn’t provide a local user
interface, and it also doesn’t support using Remote Desktop Services (formerly Terminal Services) to
provide remote user interfaces.
Windows Azure can help developers create better applications. Yet the improvements it offers require
change, and so moving existing software to this new platform can take some effort. Making good
decisions requires understanding both the potential business value and any technical challenges that
moving an application to Windows Azure might bring.
CONCLUSION
Cloud platforms are a new world, and they open new possibilities. Reflecting this, the Windows Azure
programming model helps developers create applications that are easier to administer, more available, and
more scalable than those built in the traditional Windows Server environment. Doing this requires following
three rules:
A Windows Azure application is built from one or more roles.
A Windows Azure application runs multiple instances of each role.
A Windows Azure application behaves correctly when any role instance fails.
Using this programming model successfully also requires understanding the changes it brings to how applications
interact with the operating system, use persistent storage, and communicate between role instances. For
developers willing to do this, however, the value is clear. While it’s not right for every scenario, the Windows
Azure programming model can be useful for anybody who wants to create easier to administer, more available,
and more scalable applications.
FOR FURTHER READING
Introducing Windows Azure: http://go.microsoft.com/?linkid=9682907 Introducing the Windows Azure Platform: http://go.microsoft.com/?linkid=9752185