Microsoft Office 365 Service Reliability and Disaster Recovery
Kumar Venkateswar, Sr. Program ManagerMicrosoft Corporation
OSP324
Session Objectives and Takeaways
Session Objective(s): Learn how Office 365 is engineered for reliability and predictable service deliveryUnderstand the customer-facing information that provides insight into service availability and interruptions.Describe the process of continuous improvement of service availability.
The picture is more nuanced than the numbersOffice 365 provides end-to-end reliability through a thoughtful product and service offering.
What is the uptime in Office 365?
Why is it good? What does Microsoft do to make sure it is?
How do these numbers translate into my organization?
What happens when I have an outage?
How does our approach differ from our competitors?
What’s next? How does Microsoft make sure it keeps getting better?
Agenda
What is the uptime in Office 365?
Why is it good? What does Microsoft do to make sure it is?
How do these numbers translate into my organization?
What happens when I have an outage?
How does our approach differ from our competitors?
What’s next? How does Microsoft make sure it keeps getting better?
Agenda
Uptime in Office 365 – The Number
99.9%6
What is the uptime in Office 365?
Why is it good? What does Microsoft do to make sure it is?
How do these numbers translate into my organization?
What happens when I have an outage?
How does our approach differ from our competitors?
What’s next? How does Microsoft make sure it keeps getting better?
Agenda
Microsoft Has Years of Experience Running Infrastructure at High Scale
Hotmail
1997
Windows Update
1995
Bing / MSN
search
1998
Xbox Live
2002
Exchange Hosted Services (now part of Office 365)
2005
8
Microsoft Office 365 Offers the most resilient and predictable service availability experience for the cloud.
Backed by the most responsive support available
and a comprehensive 99.9% financially backed SLA
9
The Many Faces of Resilience
As described by Madni and Jackson
Reconfiguration
Resilience AbsorptionRestoration
Anticipation
10
• Functional redundancy
• Physical redundancy
• Reorganization
• Human backup
• “Human-in-loop”
• Predictability
• Complexity avoidance
• Context spanning
• Graceful degradation
• Drift correction
• “Neutral” state
• Inspectability
• Intent awareness
• Learning/Adaptation
Many Design Heuristics for Implementing Resilience Exist
As described by Madni and Jackson
11
Design of Office 365
• Online and offline functionality in order to provide continued functionality even in the light of failures
• Office 365 provides physical redundancy at multiple levels to protect against hardware failures
• Data in transit and at rest
• Network and hardware redundancy
• Facilities and power redundancy
• At least 2 datacenters per region
Physical Redundancy
Functional Redundancy
12
Design of Office 365
• Active load balancing to restructure the system against rare extreme load conditions
• Response to hardware failures
Reorganization
13
Design of Office 365
• Monitoring system attempts automated recovery actions, and alerts 24x7 on-call engineer when recovery does not succeed
• On-call engineers are core product group members (dev, test, and PM) in the relevant area for the alert for rapid response and relevant information collection
Human Backup and “Human-in-Loop”
14
Design of Office 365
• Detailed logging and tracing to avoid unnecessary assumptions by on-call engineers
• Deviations from normal behavior deliver alerts to on-call engineers, enabling relevant information collection and rapid resolution
Inspectability and Predictability
15
Design of Office 365
• Standardized hardware and automated deployment model
Complexity Avoidance
• Recovery across “failure domains” tested regularly, including regional disasters• Service component isolation to avoid failure cascades.
Context Spanning
• Built-in workload management mechanisms to avoid catastrophic impact and degrade gracefully
Graceful Degradation and Drift Correction
16
What is the uptime in Office 365?
Why is it good? What does Microsoft do to make sure it is?
How do these numbers translate into my organization?
What happens when I have an outage?
How does our approach differ from our competitors?
What’s next? How does Microsoft make sure it keeps getting better?
Agenda
What is Uptime?
• Longer outages have greater impact to the percentage• Outages that affect a greater number of users have greater impact• More severe outages in terms of users or duration lead to greater deviations from
100%, which is desirable for remedy service credits.
The Office 365 service level agreement expresses uptime in this way:
Why?
18
What is Uptime?
• The objective is to describe the risk of outage to an individual customer based on the aggregate uptime of the service.
• CAUTION – This does not capture the full risk picture!
The aggregate uptime of service components can also be expressed similarly:
19
Looking Behind the Uptime Numbers
• Downtime that is not dependent on the number of users is still adjusted by the number of users.
• The aggregate uptime is heavily dependent on the definition of downtime.
• Different cloud services provide different functionality, making uptime hard to compare.
• Productivity loss due to downtime differs by service.
What are the caveats with this aggregate uptime number, particularly when it is used to compare different services?
20
What is the uptime in Office 365?
Why is it good? What does Microsoft do to make sure it is?
How do these numbers translate into my organization?
What happens when I have an outage?
How does our approach differ from our competitors?
What’s next? How does Microsoft make sure it keeps getting better?
Agenda
Office 365 Service Communication Experiences
Planned Maintenance
Notification of planned service maintenance including transitions/upgrades, repair and update scenarios.
Service Alteration Notification about changes to service features, capabilities or business terms of service.
Service IncidentNotification regarding major service interrupting incidents.
Account Lifecycle
Notification of milestones in the subscription lifecycle.
Channels
To: CustomerEmail
Service Health Dashboard
Community
Post Incident Review
Draft
Experiences when Customers’ Access to Services are Impacted.
Office 365 Service Level Agreement
Backed by the most responsive support available and a comprehensive 99.9% financially backed SLA
Service Incident Communication Flow
Incident Identification
Post Incident Wrap Up Ongoing Communication
Service Incident Communication Channels
Facebook TwitterEmail
Service Health Dashboard
Community
Post Incident ReviewRSS Feed
Additional Actions
Roles and Responsibilities
26
• IMs are on-call engineers and core product group members with deep expertise in their relevant area
• Determine scope of the outage• Determine root cause and fastest and best path to
resolution
Incident Manager
•CMs are on-call engineers and core product group members with deep expertise in their relevant area•Coordinate and supply issue information across internal teams•Post customer communications to SHD
Communication Manager
• Update internal support communications• Monitor support channelsSupport
Taxonomy for Service Incident StatusStatus Description SHD icon
Investigating Monitors have indicated a service anomaly, and/or we have received reports of a potential service incident and we are currently investigating the reports.
Service Interruption
We have confirmed that the normal services are being impacted. We are taking immediate action to:
Understand the cause of the failure and Determine best course of action to restore service(s).
Degraded Service
The services are currently experiencing degraded performance due to a service incident. Services are still active, but service responsiveness and/or delivery times may be slower than usual. We are currently working to restore normal service responsiveness.
Restoring Service
We have isolated the likely cause of the incident and are in the process of restoring normal services.
Extended Recovery
System services are restored. Due to existing backlog of items the services may be slower than usual while the backlog clears
Service Restored Normal system services have been restored.
O365 Service Incident Notification Process
Incident Occurs
Service Health Dashboard (SHD)
updated “investigating”
Incident Status posted to SHD
SHD updated until service restoration
Closure Summary posted to SHD
Post Incident Review posted to SHD
Within 5 business days
Office 365 Planned Maintenance Communication
Type Description Channel
Planned MaintenanceUpdate
5 day prior notification of planned service maintenance that falls within approved maintenance timeframes.
Service Health Dashboard
Planned Maintenance Update (Outside Window)
Notification of planned service maintenance that falls outside the approved maintenance timeframes.
Service Health Dashboard, To: Customer Email
Transitions / Upgrades Notification of service transitions and/or upgrades
Service Health Dashboard, To: Customer Email
Draft
Status Description SHD icon
Scheduled (5 business days advance notice)
The planned maintenance activity has been scheduled.
In Progress The planned maintenance activity is in progress. Please see the details for the expected time for completion.
Completed The planned maintenance activity is complete.
Postponed The planned maintenance activity has been postponed. Please see the details regarding the updated schedule
Cancelled The planned maintenance activity has been cancelled
What is the uptime in Office 365?
Why is it good? What does Microsoft do to make sure it is?
How do these numbers translate into my organization?
What happens when I have an outage?
How does our approach differ from our competitors?
What’s next? How does Microsoft make sure it keeps getting better?
Agenda
Productivity is Our Core Business
Microsoft is the company that businesses look to for the software they need to boost productivity and operate with efficiency, effectiveness, and intelligence.
Microsoft Office Division, the division that produces Office 365, produced over half of Microsoft’s operating income. For some of our competitors, productivity is a minor side business at best.
31
Office Software Stack Provides Resilience…
To Network Interruptions…
To Cloud Disruptions…
To The Realities Of Business Life
32
More Than Just Login
The Office 365 service level agreement covers all services – no exceptions!
The definition of downtime for Office 365 is more than the “server-side error rate” – it covers real functionality, when users are unable to read, write, access, send, receive data.
Every User Counts
The Office 365 service level agreement refers to all end users, not just those exceeding a particular threshold.
34
Testing, Not Wishful Thinking
The recovery time objective (RTO) and recovery point objective (RPO) are based on regular verification and what we believe we can deliver in a real disaster.
Some of our competitors claim a zero RTO and zero RPO, even if they have needed to restore from tape in the past!
35
What is the uptime in Office 365?
Why is it good? What does Microsoft do to make sure it is?
How do these numbers translate into my organization?
What happens when I have an outage?
How does our approach differ from our competitors?
What’s next? How does Microsoft make sure it keeps getting better?
Agenda
Microsoft is Committed to Cloud Productivity Services
37
Office 365 Service Availability Experience is Good at the Start…
38
• Biweekly service updates
• Feature and capability releases every 90 days
• Major feature and capability releases every 12-24 months
Commitment to Continuous Improvement
Quarterly Verification
Anomaly Detection
Improvement Development
ImprovementDevelopment
39
We’re Not Planning On StoppingMicrosoft Office 365 will continue to offers the most resilient and predictable service availability experience for the cloud.
Backed by the most responsive support available and a the most comprehensive financially backed SLA to reflect our commitment to meet your service availability needs.
40
In Review: Session Objectives and Takeaways
Session Objective(s): Learn how Office 365 is engineered for reliability and predictable service deliveryUnderstand the customer-facing information that provides insight into service availability and interruptions.Describe the process of continuous improvement of service availability.
The picture is more nuanced than the numbersOffice 365 provides end-to-end reliability through a thoughtful product and service offering.
Related Resources
Office 365 TechCenter: technet.microsoft.com/Office365
Office Client TechCenter: technet.microsoft.com/officeOffice, Office 365 and SharePoint Demo Area Includes:
Office 365 IT Pro Command CenterOffice 365 Data Center Exhibit
Resources
Connect. Share. Discuss.
http://europe.msteched.com
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Resources for Developers
http://microsoft.com/msdn
Evaluations
http://europe.msteched.com/sessions
Submit your evals online
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
PRESENTATION.