Top Banner
45

Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Apr 01, 2015

Download

Documents

Kelsi Bowell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.
Page 2: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Microsoft Office 365 Service Reliability and Disaster Recovery

Kumar Venkateswar, Sr. Program ManagerMicrosoft Corporation

OSP324

Page 3: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Session Objectives and Takeaways

Session Objective(s): Learn how Office 365 is engineered for reliability and predictable service deliveryUnderstand the customer-facing information that provides insight into service availability and interruptions.Describe the process of continuous improvement of service availability.

The picture is more nuanced than the numbersOffice 365 provides end-to-end reliability through a thoughtful product and service offering.

Page 4: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is the uptime in Office 365?

Why is it good? What does Microsoft do to make sure it is?

How do these numbers translate into my organization?

What happens when I have an outage?

How does our approach differ from our competitors?

What’s next? How does Microsoft make sure it keeps getting better?

Agenda

Page 5: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is the uptime in Office 365?

Why is it good? What does Microsoft do to make sure it is?

How do these numbers translate into my organization?

What happens when I have an outage?

How does our approach differ from our competitors?

What’s next? How does Microsoft make sure it keeps getting better?

Agenda

Page 6: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Uptime in Office 365 – The Number

99.9%6

Page 7: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is the uptime in Office 365?

Why is it good? What does Microsoft do to make sure it is?

How do these numbers translate into my organization?

What happens when I have an outage?

How does our approach differ from our competitors?

What’s next? How does Microsoft make sure it keeps getting better?

Agenda

Page 8: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Microsoft Has Years of Experience Running Infrastructure at High Scale

Hotmail

1997

Windows Update

1995

Bing / MSN

search

1998

Xbox Live

2002

Exchange Hosted Services (now part of Office 365)

2005

8

Page 9: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Microsoft Office 365 Offers the most resilient and predictable service availability experience for the cloud.

Backed by the most responsive support available

and a comprehensive 99.9% financially backed SLA

9

Page 10: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

The Many Faces of Resilience

As described by Madni and Jackson

Reconfiguration

Resilience AbsorptionRestoration

Anticipation

10

Page 11: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

• Functional redundancy

• Physical redundancy

• Reorganization

• Human backup

• “Human-in-loop”

• Predictability

• Complexity avoidance

• Context spanning

• Graceful degradation

• Drift correction

• “Neutral” state

• Inspectability

• Intent awareness

• Learning/Adaptation

Many Design Heuristics for Implementing Resilience Exist

As described by Madni and Jackson

11

Page 12: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Design of Office 365

• Online and offline functionality in order to provide continued functionality even in the light of failures

• Office 365 provides physical redundancy at multiple levels to protect against hardware failures

• Data in transit and at rest

• Network and hardware redundancy

• Facilities and power redundancy

• At least 2 datacenters per region

Physical Redundancy

Functional Redundancy

12

Page 13: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Design of Office 365

• Active load balancing to restructure the system against rare extreme load conditions

• Response to hardware failures

Reorganization

13

Page 14: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Design of Office 365

• Monitoring system attempts automated recovery actions, and alerts 24x7 on-call engineer when recovery does not succeed

• On-call engineers are core product group members (dev, test, and PM) in the relevant area for the alert for rapid response and relevant information collection

Human Backup and “Human-in-Loop”

14

Page 15: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Design of Office 365

• Detailed logging and tracing to avoid unnecessary assumptions by on-call engineers

• Deviations from normal behavior deliver alerts to on-call engineers, enabling relevant information collection and rapid resolution

Inspectability and Predictability

15

Page 16: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Design of Office 365

• Standardized hardware and automated deployment model

Complexity Avoidance

• Recovery across “failure domains” tested regularly, including regional disasters• Service component isolation to avoid failure cascades.

Context Spanning

• Built-in workload management mechanisms to avoid catastrophic impact and degrade gracefully

Graceful Degradation and Drift Correction

16

Page 17: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is the uptime in Office 365?

Why is it good? What does Microsoft do to make sure it is?

How do these numbers translate into my organization?

What happens when I have an outage?

How does our approach differ from our competitors?

What’s next? How does Microsoft make sure it keeps getting better?

Agenda

Page 18: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is Uptime?

• Longer outages have greater impact to the percentage• Outages that affect a greater number of users have greater impact• More severe outages in terms of users or duration lead to greater deviations from

100%, which is desirable for remedy service credits.

The Office 365 service level agreement expresses uptime in this way:

Why?

18

Page 19: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is Uptime?

• The objective is to describe the risk of outage to an individual customer based on the aggregate uptime of the service.

• CAUTION – This does not capture the full risk picture!

The aggregate uptime of service components can also be expressed similarly:

19

Page 20: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Looking Behind the Uptime Numbers

• Downtime that is not dependent on the number of users is still adjusted by the number of users.

• The aggregate uptime is heavily dependent on the definition of downtime.

• Different cloud services provide different functionality, making uptime hard to compare.

• Productivity loss due to downtime differs by service.

What are the caveats with this aggregate uptime number, particularly when it is used to compare different services?

20

Page 21: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is the uptime in Office 365?

Why is it good? What does Microsoft do to make sure it is?

How do these numbers translate into my organization?

What happens when I have an outage?

How does our approach differ from our competitors?

What’s next? How does Microsoft make sure it keeps getting better?

Agenda

Page 22: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Office 365 Service Communication Experiences

Planned Maintenance

Notification of planned service maintenance including transitions/upgrades, repair and update scenarios.

Service Alteration Notification about changes to service features, capabilities or business terms of service.

Service IncidentNotification regarding major service interrupting incidents.

Account Lifecycle

Notification of milestones in the subscription lifecycle.

Channels

Facebook

Twitter

To: CustomerEmail

Service Health Dashboard

Community

Post Incident Review

Draft

Experiences when Customers’ Access to Services are Impacted.

Page 23: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Office 365 Service Level Agreement

Backed by the most responsive support available and a comprehensive 99.9% financially backed SLA

Page 24: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Service Incident Communication Flow

Incident Identification

Post Incident Wrap Up Ongoing Communication

Page 25: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Service Incident Communication Channels

Facebook TwitterEmail

Service Health Dashboard

Community

Post Incident ReviewRSS Feed

Additional Actions

Page 26: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Roles and Responsibilities

26

• IMs are on-call engineers and core product group members with deep expertise in their relevant area

• Determine scope of the outage• Determine root cause and fastest and best path to

resolution

Incident Manager

•CMs are on-call engineers and core product group members with deep expertise in their relevant area•Coordinate and supply issue information across internal teams•Post customer communications to SHD

Communication Manager

• Update internal support communications• Monitor support channelsSupport

Page 27: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Taxonomy for Service Incident StatusStatus Description SHD icon

Investigating Monitors have indicated a service anomaly, and/or we have received reports of a potential service incident and we are currently investigating the reports.

Service Interruption

We have confirmed that the normal services are being impacted. We are taking immediate action to:

Understand the cause of the failure and Determine best course of action to restore service(s).

Degraded Service

The services are currently experiencing degraded performance due to a service incident. Services are still active, but service responsiveness and/or delivery times may be slower than usual. We are currently working to restore normal service responsiveness.

Restoring Service

We have isolated the likely cause of the incident and are in the process of restoring normal services.

Extended Recovery

System services are restored. Due to existing backlog of items the services may be slower than usual while the backlog clears

Service Restored Normal system services have been restored.

Page 28: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

O365 Service Incident Notification Process

Incident Occurs

Service Health Dashboard (SHD)

updated “investigating”

Incident Status posted to SHD

SHD updated until service restoration

Closure Summary posted to SHD

Post Incident Review posted to SHD

Within 5 business days

Page 29: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Office 365 Planned Maintenance Communication

Type Description Channel

Planned MaintenanceUpdate

5 day prior notification of planned service maintenance that falls within approved maintenance timeframes.

Service Health Dashboard

Planned Maintenance Update (Outside Window)

Notification of planned service maintenance that falls outside the approved maintenance timeframes.

Service Health Dashboard, To: Customer Email

Transitions / Upgrades Notification of service transitions and/or upgrades

Service Health Dashboard, To: Customer Email

Draft

Status Description SHD icon

Scheduled (5 business days advance notice)

The planned maintenance activity has been scheduled.

In Progress The planned maintenance activity is in progress. Please see the details for the expected time for completion.

Completed The planned maintenance activity is complete.

Postponed The planned maintenance activity has been postponed. Please see the details regarding the updated schedule

Cancelled The planned maintenance activity has been cancelled

Page 30: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is the uptime in Office 365?

Why is it good? What does Microsoft do to make sure it is?

How do these numbers translate into my organization?

What happens when I have an outage?

How does our approach differ from our competitors?

What’s next? How does Microsoft make sure it keeps getting better?

Agenda

Page 31: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Productivity is Our Core Business

Microsoft is the company that businesses look to for the software they need to boost productivity and operate with efficiency, effectiveness, and intelligence.

Microsoft Office Division, the division that produces Office 365, produced over half of Microsoft’s operating income. For some of our competitors, productivity is a minor side business at best.

31

Page 32: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Office Software Stack Provides Resilience…

To Network Interruptions…

To Cloud Disruptions…

To The Realities Of Business Life

32

Page 33: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

More Than Just Login

The Office 365 service level agreement covers all services – no exceptions!

The definition of downtime for Office 365 is more than the “server-side error rate” – it covers real functionality, when users are unable to read, write, access, send, receive data.

Page 34: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Every User Counts

The Office 365 service level agreement refers to all end users, not just those exceeding a particular threshold.

34

Page 35: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Testing, Not Wishful Thinking

The recovery time objective (RTO) and recovery point objective (RPO) are based on regular verification and what we believe we can deliver in a real disaster.

Some of our competitors claim a zero RTO and zero RPO, even if they have needed to restore from tape in the past!

35

Page 36: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

What is the uptime in Office 365?

Why is it good? What does Microsoft do to make sure it is?

How do these numbers translate into my organization?

What happens when I have an outage?

How does our approach differ from our competitors?

What’s next? How does Microsoft make sure it keeps getting better?

Agenda

Page 37: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Microsoft is Committed to Cloud Productivity Services

37

Page 38: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Office 365 Service Availability Experience is Good at the Start…

38

Page 39: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

• Biweekly service updates

• Feature and capability releases every 90 days

• Major feature and capability releases every 12-24 months

Commitment to Continuous Improvement

Quarterly Verification

Anomaly Detection

Improvement Development

ImprovementDevelopment

39

Page 40: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

We’re Not Planning On StoppingMicrosoft Office 365 will continue to offers the most resilient and predictable service availability experience for the cloud.

Backed by the most responsive support available and a the most comprehensive financially backed SLA to reflect our commitment to meet your service availability needs.

40

Page 41: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

In Review: Session Objectives and Takeaways

Session Objective(s): Learn how Office 365 is engineered for reliability and predictable service deliveryUnderstand the customer-facing information that provides insight into service availability and interruptions.Describe the process of continuous improvement of service availability.

The picture is more nuanced than the numbersOffice 365 provides end-to-end reliability through a thoughtful product and service offering.

Page 42: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Related Resources

Office 365 TechCenter: technet.microsoft.com/Office365

Office Client TechCenter: technet.microsoft.com/officeOffice, Office 365 and SharePoint Demo Area Includes:

Office 365 IT Pro Command CenterOffice 365 Data Center Exhibit

Page 43: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Resources

Connect. Share. Discuss.

http://europe.msteched.com

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Resources for Developers

http://microsoft.com/msdn

Page 44: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

Evaluations

http://europe.msteched.com/sessions

Submit your evals online

Page 45: Microsoft Office 365 Service Reliability and Disaster Recovery Kumar Venkateswar, Sr. Program Manager Microsoft Corporation OSP324.

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS

PRESENTATION.