Top Banner
Scaling Humans Ops teams and incident management dotScale, Paris 2015 David Mytton, CEO, Server Density
39
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling humans - Ops teams and incident management

Scaling HumansOps teams and incident management

dotScale, Paris 2015 David Mytton, CEO, Server Density

Page 2: Scaling humans - Ops teams and incident management

Cost of uptime?

Page 3: Scaling humans - Ops teams and incident management

Cost of uptime?

Page 4: Scaling humans - Ops teams and incident management

Cost of uptime?

$2.9bnQ1: 2015

Page 5: Scaling humans - Ops teams and incident management

Cost of uptime?

Page 6: Scaling humans - Ops teams and incident management

Cost of uptime?

$2.9bnQ1: 2015

$870mQ1: 2015

Page 7: Scaling humans - Ops teams and incident management

Cost of uptime?

Page 8: Scaling humans - Ops teams and incident management

Cost of uptime?

$2.9bnQ1: 2015

$870mQ1: 2015

$4.1bnQ1: 2015

Page 9: Scaling humans - Ops teams and incident management

Cost of uptime?

Page 10: Scaling humans - Ops teams and incident management

How much are you spending?

Page 11: Scaling humans - Ops teams and incident management

Expect downtime

• Prepare

• Respond

• Postmortem

Page 12: Scaling humans - Ops teams and incident management

Prepare

• On call

• Primary/secondary

Page 13: Scaling humans - Ops teams and incident management

Prepare

• On call

• Primary/secondary

• Reachability

Page 14: Scaling humans - Ops teams and incident management

Prepare

• On call

• Off call

Page 15: Scaling humans - Ops teams and incident management

Prepare

• On call

• Off call

• Docs

Page 16: Scaling humans - Ops teams and incident management

Prepare

• On call

• Off call

• Docs

• Searchable

Page 17: Scaling humans - Ops teams and incident management

Prepare

• On call

• Off call

• Docs

• Searchable

• Independent

Page 18: Scaling humans - Ops teams and incident management

Prepare

Page 19: Scaling humans - Ops teams and incident management

Prepare

• Key info

• Team contacts

Page 20: Scaling humans - Ops teams and incident management

Prepare

• Key info

• Team contacts

• Vendor contacts

Page 21: Scaling humans - Ops teams and incident management

Prepare

• Key info

• Team contacts

• Vendor contacts

• Key credentials

Page 22: Scaling humans - Ops teams and incident management

Prepare

• Key info

• Unexpected situations

• Communication

Page 23: Scaling humans - Ops teams and incident management

Prepare

• Key info

• Unexpected situations

• Communication

• Internet access

Page 24: Scaling humans - Ops teams and incident management

Prepare

• Key info

• Unexpected situations

• Communication

• Internet access

• Support access

Page 25: Scaling humans - Ops teams and incident management

Respond

• First responder

1. Load incident response checklist

Page 26: Scaling humans - Ops teams and incident management

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

Page 27: Scaling humans - Ops teams and incident management

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

3. Log incident in JIRA

Page 28: Scaling humans - Ops teams and incident management

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

3. Log incident in JIRA

4. Begin investigation

Page 29: Scaling humans - Ops teams and incident management

Respond

• Key response principles

• Log everything

Page 30: Scaling humans - Ops teams and incident management

Respond

• Key response principles

• Log everything

• Frequent public updates

Page 31: Scaling humans - Ops teams and incident management

Respond

• Key response principles

• Log everything

• Frequent public updates

• Gather the team

Page 32: Scaling humans - Ops teams and incident management

Respond

• Key response principles

• Log everything

• Frequent public updates

• Gather the team

• Escalate!

Page 33: Scaling humans - Ops teams and incident management

Postmortem

• Within a few days

Page 34: Scaling humans - Ops teams and incident management

Postmortem

• Within a few days

• Tell the story

Page 35: Scaling humans - Ops teams and incident management

Postmortem

• Within a few days

• Tell the story

• Appropriate technical detail

Page 36: Scaling humans - Ops teams and incident management

Postmortem

• Within a few days

• Tell the story

• Appropriate technical detail

• What failed, why?

Page 37: Scaling humans - Ops teams and incident management

Postmortem

• How it’s going to be fixed

Page 38: Scaling humans - Ops teams and incident management

Postmortem

Page 39: Scaling humans - Ops teams and incident management

ありがとうございます

[email protected]

@davidmytton