Page 1
Scaling HumansOps teams and incident management
dotScale, Paris 2015 David Mytton, CEO, Server Density
Page 4
Cost of uptime?
$2.9bnQ1: 2015
Page 6
Cost of uptime?
$2.9bnQ1: 2015
$870mQ1: 2015
Page 8
Cost of uptime?
$2.9bnQ1: 2015
$870mQ1: 2015
$4.1bnQ1: 2015
Page 10
How much are you spending?
Page 11
Expect downtime
• Prepare
• Respond
• Postmortem
Page 12
Prepare
• On call
• Primary/secondary
Page 13
Prepare
• On call
• Primary/secondary
• Reachability
Page 14
Prepare
• On call
• Off call
Page 15
Prepare
• On call
• Off call
• Docs
Page 16
Prepare
• On call
• Off call
• Docs
• Searchable
Page 17
Prepare
• On call
• Off call
• Docs
• Searchable
• Independent
Page 19
Prepare
• Key info
• Team contacts
Page 20
Prepare
• Key info
• Team contacts
• Vendor contacts
Page 21
Prepare
• Key info
• Team contacts
• Vendor contacts
• Key credentials
Page 22
Prepare
• Key info
• Unexpected situations
• Communication
Page 23
Prepare
• Key info
• Unexpected situations
• Communication
• Internet access
Page 24
Prepare
• Key info
• Unexpected situations
• Communication
• Internet access
• Support access
Page 25
Respond
• First responder
1. Load incident response checklist
Page 26
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
Page 27
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
3. Log incident in JIRA
Page 28
Respond
• First responder
1. Load incident response checklist
2. Log into Ops War Room
3. Log incident in JIRA
4. Begin investigation
Page 29
Respond
• Key response principles
• Log everything
Page 30
Respond
• Key response principles
• Log everything
• Frequent public updates
Page 31
Respond
• Key response principles
• Log everything
• Frequent public updates
• Gather the team
Page 32
Respond
• Key response principles
• Log everything
• Frequent public updates
• Gather the team
• Escalate!
Page 33
Postmortem
• Within a few days
Page 34
Postmortem
• Within a few days
• Tell the story
Page 35
Postmortem
• Within a few days
• Tell the story
• Appropriate technical detail
Page 36
Postmortem
• Within a few days
• Tell the story
• Appropriate technical detail
• What failed, why?
Page 37
Postmortem
• How it’s going to be fixed
Page 39
ありがとうございます
[email protected]
@davidmytton