Top Banner
How To Run a Post-Mortem With Humans (Not Robots) Dan Milstein Hut 8 Labs @danmil
57

How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Jan 15, 2015

Download

Technology

Dan Milstein

Slides (with annotations) from a talk on post-mortems at Velocity CA, 2013.

This is an expanded version of my earlier slides, from the Lean Startup Conf.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

How To Run a Post-Mortem With Humans (Not Robots)

Dan MilsteinHut 8 Labs@danmil

Page 2: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Act I: What The Hell Is a Post-Mortem Anyways?

Page 3: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Ahhhhhh! Something Very Bad Just Happened

Page 4: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

What Is a Post-Mortem Anyways?

• Something you do when your company has badly screwed up

• E.g. your CEO demos your cloud storage system to an early prospective customer, and, when he runs a search, it shows other customers’ data (I have done this, it was not awesome)

• You get a bunch of people into a room and say: “How on earth did that happen? And how can we make sure it never, ever happens again?”

• That’s a Post-Mortem

• But, there’s a problem....

Page 5: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Shameful Mistakes: Humans vs Robots

Page 6: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Human Beings Will Eff It Up

• Humans (unlike robots) feel this intense emotion called shame

• Shame will suggest (strongly) “Slow Down, Stop Making So Many Mistakes”

• Aka “Destroy your company by way of opportunity costs, immediately!”

• Has potential to be incredibly damaging to your startup

• And I have some bad news...

Page 7: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

You Will Totally Experience Shame (I Still Do)

F.A.E.

Page 8: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

This Emotional Experience Can Not Be Avoided

• I’ve run c. 50 post-mortems, have studied failure... and I still have this emotional reaction

• You will, too. And so will your team.

• Much more strongly than you realize right now

• This is the “Fundamental Attribution Error” (FAE), from psychology

• FAE = humans vastly underestimate the power of a situation on our behavior

Page 9: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Big Idea: Adopt Economic, Not Moral Mindset

$, FTW

Page 10: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

What Does That Mean

• Let me tell you a story...

Page 11: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Parable: A Tale of Two Factories

Page 12: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Two Factories

• Both make widgets

• Both are missing their monthly Widget Production goals by 10%

• But for different reasons...

Page 13: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Factory 1... Broken Machine

Page 14: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

When The Machine Breaks...

• Belt slips off every once in a while

• Ruins a bunch of widgets

• Gotta replace it, drift a little behind plan

• So... what questions do humans ask in this situation?

Page 15: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

• “How much is it costing us?”

• “How much does it cost to repair?”

• “Can we kludge a partial fix?”

• “What are risks if we delay a fix?”

Economic Mindset = Broken Machine

Page 16: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Note the Key Words

• “Cost”, “Partial”, “Risk”

• These are things you hear a lot in an economic discussion

• Okay, meanwhile in Factory 2, also missing by 10%, different reason...

Page 17: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Factory 2... One Employee Is an Axe Murderer

Page 18: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

After Every Axe Murdering...

• Have to, like, hire a new guy, train him on the machine, takes forever

• Questions we asked before are now somehow deeply wrong:

• “What if we just cut down on the rate, so there’s less axe murdering?”

• “Hey, we can train a pool of temps on all the machines, when someone gets killed, we’ll just swap some new guy in, bang, problem solved!”

• “How much is it really costing us, anyways?”

• These ideas seem obscene, not merely bad

Page 19: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Moral Mindset = Axe Murderer

“Search for villains, elevation of accusers, and mobilization of authority to mete out punishment” (Pinker, The Blank Slate)

Page 20: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Moral Mindset, Key Words

• “Villains”, “Accusers”, “Authority”, “Punishment”

• I believe that most companies, in investigating outages, act much more like they’re looking for an axe murderer, than trying to fix a broken machine

Page 21: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Act II: What To Do in the Post-Mortem Room

Page 22: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Challenge #1, As Person Running Post-Mortems

Get team out of moral mindset.

Note: this is not, in fact, easy.

Page 23: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Why It’s Hard

• Mindsets control how we interpret the world...

• ...including what people say to us

• So, a team sitting there, fearing moral censure, hears you say “We’re not looking to blame anyone”, they just think you’re lying. How could you mean that, when the thing that happened was so terrible and wrong?

• The deep trick (and this is the point of this whole presentation, frankly), is that you have to take advantage of the thing that separates humans and robots...

Page 24: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Fundamental Tool: Make ‘Em Laugh

Page 25: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Humor == Breaking Frames

• That’s what humor actually is -- something that stretches or breaks the mental frame that people are using to interpret a situation

• So, you use humor to break the frame, release people from the blame/fear/punishment of the moral mindset, and then refocus them on the economic challenges you’re facing

• The humor is, IMHO, not a nice-to-have. It’s absolutely central. I’ve seen smart, caring leaders get this one wrong, and finish their post-mortems with a room full of tense, closed-up team members (and no good ideas on the table)

• Talk has specific examples of this, but this is a central point

Page 26: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Tip 1: Share Your Personal “Bad Things”

Page 27: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Place The Bad Thing on a Continuum

• Moral mindset is very absolutist: this bad thing is The Worst Thing Ever

• I like to say “Okay, well it’s pretty bad, let’s compare it to some things”

• Did we irretrievably lose customer data? (I’ve done that, not awesome)

• Did we almost get our customer fired by her boss (also, not awesome)

• Did we send hundreds of emails to everyone on our customer’s mailing list... but the emails were all question marks? For a customer who was in the proofreading business? (done that, very much not awesome)

• People laugh, and then say “Okay, how bad was this, really?” Win.

Page 28: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

More Stories of Actual Failures (Just For Fun)

• Did we break our allergies-to-medicines module, and risk having a doctor prescribe the wrong medication to someone?

• Did our internet-connected home thermostat system have a server crash, causing all the thermostats to set the temp to the default... of 85 degrees?

• Did our high-frequency trading program have flaws that led to our company losing 450 million dollars? (that is a tough one to beat, IMHO)

• Collect your own! It’s fun!

Page 29: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Tip 2: Mock Hindsight Bias To Its Face

“Let’s plan for a future where we’re all as stupid as we are today.”

Page 30: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

How Hindsight Bias Shows up in Post-Mortems

• Someone says “Oh, yeah, I screwed that one up, I knew I had to run the deploy in that one order, and I just forgot. I’m really sorry, I won’t make that mistake again, totally my bad.”

• You have to utterly reject this. It’s pure hindsight bias (easy to see errors after the fact, very difficult in the moment).

• I say “It’s like we’re saying ‘I was stupid, this one time, and we’ll fix that problem by never being stupid again.’”

• Hence: “planning for a future where we’re as stupid as we are today”

• aka “Must create a system which is resilient to occasional bouts of really intense stupidity”.

Page 31: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Tip 3: Relish Absurdities of Your System

Page 32: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

You Will Find That Your Code is a Mess

• E.g. you’ve refactored, and rewritten in python (or node or something), and moved to the cloud, but this 5 whys is making clear that your most important report is still run by a VisualCron job on a Windows server that never quite made it out of the office... and someone just tripped on the power cord

• Team will feel ashamed, you have to give them license to relish absurdity

• I often point out “There are two kinds of startups: the ones that achieve some modest traction on top of a pile of code of which they are vaguely ashamed... and the ones that go out of business. That’s it. No third kind.”

• Also sometimes it helps to just laugh: “It’s kind of amazing this works at all”

Page 33: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Interlude: A Worked Example

Page 34: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Three Axioms For Leading Post-Mortems

• Everyone involved acted in good faith

• Everyone involved is competent

• We’re doing this to find improvements

Page 35: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Axioms == Ground Truth From Which You Start

• If you don’t start with these as givens...

• ...you’ll find yourself seeing every incident as human error

• Whereas, if you can convince/trick yourself into such beliefs...

• ...you’ll find a thousand valuable improvements to make

• Or, to put it another way:

Page 36: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Human Error is the Question, Not the Answer

Page 37: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Restate the Problem To Include TTR

We broke the db access code.

Page 38: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Restate the Problem To Include TTR

We pushed a deploy...which broke db access code.

Page 39: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Restate the Problem To Include TTR

We pushed a deploy...which broke the db access code...and didn’t find out until customers complained.

Page 40: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Restate the Problem To Include TTR

We pushed a deploy...which broke the db access code...didn’t find out until customers complained...and couldn’t fix it for three hours.

Page 41: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Redefining Problem Is Very Valuable

• People tend to focus on a single mistake

• Broaden that, to include full cycle back to restored service

• At what point was the triggering decision made?

• How long did it take to find out something was wrong?

• How long did it take to restore service?

Page 42: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

“Broadest Fix” vs “Root Cause”

Page 43: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Handling a Fork in the Road

• Which is the Root Cause? DB access bug or monitoring failure?

• Answer: don’t care about “root causes”. They don’t exist (multiple things conspire for failures to happen). Also, kind of moral/blame-ish.

• Ask instead: if we made an incremental improvement in area A or area B, which would prevent the broadest class of problems going ahead?

• Much better conversation. Answer here is clear: monitoring.

Page 44: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Act III: Corrective Actions / Remediations / Fixes

Page 45: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Incrementalism Or You’re Fired

Page 46: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Require Small Steps From Your Team

• Team will tell you they have no option but to do Some Huge Thing

• You have to totally reject this, push for a small step

• e.g. “What’s the simplest, dumbest thing that will make it slightly better?”

• After some hemming and hawing, great, cheap ideas emerge

• Might be: small steps towards Huge Thing

• Or: installing data collection to prove Huge Thing is necessary

Page 47: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

“Automation” vs “Tools”

Page 48: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

“Automation” => Humans Cause Your Problems

• Strong

• Silent

• Clumsy

• Difficult to Direct

David Woods, “Decomposing Automation: Apparent Simplicity, Real Complexity”

Page 49: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Automation Written By People Who Don’t Do Job

Page 50: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

“Tooling” => Humans Solve Your Problems

• How do the humans currently do their jobs?

• What tools do they use?

• When you give them a new tool, do they actually use it?

• How badly did you just screw up their jobs?

• YOU MUST ITERATE

Page 51: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Dan Mongers Some Fear

Page 52: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Here’s What’s Happening, Right Now

• Your systems are experiencing constant, small-scale failures... invisibly

• Your team is struggling to keep your systems running... but are so habituated to it, they don’t even realize that’s true

• Your smart people are spending their smart cycles just trying to work around the complexity in your system

• The business side is making plans which aren’t supported by your infrastructure

• Customers are getting ready to surprise you, and it won’t be fun

Page 53: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Do This

• Elect a Post-Mortem Boss (Man|Lady)

• Look for a Goldilocks incident

• Expect awkwardness

• THERE MUST BE FIXES

• Incrementally improve the incremental improvements

Page 54: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Read This

• How Complex Systems Fail, Richard Cook (SOOOOO GOOOD)

• How the Mind Works, Steven Pinker (moral instinct, much other awesome)

• Thinking Fast and Slow, Daniel Kahneman

• The Field Guide to Understanding Human Error, Sidney Dekker

• Complications and Better, Atul Gawande (marvelous narratives)

• Kitchen Soap, blog by John Allspaw

Page 55: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Photo Credits, I

• “Wonderworks Upside Down Building”, by Andy Leonard, http://www.flickr.com/photos/rover75/3901166997/

• “Robot de Martillo”, by Luis Perez, http://www.flickr.com/photos/pe5pe/2454661748/

• “Helios-Factory floor”, http://commons.wikimedia.org/wiki/File:Helioshall2.jpg

• “old machine”, by Jun Aoyama, http://www.flickr.com/photos/jam343/1730140/

• “Axe Marks The Spot”, by Alan Levine, http://www.flickr.com/photos/cogdog/4461665810/

• “Failboat Has Arrived”, http://www.rotskyinstitute.com/rotsky/wp-content/uploads/2008/02/failboat2.jpg

Page 56: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Photo Credits, II

• “14 plugs but only 6 sockets”, by Jason Rogers, http://www.flickr.com/photos/restlessglobetrotter/2661016046/

• “shame in scranton”, by Shira Golding Evergreen, http://www.flickr.com/photos/boojee/3613772785/

• “tiny dollhouse steps”, by Yi-Tao “Timo” Lee, http://www.flickr.com/photos/timojazz/6235519218/

• “Computers can be stupid”, by Brent Moore, http://www.flickr.com/photos/brent_nashville/2634912345/

• “Robot Uprising”, http://gordonandthewhale.com/wp-content/uploads/2010/10/How-To-Survive-a-Robot-Uprising.jpeg

• “Shark”, by Steve Garner, http://www.flickr.com/photos/22032337@N02/8314569214/

Page 57: How to Run a Post-Mortem (With Humans, Not Robots), Velocity 2013

Thanks...

Dan MilsteinHut 8 Labs@danmil