Do’s & of post-incident analysis Don’ts
Do’s &
of post-incident analysis
Don’ts
Jason HandDevOps EvangelistDevOps, Dogs, Horses, and Mountain LivingTwitter: @jasonhand
VictorOpsIncident management & notificationsMakes on-call suck less!Twitter: @victorops
Jason YeeTechnical writer/evangelistTravel hacker & ChefTwitter: @gitbisect
DatadogSaaS-based full stack monitoringOver a trillion data points per dayTwitter: @datadoghq
AgendaService Disruptions
Detection
Diagnosis
Post-incident analysis
Framework
Follow & Share on Twitter
#VOWebinar
@gitbisect@jasonhand
@datadogHQ@VictorOps
Service Disruptions
There is no such thing as being soooo good, you’ll never fail
Are a reality in ALL complex systems
Complex Systems
● Diversity● Interdependent● Adaptive● Connectedness
(i.e. we can be connected but not dependent on each other)
Cynefin Framework
● Obvious - cause & effect is obvious to all
● Complicated - cause & effect requires analysis or expert knowledge
● Complex - cause & effect can only be perceived in retrospect
● Chaotic - no relationship between cause & effect
Cynefin diagram by Dave Snowden CC BY-SA 3.0
Contributing Factors
Systems Thinking: an understanding of a
system by examining the linkages and
interactions between the components that
comprise the entirety of that defined system
MTTR vs MTBF
Mean Time To Repair
vs
Mean Time Between Failure
Detection
Collecting data is cheapNot having it when you need it can be expensive
4 qualities of good metricsNot all metrics are created equal
1. Well understood
2. Granular
3. Tagged & filterable
4. Long-lived
Diagnosis
Real-time Notification
Getting “the right” Humans Involved
Paging has evolved to: Smart & Actionable alerts ...
Routed to the right teams and people …
With valuable context
Graphs, Logs, Runbooks
Automation
ChatOps
jhand.co/chatopsbook
The Full Incident Lifecycle
What we are really here to learn about...
Post-incident Analysis(a.k.a. learning review, postmortem)
Do: Establish that we are here to learn
The primary objective of these exercises is to learn
Do: Establish timeline of events
Identify when anomaly was first detected, first responders, SMEs pulled in to assist, conversations, commands, etc.
Don’t: Hijack the Discussion
Having an objective moderator run the exercise can help prevent one person (or small group) from steamrolling the conversation and avoids
“Group Think”
Do: Describe What Happened
Gather a detailed account of what happened from team members. What services, components, etc. were affected? Include how
customers were impacted
i.e. Accountability
Don’t: Explain What Happened
Explaining often leads to a less than objective understanding of what took place as well as finger pointing and blame
Do: Ask “How” Things Happened
Understand in great detail “how” things happened including multiple contributing factors
Don’t: Ask “Why” Things Happened
Asking “why” often contains bias and leads to blame
“Why” .. brings us to the very mysterious incentives we have in the workplace.
“How" brings us to the conditions that allowed the event to take place to begin with. - John Allspaw (CTO Etsy)
Do: Understand Contributing Factors
Use Systems Thinking to see more holistically
“Cause is not something found in the rubble. Cause is created
in the minds of the investigators” - Sydney Dekker
Don’t: Focus on a ‘Root Cause’
Rather than focusing on the ‘Root Cause’ of service disruption, understand all of the contributing factors.
Newtonian thinking … Why some still seek a root cause
We’ve created the idea that a single cause has an
equal and opposite effect
● Humans adapt to the work they have● Root Cause analysis ONLY works in SIMPLE systems● Root Cause Analysis = Retrospective Cover of Ass
In complex systems .. it doesn’t
Do: Watch For Bias
We are easily susceptible to cognitive bias such as: confirmation, hindsight, anchoring, outcome, availability
Don’t: Blame Humans
Humans are only a part of the problem and response, never a contributing factor is issues
Do: Include What Went Well
Much can be learned from what worked during the response to a service disruption. Capture and discuss what efforts actually went well.
Don’t: Hide What Happened
Customers and end-users are savvy. Being transparent about what took place and what was done will help build trust
Do: Conduct Analysis Soon
Gather the team and conduct the post-incident analysis as soon as everyone is rested
Don’t: Wait longer than 48 hours
The longer time passes, the less accurate accounts of what took place will be
Do: Assign Action Items
Look for small incremental improvements to take action on.Each improvement item should be assigned an owner and tracked for
follow up
Don’t: Debate Without Action
Don’t allow for extended debate on action items. Place ideas into a “parking lot” for later action but come up with at least one action item to
be implemented immediately
Do: Hear from everyone
To fully understand the disruption and response you want to hear from all parties involved. Everyone’s experience was different. The more
voices you hear from, the more accurate the story and timeline become.
Do: Encourage Many Possible Improvements
We are looking for many possible areas for incremental improvements to our systems, processes, tools, incident response, and team members. Encourage people to build on top of existing ideas in
addition to posing alternatives.
Don’t: Overpromise or Overcommit
We are looking for ideas not binding commitments. This helps to make sure you get suggestions from a wide group
Do: Archive Your Postmortem
Save and store your postmortem where it is available to everyone internally for future review or as assistance during future similar
incidents
Do: Rinse & Repeat
Be disciplined in your post-incident analysis exercises and perform them for all incidents regardless of the severity. Practice makes
perfect and these will become more efficient and useful over time
Framework
Post-incident analysis framework
1. Summary: what happened?2. How was the incident detected?3. How did we respond?4. How did it happen?5. How can we improve?
Summary: what happened?
● Impact on customers● Severity of the incident● Components affected● What ultimately resolved the incident?● Externally shared information
How was the incident detected?
● Did we have a metric that showed the incident?● Was there a monitor/alerting on that metric?● How long did it take to declare an incident?
How did we respond?
● Who was involved?● ChatOps archive links● Timeline of events● What went well?● What didn’t go so well?
How did it happen?
● Technical deep-dive● Include context● Identify contributing factors● Ask “How,” not “Why”
How can we improve?
● Now (immediate actions)● Next (in current or following sprint)● Later (after the next sprint)● Follow up notes● Ensure all items are actionable and tracked
Summary:
Resources● Post-incident analysis framework/template
○ http://bit.ly/2dxDIT3
● Blameless postmortems & a just culture - John Allspaw○ https://codeascraft.com/2012/05/22/blameless-postmortems/
● The infinite hows - John Allspaw○ http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/
● The human side of postmortems - Dave Zwieback○ http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp
● Writing your first postmortem - Mathias Lafeldt○ https://medium.com/production-ready/writing-your-first-postmortem-8053c678b90f
Q&A
Do: Start a free trialhttps://app.datadoghq.com/signuphttps://victorops.com/start-free-trial