Jason Hand – DevOps Evangelist Tips & Tricks to Reduce TTR for the Next Incident @jasonhand
Time to Resolution (TTR)
• The total amount of time taken to resolve an incident
• MTTR – Mean Time To Resolution* – summary over time – measurement used to describe the most
"typical" value in a set of values – the lower the better
*Resolve = Repair = Recover
• Incident Lifecycle – Alerting – Triage – Investigation – Identification – Resolution – Documentation
Alerting “zero 1me” aler1ng pla6orm to find people instantly can only really effect average TTR by a very small percentage
No1fy on-‐call members
Victor’s Tips
“Include useful content & context in the alerts.”
“Use custom no8fica8ons to dis8nguish cri8cal alerts.”
Victor’s Tips
“Get the right alerts to the right people through rou8ng.”
“Establish a single source of truth for all ac8vi8es of an incident.”
Resolution Self-‐documen1ng what teams do to solve the problem
Bidirec1onal integra1on with your favorite chat client and the VictorOps 1meline
Team members performing system ac1ons to fix the problem(s)
“Conduct (blameless) post-‐mortems.”
“Be vocal & share what is taking place.”
“Provide quick access to accurate metrics & runbooks.”
“Collaborate & Share.”
“Connect with the right resources and team members.”
“Get the right alerts to the right people through rou8ng.”
“Establish a single source of truth for all ac8vi8es of an incident.”
“Include useful content & context in the alerts.”
“Use custom no8fica8ons to dis8nguish cri8cal alerts.”
Jason Hand – DevOps Evangelist Tips & Tricks to Reduce TTR for the Next Incident
@jasonhand
Thank You