Top Banner
Three years of breaking things to make them better Donny Nadolny [email protected]
41

Three years of breaking things to make them better - Devops Days Sydney 2016

Jan 19, 2017

Download

Technology

Donny Nadolny
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Three years of breaking things to make them better - Devops Days Sydney 2016

Three years of breaking thingsto make them betterDonny [email protected]

Page 2: Three years of breaking things to make them better - Devops Days Sydney 2016

Conclusions

Page 3: Three years of breaking things to make them better - Devops Days Sydney 2016

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it

Page 4: Three years of breaking things to make them better - Devops Days Sydney 2016

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it2. Don’t automate it… yet

Page 5: Three years of breaking things to make them better - Devops Days Sydney 2016

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it2. Don’t automate it… yet3. When it gets boring, switch it up

Page 6: Three years of breaking things to make them better - Devops Days Sydney 2016

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Page 7: Three years of breaking things to make them better - Devops Days Sydney 2016

What is Failure Friday?

Page 8: Three years of breaking things to make them better - Devops Days Sydney 2016

What is Failure Friday?THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Failure Friday is a fault injection test against our production environment.

Page 9: Three years of breaking things to make them better - Devops Days Sydney 2016

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Page 10: Three years of breaking things to make them better - Devops Days Sydney 2016

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Page 11: Three years of breaking things to make them better - Devops Days Sydney 2016

Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

Page 12: Three years of breaking things to make them better - Devops Days Sydney 2016

Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Ramp up traffic to the disaster recovery site

Page 13: Three years of breaking things to make them better - Devops Days Sydney 2016

Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Ramp up traffic to the disaster recovery site• Fail over database master

Page 14: Three years of breaking things to make them better - Devops Days Sydney 2016

Larger-scale eventsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Ramp up traffic to the disaster recovery site• Fail over database master• Take down one data centre (one region, one AZ)

Page 15: Three years of breaking things to make them better - Devops Days Sydney 2016

Benefits of Failure FridayTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Are you sure your process comes up after a reboot?• If one machine is slow, does it act as a tarpit and slow

down others?• Does your DR work?• Get people comfortable touching production• Make sure your monitoring and alerting works

Page 16: Three years of breaking things to make them better - Devops Days Sydney 2016

How to get startedTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Don’t automate2. Pick reasonable problems, test in staging first3. Track results in your task tracker (JIRA, etc)

Page 17: Three years of breaking things to make them better - Devops Days Sydney 2016

THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Page 18: Three years of breaking things to make them better - Devops Days Sydney 2016

What’s new?

Page 19: Three years of breaking things to make them better - Devops Days Sydney 2016

Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

2013• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

Page 20: Three years of breaking things to make them better - Devops Days Sydney 2016

Our Basic AttacksTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

2013• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

2016• Stop/start a process• Suspend/resume a process• Reboot a machine• Network isolation• Add latency & packet loss

Page 21: Three years of breaking things to make them better - Devops Days Sydney 2016

Game Day Hour - Number 1THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Test major incident response• Cause fake incident, fix it, retro on our response

focusing on communication

Page 22: Three years of breaking things to make them better - Devops Days Sydney 2016

Game Day Hour - Number 1THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Test major incident response• Cause fake incident, fix it, retro on our response

focusing on communication• Don’t make it a semi-surprise

Page 23: Three years of breaking things to make them better - Devops Days Sydney 2016

Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Database issue, but…

Page 24: Three years of breaking things to make them better - Devops Days Sydney 2016

Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Database issue, but… most of the ops team is at Devops Days Sydney

Page 25: Three years of breaking things to make them better - Devops Days Sydney 2016

Game Day Hour - Number 2THREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Database issue, but… most of the ops team is at Devops Days Sydney

• Many variations on this:• team X is at an offsite• office Y is hit by a natural disaster

Page 26: Three years of breaking things to make them better - Devops Days Sydney 2016

Chaos CatTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Page 27: Three years of breaking things to make them better - Devops Days Sydney 2016

Chaos CatTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Reboot a random machine• Add latency/packet loss to a machine for 7 minutes• Ultimately: run a full plan• Remember: it was 3 yearsbefore we started automating

Page 28: Three years of breaking things to make them better - Devops Days Sydney 2016

Team-specific FFsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• A single FF is a bottleneck• Individual teams can run their own:

• On-call training - cause pages on a test account• “What’s your status?” - look at dashboards for

previous time periods: some healthy, some unhealthy

Page 29: Three years of breaking things to make them better - Devops Days Sydney 2016

Team-specific FFsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• A single FF is a bottleneck• Individual teams can run their own:

• On-call training - cause pages on a test account• “What’s your status?” - look at dashboards for

previous time periods: some healthy, some unhealthy

• Reviewing dashboards is a gold mine!• For all exercises, have a pair act with the rest of

the team observing

Page 30: Three years of breaking things to make them better - Devops Days Sydney 2016

What’s next?

Page 31: Three years of breaking things to make them better - Devops Days Sydney 2016

More Game DaysTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Team / office is unavailable• Github is down• Slack / hangouts is down• CI server is down

Page 32: Three years of breaking things to make them better - Devops Days Sydney 2016

Capacity planningTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• If your traffic spiked by 20%, do you have enough capacity?

• Take down servers and find out!

Page 33: Three years of breaking things to make them better - Devops Days Sydney 2016

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Page 34: Three years of breaking things to make them better - Devops Days Sydney 2016

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

“100% is the wrong reliability target for basically everything"

https://landing.google.com/sre/interview/ben-treynor.html

Page 35: Three years of breaking things to make them better - Devops Days Sydney 2016

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

Nines Monthly Unavailability

1: 90% 3 days2: 99% 7.2 hours3: 99.9% 43.8 minutes4: 99.99% 4.38 minutes5: 99.999% 25.9 seconds6: 99.9999% 2.6 seconds7: 99.99999% 263 milliseconds8: 99.999999% 26.3 milliseconds9: 99.9999999% 2.63 milliseconds

Page 36: Three years of breaking things to make them better - Devops Days Sydney 2016

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget

Page 37: Three years of breaking things to make them better - Devops Days Sydney 2016

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget

• Find hidden dependencies (priority inversion)

Page 38: Three years of breaking things to make them better - Devops Days Sydney 2016

Error budgetTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

• Crazy idea: if you’re under budget for 3 months in a row, take down your service to use up the budget

• Find hidden dependencies (priority inversion)• Gut-check your target availability

Page 39: Three years of breaking things to make them better - Devops Days Sydney 2016

Conclusions

Page 40: Three years of breaking things to make them better - Devops Days Sydney 2016

ConclusionsTHREE YEARS OF BREAKING THINGS TO MAKE THEM BETTER

1. Failure Friday is awesome, you should do it2. Don’t automate it… yet3. When it gets boring, switch it up