Top Banner
CHAOS DRIVEN DEVELOPMENT Future Insights Live 2015, Las Vegas Bruce Wong
44
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chaos Driven Development

CHAOS DRIVEN DEVELOPMENTFuture Insights Live 2015, Las Vegas

Bruce Wong

Page 2: Chaos Driven Development

A LITTLE ABOUT ME

• Founder of Chaos Engineering @ Netflix

• Computer Science Background

• Multiple roles scaling Netflix from 8m to 60m+ subs

• Currently Taking a Break

@bruce_m_wong

Page 3: Chaos Driven Development

Most enterprises hire people to fix things. Netflix hires people to break things….

…we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes.

http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone@bruce_m_wong

Page 4: Chaos Driven Development

http://www.techrepublic.com/article/serious-about-cloud-it-might-be-time-to-look-into-chaos-engineering/https://gigaom.com/2014/09/11/netflixs-new-chaos-engineering-push-aims-to-hire-staff-to-help-break-its-cloud-based-system/@bruce_m_wong

Page 5: Chaos Driven Development

http://www.cnbc.com/id/102394893@bruce_m_wong

Page 6: Chaos Driven Development

http://www.cnbc.com/id/102394893@bruce_m_wong

Page 7: Chaos Driven Development

CHAOS DEFINED

“If it ain’t broke don’t fix it”

-Bert Lance, Nation’s Business 1977

If it ain’t broke, try harder -chaos philosophy

@bruce_m_wong

Page 8: Chaos Driven Development

CHAOS DEFINED

Intentionally introducing failure into a system with the purpose of validating resilience design.

@bruce_m_wong

Page 9: Chaos Driven Development

WHY CHAOS?

Failure happens.

@bruce_m_wong

Page 10: Chaos Driven Development

WHY CHAOS?

•Hardware fails

•Power outages

•Software has bugs

•Human error

•Natural disasters@bruce_m_wong

Page 11: Chaos Driven Development

http://money.cnn.com/2012/10/30/technology/netflix-hurricane-sandy/@bruce_m_wong

Page 12: Chaos Driven Development

http://www.pcworld.com/article/2691772/how-netflix-survived-the-amazon-ec2-reboot.htmlhttps://gigaom.com/2014/10/03/netflix-lost-218-database-servers-during-aws-reboot-and-stayed-online/

@bruce_m_wong

Page 13: Chaos Driven Development
Page 14: Chaos Driven Development

BLUE MOONS

Once in a blue moon will eventually happen@bruce_m_wong

Page 15: Chaos Driven Development

FAULT-TOLERANT DESIGN PRINCIPLES

• Eliminate Single Points of Failure

• Allow parts of the system to fail independently (Failure Isolation)

• Prevent propagation (Failure Containment)

@bruce_m_wong

Page 16: Chaos Driven Development

START WITH CONSEQUENCES

Chaos Driven Development

@bruce_m_wong

Page 17: Chaos Driven Development

MINIMUM VIABLE PRODUCT• Understand your users

• Understand your value proposition

• Understand your business

@bruce_m_wong

Page 18: Chaos Driven Development

PRIORITIZE• Many aspects and features are important

• Each have different consequences for not working

• A product’s value proposition is what drives your business

@bruce_m_wong

Page 19: Chaos Driven Development

DESIGN FOR FAILURE

What failure isolation might look like

@bruce_m_wong

Page 20: Chaos Driven Development
Page 21: Chaos Driven Development
Page 22: Chaos Driven Development
Page 23: Chaos Driven Development
Page 24: Chaos Driven Development
Page 25: Chaos Driven Development
Page 26: Chaos Driven Development
Page 27: Chaos Driven Development
Page 28: Chaos Driven Development

APPLYING CHAOS

Validation of fault-tolerant design

@bruce_m_wong

Page 29: Chaos Driven Development

BREAKING THE CONNECTION

How Confident are you?

-Next week?

-Next month?

-After that “quick patch”

Page 30: Chaos Driven Development

WHAT DOES CHAOS LOOK LIKE?

• Types - errors, latency

• Duration - how long?

• Intensity - how much?

@bruce_m_wong

Page 31: Chaos Driven Development

WHAT DOES CHAOS LOOK LIKE?

• Return errors a % of requests

• i.e. return HTTP500 for 1% of requests for 1 minute

@bruce_m_wong

Page 32: Chaos Driven Development

WHAT DOES CHAOS LOOK LIKE?

• Make it slow(er) - Introduce Latency

• i.e. sleep for 10ms on every request for 1 minute

@bruce_m_wong

Page 33: Chaos Driven Development

WHAT DOES CHAOS LOOK LIKE?

Gradually increase

• i.e. sleep for 10ms on every request for 1 minute

• sleep for 100ms on every request for 3 minutes

@bruce_m_wong

Page 34: Chaos Driven Development

WHAT DOES CHAOS LOOK LIKE?

The design/implementation worked!

• microscopic impact, high confidence

What if it didn’t work?

• smaller impact than an outage

• proactively fix it and try again@bruce_m_wong

Page 35: Chaos Driven Development

WHAT AN OUTAGE LOOKS LIKE?

• Detection takes time (TTD)

• Analysis takes time

• Resolution takes time (TTR)

• Inconvenient times

@bruce_m_wong

Page 36: Chaos Driven Development

CHAOS VS OUTAGEChaos

• Controlled

• Planned

• Intentional

• Microscopic user impact

Outages

• Uncontrolled

• Unpredictable

• Unintended

• Large impact@bruce_m_wong

Page 37: Chaos Driven Development

WHAT ABOUT TESTING?

• Testing is good - do it, automate it

• While great testing disciplines can find most functional bugs…

• scale, traffic and capacity

• System misconfiguration and design limitations

@bruce_m_wong

Page 38: Chaos Driven Development

LESSONS LEARNED

• Learn more from chaos exercises than outages

• Fixing a failure mode will uncover new ones

• Configuration is often overlooked

• Tools can break

@bruce_m_wong

Page 39: Chaos Driven Development

WHY IS THIS HARD?

@bruce_m_wong

Page 40: Chaos Driven Development

WHAT MAKES RESILIENCE DESIGN HARD?

• Product and Engineering Decision

• Tradeoffs are difficult

• Organizational Silos

@bruce_m_wong

Page 41: Chaos Driven Development

ORGANIZATIONAL SILOS• Services by Domain

• Dev/Ops/Product

• Incomplete context

@bruce_m_wong

Page 42: Chaos Driven Development

WHAT MAKES CHAOS HARD?In addition to the technical challenges

• Organizations rarely incentivize people to try and break production

• Misconceptions about complex systems and scale

@bruce_m_wong

Page 43: Chaos Driven Development

TAKE AWAYS

• What are the consequences?

• Start small, start early

• Work together - share context

• Validate don’t assume

@bruce_m_wong

Page 44: Chaos Driven Development

QUESTIONS?

@bruce_m_wong