Top Banner
Antifragility and testing distributed systems Approaches for testing and improving resiliency
43

Antifragility and testing for distributed systems failure

Feb 11, 2017

Download

Internet

DiUS
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Antifragility and testing for distributed systems failure

Antifragility and testing distributed systemsApproaches for testing and improving resiliency

Page 2: Antifragility and testing for distributed systems failure

FailureIt’s inevitable

Page 3: Antifragility and testing for distributed systems failure

Microservice Architectures

■ Bounded contexts■ Deterministic in nature■ Simple behaviour■ Independently testable (e.g. Pact)

Page 4: Antifragility and testing for distributed systems failure
Page 5: Antifragility and testing for distributed systems failure

Distributed Architectures

Conversely…

■ Unbounded context■ Non-determinism■ Exhibit chaotic behaviour■ Emergent behaviour■ Complex testing

Page 6: Antifragility and testing for distributed systems failure
Page 7: Antifragility and testing for distributed systems failure

Problems with traditional approaches

■ Integration test hell■ Need to get by without E2E environments■ Learnings are non-representative anyway■ Slower■ Costly (effort + $$)

Page 8: Antifragility and testing for distributed systems failure

Alternative?

Create an isolated, simulated environment

■ Run locally or on a CI environment■ Fast - no need to setup complex test data, scenarios etc.■ Enables single-variable hypothesis testing■ Automatable

Page 9: Antifragility and testing for distributed systems failure

Lab Testing w\ Docker ComposeHypothesis testing simulated environments

Page 10: Antifragility and testing for distributed systems failure

Docker Compose

■ Docker container orchestration tool■ Run locally or remotely■ Works across platforms (Windows, Mac, *nix)■ Easy to use

Page 11: Antifragility and testing for distributed systems failure
Page 12: Antifragility and testing for distributed systems failure

Nginx

Let’s take a practical, real-world example: Nginx as an API Proxy.

Page 13: Antifragility and testing for distributed systems failure
Page 14: Antifragility and testing for distributed systems failure

Simulating failure with Muxy

“A tool to help simulate distributed systems failures”

Page 15: Antifragility and testing for distributed systems failure

Hypothesis testing

Our job is to hypothesise, test, learn, change, and repeat

Page 16: Antifragility and testing for distributed systems failure

Nginx TestingH0 = Introducing network latency does not cause errors

Test setup:

● Nginx running locally, with Production configuration● DNSMasq used to resolve production urls to other Docker

containers● Muxy container setup, proxying the API● A test harness to hit the API via Nginx n times, expecting

0 failures

Page 17: Antifragility and testing for distributed systems failure
Page 18: Antifragility and testing for distributed systems failure

Demo

Fingers crossed...

Page 19: Antifragility and testing for distributed systems failure

Knobs and Levers

We can now have a number of levers to pull. What if we...

● Want to improve on our SLA?● Want to see how it performs if the API is hard down?● ...

Page 20: Antifragility and testing for distributed systems failure

AntifragilityFailure is inevitable, let’s make it normal

Page 21: Antifragility and testing for distributed systems failure

Titanic Architectures

Architectures

Page 22: Antifragility and testing for distributed systems failure

Titanic Architectures

“Titanic architectures are architectures that are good in theory, but haven’t been put into practice”

Page 23: Antifragility and testing for distributed systems failure

Anti-titanic architectures?

“What doesn’t kill you makes you stronger”

Page 24: Antifragility and testing for distributed systems failure

Antifragility

“The resilient resists shocks and stays the same; the antifragile gets better” - Nasim Taleb

Page 25: Antifragility and testing for distributed systems failure

Chaos Engineering

● We expect our teams to build resilient applications○ Fault tolerance across and within service boundaries

● We expect servers and dependent services to fail● Let’s make that normal● Production is a playground● Levelling up

Page 26: Antifragility and testing for distributed systems failure

Chaos Engineering - Principles

1. Build a hypothesis around Steady State Behavior2. Vary real-world events3. Run experiments in production4. Automate experiments to run continuously

Requires the ability to measure - you need metrics!!

http://www.principlesofchaos.org/

Page 27: Antifragility and testing for distributed systems failure

Production Hypothesis Testing

H0 = Loss of an AWS region does not result in errors

Test setup:

● Multi-region application setup for the video playing API● Apply Chaos Kong to us-west-2● Measure aggregate production traffic for ‘normal’ levels

Page 28: Antifragility and testing for distributed systems failure

Kill an AWS region

http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html

Page 29: Antifragility and testing for distributed systems failure

Go/Hystrix API Demo

H0 = Introducing network latency does not cause API errors

Test setup:

● API1 running with Hystrix circuit breaker enabled if API2 does not respond within SLAs

● Muxy container setup, proxying upstream API2● A test harness to hit API1 n times, expecting 0 failures

Page 30: Antifragility and testing for distributed systems failure

Human FactorsTechnology is only part of the problem, can we test that too?

Page 31: Antifragility and testing for distributed systems failure
Page 32: Antifragility and testing for distributed systems failure

Chernobyl

● Worst nuclear disaster of all time (1986)● Public information sketchy● Estimated > 3M Ukrainians affected● Radioactive clouds sent over Europe● Combination of system + human errors● Series of seemingly logical steps ->

catastrophe

Page 33: Antifragility and testing for distributed systems failure

What we know about human factors

● Accidents happen● 1am - 8am = higher incidence of human errors● Humans will ignore directions

○ They sometimes need to (e.g. override)○ Other times they think they need to

(mistake)● Computers are better at following processes

Page 34: Antifragility and testing for distributed systems failure

Let’s use a Production deployment as a key example:

● CI -> CD pipeline used to deploy● Production incident occurs 6 hours later (2am)● ...what do we do?● We trust the build pipeline, avoid non-standard

actions

These events help us understand and improve our systems

Translation

Page 35: Antifragility and testing for distributed systems failure

“ A game day exercise is where we intentionally try to break our system, with the goal of being able to understand it better and learn from it ”

Game Day Exercises

Page 36: Antifragility and testing for distributed systems failure

Prerequisites:

● A game plan● All team members and affected staff aware of it● Close collaboration between Dev, Ops, Test,

Product people etc.● An open mind● Hypotheses● Metrics● Bravery

Game Day Exercises

Page 37: Antifragility and testing for distributed systems failure

● Get entire team together● Make a simple diagram of system on a

whiteboard● Come up with ~5 failure scenarios● Write down hypotheses for each scenario● Backup any data you can’t lose● Induce each failure and observe the results

Game Day Exercises

https://stripe.com/blog/game-day-exercises-at-stripe

Page 38: Antifragility and testing for distributed systems failure

Examples of things that fail:

● Application dies● Hard disk fail● Machine dies < AZ < Region…● Github/Source control goes down● Build server dies● Loss of \ degraded network connectivity● Loss of dependent API● ...

Game Day Exercises

Page 39: Antifragility and testing for distributed systems failure

Wrapping upI hope I didn’t fail

Page 40: Antifragility and testing for distributed systems failure

■ Apply the scientific method■ Use metrics to make learn and make decisions■ Docker-compose + Muxy to automate failure ■ Build resilience into software & architecture■ Regularly Production resilience until it’s normal■ Production outages are opportunities to learn■ Start small!

Wrapping up

Page 41: Antifragility and testing for distributed systems failure

Thank you

PRESENTED BY:

@matthewfellows

Page 42: Antifragility and testing for distributed systems failure

■ Antifragility (https://en.wikipedia.org/wiki/Antifragile) ■ Chaos Engineering (

http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html)

■ Principles of Chaos (http://www.principlesofchaos.org/)■ Human factors in large-scale technological systems'

accidents: Three Mile Island, Bhopal, Chernobyl (http://oae.sagepub.com/content/5/2/133.abstract)

References

Page 43: Antifragility and testing for distributed systems failure

■ Docker Compose (https://www.docker.com/docker-compose)

■ Muxy (https://github.com/mefellows/muxy)■ Nginx resilience testing with Docker Compose (

www.onegeek.com.au/articles/resilience-testing-nginx-with-docker-dnsmasq-and-muxy)

■ Golang + Hystrix resilience testing with Docker Compose (https://github.com/mefellows/muxy/tree/mst-meetup-demo/examples/hystrix)

Code \ Tool References