Top Banner
Why Resilience? A primer at varying flight altitudes Uwe Friedrichsen, codecentric AG, 2014
59

Why resilience - A primer at varying flight altitudes

May 11, 2015

Download

Technology

This session provides a primer to resilience at varying flight altitudes.

It starts at a management level and motivates why resilience is important, why it is important today and what the business case for resilience is (or actually is not).

Then it descends to a high level architectural view and explains resilience a bit more in detail, its correlation to availability and the difference between resilience and robustness.

Afterwards it descends to a design level and explains some selected core principles of resilience, some of them garnished with grass-root level flight altitude code examples.

At the end the flight altitude is risen again and some recommendations how to introduce resilient software design into your software development process are given and the correlation to some related topics is explained.

Of course this slide deck will only show a fraction of the actual talk contents as the voice track is missing but I hope it will be helpful anyway.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Why resilience - A primer at varying flight altitudes

Why Resilience? A primer at varying flight altitudes

Uwe Friedrichsen, codecentric AG, 2014

Page 2: Why resilience - A primer at varying flight altitudes

@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com

Page 3: Why resilience - A primer at varying flight altitudes

Resilience? Never heard of it …

Page 4: Why resilience - A primer at varying flight altitudes

re•sil•ience (rɪˈzɪl yəns) also re•sil′ien•cy, n. 1.  the power or ability to return to the original form, position,

etc., after being bent, compressed, or stretched; elasticity. 2.  ability to recover readily from illness, depression, adversity,

or the like; buoyancy. Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd. Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved.

http://www.thefreedictionary.com/resilience

Page 5: Why resilience - A primer at varying flight altitudes

Resilience (IT) The ability of an application to handle unexpected situations

-  without the user noticing it (best case) -  with a graceful degradation of service (worst case)

Page 6: Why resilience - A primer at varying flight altitudes

Resilience is not about testing your application

(You should definitely test your application, but that‘s a different story)

public class MySUTTest { @Test public void shouldDoSomething() { MySUT sut = new MySUT(); MyResult result = sut.doSomething(); assertEquals(<Some expected result>, result); } … }

Page 7: Why resilience - A primer at varying flight altitudes

It‘s all about production!

Page 8: Why resilience - A primer at varying flight altitudes

Why should I care?

Page 9: Why resilience - A primer at varying flight altitudes

Business

Production

Availability

Resilience

Page 10: Why resilience - A primer at varying flight altitudes

Your web server doesn‘t look good …

Page 11: Why resilience - A primer at varying flight altitudes

The dreaded SiteTooSuccessfulException …

Page 12: Why resilience - A primer at varying flight altitudes

Reasons to care about resilience •  Loss of lives

•  Loss of goods (manufacturing facilities)

•  Loss of money

•  Loss of reputation

Page 13: Why resilience - A primer at varying flight altitudes

Why should I care about it today?

(The risks you mention are not new)

Page 14: Why resilience - A primer at varying flight altitudes

Resilience drivers

•  Cloud-based systems

•  Highly scalable systems

•  Zero Downtime

•  IoT & Mobile

•  Social

à Reliably running distributed systems

Page 15: Why resilience - A primer at varying flight altitudes

What’s the business case?

(I don’t see any money to be made with it)

Page 16: Why resilience - A primer at varying flight altitudes

Counter question

Can you afford to ignore it?

(It’s not about making money, it’s about not loosing money)

Page 17: Why resilience - A primer at varying flight altitudes

Resilience business case

•  Identify risk scenarios

•  Calculate current occurrence probability

•  Calculate future occurrence probability

•  Calculate short-term losses

•  Calculate long-term losses

•  Assess risks and money

•  Do not forget the competitors

Page 18: Why resilience - A primer at varying flight altitudes

Let’s dive deeper into resilience

Page 19: Why resilience - A primer at varying flight altitudes

Classification attempt

Reliability: A set of attributes that bear on the capability of software to maintain its level of performance under stated conditions for a stated period of time.

Efficiency

ISO/IEC 9126 software quality characteristics

Usability

Reliability Portability

Maintainability

Functionality

Available with acceptable latency

Resilience goes beyond that

Page 20: Why resilience - A primer at varying flight altitudes

How can I maximize availability?

Page 21: Why resilience - A primer at varying flight altitudes

Availability ≔ MTTF MTTF + MTTR

MTTF: Mean Time To Failure MTTR: Mean Time To Recovery

Page 22: Why resilience - A primer at varying flight altitudes

Traditional approach (robustness)

Availability ≔ MTTF MTTF + MTTR

Maximize MTTF

Page 23: Why resilience - A primer at varying flight altitudes

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.

Leslie Lamport

Page 24: Why resilience - A primer at varying flight altitudes

Failures in todays complex, distributed, interconnected systems are not the exception.

They are the normal case.

Page 25: Why resilience - A primer at varying flight altitudes

Contemporary approach (resilience)

Availability ≔ MTTF MTTF + MTTR

Minimize MTTR

Page 26: Why resilience - A primer at varying flight altitudes

Do not try to avoid failures. Embrace them.

Page 27: Why resilience - A primer at varying flight altitudes

What kinds of failures do I need to deal with?

Page 28: Why resilience - A primer at varying flight altitudes

Failure types

•  Crash failure

•  Omission failure

•  Timing failure

•  Response failure

•  Byzantine failure

Page 29: Why resilience - A primer at varying flight altitudes

How do I implement resilience?

Page 30: Why resilience - A primer at varying flight altitudes

Bulkheads

Page 31: Why resilience - A primer at varying flight altitudes

•  Divide system in failure units

•  Isolate failure units

•  Define fallback strategy

Page 32: Why resilience - A primer at varying flight altitudes

Redundancy

Page 33: Why resilience - A primer at varying flight altitudes

•  Elaborate use caseMinimize MTTR / scale transactions / handle response errors / …

•  Define routing & balancing strategy Round robin / master-slave / fan-out & quickest one wins / …

•  Consider admin involvementAutomatic vs. manual / notification – monitoring / …

Page 34: Why resilience - A primer at varying flight altitudes

Loose Coupling

Page 35: Why resilience - A primer at varying flight altitudes

•  Isolate failure units (complements bulkheads)

•  Go asynchronous wherever possible

•  Use timeouts & circuit breakers

•  Make actions idempotent

Page 36: Why resilience - A primer at varying flight altitudes

Implementation Example #1

Timeouts

Page 37: Why resilience - A primer at varying flight altitudes

Timeouts (1) // Basics myObject.wait(); // Do not use this by default myObject.wait(TIMEOUT); // Better use this // Some more basics myThread.join(); // Do not use this by default myThread.join(TIMEOUT); // Better use this

Page 38: Why resilience - A primer at varying flight altitudes

Timeouts (2) // Using the Java concurrent library Callable<MyActionResult> myAction = <My Blocking Action> ExecutorService executor = Executors.newSingleThreadExecutor(); Future<MyActionResult> future = executor.submit(myAction); MyActionResult result = null; try { result = future.get(); // Do not use this by default result = future.get(TIMEOUT, TIMEUNIT); // Better use this } catch (TimeoutException e) { // Only thrown if timeouts are used ... } catch (...) { ... }

Page 39: Why resilience - A primer at varying flight altitudes

Timeouts (3) // Using Guava SimpleTimeLimiter Callable<MyActionResult> myAction = <My Blocking Action> SimpleTimeLimiter limiter = new SimpleTimeLimiter(); MyActionResult result = null; try { result = limiter.callWithTimeout(myAction, TIMEOUT, TIMEUNIT, false); } catch (UncheckedTimeoutException e) { ... } catch (...) { ... }

Page 40: Why resilience - A primer at varying flight altitudes

Implementation Example #2

Circuit Breaker

Page 41: Why resilience - A primer at varying flight altitudes

Circuit Breaker – concept

Client Resource Circuit Breaker

Request

Resource unavailable

Resource available

Closed Open

Half-Open

Lifecycle

Page 42: Why resilience - A primer at varying flight altitudes
Page 43: Why resilience - A primer at varying flight altitudes

Implemented patterns •  Timeout

•  Circuit breaker

•  Load shedder

Page 44: Why resilience - A primer at varying flight altitudes

Supported patterns •  Bulkheads

(a.k.a. Failure Units)

•  Fail fast

•  Fail silently

•  Graceful degradation of service

•  Failover

•  Escalation

•  Retry

•  ...

Page 45: Why resilience - A primer at varying flight altitudes

Hello, world!

Page 46: Why resilience - A primer at varying flight altitudes

public class HelloCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default"; private final String name; public HelloCommand(String name) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.name = name; } @Override protected String run() throws Exception { return "Hello, " + name; } } @Test public void shouldGreetWorld() { String result = new HelloCommand("World").execute(); assertEquals("Hello, World", result); }

Page 47: Why resilience - A primer at varying flight altitudes

Source: https://github.com/Netflix/Hystrix/wiki/How-it-Works

Page 48: Why resilience - A primer at varying flight altitudes

Fallbacks

Page 49: Why resilience - A primer at varying flight altitudes

•  What will you do if a request fails?

•  Consider failure handling from the very beginning

•  Supplement with general failure handling strategies

Page 50: Why resilience - A primer at varying flight altitudes

Scalability

Page 51: Why resilience - A primer at varying flight altitudes

•  Define scaling strategy

•  Think full stack

•  Apply D-I-D rule

•  Design for elasticity

Page 52: Why resilience - A primer at varying flight altitudes

… and many more •  Supervision patterns

•  Recovery & mitigation patterns

•  Anti-fragility patterns

•  Supporting patterns

•  A rich pattern family

Different approach than traditional

enterprise software development

Page 53: Why resilience - A primer at varying flight altitudes

How do I integrate resilience into my

software development process?

Page 54: Why resilience - A primer at varying flight altitudes

Steps to adopt resilient software design

1.  Create awareness: Go DevOps

2.  Create capability: Coach your developers

3.  Create sustainability: Inject errors

Page 55: Why resilience - A primer at varying flight altitudes

Related topics

Reactive

Anti-fragility

Fault-tolerant software design

Recovery-oriented computing

Page 56: Why resilience - A primer at varying flight altitudes

Wrap-up

•  Resilience is about availability

•  Crucial for todays complex systems

•  Not caring is a risk

•  Go DevOps to create awareness

Page 57: Why resilience - A primer at varying flight altitudes

Do not avoid failures. Embrace them!

Page 58: Why resilience - A primer at varying flight altitudes

@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com

Page 59: Why resilience - A primer at varying flight altitudes