Why Resilience? A primer at varying flight altitudes Uwe Friedrichsen, codecentric AG, 2014
May 11, 2015
Why Resilience? A primer at varying flight altitudes
Uwe Friedrichsen, codecentric AG, 2014
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
Resilience? Never heard of it …
re•sil•ience (rɪˈzɪl yəns) also re•sil′ien•cy, n. 1. the power or ability to return to the original form, position,
etc., after being bent, compressed, or stretched; elasticity. 2. ability to recover readily from illness, depression, adversity,
or the like; buoyancy. Random House Kernerman Webster's College Dictionary, © 2010 K Dictionaries Ltd. Copyright 2005, 1997, 1991 by Random House, Inc. All rights reserved.
http://www.thefreedictionary.com/resilience
Resilience (IT) The ability of an application to handle unexpected situations
- without the user noticing it (best case) - with a graceful degradation of service (worst case)
Resilience is not about testing your application
(You should definitely test your application, but that‘s a different story)
public class MySUTTest { @Test public void shouldDoSomething() { MySUT sut = new MySUT(); MyResult result = sut.doSomething(); assertEquals(<Some expected result>, result); } … }
It‘s all about production!
Why should I care?
Business
Production
Availability
Resilience
Your web server doesn‘t look good …
The dreaded SiteTooSuccessfulException …
Reasons to care about resilience • Loss of lives
• Loss of goods (manufacturing facilities)
• Loss of money
• Loss of reputation
Why should I care about it today?
(The risks you mention are not new)
Resilience drivers
• Cloud-based systems
• Highly scalable systems
• Zero Downtime
• IoT & Mobile
• Social
à Reliably running distributed systems
What’s the business case?
(I don’t see any money to be made with it)
Counter question
Can you afford to ignore it?
(It’s not about making money, it’s about not loosing money)
Resilience business case
• Identify risk scenarios
• Calculate current occurrence probability
• Calculate future occurrence probability
• Calculate short-term losses
• Calculate long-term losses
• Assess risks and money
• Do not forget the competitors
Let’s dive deeper into resilience
Classification attempt
Reliability: A set of attributes that bear on the capability of software to maintain its level of performance under stated conditions for a stated period of time.
Efficiency
ISO/IEC 9126 software quality characteristics
Usability
Reliability Portability
Maintainability
Functionality
Available with acceptable latency
Resilience goes beyond that
How can I maximize availability?
Availability ≔ MTTF MTTF + MTTR
MTTF: Mean Time To Failure MTTR: Mean Time To Recovery
Traditional approach (robustness)
Availability ≔ MTTF MTTF + MTTR
Maximize MTTF
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
Leslie Lamport
Failures in todays complex, distributed, interconnected systems are not the exception.
They are the normal case.
Contemporary approach (resilience)
Availability ≔ MTTF MTTF + MTTR
Minimize MTTR
Do not try to avoid failures. Embrace them.
What kinds of failures do I need to deal with?
Failure types
• Crash failure
• Omission failure
• Timing failure
• Response failure
• Byzantine failure
How do I implement resilience?
Bulkheads
• Divide system in failure units
• Isolate failure units
• Define fallback strategy
Redundancy
• Elaborate use caseMinimize MTTR / scale transactions / handle response errors / …
• Define routing & balancing strategy Round robin / master-slave / fan-out & quickest one wins / …
• Consider admin involvementAutomatic vs. manual / notification – monitoring / …
Loose Coupling
• Isolate failure units (complements bulkheads)
• Go asynchronous wherever possible
• Use timeouts & circuit breakers
• Make actions idempotent
Implementation Example #1
Timeouts
Timeouts (1) // Basics myObject.wait(); // Do not use this by default myObject.wait(TIMEOUT); // Better use this // Some more basics myThread.join(); // Do not use this by default myThread.join(TIMEOUT); // Better use this
Timeouts (2) // Using the Java concurrent library Callable<MyActionResult> myAction = <My Blocking Action> ExecutorService executor = Executors.newSingleThreadExecutor(); Future<MyActionResult> future = executor.submit(myAction); MyActionResult result = null; try { result = future.get(); // Do not use this by default result = future.get(TIMEOUT, TIMEUNIT); // Better use this } catch (TimeoutException e) { // Only thrown if timeouts are used ... } catch (...) { ... }
Timeouts (3) // Using Guava SimpleTimeLimiter Callable<MyActionResult> myAction = <My Blocking Action> SimpleTimeLimiter limiter = new SimpleTimeLimiter(); MyActionResult result = null; try { result = limiter.callWithTimeout(myAction, TIMEOUT, TIMEUNIT, false); } catch (UncheckedTimeoutException e) { ... } catch (...) { ... }
Implementation Example #2
Circuit Breaker
Circuit Breaker – concept
Client Resource Circuit Breaker
Request
Resource unavailable
Resource available
Closed Open
Half-Open
Lifecycle
Implemented patterns • Timeout
• Circuit breaker
• Load shedder
Supported patterns • Bulkheads
(a.k.a. Failure Units)
• Fail fast
• Fail silently
• Graceful degradation of service
• Failover
• Escalation
• Retry
• ...
Hello, world!
public class HelloCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default"; private final String name; public HelloCommand(String name) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.name = name; } @Override protected String run() throws Exception { return "Hello, " + name; } } @Test public void shouldGreetWorld() { String result = new HelloCommand("World").execute(); assertEquals("Hello, World", result); }
Source: https://github.com/Netflix/Hystrix/wiki/How-it-Works
Fallbacks
• What will you do if a request fails?
• Consider failure handling from the very beginning
• Supplement with general failure handling strategies
Scalability
• Define scaling strategy
• Think full stack
• Apply D-I-D rule
• Design for elasticity
… and many more • Supervision patterns
• Recovery & mitigation patterns
• Anti-fragility patterns
• Supporting patterns
• A rich pattern family
Different approach than traditional
enterprise software development
How do I integrate resilience into my
software development process?
Steps to adopt resilient software design
1. Create awareness: Go DevOps
2. Create capability: Coach your developers
3. Create sustainability: Inject errors
Related topics
Reactive
Anti-fragility
Fault-tolerant software design
Recovery-oriented computing
Wrap-up
• Resilience is about availability
• Crucial for todays complex systems
• Not caring is a risk
• Go DevOps to create awareness
Do not avoid failures. Embrace them!
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com