Fault Tolerance in a High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1
May 11, 2015
Fault Tolerance in a High Volume, Distributed SystemBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensen
1
Dozens of dependencies.
One going down takes everything down.
99.99%30 = 99.7% uptime0.3% of 1 billion = 3,000,000 failures
2+ hours downtime/montheven if all dependencies have excellent uptime.
Reality is generally worse.
2
3
4
5
No single dependency should take down the entire app.
Fail fast.Fail silent.Fallback.
Shed load.
6
Options
Aggressive Network Timeouts
Semaphores (Tryable)
Separate Threads
Circuit Breaker
7
Options
Aggressive Network Timeouts
Semaphores (Tryable)
Separate Threads
Circuit Breaker
8
Options
Aggressive Network Timeouts
Semaphores (Tryable)
Separate Threads
Circuit Breaker
9
TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}
Semaphores (Tryable): Limited Concurrency
10
TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}
Semaphores (Tryable): Limited Concurrency
if (executionSemaphore.tryAcquire()) { } else { }
11
TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}
Semaphores (Tryable): Limited Concurrency
if (executionSemaphore.tryAcquire()) { } else { return getFallback();}
12
Options
Aggressive Network Timeouts
Semaphores (Tryable)
Separate Threads
Circuit Breaker
13
try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior
throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }
... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}
Separate Threads: Limited Concurrency
14
try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior
throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }
... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}
Separate Threads: Limited Concurrency
if (!threadPool.isQueueSpaceAvailable()) {
throw new RejectedExecutionException }
} catch (RejectedExecutionException e) { }
15
try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior
throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }
... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}
Separate Threads: Limited Concurrency
if (!threadPool.isQueueSpaceAvailable()) {
throw new RejectedExecutionException }
} catch (RejectedExecutionException e) { return getFallback();}
16
Separate Threads: Timeout
public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);
// retrieve the fallback return getFallback(); }}
Override of Future.get()
17
Separate Threads: Timeout
public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);
// retrieve the fallback return getFallback(); }}
Override of Future.get()
try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) {
}}
18
Separate Threads: Timeout
public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);
// retrieve the fallback return getFallback(); }}
Override of Future.get()
try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) {
return getFallback(); }}
19
Options
Aggressive Network Timeouts
Semaphores (Tryable)
Separate Threads
Circuit Breaker
20
if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}
Circuit Breaker
21
if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}
Circuit Breaker
if (circuitBreaker.allowRequest()) { } else { }
22
if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}
Circuit Breaker
if (circuitBreaker.allowRequest()) { } else { return getFallback(); }
23
Netflix uses all 4 in combination
24
25
Tryable semaphores for “trusted” clients and fallbacks
Separate threads for “untrusted” clients
Aggressive timeouts on threads and network callsto “give up and move on”
Circuit breakers as the “release valve”
26
27
28
29
Benefits of Separate Threads
Protection from client libraries
Lower risk to accept new/updated clients
Quick recovery from failure
Client misconfiguration
Client service performance characteristic changes
Built-in concurrency30
Drawbacks of Separate Threads
Some computational overhead
Load on machine can be pushed too far
...
Benefits outweigh drawbackswhen clients are “untrusted”
31
32
Visualizing Circuits in Realtime(generally sub-second latency)
Video available athttps://vimeo.com/33576628
33
Rolling 10 second counter – 1 second granularity
Median Mean 90th 99th 99.5th
Latent Error Timeout Rejected
Error Percentage(error+timeout+rejected)/
(success+latent success+error+timeout+rejected).
34
Netflix DependencyCommand Implementation
35
Netflix DependencyCommand Implementation
36
Netflix DependencyCommand Implementation
37
Netflix DependencyCommand Implementation
38
Netflix DependencyCommand Implementation
39
Netflix DependencyCommand Implementation
40
Netflix DependencyCommand Implementation
Fallbacks
CacheEventual Consistency
Stubbed DataEmpty Response
41
Netflix DependencyCommand Implementation
42
Netflix DependencyCommand Implementation
43
Rolling NumberRealtime Stats and Decision Making
44
Request CollapsingTake advantage of resiliency to improve efficiency
45
Request CollapsingTake advantage of resiliency to improve efficiency
46
47
Fail fast.Fail silent.Fallback.
Shed load.
48
Questions & More Information
Fault Tolerance in a High Volume, Distributed Systemhttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
Making the Netflix API More Resilienthttp://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
Ben Christensen@benjchristensen
http://www.linkedin.com/in/benjchristensen
49