Top Banner
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chris Sanden, Netflix Roy Rapoport, Netflix October 2015 BDT207 Real-Time Analytics In Service of Self-Healing Ecosystems
85

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Apr 16, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Chris Sanden, Netflix

Roy Rapoport, Netflix

October 2015

BDT207

Real-Time Analytics In Service of

Self-Healing Ecosystems

Page 2: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

@chris_sanden

Chris & Roy

@royrapoport

Page 3: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Prerequisites

Page 4: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Expectations

(Reasonable)

Telemetry

System

Real-Time

Analytics

System(s)

Data

Orchestration

Systems

Decision

Observation

Page 5: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Bad News: An Evolution

Page 6: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Bad News: An Evolution

Page 7: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Bad News: An Evolution

Page 8: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Not Bad

• Absolutely necessary

• Pretty useful

• Insufficient

Page 9: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

We’ve Got

1,982,562,395

ProblemsAnd Boredom Ain’t One

Page 10: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

Complexity in a

Few* Dimensions

* For sufficiently large values of “few”

Page 11: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

421,010

Page 12: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

Telemetry Volumeis silly

Page 13: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

2,000,000,000is silly

Page 14: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

14”

Page 15: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

Page 16: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

Page 17: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

Page 18: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

MMO: Most Memorable Outage

• One device (out of ~103)

• One test cell (out of ~101)

• One test (out of ~104)

• Couldn’t view House of Cards S3E1

• For a week

Page 19: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Scale At Scale

We have weird, device-specific problems all

the time, and interactions with A/B tests only

make them more complicated, so I'm not

sure we have a pat moral of the story except

that we really like alerting and fast

responses.

- Matt McCarthy

Page 20: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Bad News About the Cloud

• Infrastructure no longer the bottleneck

• Before: Weeks to change infrastructure

• After: API call

• TTD expectations vastly higher

• AWS makes us the lameness bottleneck

Page 21: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Good News About the Cloud

• Infrastructure no longer the bottleneck

• Before: Weeks to change infrastructure

• After: API call

• Rapid recovery, automated response

possible

• AWS: Enabling productive laziness for 9

years and counting

Page 22: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Don’t Forget to Bring a Towel!Monitoring Capabilities You’ll Find Useful

• Time series

• Event Streaming

• Dependency Discovery and Inspection

Page 23: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Real-Time Analytics

1. Prediction

2. Detection

3. Correlation

Page 24: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

1. Prediction

Page 25: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

1.1 Predictive Scaling

Page 26: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive Scaling

Auto Scaling is reactive.• SCALE UP by 10%

• WHEN Requests Per Second > 120

• FOR 10 consecutive minutes

• FOLLOWED-BY a cool-down of 15 minutes

Page 27: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive Scaling

Advanced Use Cases• Rapid, reoccurring, spike in demand

• Variable traffic patterns

• Outages

Page 28: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive Scaling

Concept• Anticipate change in traffic and workload.

• Predict the resources needed a head of time.

• Proactively scale up or down.

Page 29: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive Scaling

Metric Selection• Clear, relatively stable, and recurring pattern.

• Independent of cluster performance.

Page 30: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive ScalingRequests Per Second (RPS)

Page 31: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive ScalingFast Fourier Transformation (FFT)

Page 32: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive ScalingFFT-based Prediction

Page 33: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Prediction

Predictive Scaling

Action Plan

Page 34: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive Scaling

MetricFFT

PredictionAction Plan

Scale

Prediction Workflow

Page 35: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Predictive Scaling

Predictive-reactive Auto Scaling• A hybrid approach

• Predict the workload of a cluster in advance and proactively scale.

• Use auto scaling to handle unexpected surges in workload.

Page 36: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

2. Detection

Page 37: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

2.1 Anomaly Detection

Page 38: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

a·nom·a·ly de·tec·tion[uh-nom-uh-lee] [dih-tek-shuh n]

1. identification of observations which do not conform to an expected pattern.

2. a task that keeps data scientists up at night.

Page 39: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

“Blips”

Anomaly DetectionAnomaly Types

“Bloops”

Page 40: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly DetectionStatic Threshold

Page 41: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection Static Threshold

Page 42: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection

Prediction Algorithms• FFT-based Prediction

• Double Exponential Smoothing (DES)

• Holt-Winters

• ARIMA

• Etc.

Page 43: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection

Metric Prediction Residual Threshold

Detection Workflow

Page 44: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly DetectionDouble Exponential Smoothing

Page 45: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly DetectionDouble Exponential Smoothing

Page 46: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly DetectionDouble Exponential Smoothing

Page 47: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection

Statistical Techniques• Three-sigma (3-sigma)

• Kolmogorov-Smirnov (KS)

• Interquartile Range (IQR)

• Grubbs Test

• Least Squares

• Etc.

Page 48: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection

Metric Prediction Residual Threshold

Detection Workflow

Page 49: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection

Metric Prediction Residual 3-sigma

Detection Workflow

Page 50: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection

Metric Prediction Residual IQRCombine

Votes

3-sigma

KS

Detection Workflow - Ensemble Approach

Page 51: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Anomaly Detection

Advanced Detection Techniques• Robust Anomaly Detection (RAD) - Netflix

• Seasonal Hybrid ESD - Twitter

• Extendible Generic Anomaly Detection System (EGADS) - Yahoo

• Kale - Etsy

Page 52: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

2.2 Outlier Detection

Page 53: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

out·li·er de·tec·tion[out-lahy-er] [dih-tek-shuh n]

1. identification of unusual members from a set of generating mechanisms.

2. not be confused with anomaly detection.

Page 54: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Time

Popu

lation

Outlier Detection

Anomaly

Detection

Page 55: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Server Outlier Detection

Netflix runs on thousands of servers• A small percentage of servers become unhealthy.

• Customer experience may be degraded.

• Time wasted looking for evidence.

Page 56: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Server Outlier Detection

Page 57: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Server Outlier Detection

Page 58: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Server Outlier Detection

Cluster Analysis• Unsupervised machine learning.

• If a server belongs to a group it should be near lots of other points as

measured by some distance function.

Assumption• Servers running the same hardware and software should behave similar.

Page 59: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Server Outlier Detection

DBSCAN - Density-Based Spatial Clustering of Applications with Noise

Page 60: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Server Outlier Detection

Metric DBSCAN Filter Action

Detection Workflow

Page 61: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Server Outlier Detection

Actions / Remediation• Send e-mail

• Page service owner

• Terminate instance

• Remove from service

• Detach from a load balancer

Page 62: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

2.3 Automated Canary Analysis

Page 63: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary Analysis

Canary Release Process• A change is gradually rolled out to production.

• Checkpoints are performed along the way.

• A decision is made at each checkpoint.

Page 64: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary Analysis

Advantages• Better degree of trust and safety in deployments.

• Faster deployment cadence.

• Lower investment in simulation engineering.

Page 65: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary AnalysisCanary Process

Current Version

(v1.0)

New Version

(v1.1)

Load

BalancerTraffic

100 Servers

5 Servers

95%

5%

Metrics

Page 66: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary AnalysisCanary Process

Current Version

(v1.0)

New Version

(v1.1)

Load

BalancerTraffic

0 Servers

100 Servers

100%

Metrics

Page 67: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary Analysis

Page 68: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary Analysis

Automated Analysis• Identify a set of metrics to compare.

• Use a statistical test to identify the difference between v1.0 and v1.1

• Mann–Whitney

• Kolmogorov-Smirnov

• Generate a score that indicates overall similarity.

• Percentage of metrics that match in performance.

Page 69: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary Analysis

MetricsStatistical

TestCalculate

ScoreDecision

Analysis Workflow

Page 70: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary AnalysisAugmented Canary Process

Previous Version

(v1.0)

New Version

(Canary - v1.1)

Load

BalancerTraffic

88 Servers

6 Servers

Previous Version

(Control - v1.0)

6 Servers

AnalysisMetrics

Page 71: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Automated Canary Analysis

Page 72: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

3. Correlation

Page 73: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Correlation AnalysisAutomated Finger-Pointing for Fun and Profit

You Want Service-Oriented Architecture?

We’ve got Service-Oriented Architecture

Page 74: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Correlation AnalysisAutomated Finger-Pointing for Fun and Profit

A

B C

D

E

F

G

H

I

J

Page 75: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Correlation AnalysisAutomated Finger-Pointing for Fun and Profit

A

B C

D

E

F

G

H

I

J

Page 76: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Correlation AnalysisSomething Else Is Also Weird!

CPU up

Alert triggered

HTTP 400

Correlated spike

HTTP requests

Correlated drop

Page 77: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Correlation AnalysisIf you care about this, you don’t care about that …

I Care About

This Metric!

I Also Care About

This Metric!

Maybe not?

Page 78: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Conclusion

Page 79: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Magic!

In Conclusion

Not Magic!

DES

IQR

FFT

DBSCAN

RAD/RPCA

Page 80: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

In Conclusion

(Reasonable)

Telemetry

System

Real-Time

Analytics

System(s)

Data

Orchestration

Systems

Decision

Observation

Page 81: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

In Conclusion

Wanna play?

Page 82: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Useful Links

• Prediction• Predictive Auto Scaling: http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.htm

• FFT: https://en.wikipedia.org/wiki/Fast_Fourier_transform

• Detection• Double Exponential Smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing

• Interquartile Range (IQR): https://en.wikipedia.org/wiki/Interquartile_range

• Ensemble Learning: http://www.scholarpedia.org/article/Ensemble_learning

• Robust Anomaly Detection (RAD): http://techblog.netflix.com/2015/02/rad-outlier-detection-on-big-data.html

• DBSCAN: https://en.wikipedia.org/wiki/DBSCAN

• Server Outlier Detection: http://techblog.netflix.com/2015/07/tracking-down-villains-outlier.html

• Canary Release Process: http://martinfowler.com/bliki/CanaryRelease.html

• Automated Canary Analysis: http://www.infoq.com/presentations/canary-analysis-deployment-pattern

• Nonparametric tests: https://en.wikipedia.org/wiki/Nonparametric_statistics

• Correlation

• Pearson Correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Page 83: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Attributions

• http://aggronaut.com

• http://designsold.com/pictures-of-kittens/

• http://slate.com, Illustration by Phil Plait

• http://www-rohan.sdsu.edu/

• http://scikit-learn.org/stable/documentation.html

Page 84: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Remember to complete

your evaluations!

Page 85: (BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

Thank you!

@chris_sanden

@royrapoport