Top Banner
Beyond Pretty Charts Analytics for the Cloud Infrastructure Velocity Europe 2013 Toufic Boubez, Ph.D. Co-Founder, CTO Metafor Software toufi[email protected] @tboubez
43

Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

Jan 26, 2015

Download

Technology

tboubez

My presentation from Velocity Europe 2013 in London: Beyond Pretty Charts…. Analytics for the cloud infrastructure.

IT Ops collect tons of data on the status of their data center or cloud environment. Much of that data ends up as graphs on big screens so ops folks can keep an eye on the behavior of their systems. But unless a threshold is crossed, behavioral issues will often fall through the cracks. Thresholds are reactive, and humans are, well, human. Applying analytics and machine learning to detect anomalies in dynamic infrastructure environments can catch these behavioral changes before they become critical.

Current tools used to monitor web environments rely on fundamental assumptions that are no longer true such as assuming that the underlying system being monitored is relatively static or that the behavioral limits of these systems can be defined by static rules and thresholds. Thus interest in applying analytics and machine learning to predict and detect anomalies in these dynamic environments is gaining steam. However, understanding which algorithms should be used to identify and predict anomalies accurately within all that data we generate is not so easy.

This talk will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as:

Understanding your data’s characteristics
The two main approaches for analyzing operations data: parametric and non-parametric methods
Simple data transformations that can give you powerful results
By the end of this talk, attendees will understand the pros and cons of the key statistical analysis techniques and walk away with examples as well as practical rules of thumb and usage patterns.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

Beyond Pretty ChartsAnalytics for the Cloud Infrastructure

Velocity Europe 2013

Toufic Boubez, Ph.D.Co-Founder, CTOMetafor [email protected]@tboubez

Page 2: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

2

Toufic intro – who I am

• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies

– Acquired by Computer Associates in 2013– I escaped

• Co-Founder/CTO Saffron Technology• IBM Chief Architect for SOA• Co-Author, Co-Editor: WS-Trust, WS-

SecureConversation, WS-Federation, WS-Policy• Building large scale software systems for 20 years (I’m

older than I look, I know!)

Page 3: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

3

Genesis of this talk

• Evolving from various conference presentations– Blog:http

://www.metaforsoftware.com/category/anomaly-detection-101/

– Many briefly mentioned issues, never explored– Needed more details and examples

• Note: real data• Note: no y-axis labels on charts – on purpose!!• Note to self: remember to SLOW DOWN!• Note to self: mention the cats!! Everybody loves cats!!

Page 4: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

4

Wall of Charts™

Page 5: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

5

The WoC side-effects: alert fatigue

“Alert fatigue is the single biggest problem we have right now … We need to be more intelligent about our alerts or we’ll all go insane.”

- John Vincent (@lusis)

(#monitoringsucks)

Page 6: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

6

The fallacy of thresholds

• So what if my unicorn usage is at 89-91%, and has been stable?• I’d much rather know if it’s at 60% and has been rapidly

increasing

• Static thresholds and rules won’t help you in this case

Page 7: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

7

Work smarter not harder

• We don’t need more metrics• We don’t need more thresholds and rules• We DO need better, smarter tools

Page 8: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

8

TO THE RESCUE: Anomaly Detection!!

• Anomaly detection (also known as outlier detection) is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1]

• For devops: Need to know when one or more of our metrics is going wonky

Page 9: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

9

#monitoringsucks vs #imonitoring

• Proper monitoring tools should give us all the information we need to be PROACTIVE– But they don’t

• Current monitoring tools assume that the underlying system is relatively static– Surround it with static thresholds and rules.– Good for detecting catastrophic events but not

much else– WHY!!??

Page 10: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

10

“Traditional” analytics …

• Roots in manufacturing process QC

Page 11: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

11

… are based on Gaussian distributions

• Make assumptions about probability distributions and process behaviour– Usually assume data is normally distributed

with a useful and usable mean and standard deviation

Page 12: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

12

What’s normal!!??

Page 13: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

13

THIS is normal

Page 14: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

14

Three-Sigma Rule

• Three-sigma rule– ~68% of the values lie within 1 std deviation of the mean– ~95% of the values lie within 2 std deviations– 99.73% of the values lie within 3 std deviations: anything

else is an outlier

Page 15: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

15

Aaahhhh

• The mysterious red lines explained

Page 16: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

16

The four horsemen

• Four horsemen of the modelpocalypse™ [Abe Stanway & Jon Cowie http://www.slideshare.net/jonlives/bring-the-noise]

– Seasonality– Spike influence– Normality– Parameters

Page 17: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

17

Moving Averages for detecting outliers

• Moving Averages “Big idea”:– At any point in time in a well-behaved time series,

your next value should not significantly deviate from the general trend of your data

– Mean as a predictor is too static, relies on too much past data (ALL of the data!)

– Instead of overall mean use a finite window of past values, predict most likely next value

– Alert if actual value “significantly” (3 sigmas?) deviates from predicted value

Page 18: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

18

Simple and Weighted Moving Averages

• Simple Moving Average– Average of last N values in your time series

• S[t] <- sum(X[t-(N-1):t])/N

– Each value in the window contributes equally to prediction

– …INCLUDING spikes and outliers• Weigthed Moving Average

– Similar to SMA but assigns linearly (arithmetically) decreasing weights to every value in the window

– Older values contribute less to the prediction

Page 19: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

19

Exponential Smoothing

• Exponential Smoothing– Similar to weighted average, but with weights decay exponentially over the

whole set of historic samples• S[t]=αX[t-1] + (1-α)S[t-1]

– Does not deal with trends in data• DES

– In addition to data smoothing factor (α), introduces a trend smoothing factor (β)

– Better at dealing with trending– Does not deal with seasonality in data

• TES, Holt-Winters– Introduces additional seasonality factor– … and so on

• ALL assume Gaussian!

Page 20: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

20

Gaussian distributions are powerful because:

• Far far in the future, in a galaxy far far away:– I can make the same predictions because the

statistical properties of the data haven’t changed– I can easily compare different metrics since they

have similar statistical properties

• BUT…• Cue in DRAMATIC MUSIC

Page 21: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

21

What’s my distribution?

Page 22: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

22

Another common distribution

Page 23: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

23

Let’s look at an example

Page 24: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

24

3-sigma rule

Page 25: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

25

Holt-Winters predictions

Page 26: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

26

Histogram – probability distribution

Page 27: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

27

Another example

Page 28: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

28

3-sigma rule

Page 29: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

29

Holt-Winters predictions

Page 30: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

30

Histogram – probability distribution

Page 31: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

31

Are we doomed?

• No!• There are lots of other non-Gaussian based

techniques:– Adaptive Mixture of Gaussians– Non-parametric techniques (

http://www.metaforsoftware.com/everything-you-should-know-about-anomaly-detection-know-your-data-parametric-or-non-parametric/)

– Spectral analysis

Page 32: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

32

Kolmogorov-Smirnov test

• Non-parametric test– Compare two probability

distributions– Makes no assumptions (e.g.

Gaussian) about the distributions of the samples

– Measures maximum distance between cumulative distributions

– Can be used to compare periodic/seasonal metric periods (e.g. day-to-day or week-to-week)

http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

Page 33: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

33

KS test with bootstrap

Page 34: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

34

What about slow trends?

Page 35: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

35

KS test on slow memory leak

Page 36: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

36

Histogram – probability distribution

Page 37: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

37

We’re not doomed, but: Know your data!!

• You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use.

• A large amount of data center data is non-Gaussian– Guassian statistics won’t work– Use appropriate techniques

Page 38: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

38

Pet Peeve: How much data do we need?

• Trend towards higher and higher sampling rates in data collection

• Reminds me of Jorge Luis Borges’ story about Funes the Memorious– Perfect recollection of the slightest details of every

instant of his life, but lost the ability for abstraction

• Our brain works on abstraction– We notice patterns BECAUSE we can abstract

Page 39: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

39

The danger of over-abstraction

+

= comfortable?

Page 40: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

40

So, how much data DO you need?

• You don’t need more resolution that twice your highest frequency (Nyquist-Shanon sampling theorem)

• Most of the algorithms for analytics will smooth, average, filter, and pre-process the data.

• Watch out for correlated metrics (e.g. used vs. available memory)

Page 41: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

41

Think: Is all data important to collect?

• Two camps:– Data is data, let’s collect and analyze everything and

figure out the trends. – Not all data is important, so let’s figure out what’s

important first and understand the underlying model so we don’t waste resources on the rest.

• Similar to the very public bun fight between Noam Chomsky and Peter Norvig– http://norvig.com/chomsky.html

• Unresolved as far as I know

Page 42: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

42

Shout out to etsy

• Check out kale:• Check out kale for some analytics:

– http://codeascraft.com/2013/06/11/introducing-kale/

– https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py

Page 43: Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

43

More?

• Only scratched the surface• I want to talk more about algorithms, analytics,

current issues, etc, in more depth, but time’s up!!– Go back in time to me Office Hours session, or– Come talk to me or email me if interested.

• Thank you!

[email protected]@tboubez