Anomaly Detection Analytics for the Data Centre devopsdays Vancouver 25 October 2013 Toufic Boubez, Ph.D. Co-Founder, CTO Metafor Software
Jan 26, 2015
Anomaly Detection Analytics for the Data Centre
devopsdays Vancouver25 October 2013
Toufic Boubez, Ph.D.Co-Founder, CTOMetafor Software
2
Toufic intro – who I am
• Co-Founder/CTO Metafor Software• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013– I escaped
• Co-Founder/CTO Saffron Technology• Chief Architect IBM (SOA)• Building large scale software systems for 20
years (I’m older than I look, I know!)
3
Why this talk?
• April: devopsdays Austin: Open Space talk– Blog:
http://metaforsoftware.com/beyond-the-pretty-charts-a-report-from-devopsdays-in-austin/
• June: devopsdays Silicon Valley presentation:– Five major lessons learned
• Explore issues mentioned in June
• Note: real data• Note: no labels on charts – on purpose!!• Note to self: remember to SLOW DOWN!• Note to self: mention the cats!! Everybody loves cats!!
4
Wall of Charts™
5
The Wall of Charts side-effects
“Alert fatigue is the single biggest problem we have right now … We need to be more intelligent about our alerts or we’ll all go insane.”
- John Vincent, Monitorama, March 2013
Alert Overload Metrics Overload
6
Need mo’ better alerting
– So what if my unicorn usage is at 89-91%, and has been stable?– I’d much rather know if it’s at 60% and has been rapidly increasing
– Static thresholds and rules won’t help you in this case– Need some intelligent Anomaly Detection mechanism
7
Anomaly Detection for DevOps
• Anomaly detection (also known as outlier detection) is the search for items or events which do not conform to an expected pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A survey". ACM Computing Surveys 41 (3): 1]
• For devops: Need to know when one or more of our metrics is going wonky
8
#monitoringsucks vs #iheartmonitoring
• Proper monitoring tools should give us all the information we need to be PROACTIVE– But they don’t
• Current monitoring tools assume that the underlying system is relatively static– Surround it with static thresholds and rules.– Good for detecting catastrophic events but not
much else– BUT WHY!!??
9
“Traditional” analytics …
• Roots in manufacturing process QC
10
… are based on Gaussian distributions
• Makes assumptions about probability distributions and process behaviour– Usually assumes data is normally distributed with
a useful and usable mean and standard deviation• Blah blah blah what does it mean?
11
What’s normal!!??
12
Distribution Schmistribution
13
Three-Sigma Rule
• Three-sigma rule– ~68% of the values lie within 1 std deviation of the mean– ~95% of the values lie within 2 std deviations– 99.73% of the values lie within 3 std deviations
14
Aaahhhh
• The mysterious red lines explained
15
Moving Averages for detecting outliers
• Big idea:– Based on past values, predict most likely next value– Alert if actual value “significantly” deviates from predicted
value• Simple Moving Average
– Average of last N values in your time series• S[t] <- sum(X[t-(N-1):t])/N
– Each value in the window contributes equally to prediction– Idea is that your next value should not significantly deviate
from the general trend of your data
16
Weighted Moving Average
• Weigthed Moving Average– Similar to SMA but assigns linearly (arithmetically)
decreasing weights to every value in the window– Older values contribute less to the prediction
• Neither SMA or WMA deal well with periodicity in your data
17
Exponential Smoothing
• Exponential Smoothing– Similar to weighted average, but with weights decay
exponentially over the whole set of historic samples• S[t]=αX[t-1] + (1-α)S[t-1]
– Is as almost as bad as moving averages in dealing with periodicity and trending time series!!
• DES: Holt-Winters– In addition to data smoothing factor (α), introduces a trend
smoothing factor (β)– Better at dealing with periodicity and trending
• ALL assume Gaussian!
18
Gaussian distributions are powerful because:
• Far far in the future, in a galaxy far far away:– I can make the same predictions because the
statistical properties of the data haven’t changed– I can compare different metrics since they have
similar statistical properties
• BUT…• Cue in DRAMATIC MUSIC
19
What’s my distribution?
20
Another common distribution
21
Let’s look at an example
22
Histogram – probability distribution
23
3-sigma rule
24
Holt-Winters predictions
25
Are we doomed?
• There’s A LOT you can do with the data, other than just looking at it and putting thresholds!– Adaptive Mixture of Gaussians– Non-parametric techniques (
http://www.metaforsoftware.com/everything-you-should-know-about-anomaly-detection-know-your-data-parametric-or-non-parametric/)
– Spectral analysis
26
Mixture of Gaussians
27
We’re not doomed, but: Know your data!!
• You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use.
• A large amount of data center data is non-Gaussian– Guassian statistics won’t work– Use appropriate techniques
28
Pet Peeve #1: How much data do we need?
• Trend towards higher and higher sampling rates in data collection
• Reminds me of Jorge Luis Borges’ story about Funes the Memorious– Perfect recollection of the slightest details of every
instant of his life, but lost the ability for abstraction
• Our brain works on abstraction– We notice patterns BECAUSE we can abstract
29
The danger of over-abstraction
+
= comfortable?
30
So, how much data DO you need?
• You don’t need more resolution that twice your highest frequency (Nyquist-Shanon sampling theorem)
• Most of the algorithms for analytics will smooth, average, filter, and pre-process the data.
• Watch out for correlated metrics (e.g. used vs. available memory)
31
Think: Is all data important to collect?
• Two camps:– Data is data, let’s collect and analyze everything and
figure out the trends. – Not all data is important, so let’s figure out what’s
important first and understand the underlying model so we don’t waste resources on the rest.
• Similar to the very public bun fight between Noam Chomsky and Peter Norvig– http://norvig.com/chomsky.html
• Unresolved as far as I know
32
Do we need both metrics?
33
More?
• Only scratched the surface• I want to talk more about analytics, in more
depth, but time’s up!!– (Actually Jenny won’t let me)
• Come talk to me during the breaks!• Thank you!