Top Banner
Why dynamic & adaptive thresholds matters
37

Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Jun 10, 2015

Download

Technology

Nagios

Anders Haal's presentation on using adaptive thresholds with Nagios.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Why 

dynamic & adaptivethresholdsmatters

Page 2: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

anders håål, ingenjörsbyn ab [email protected]@thenodon

Page 3: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Bischeck ­  dynamic & adaptivethresholds for Nagios

www.bischeck.org

Page 4: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold

Page 5: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

What is the limitation with static threshold?

✗ Not static✗ Load varies throughout the day, week✗ To many or to few alarms✗ Collecting and thresholding in the same context

✗ Based on the current measurement✗ Do not consider dependency to other services

Page 6: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

How to make thresholdsdynamic & adaptive? 

Page 7: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Database table size should not be bigger then 5 % of yesterdays max size “

{example 1}

Page 8: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Database table size should not be bigger then 5 % of max size yesterday“

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Table size

Yesterday

table size < max(yesterday)*1.05

Today

{example 1}

Page 9: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Number of on­line users should not be more then 10 % higher then the average number of on­line users for the last 10 data points”

{example 2}

Page 10: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

users < avg(X0+X1+.....+X9)*1.1

Where X is the historical on-line users data points

{example 2}

“Number of on­line users should not be more then 10 % higher then the average number of on­line users for the last 10 data points”

Page 11: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

{example 3}

“The number of orders with errors should be lower then 5% of the total number of registered orders”

Page 12: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“The number of orders with errors should be lower then 5% of the total number of registered orders”

{example 3}

Total #orders

#orders with errors < 0.05 * (Total #orders)

Page 13: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Message queue size should be above the defined Friday threshold profile”

 

{example 4}

count

time of the day

Page 14: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

“Message queue size should be above the defined Friday threshold profile”

 

{example 4}

count

time of the day

count

time of the day

Page 15: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

How to make thresholdsdynamic & adaptive? 

✗ Time profiles✗ Historical data points✗ Math and statistical operations

Page 16: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

We did not want a check_XYZ hack 

We wanted a tool

Page 17: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Collecting

Page 18: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Thresholding

Collecting

Separation

Page 19: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Collecting

Page 20: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Logic

Collecting

Page 21: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Logic

Collecting

Day profile

Page 22: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Calender Logic

Collecting

Day profile

Page 23: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Calender Logic

Collecting

Day profile

Nagios

Page 24: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Historical data

Calender Logic

Collecting

Day profile

Scheduling

ServerInterface

NagiosOpenTSDB XYZ

Page 25: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters
Page 26: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters
Page 27: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

bischeck basics

● Configuration like Nagios – host, service but also service item● Host is just a container of the rest● Service specify the connection and scheduling● Service item specify the “query” and the threshold 

class to use

● Host and service name must be the same as in the Nagios configuration

Page 28: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold – 24 hour day profile 

● Divide the day in 24 hour points, where every point can be:● Static value● Dynamic value 

– Math expression on single value or range of data from the cache

– Based on cached data points retrieved by● Index – single value or index range● Time – single value (closest) or time range (between)

Page 29: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

....

<!­­ 12:00 Static ­­>

<hour>7000</hour>

....

Page 30: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

....

<!­­ 12:00 Static ­­>

<hour>7000</hour>

<!­­ 13:00 Adaptive ­­>     

<hour>erpserver­orders­ediOrders[0] / 3</hour>

....

Page 31: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

....

<!­­ 12:00 Static ­­>

<hour>7000</hour>

<!­­ 13:00 Adaptive ­­>     

<hour>erpserver­orders­ediOrders[0] / 3</hour>

<!­­ 14:00 Adaptive with math function ­­>

<hour>avg(erpserver­orders­ediOrders[­30M:­60M]) / 2</hour>

....

Page 32: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold – 24 hour day profile 

Between every “full” hour a linear equation is calculated 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0

10000

20000

30000

40000

50000

60000

Day profile

Hour

Page 33: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Threshold – 24 hour day profile 

● Connect calender to the day profile and evaluate according to the following order:

1. Month and day of month 

2.Week and day of week 

3.Day in month 

4.Day in the week 

5.Month 

6.Week

● Holiday – exception days  

Page 34: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

And more....● Multi­threaded and multi­scheduling schema per service

● interval● cron 

● Data collection – jdbc, livestatus, internal cache● Virtual services● Date macros in execution statements ● Customize 

● connection (service classes)● execution (service item classes)● thresholds (threshold classes) ● server integration (server classes)

● XML configuration supported with WEBui (beta)● GPL 2 license

Page 35: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Future● Improved time series database● Patterns/baselines● More statistic functions● “Sensors” ­ alarms on multiple/aggregated data points● Any ideas?

Page 36: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Infrastructure monitoring Application performance monitoring [APM]

Business activity monitoring [BAM]

Operational Business intelligence [OBI]

Page 37: Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds matters

Questions & Feedback 

Pictures – Creative Commonswww.flickr.com/photos/loneprimate/4017405677www.flickr.com/photos/catatronic/2397319483www.flickr.com/photos/dtrimarchi/6815004766www.flickr.com/photos/bikeracer/6740232